[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918643#comment-13918643
 ] 

Uwe Schindler edited comment on TIKA-1252 at 3/3/14 10:17 PM:
--

I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate keys (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.


was (Author: thetaphi):
I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate values (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.

> Tika is not indexing all authors of a PDF
> -
>
> Key: TIKA-1252
> URL: https://issues.apache.org/jira/browse/TIKA-1252
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 1.4
> Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
> Bitnami Stack)
>Reporter: Alexandre Madurell
>
> When submitting a PDF with this information in its XMP metadata:
> ...
>   
> 
>   Author 1
>   Author 2
> 
>   
> ...
> Only the first one appears in the collection:
> ...
> "author":["Author 1"],
> "author_s":"Author 1",
> ...
> In spite of having set the field to multiValued in the Solr schema:
>  multiValued="true"/>
> Let me know if there's any further specific information I could provide.
> Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918643#comment-13918643
 ] 

Uwe Schindler commented on TIKA-1252:
-

I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate values (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.

> Tika is not indexing all authors of a PDF
> -
>
> Key: TIKA-1252
> URL: https://issues.apache.org/jira/browse/TIKA-1252
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 1.4
> Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
> Bitnami Stack)
>Reporter: Alexandre Madurell
>
> When submitting a PDF with this information in its XMP metadata:
> ...
>   
> 
>   Author 1
>   Author 2
> 
>   
> ...
> Only the first one appears in the collection:
> ...
> "author":["Author 1"],
> "author_s":"Author 1",
> ...
> In spite of having set the field to multiValued in the Solr schema:
>  multiValued="true"/>
> Let me know if there's any further specific information I could provide.
> Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918634#comment-13918634
 ] 

Uwe Schindler commented on TIKA-1252:
-

This could be a problem in Solr's DataImportHandler. I am not 100% sure, if 
this one supports multiple values per key. Maybe it is using a Map... In any 
case, if this is caused by Solr, I will move the issue over to SOLR.

> Tika is not indexing all authors of a PDF
> -
>
> Key: TIKA-1252
> URL: https://issues.apache.org/jira/browse/TIKA-1252
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 1.4
> Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
> Bitnami Stack)
>Reporter: Alexandre Madurell
>
> When submitting a PDF with this information in its XMP metadata:
> ...
>   
> 
>   Author 1
>   Author 2
> 
>   
> ...
> Only the first one appears in the collection:
> ...
> "author":["Author 1"],
> "author_s":"Author 1",
> ...
> In spite of having set the field to multiValued in the Solr schema:
>  multiValued="true"/>
> Let me know if there's any further specific information I could provide.
> Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918529#comment-13918529
 ] 

Nick Burch commented on TIKA-1252:
--

Tika supports multiple values for a given metadata key, though not all parsers 
support extracting multiple values for all keys. 

I'd suggest you first off try with a recent copy of the tika-app jar, just to 
check if it's a problem with how you're integrating with SOLR. If that can't 
return multiple author tags, any chance you could upload a small PDF that shows 
this problem, so someone can look into why the parser isn't doing so?

> Tika is not indexing all authors of a PDF
> -
>
> Key: TIKA-1252
> URL: https://issues.apache.org/jira/browse/TIKA-1252
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 1.4
> Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
> Bitnami Stack)
>Reporter: Alexandre Madurell
>
> When submitting a PDF with this information in its XMP metadata:
> ...
>   
> 
>   Author 1
>   Author 2
> 
>   
> ...
> Only the first one appears in the collection:
> ...
> "author":["Author 1"],
> "author_s":"Author 1",
> ...
> In spite of having set the field to multiValued in the Solr schema:
>  multiValued="true"/>
> Let me know if there's any further specific information I could provide.
> Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1254) No warning when Tika does not find a parser.

2014-03-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918528#comment-13918528
 ] 

Nick Burch commented on TIKA-1254:
--

This is as expected - it's assumed that if you're depending on only tika-core 
then you're supplying all your own custom parsers, or just using the detection 
parts

If you ask TikaConfig nicely, it'll tell you the parsers it has registered, and 
the mimetypes it can handle parsing for. See tika-app for an example of how to 
do that. You're best off adding that sort of check to your own code if you can 
risk missing key parsers, as only you know which mime types matter for your use 
case

> No warning when Tika does not find a parser.
> 
>
> Key: TIKA-1254
> URL: https://issues.apache.org/jira/browse/TIKA-1254
> Project: Tika
>  Issue Type: Wish
>Reporter: Ankit Gupta
>Priority: Minor
>
> When using Tika using Gradle or Maven, if the dependency is specified only on 
> tika-core and not on tika-parsers, then there is no warning to let you know 
> that there is a library missing and the function returns an empty string.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Alexandre Madurell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918450#comment-13918450
 ] 

Alexandre Madurell commented on TIKA-1252:
--

Hmmm... maybe I need to build a DublinCoreAdapter on top of Tika's Metadata 
class as mentioned here? 
http://lucene.472066.n3.nabble.com/Metadata-use-by-Apache-Java-projects-td645477.html#a645484

Kind of a newbie here... any help is appreciated.

> Tika is not indexing all authors of a PDF
> -
>
> Key: TIKA-1252
> URL: https://issues.apache.org/jira/browse/TIKA-1252
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 1.4
> Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
> Bitnami Stack)
>Reporter: Alexandre Madurell
>
> When submitting a PDF with this information in its XMP metadata:
> ...
>   
> 
>   Author 1
>   Author 2
> 
>   
> ...
> Only the first one appears in the collection:
> ...
> "author":["Author 1"],
> "author_s":"Author 1",
> ...
> In spite of having set the field to multiValued in the Solr schema:
>  multiValued="true"/>
> Let me know if there's any further specific information I could provide.
> Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1254) No warning when Tika does not find a parser.

2014-03-03 Thread Ankit Gupta (JIRA)
Ankit Gupta created TIKA-1254:
-

 Summary: No warning when Tika does not find a parser.
 Key: TIKA-1254
 URL: https://issues.apache.org/jira/browse/TIKA-1254
 Project: Tika
  Issue Type: Wish
Reporter: Ankit Gupta
Priority: Minor


When using Tika using Gradle or Maven, if the dependency is specified only on 
tika-core and not on tika-parsers, then there is no warning to let you know 
that there is a library missing and the function returns an empty string.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1253) SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with [1.6, 1.7]

2014-03-03 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918267#comment-13918267
 ] 

Ken Krugler commented on TIKA-1253:
---

Hi  sudheshna,

Please start a discussion about this issue on the Tika user mail list. 
Following that, you'd be in a better position to decide whether to file an 
issue for Tika in Jira to upgrade to a newer version of slf4. Also note that a 
request to upgrade Tika to use a newer component isn't a bug, it's (maybe) an 
improvement...but using a newer version of slf4j can cause issues for other 
projects that use an older version.

Your best bet short-term might be to pull & build a version of Tika with the 
target slf4j version.

> SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible 
> with [1.6, 1.7]
> --
>
> Key: TIKA-1253
> URL: https://issues.apache.org/jira/browse/TIKA-1253
> Project: Tika
>  Issue Type: Bug
>Reporter: sudheshna iyer
>Priority: Blocker
>
> I am receiving the following error with Tika 4.0
> SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible 
> with [1.6, 1.7]
> pom.xml file entry:
>   
>   org.apache.tika
>   tika-app
>   1.4
> 
>   
> I have to incorporate tika project with other projects which use 1.7 of 
> SLF4J. Since Tika is not compatible with 1.7, I am not able to run my Tika 
> service.
> Why is Tika using lower versions of SLF4J? What is the workaround? 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1253) SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with [1.6, 1.7]

2014-03-03 Thread sudheshna iyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sudheshna iyer updated TIKA-1253:
-

Priority: Blocker  (was: Major)

> SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible 
> with [1.6, 1.7]
> --
>
> Key: TIKA-1253
> URL: https://issues.apache.org/jira/browse/TIKA-1253
> Project: Tika
>  Issue Type: Bug
>Reporter: sudheshna iyer
>Priority: Blocker
>
> I am receiving the following error with Tika 4.0
> SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible 
> with [1.6, 1.7]
> pom.xml file entry:
>   
>   org.apache.tika
>   tika-app
>   1.4
> 
>   
> I have to incorporate tika project with other projects which use 1.7 of 
> SLF4J. Since Tika is not compatible with 1.7, I am not able to run my Tika 
> service.
> Why is Tika using lower versions of SLF4J? What is the workaround? 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1253) SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with [1.6, 1.7]

2014-03-03 Thread sudheshna iyer (JIRA)
sudheshna iyer created TIKA-1253:


 Summary: SLF4J: The requested version 1.5.6 by your slf4j binding 
is not compatible with [1.6, 1.7]
 Key: TIKA-1253
 URL: https://issues.apache.org/jira/browse/TIKA-1253
 Project: Tika
  Issue Type: Bug
Reporter: sudheshna iyer


I am receiving the following error with Tika 4.0

SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with 
[1.6, 1.7]

pom.xml file entry:

org.apache.tika
tika-app
1.4
  

I have to incorporate tika project with other projects which use 1.7 of SLF4J. 
Since Tika is not compatible with 1.7, I am not able to run my Tika service.

Why is Tika using lower versions of SLF4J? What is the workaround? 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Tika 1.5 vs 1.4 testing

2014-03-03 Thread Hong-Thai Nguyen
Hi all,

I've checked on same corpus. Here's the comparaison :
||Tika||POI||PDFbox||Failed docs||
|1.4|3.9|1.8.1|92|
|1.5|3.10-beta2|1.8.4|182|

== TIKA 1.4 
- pdf (7)
   * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@4d39a96c
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@4d39a96c
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unable to extract PDF content
- pptx (8)
   * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Error creating OOXML extractor
   * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@4db190a5
- doc (2)
   * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- ppt (40)
   * (39) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
   * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- xls (9)
   * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
   * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- dwg (4)
   * (4) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version: 
AC1014
- odp (2)
   * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@7286f080
- rtf (13)
   * (13) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@455a7af4
- pps (5)
   * (5) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2

== TIKA 1.5 
- pdf (16)
   * (10) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@1e59efa5
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@1e59efa5
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unable to extract PDF content
- pptx (19)
   * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Error creating OOXML extractor
   * (12) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@2b195ebd
- doc (11)
   * (9) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@7b796022
   * (2) 
com.polyspot.document.converter.ConversionException: org.xml.sax.SAXException: 
Namespace http://www.w3.org/1999/xhtml not declared
- ppt (47)
   * (46) 
com.polyspot.document.converter.ConversionE