[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918643#comment-13918643 ] Uwe Schindler edited comment on TIKA-1252 at 3/3/14 10:17 PM: -- I did a quick check in [https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java] Solr does not seem to remove duplicate keys (see {{addMetadata()}} and {{addField(String fname, String fval, String[] vals)}}). Furthermore, if the field is *not* multivalued, the data is concatenated with whitespace and put into *one* field (see line 226 ff). So this looks like a configuration problem or really a bug in TIKA. was (Author: thetaphi): I did a quick check in [https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java] Solr does not seem to remove duplicate values (see {{addMetadata()}} and {{addField(String fname, String fval, String[] vals)}}). Furthermore, if the field is *not* multivalued, the data is concatenated with whitespace and put into *one* field (see line 226 ff). So this looks like a configuration problem or really a bug in TIKA. > Tika is not indexing all authors of a PDF > - > > Key: TIKA-1252 > URL: https://issues.apache.org/jira/browse/TIKA-1252 > Project: Tika > Issue Type: Bug > Components: metadata, parser >Affects Versions: 1.4 > Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, > Bitnami Stack) >Reporter: Alexandre Madurell > > When submitting a PDF with this information in its XMP metadata: > ... > > > Author 1 > Author 2 > > > ... > Only the first one appears in the collection: > ... > "author":["Author 1"], > "author_s":"Author 1", > ... > In spite of having set the field to multiValued in the Solr schema: > multiValued="true"/> > Let me know if there's any further specific information I could provide. > Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918643#comment-13918643 ] Uwe Schindler commented on TIKA-1252: - I did a quick check in [https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java] Solr does not seem to remove duplicate values (see {{addMetadata()}} and {{addField(String fname, String fval, String[] vals)}}). Furthermore, if the field is *not* multivalued, the data is concatenated with whitespace and put into *one* field (see line 226 ff). So this looks like a configuration problem or really a bug in TIKA. > Tika is not indexing all authors of a PDF > - > > Key: TIKA-1252 > URL: https://issues.apache.org/jira/browse/TIKA-1252 > Project: Tika > Issue Type: Bug > Components: metadata, parser >Affects Versions: 1.4 > Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, > Bitnami Stack) >Reporter: Alexandre Madurell > > When submitting a PDF with this information in its XMP metadata: > ... > > > Author 1 > Author 2 > > > ... > Only the first one appears in the collection: > ... > "author":["Author 1"], > "author_s":"Author 1", > ... > In spite of having set the field to multiValued in the Solr schema: > multiValued="true"/> > Let me know if there's any further specific information I could provide. > Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918634#comment-13918634 ] Uwe Schindler commented on TIKA-1252: - This could be a problem in Solr's DataImportHandler. I am not 100% sure, if this one supports multiple values per key. Maybe it is using a Map... In any case, if this is caused by Solr, I will move the issue over to SOLR. > Tika is not indexing all authors of a PDF > - > > Key: TIKA-1252 > URL: https://issues.apache.org/jira/browse/TIKA-1252 > Project: Tika > Issue Type: Bug > Components: metadata, parser >Affects Versions: 1.4 > Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, > Bitnami Stack) >Reporter: Alexandre Madurell > > When submitting a PDF with this information in its XMP metadata: > ... > > > Author 1 > Author 2 > > > ... > Only the first one appears in the collection: > ... > "author":["Author 1"], > "author_s":"Author 1", > ... > In spite of having set the field to multiValued in the Solr schema: > multiValued="true"/> > Let me know if there's any further specific information I could provide. > Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918529#comment-13918529 ] Nick Burch commented on TIKA-1252: -- Tika supports multiple values for a given metadata key, though not all parsers support extracting multiple values for all keys. I'd suggest you first off try with a recent copy of the tika-app jar, just to check if it's a problem with how you're integrating with SOLR. If that can't return multiple author tags, any chance you could upload a small PDF that shows this problem, so someone can look into why the parser isn't doing so? > Tika is not indexing all authors of a PDF > - > > Key: TIKA-1252 > URL: https://issues.apache.org/jira/browse/TIKA-1252 > Project: Tika > Issue Type: Bug > Components: metadata, parser >Affects Versions: 1.4 > Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, > Bitnami Stack) >Reporter: Alexandre Madurell > > When submitting a PDF with this information in its XMP metadata: > ... > > > Author 1 > Author 2 > > > ... > Only the first one appears in the collection: > ... > "author":["Author 1"], > "author_s":"Author 1", > ... > In spite of having set the field to multiValued in the Solr schema: > multiValued="true"/> > Let me know if there's any further specific information I could provide. > Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1254) No warning when Tika does not find a parser.
[ https://issues.apache.org/jira/browse/TIKA-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918528#comment-13918528 ] Nick Burch commented on TIKA-1254: -- This is as expected - it's assumed that if you're depending on only tika-core then you're supplying all your own custom parsers, or just using the detection parts If you ask TikaConfig nicely, it'll tell you the parsers it has registered, and the mimetypes it can handle parsing for. See tika-app for an example of how to do that. You're best off adding that sort of check to your own code if you can risk missing key parsers, as only you know which mime types matter for your use case > No warning when Tika does not find a parser. > > > Key: TIKA-1254 > URL: https://issues.apache.org/jira/browse/TIKA-1254 > Project: Tika > Issue Type: Wish >Reporter: Ankit Gupta >Priority: Minor > > When using Tika using Gradle or Maven, if the dependency is specified only on > tika-core and not on tika-parsers, then there is no warning to let you know > that there is a library missing and the function returns an empty string. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918450#comment-13918450 ] Alexandre Madurell commented on TIKA-1252: -- Hmmm... maybe I need to build a DublinCoreAdapter on top of Tika's Metadata class as mentioned here? http://lucene.472066.n3.nabble.com/Metadata-use-by-Apache-Java-projects-td645477.html#a645484 Kind of a newbie here... any help is appreciated. > Tika is not indexing all authors of a PDF > - > > Key: TIKA-1252 > URL: https://issues.apache.org/jira/browse/TIKA-1252 > Project: Tika > Issue Type: Bug > Components: metadata, parser >Affects Versions: 1.4 > Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, > Bitnami Stack) >Reporter: Alexandre Madurell > > When submitting a PDF with this information in its XMP metadata: > ... > > > Author 1 > Author 2 > > > ... > Only the first one appears in the collection: > ... > "author":["Author 1"], > "author_s":"Author 1", > ... > In spite of having set the field to multiValued in the Solr schema: > multiValued="true"/> > Let me know if there's any further specific information I could provide. > Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1254) No warning when Tika does not find a parser.
Ankit Gupta created TIKA-1254: - Summary: No warning when Tika does not find a parser. Key: TIKA-1254 URL: https://issues.apache.org/jira/browse/TIKA-1254 Project: Tika Issue Type: Wish Reporter: Ankit Gupta Priority: Minor When using Tika using Gradle or Maven, if the dependency is specified only on tika-core and not on tika-parsers, then there is no warning to let you know that there is a library missing and the function returns an empty string. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1253) SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with [1.6, 1.7]
[ https://issues.apache.org/jira/browse/TIKA-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918267#comment-13918267 ] Ken Krugler commented on TIKA-1253: --- Hi sudheshna, Please start a discussion about this issue on the Tika user mail list. Following that, you'd be in a better position to decide whether to file an issue for Tika in Jira to upgrade to a newer version of slf4. Also note that a request to upgrade Tika to use a newer component isn't a bug, it's (maybe) an improvement...but using a newer version of slf4j can cause issues for other projects that use an older version. Your best bet short-term might be to pull & build a version of Tika with the target slf4j version. > SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible > with [1.6, 1.7] > -- > > Key: TIKA-1253 > URL: https://issues.apache.org/jira/browse/TIKA-1253 > Project: Tika > Issue Type: Bug >Reporter: sudheshna iyer >Priority: Blocker > > I am receiving the following error with Tika 4.0 > SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible > with [1.6, 1.7] > pom.xml file entry: > > org.apache.tika > tika-app > 1.4 > > > I have to incorporate tika project with other projects which use 1.7 of > SLF4J. Since Tika is not compatible with 1.7, I am not able to run my Tika > service. > Why is Tika using lower versions of SLF4J? What is the workaround? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1253) SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with [1.6, 1.7]
[ https://issues.apache.org/jira/browse/TIKA-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sudheshna iyer updated TIKA-1253: - Priority: Blocker (was: Major) > SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible > with [1.6, 1.7] > -- > > Key: TIKA-1253 > URL: https://issues.apache.org/jira/browse/TIKA-1253 > Project: Tika > Issue Type: Bug >Reporter: sudheshna iyer >Priority: Blocker > > I am receiving the following error with Tika 4.0 > SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible > with [1.6, 1.7] > pom.xml file entry: > > org.apache.tika > tika-app > 1.4 > > > I have to incorporate tika project with other projects which use 1.7 of > SLF4J. Since Tika is not compatible with 1.7, I am not able to run my Tika > service. > Why is Tika using lower versions of SLF4J? What is the workaround? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1253) SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with [1.6, 1.7]
sudheshna iyer created TIKA-1253: Summary: SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with [1.6, 1.7] Key: TIKA-1253 URL: https://issues.apache.org/jira/browse/TIKA-1253 Project: Tika Issue Type: Bug Reporter: sudheshna iyer I am receiving the following error with Tika 4.0 SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible with [1.6, 1.7] pom.xml file entry: org.apache.tika tika-app 1.4 I have to incorporate tika project with other projects which use 1.7 of SLF4J. Since Tika is not compatible with 1.7, I am not able to run my Tika service. Why is Tika using lower versions of SLF4J? What is the workaround? -- This message was sent by Atlassian JIRA (v6.2#6252)
Tika 1.5 vs 1.4 testing
Hi all, I've checked on same corpus. Here's the comparaison : ||Tika||POI||PDFbox||Failed docs|| |1.4|3.9|1.8.1|92| |1.5|3.10-beta2|1.8.4|182| == TIKA 1.4 - pdf (7) * (1) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@4d39a96c * (3) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@4d39a96c * (3) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unable to extract PDF content - pptx (8) * (7) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Error creating OOXML extractor * (1) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@4db190a5 - doc (2) * (2) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 - ppt (40) * (39) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 * (1) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 - xls (9) * (7) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 * (2) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 - dwg (4) * (4) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version: AC1014 - odp (2) * (2) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@7286f080 - rtf (13) * (13) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@455a7af4 - pps (5) * (5) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@6ddd7ea2 == TIKA 1.5 - pdf (16) * (10) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@1e59efa5 * (3) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.ParserDecorator$1@1e59efa5 * (3) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unable to extract PDF content - pptx (19) * (7) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Error creating OOXML extractor * (12) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@2b195ebd - doc (11) * (9) com.polyspot.document.converter.ConversionException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@7b796022 * (2) com.polyspot.document.converter.ConversionException: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared - ppt (47) * (46) com.polyspot.document.converter.ConversionE