[jira] [Commented] (TIKA-1559) SecureContentHandler.SecureSAXException is not serializable
[ https://issues.apache.org/jira/browse/TIKA-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333749#comment-14333749 ] Alexander Pogrenbyak commented on TIKA-1559: See this issue in elasticsearch-mapper-attachments project for use case: https://github.com/elasticsearch/elasticsearch-mapper-attachments/issues/113 > SecureContentHandler.SecureSAXException is not serializable > --- > > Key: TIKA-1559 > URL: https://issues.apache.org/jira/browse/TIKA-1559 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.7 >Reporter: Alexander Pogrenbyak > > I am using [elasticsearch-mapper-attachments > plugin|https://github.com/elasticsearch/elasticsearch-mapper-attachments] > It is, in turn, uses Tika library to convert attachments to plain text. > If things do not go according to plan, it wraps the TikaException in its own, > and tries to send it down the wire to client (me). > Unfortunately, if exception is not serializable, it gets converted to > somewhat cryptic *NotSerializableTransportException*, which makes it harder > to parse on the client side. > The reason {{SecureContentHandler.SecureSAXException}} is not serializable, > because it is an inner class of {{SecureContentHandler}} and it is, in turn, > not serializable, and is present as the parent object on the Exception object. > One fix would be to convert {{SecureSAXException}} to inner *static* class, > and add {{SecureContentHandler}} as a transient field, and explicitly > initialize it in constructor. With this minimal modifications, this > {{SecureSAXException}} will become fully serializable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1559) SecureContentHandler.SecureSAXException is not serializable
Alexander Pogrenbyak created TIKA-1559: -- Summary: SecureContentHandler.SecureSAXException is not serializable Key: TIKA-1559 URL: https://issues.apache.org/jira/browse/TIKA-1559 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Alexander Pogrenbyak I am using [elasticsearch-mapper-attachments plugin|https://github.com/elasticsearch/elasticsearch-mapper-attachments] It is, in turn, uses Tika library to convert attachments to plain text. If things do not go according to plan, it wraps the TikaException in its own, and tries to send it down the wire to client (me). Unfortunately, if exception is not serializable, it gets converted to somewhat cryptic *NotSerializableTransportException*, which makes it harder to parse on the client side. The reason {{SecureContentHandler.SecureSAXException}} is not serializable, because it is an inner class of {{SecureContentHandler}} and it is, in turn, not serializable, and is present as the parent object on the Exception object. One fix would be to convert {{SecureSAXException}} to inner *static* class, and add {{SecureContentHandler}} as a transient field, and explicitly initialize it in constructor. With this minimal modifications, this {{SecureSAXException}} will become fully serializable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333628#comment-14333628 ] Uwe Schindler commented on TIKA-1526: - Thanks David! > ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so > Turkish Tika users can still use non-external parsers > > > Key: TIKA-1526 > URL: https://issues.apache.org/jira/browse/TIKA-1526 > Project: Tika > Issue Type: Wish >Reporter: Hoss Man > > the JDK has numerous pain points regarding the Turkish locale, "posix_spawn" > lowercasing being one of them... > https://bugs.openjdk.java.net/browse/JDK-8047340 > https://bugs.openjdk.java.net/browse/JDK-8055301 > As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is > enabled & configured by default in Tika, and uses ExternalParser.check to see > if tesseract is available -- but because of the JDK bug, this means that Tika > fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like > so... > {noformat} > [junit4]> Throwable #1: java.lang.Error: posix_spawn is not a supported > process launch mechanism on this platform. > [junit4]> at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) > [junit4]> at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) > [junit4]> at java.security.AccessController.doPrivileged(Native > Method) > [junit4]> at java.lang.UNIXProcess.(UNIXProcess.java:92) > [junit4]> at java.lang.ProcessImpl.start(ProcessImpl.java:130) > [junit4]> at > java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > [junit4]> at java.lang.Runtime.exec(Runtime.java:620) > [junit4]> at java.lang.Runtime.exec(Runtime.java:485) > [junit4]> at > org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) > [junit4]> at > org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) > [junit4]> at > org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) > [junit4]> at > org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) > [junit4]> at > org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) > [junit4]> at > org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) > [junit4]> at > org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) > [junit4]> at > org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) > [junit4]> at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > [junit4]> at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > {noformat} > ...unless they go out of their way to white list only the parsers they > need/want so TesseractOCRParser (and any other ExternalParsers) will never > even be check()ed. > It would be nice if Tika's ExternalParser class added a similar > hack/workarround to what was done in SOLR-6387 to trap these types of errors. > In Solr we just propogate a better error explaining why Java hates the > turkish langauge... > {code} > } catch (Error err) { > if (err.getMessage() != null && (err.getMessage().contains("posix_spawn") > || err.getMessage().contains("UNIXProcess"))) { > log.warn("Error forking command due to JVM locale bug (see > https://issues.apache.org/jira/browse/SOLR-6387): " + err.getMessage()); > return "(error executing: " + cmd + ")"; > } > } > {code} > ...but with Tika, it might be better for all ExternalParsers to just "opt > out" as if they don't recognize the filetype when they detect this type of > error fro m the check method (or perhaps it would be better if > AutoDetectParser handled this? ... i'm not really sure how it would best fit > into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333470#comment-14333470 ] David Pilato commented on TIKA-1526: I just ran a test on my machine: With Tika 1.7 and Locale=tr, it fails. With Tika 1.8-SNAPSHOT and Locale=tr, it passes. So I can confirm that the current patch in trunk fixes it. \o/ Hope this helps. > ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so > Turkish Tika users can still use non-external parsers > > > Key: TIKA-1526 > URL: https://issues.apache.org/jira/browse/TIKA-1526 > Project: Tika > Issue Type: Wish >Reporter: Hoss Man > > the JDK has numerous pain points regarding the Turkish locale, "posix_spawn" > lowercasing being one of them... > https://bugs.openjdk.java.net/browse/JDK-8047340 > https://bugs.openjdk.java.net/browse/JDK-8055301 > As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is > enabled & configured by default in Tika, and uses ExternalParser.check to see > if tesseract is available -- but because of the JDK bug, this means that Tika > fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like > so... > {noformat} > [junit4]> Throwable #1: java.lang.Error: posix_spawn is not a supported > process launch mechanism on this platform. > [junit4]> at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) > [junit4]> at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) > [junit4]> at java.security.AccessController.doPrivileged(Native > Method) > [junit4]> at java.lang.UNIXProcess.(UNIXProcess.java:92) > [junit4]> at java.lang.ProcessImpl.start(ProcessImpl.java:130) > [junit4]> at > java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > [junit4]> at java.lang.Runtime.exec(Runtime.java:620) > [junit4]> at java.lang.Runtime.exec(Runtime.java:485) > [junit4]> at > org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) > [junit4]> at > org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) > [junit4]> at > org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) > [junit4]> at > org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) > [junit4]> at > org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) > [junit4]> at > org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) > [junit4]> at > org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) > [junit4]> at > org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) > [junit4]> at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > [junit4]> at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > {noformat} > ...unless they go out of their way to white list only the parsers they > need/want so TesseractOCRParser (and any other ExternalParsers) will never > even be check()ed. > It would be nice if Tika's ExternalParser class added a similar > hack/workarround to what was done in SOLR-6387 to trap these types of errors. > In Solr we just propogate a better error explaining why Java hates the > turkish langauge... > {code} > } catch (Error err) { > if (err.getMessage() != null && (err.getMessage().contains("posix_spawn") > || err.getMessage().contains("UNIXProcess"))) { > log.warn("Error forking command due to JVM locale bug (see > https://issues.apache.org/jira/browse/SOLR-6387): " + err.getMessage()); > return "(error executing: " + cmd + ")"; > } > } > {code} > ...but with Tika, it might be better for all ExternalParsers to just "opt > out" as if they don't recognize the filetype when they detect this type of > error fro m the check method (or perhaps it would be better if > AutoDetectParser handled this? ... i'm not really sure how it would best fit > into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333400#comment-14333400 ] Uwe Schindler edited comment on TIKA-1558 at 2/23/15 4:06 PM: -- Hi, Lucene uses SPI for its index codecs, so we are familar with SPI. But we have no problems with "order" of classpath. We just preserve what Java delivers in Classloader.getResources(). But order is not really important (it was important for testing in Lucene 4.x, but that's history since last Friday). We already have custom TikaConfig support in the extraction module, so I am happy to use that. In our case we would only put the SPI exclusion into our test classpath. But TikaConfig is also fine. was (Author: thetaphi): Hi, Lucene uses SPI for its index codecs, so we are familar with SPI. But we have no problems with "order" of classpath. We just preserve what Java delivers in Classloader.getResources(). But order is not really important (it was important for testing in Lucene 4.x, but that's history since last Friday). We already have a custom TikaConfig class so I am happy to use that. In our case we would only put the SPI exclusion into our test classpath. But TikaConfig is also fine. > Create a Parser Blacklist > - > > Key: TIKA-1558 > URL: https://issues.apache.org/jira/browse/TIKA-1558 > Project: Tika > Issue Type: New Feature >Reporter: Tyler Palsulich >Assignee: Tyler Palsulich > Fix For: 1.8 > > > As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to > disable Parsers without pulling their dependencies out. In some cases (e.g. > disable all ExternalParsers), there may not be an easy way to exclude the > dependencies via Maven. > So, an initial design would be to include another file like > {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a > new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in > {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list > that are assignable to an element in > {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: help for Use of Tika
Hi Brijesh, Thank you for your email! Please send a blank email to dev-subscr...@tika.apache.org and follow the instructions from there to get subscribed. We’d be happy to talk with you about your project ideas! :) Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Brijesh saini Date: Monday, February 23, 2015 at 7:12 AM To: "dev-ow...@tika.apache.org" Subject: help for Use of Tika >sir/ma'm, > > Please help in our project development > >we need to build a tool in which we has to parse a file(world,pdf many >more) for it's content to be extracted so, we can use that file content >for further use > >so,help us that how we can we use Tika'API in our project to creat that >parser > >or subscirbe me to Tika's Mailing-List so, i can discuss my topic with >developers > > > >-Thanks
[jira] [Commented] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333400#comment-14333400 ] Uwe Schindler commented on TIKA-1558: - Hi, Lucene uses SPI for its index codecs, so we are familar with SPI. But we have no problems with "order" of classpath. We just preserve what Java delivers in Classloader.getResources(). But order is not really important (it was important for testing in Lucene 4.x, but that's history since last Friday). We already have a custom TikaConfig class so I am happy to use that. In our case we would only put the SPI exclusion into our test classpath. But TikaConfig is also fine. > Create a Parser Blacklist > - > > Key: TIKA-1558 > URL: https://issues.apache.org/jira/browse/TIKA-1558 > Project: Tika > Issue Type: New Feature >Reporter: Tyler Palsulich >Assignee: Tyler Palsulich > Fix For: 1.8 > > > As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to > disable Parsers without pulling their dependencies out. In some cases (e.g. > disable all ExternalParsers), there may not be an easy way to exclude the > dependencies via Maven. > So, an initial design would be to include another file like > {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a > new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in > {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list > that are assignable to an element in > {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1494#comment-1494 ] Chris A. Mattmann commented on TIKA-1558: - I agree that longer term we should move more to a config file, but there is a lot of work that needs to be done between now and then. This is a good interim solution and the code can keep evolving, so if someone comes up with a better patch by all means. Nick's solution was great; we now have a solution that Tyler added; and later maybe we can trump both of them with the config file. > Create a Parser Blacklist > - > > Key: TIKA-1558 > URL: https://issues.apache.org/jira/browse/TIKA-1558 > Project: Tika > Issue Type: New Feature >Reporter: Tyler Palsulich >Assignee: Tyler Palsulich > Fix For: 1.8 > > > As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to > disable Parsers without pulling their dependencies out. In some cases (e.g. > disable all ExternalParsers), there may not be an easy way to exclude the > dependencies via Maven. > So, an initial design would be to include another file like > {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a > new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in > {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list > that are assignable to an element in > {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333266#comment-14333266 ] Tim Allison commented on TIKA-1558: --- I agree with Nick that I'd prefer to migrate more and more control to a config file than relying on SPI in the long term. As Nick observes on TIKA-1557, there is a way to do this now with the config file, but more work remains before we're ready to fully move to a config file. [~thetaphi], from a Solr/DIH perspective, would Solr users prefer SPI or a config file? > Create a Parser Blacklist > - > > Key: TIKA-1558 > URL: https://issues.apache.org/jira/browse/TIKA-1558 > Project: Tika > Issue Type: New Feature >Reporter: Tyler Palsulich >Assignee: Tyler Palsulich > Fix For: 1.8 > > > As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to > disable Parsers without pulling their dependencies out. In some cases (e.g. > disable all ExternalParsers), there may not be an easy way to exclude the > dependencies via Maven. > So, an initial design would be to include another file like > {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a > new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in > {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list > that are assignable to an element in > {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)