[jira] [Commented] (TIKA-1559) SecureContentHandler.SecureSAXException is not serializable

2015-02-23 Thread Alexander Pogrenbyak (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333749#comment-14333749
 ] 

Alexander Pogrenbyak commented on TIKA-1559:


See this issue in elasticsearch-mapper-attachments project for use case: 
https://github.com/elasticsearch/elasticsearch-mapper-attachments/issues/113

> SecureContentHandler.SecureSAXException is not serializable
> ---
>
> Key: TIKA-1559
> URL: https://issues.apache.org/jira/browse/TIKA-1559
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Alexander Pogrenbyak
>
> I am using [elasticsearch-mapper-attachments 
> plugin|https://github.com/elasticsearch/elasticsearch-mapper-attachments]
> It is, in turn, uses Tika library to convert attachments to plain text.
> If things do not go according to plan, it wraps the TikaException in its own, 
> and tries to send it down the wire to client (me).
> Unfortunately, if exception is not serializable, it gets converted to 
> somewhat cryptic *NotSerializableTransportException*, which makes it harder 
> to parse on the client side.
> The reason {{SecureContentHandler.SecureSAXException}} is not serializable, 
> because it is an inner class of {{SecureContentHandler}} and it is, in turn, 
> not serializable, and is present as the parent object on the Exception object.
> One fix would be to convert {{SecureSAXException}} to inner *static* class, 
> and add {{SecureContentHandler}} as a transient field, and explicitly 
> initialize it in constructor.  With this minimal modifications, this 
> {{SecureSAXException}} will become fully serializable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1559) SecureContentHandler.SecureSAXException is not serializable

2015-02-23 Thread Alexander Pogrenbyak (JIRA)
Alexander Pogrenbyak created TIKA-1559:
--

 Summary: SecureContentHandler.SecureSAXException is not 
serializable
 Key: TIKA-1559
 URL: https://issues.apache.org/jira/browse/TIKA-1559
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Alexander Pogrenbyak


I am using [elasticsearch-mapper-attachments 
plugin|https://github.com/elasticsearch/elasticsearch-mapper-attachments]

It is, in turn, uses Tika library to convert attachments to plain text.

If things do not go according to plan, it wraps the TikaException in its own, 
and tries to send it down the wire to client (me).

Unfortunately, if exception is not serializable, it gets converted to somewhat 
cryptic *NotSerializableTransportException*, which makes it harder to parse on 
the client side.

The reason {{SecureContentHandler.SecureSAXException}} is not serializable, 
because it is an inner class of {{SecureContentHandler}} and it is, in turn, 
not serializable, and is present as the parent object on the Exception object.

One fix would be to convert {{SecureSAXException}} to inner *static* class, and 
add {{SecureContentHandler}} as a transient field, and explicitly initialize it 
in constructor.  With this minimal modifications, this {{SecureSAXException}} 
will become fully serializable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-02-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333628#comment-14333628
 ] 

Uwe Schindler commented on TIKA-1526:
-

Thanks David!

> ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so 
> Turkish Tika users can still use non-external parsers
> 
>
> Key: TIKA-1526
> URL: https://issues.apache.org/jira/browse/TIKA-1526
> Project: Tika
>  Issue Type: Wish
>Reporter: Hoss Man
>
> the JDK has numerous pain points regarding the Turkish locale, "posix_spawn" 
> lowercasing being one of them...
> https://bugs.openjdk.java.net/browse/JDK-8047340
> https://bugs.openjdk.java.net/browse/JDK-8055301
> As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
> enabled & configured by default in Tika, and uses ExternalParser.check to see 
> if tesseract is available -- but because of the JDK bug, this means that Tika 
> fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
> so...
> {noformat}
>   [junit4]> Throwable #1: java.lang.Error: posix_spawn is not a supported 
> process launch mechanism on this platform.
>   [junit4]>   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
>   [junit4]>   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
>   [junit4]>   at java.security.AccessController.doPrivileged(Native 
> Method)
>   [junit4]>   at java.lang.UNIXProcess.(UNIXProcess.java:92)
>   [junit4]>   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
>   [junit4]>   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   [junit4]>   at java.lang.Runtime.exec(Runtime.java:620)
>   [junit4]>   at java.lang.Runtime.exec(Runtime.java:485)
>   [junit4]>   at 
> org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
>   [junit4]>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
>   [junit4]>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
>   [junit4]>   at 
> org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   [junit4]>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> {noformat}
> ...unless they go out of their way to white list only the parsers they 
> need/want so TesseractOCRParser (and any other ExternalParsers) will never 
> even be check()ed.
> It would be nice if Tika's ExternalParser class added a similar 
> hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
>  In Solr we just propogate a better error explaining why Java hates the 
> turkish langauge...
> {code}
> } catch (Error err) {
>   if (err.getMessage() != null && (err.getMessage().contains("posix_spawn") 
> || err.getMessage().contains("UNIXProcess"))) {
> log.warn("Error forking command due to JVM locale bug (see 
> https://issues.apache.org/jira/browse/SOLR-6387): " + err.getMessage());
> return "(error executing: " + cmd + ")";
>   }
> }
> {code}
> ...but with Tika, it might be better for all ExternalParsers to just "opt 
> out" as if they don't recognize the filetype when they detect this type of 
> error fro m the check method (or perhaps it would be better if 
> AutoDetectParser handled this? ... i'm not really sure how it would best fit 
> into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-02-23 Thread David Pilato (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333470#comment-14333470
 ] 

David Pilato commented on TIKA-1526:


I just ran a test on my machine:

With Tika 1.7 and Locale=tr, it fails.
With Tika 1.8-SNAPSHOT and Locale=tr, it passes.

So I can confirm that the current patch in trunk fixes it. \o/

Hope this helps.

> ExternalParser should trap/ignore/workarround JDK-8047340 & JDK-8055301 so 
> Turkish Tika users can still use non-external parsers
> 
>
> Key: TIKA-1526
> URL: https://issues.apache.org/jira/browse/TIKA-1526
> Project: Tika
>  Issue Type: Wish
>Reporter: Hoss Man
>
> the JDK has numerous pain points regarding the Turkish locale, "posix_spawn" 
> lowercasing being one of them...
> https://bugs.openjdk.java.net/browse/JDK-8047340
> https://bugs.openjdk.java.net/browse/JDK-8055301
> As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
> enabled & configured by default in Tika, and uses ExternalParser.check to see 
> if tesseract is available -- but because of the JDK bug, this means that Tika 
> fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
> so...
> {noformat}
>   [junit4]> Throwable #1: java.lang.Error: posix_spawn is not a supported 
> process launch mechanism on this platform.
>   [junit4]>   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
>   [junit4]>   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
>   [junit4]>   at java.security.AccessController.doPrivileged(Native 
> Method)
>   [junit4]>   at java.lang.UNIXProcess.(UNIXProcess.java:92)
>   [junit4]>   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
>   [junit4]>   at 
> java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   [junit4]>   at java.lang.Runtime.exec(Runtime.java:620)
>   [junit4]>   at java.lang.Runtime.exec(Runtime.java:485)
>   [junit4]>   at 
> org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
>   [junit4]>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
>   [junit4]>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
>   [junit4]>   at 
> org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
>   [junit4]>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   [junit4]>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> {noformat}
> ...unless they go out of their way to white list only the parsers they 
> need/want so TesseractOCRParser (and any other ExternalParsers) will never 
> even be check()ed.
> It would be nice if Tika's ExternalParser class added a similar 
> hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
>  In Solr we just propogate a better error explaining why Java hates the 
> turkish langauge...
> {code}
> } catch (Error err) {
>   if (err.getMessage() != null && (err.getMessage().contains("posix_spawn") 
> || err.getMessage().contains("UNIXProcess"))) {
> log.warn("Error forking command due to JVM locale bug (see 
> https://issues.apache.org/jira/browse/SOLR-6387): " + err.getMessage());
> return "(error executing: " + cmd + ")";
>   }
> }
> {code}
> ...but with Tika, it might be better for all ExternalParsers to just "opt 
> out" as if they don't recognize the filetype when they detect this type of 
> error fro m the check method (or perhaps it would be better if 
> AutoDetectParser handled this? ... i'm not really sure how it would best fit 
> into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist

2015-02-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333400#comment-14333400
 ] 

Uwe Schindler edited comment on TIKA-1558 at 2/23/15 4:06 PM:
--

Hi,
Lucene uses SPI for its index codecs, so we are familar with SPI. But we have 
no problems with "order" of classpath. We just preserve what Java delivers in 
Classloader.getResources(). But order is not really important (it was important 
for testing in Lucene 4.x, but that's history since last Friday).

We already have custom TikaConfig support in the extraction module, so I am 
happy to use that. In our case we would only put the SPI exclusion into our 
test classpath. But TikaConfig is also fine.


was (Author: thetaphi):
Hi,
Lucene uses SPI for its index codecs, so we are familar with SPI. But we have 
no problems with "order" of classpath. We just preserve what Java delivers in 
Classloader.getResources(). But order is not really important (it was important 
for testing in Lucene 4.x, but that's history since last Friday).

We already have a custom TikaConfig class so I am happy to use that. In our 
case we would only put the SPI exclusion into our test classpath. But 
TikaConfig is also fine.

> Create a Parser Blacklist
> -
>
> Key: TIKA-1558
> URL: https://issues.apache.org/jira/browse/TIKA-1558
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Tyler Palsulich
> Fix For: 1.8
>
>
> As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
> disable Parsers without pulling their dependencies out. In some cases (e.g. 
> disable all ExternalParsers), there may not be an easy way to exclude the 
> dependencies via Maven.
> So, an initial design would be to include another file like 
> {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
> new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
> {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
> that are assignable to an element in 
> {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: help for Use of Tika

2015-02-23 Thread Mattmann, Chris A (3980)
Hi Brijesh,

Thank you for your email! Please send a blank email to
dev-subscr...@tika.apache.org and follow the instructions
from there to get subscribed.

We’d be happy to talk with you about your project ideas! :)

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Brijesh saini 
Date: Monday, February 23, 2015 at 7:12 AM
To: "dev-ow...@tika.apache.org" 
Subject: help for Use of Tika

>sir/ma'm,
>
>   Please help in our project development
>
>we need to build a tool in which we has to parse a file(world,pdf many
>more) for it's content to be extracted so, we can use that file content
>for further use
>
>so,help us that how we can we use Tika'API in our project to creat that
>parser
>
>or subscirbe me to Tika's Mailing-List so, i can discuss my topic with
>developers
>
>
>
>-Thanks



[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-23 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333400#comment-14333400
 ] 

Uwe Schindler commented on TIKA-1558:
-

Hi,
Lucene uses SPI for its index codecs, so we are familar with SPI. But we have 
no problems with "order" of classpath. We just preserve what Java delivers in 
Classloader.getResources(). But order is not really important (it was important 
for testing in Lucene 4.x, but that's history since last Friday).

We already have a custom TikaConfig class so I am happy to use that. In our 
case we would only put the SPI exclusion into our test classpath. But 
TikaConfig is also fine.

> Create a Parser Blacklist
> -
>
> Key: TIKA-1558
> URL: https://issues.apache.org/jira/browse/TIKA-1558
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Tyler Palsulich
> Fix For: 1.8
>
>
> As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
> disable Parsers without pulling their dependencies out. In some cases (e.g. 
> disable all ExternalParsers), there may not be an easy way to exclude the 
> dependencies via Maven.
> So, an initial design would be to include another file like 
> {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
> new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
> {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
> that are assignable to an element in 
> {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-23 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1494#comment-1494
 ] 

Chris A. Mattmann commented on TIKA-1558:
-

I agree that longer term we should move more to a config file, but there is a 
lot of work that needs to be done between now and then. This is a good interim 
solution and the code can keep evolving, so if someone comes up with a better 
patch by all means. Nick's solution was great; we now have a solution that 
Tyler added; and later maybe we can trump both of them with the config file.

> Create a Parser Blacklist
> -
>
> Key: TIKA-1558
> URL: https://issues.apache.org/jira/browse/TIKA-1558
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Tyler Palsulich
> Fix For: 1.8
>
>
> As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
> disable Parsers without pulling their dependencies out. In some cases (e.g. 
> disable all ExternalParsers), there may not be an easy way to exclude the 
> dependencies via Maven.
> So, an initial design would be to include another file like 
> {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
> new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
> {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
> that are assignable to an element in 
> {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1558) Create a Parser Blacklist

2015-02-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333266#comment-14333266
 ] 

Tim Allison commented on TIKA-1558:
---

I agree with Nick that I'd prefer to migrate more and more control to a config 
file than relying on SPI in the long term.  As Nick observes on TIKA-1557, 
there is a way to do this now with the config file, but more work remains 
before we're ready to fully move to a config file.

[~thetaphi], from a Solr/DIH perspective, would Solr users prefer SPI or a 
config file? 

> Create a Parser Blacklist
> -
>
> Key: TIKA-1558
> URL: https://issues.apache.org/jira/browse/TIKA-1558
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Tyler Palsulich
> Fix For: 1.8
>
>
> As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
> disable Parsers without pulling their dependencies out. In some cases (e.g. 
> disable all ExternalParsers), there may not be an easy way to exclude the 
> dependencies via Maven.
> So, an initial design would be to include another file like 
> {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
> new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
> {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
> that are assignable to an element in 
> {{ServiceLoader#loadServiceProviderBlacklist}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)