[jira] [Created] (TIKA-1553) Let's add an evil parser to be used in testing parser drivers
Tim Allison created TIKA-1553: - Summary: Let's add an evil parser to be used in testing parser drivers Key: TIKA-1553 URL: https://issues.apache.org/jira/browse/TIKA-1553 Project: Tika Issue Type: Test Reporter: Tim Allison Assignee: Tim Allison Priority: Minor As part of TIKA-1302 and as part of making Tika more robust generally, it would be useful to have an evil parser that will throw exceptions/errors and hang for lengths of time. This will allow us to test timeouts and handling of exceptions and errors in tika-server and in tika-batch. We could also use this for tests with ForkParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1553) Let's add an evil parser to be used in testing parser drivers
[ https://issues.apache.org/jira/browse/TIKA-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1553. --- Resolution: Fixed r1661129 Let's add an evil parser to be used in testing parser drivers - Key: TIKA-1553 URL: https://issues.apache.org/jira/browse/TIKA-1553 Project: Tika Issue Type: Test Reporter: Tim Allison Assignee: Tim Allison Priority: Minor As part of TIKA-1302 and as part of making Tika more robust generally, it would be useful to have an evil parser that will throw exceptions/errors and hang for lengths of time. This will allow us to test timeouts and handling of exceptions and errors in tika-server and in tika-batch. We could also use this for tests with ForkParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1553) Let's add an evil parser to be used in testing parser drivers
[ https://issues.apache.org/jira/browse/TIKA-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328991#comment-14328991 ] Hudson commented on TIKA-1553: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #499 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/499/]) TIKA-1553: add an EvilParser for testing purposes (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1661129) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/TikaTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/evil * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/evil/EvilParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/evil/EvilParserTest.java * /tika/trunk/tika-parsers/src/test/resources/META-INF * /tika/trunk/tika-parsers/src/test/resources/META-INF/services * /tika/trunk/tika-parsers/src/test/resources/META-INF/services/org.apache.tika.parser.Parser * /tika/trunk/tika-parsers/src/test/resources/org * /tika/trunk/tika-parsers/src/test/resources/org/apache * /tika/trunk/tika-parsers/src/test/resources/org/apache/tika * /tika/trunk/tika-parsers/src/test/resources/org/apache/tika/mime * /tika/trunk/tika-parsers/src/test/resources/org/apache/tika/mime/custom-mimetypes.xml * /tika/trunk/tika-parsers/src/test/resources/test-documents/evil * /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/fake_oom.evil * /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/heavy_hang.evil * /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/nothing_bad.evil * /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/null_pointer.evil * /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/null_pointer_no_msg.evil * /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/real_oom.evil * /tika/trunk/tika-parsers/src/test/resources/test-documents/evil/sleep.evil Let's add an evil parser to be used in testing parser drivers - Key: TIKA-1553 URL: https://issues.apache.org/jira/browse/TIKA-1553 Project: Tika Issue Type: Test Reporter: Tim Allison Assignee: Tim Allison Priority: Minor As part of TIKA-1302 and as part of making Tika more robust generally, it would be useful to have an evil parser that will throw exceptions/errors and hang for lengths of time. This will allow us to test timeouts and handling of exceptions and errors in tika-server and in tika-batch. We could also use this for tests with ForkParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523 ] Uwe Schindler commented on TIKA-1557: - I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should work alos for subclasses, so one could disable all ForkParser subclasses by adding ForkParser to blacklist. Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: New Feature Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329547#comment-14329547 ] Luis Filipe Nassif commented on TIKA-1557: -- I think the same problem that happens with TesseractOCRParser can occur with any ExternalParser, like StringsParser or ffmpeg. Maybe it will be better to add this option to ExternalParser? Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: New Feature Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329509#comment-14329509 ] David Pilato commented on TIKA-1557: Thanks! I'd not qualify it as a bug though. :) Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: Bug Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523 ] Uwe Schindler edited comment on TIKA-1557 at 2/20/15 9:05 PM: -- I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should also work for subclasses, so one could disable all ExternalParser subclasses by adding ExternalParser to blacklist. was (Author: thetaphi): I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should also work for subclasses, so one could disable all ForkParser subclasses by adding ForkParser to blacklist. Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: New Feature Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1558) Create a Parser Blacklist
Tyler Palsulich created TIKA-1558: - Summary: Create a Parser Blacklist Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329523#comment-14329523 ] Uwe Schindler edited comment on TIKA-1557 at 2/20/15 8:42 PM: -- I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should also work for subclasses, so one could disable all ForkParser subclasses by adding ForkParser to blacklist. was (Author: thetaphi): I would not make this a special option only for tesseract. As said on TIKA-1555, it would be better to have a general way to blacklist some parsers through TikaConfig. Currently you have to maintain the whole list of parsers (or parse META-INF yourself) and pass the full list to TikaConfig / AutodetectParser / CompositeParser. I would like to have an option in TIKA config to blacklist parsers. Ideally this should work alos for subclasses, so one could disable all ForkParser subclasses by adding ForkParser to blacklist. Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: New Feature Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1557: -- Issue Type: New Feature (was: Bug) Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: New Feature Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1557) Create TesseractOCR Option to Never Run
[ https://issues.apache.org/jira/browse/TIKA-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1557. - Resolution: Won't Fix Fix Version/s: (was: 1.8) Closing this as Won't Fix for a clean record. I'll open a new issue regarding a Parser blacklist. Create TesseractOCR Option to Never Run --- Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: New Feature Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Attachments: TIKA-1557.palsulich.patch As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330010#comment-14330010 ] Hudson commented on TIKA-1558: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #501 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/501/]) TIKA-1558. Enable blacklisting of Parsers and other services with a servicename.blacklist META-INF file. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1661284) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java * /tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParser.java * /tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParserSubclass.java * /tika/trunk/tika-core/src/test/java/org/apache/tika/parser/BlacklistedParserTest.java * /tika/trunk/tika-core/src/test/resources/META-INF * /tika/trunk/tika-core/src/test/resources/META-INF/services * /tika/trunk/tika-core/src/test/resources/META-INF/services/org.apache.tika.parser.Parser * /tika/trunk/tika-core/src/test/resources/META-INF/services/org.apache.tika.parser.Parser.blacklist * /tika/trunk/tika-core/src/test/resources/org/apache/tika/mime/custom-mimetypes.xml * /tika/trunk/tika-core/src/test/resources/org/apache/tika/parser/blacklist2_file.blacklist2 * /tika/trunk/tika-core/src/test/resources/org/apache/tika/parser/blacklist_file.blacklist Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1187) java.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/TIKA-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1187. - Resolution: Cannot Reproduce java.lang.OutOfMemoryError: Java heap space --- Key: TIKA-1187 URL: https://issues.apache.org/jira/browse/TIKA-1187 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.3 Environment: Ubuntu Reporter: GURFAN Priority: Critical Original Estimate: 612h Remaining Estimate: 612h Hi, While parsing the content we are getting below exception in parse method. The file which we are parsing is 1 mb. TIKA JAR: tika-core-1.3.jar File size: 1 MB. Parser parser = new AutoDetectParser(); parser.parse(is, handler, metaData, new ParseContext()); java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2734) at java.util.ArrayList.ensureCapacity(ArrayList.java:167) at java.util.ArrayList.add(ArrayList.java:351) at org.apache.fontbox.ttf.GlyfCompositeDescript.(GlyfCompositeDescript.java:60) at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:63) at org.apache.fontbox.ttf.GlyphTable.initData(GlyphTable.java:71) at org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:163) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:61) at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:90) at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:26) at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:66) at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:26) at org.apache.tika.parser.font.TrueTypeParser.parse(TrueTypeParser.java:65) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.impetus.vajra.parser.tika.TikaParser.processContent(TikaParser.java:96) at com.impetus.vajra.storm.helper.TextAnalyserBoltHelper.execute(TextAnalyserBoltHelper.java:283) at com.impetus.vajra.storm.TextAnalyserBolt.execute(TextAnalyserBolt.java:182) at backtype.storm.daemon.executor$fn__4050$tuple_action_fn__4052.invoke(executor.clj:566) at backtype.storm.daemon.executor$mk_task_receiver$fn__3976.invoke(executor.clj:345) at backtype.storm.disruptor$clojure_handler$reify__1606.onEvent(disruptor.clj:43) at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:84) at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:58) at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62) at backtype.storm.daemon.executor$fn__4050$fn__4059$fn__4106.invoke(executor.clj:658) at backtype.storm.util$async_loop$fn__465.invoke(util.clj:377) at clojure.lang.AFn.run(AFn.java:24) at java.lang.Thread.run(Thread.java:662) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1250) Process loops infintely processing a CHM file
[ https://issues.apache.org/jira/browse/TIKA-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1250. - Resolution: Cannot Reproduce We can't reproduce this without the file. And, there were some significant CHM parsing updates. So, I'm closing this off. Process loops infintely processing a CHM file - Key: TIKA-1250 URL: https://issues.apache.org/jira/browse/TIKA-1250 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Environment: Java 7 on Linux Reporter: Gary Murphy Priority: Critical Parsing process loops infinitely on certain CHM files. This is NOT the same as TIKA-1152 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file
[ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330021#comment-14330021 ] Tyler Palsulich commented on TIKA-1194: --- [~tssk], were you ever able to create a safe version of the file? /Do you still have it? It's been a while since this issue was opened. Missing text from MS Word (DOC) file Key: TIKA-1194 URL: https://issues.apache.org/jira/browse/TIKA-1194 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Tomas Safarik Priority: Critical Hello, we noticed that filtered text from some MS Word DOC files is missing one line (in table cell) in the original document. - If you add or remove one character anywhere before the problematic line/cell then the filtered text is correct. If you get the text back to original the filtering problem is back. - If the file is resaved as DOCX filtering works fine. I will provide sample document. And please let me know if more information is needed. Regards, Tomas -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1239) Using Spring and Tika together. Need to extract the content and metadata.
[ https://issues.apache.org/jira/browse/TIKA-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1239. - Resolution: Cannot Reproduce Using Spring and Tika together. Need to extract the content and metadata. -- Key: TIKA-1239 URL: https://issues.apache.org/jira/browse/TIKA-1239 Project: Tika Issue Type: Task Components: general, metadata, parser Reporter: sudheshna iyer Priority: Critical I need to use spring with Tika. Is it thread safe to use the following injected from bean context. I am injecting parseContext, handler and parser into my class TikaImpl. bean name=parseContext class=org.apache.tika.parser.ParseContext/bean bean name=parser class=org.apache.tika.parser.AutoDetectParser/bean bean name=handler class=org.xml.sax.helpers.DefaultHandler/bean bean id=tikaService class=com.intech.tika.TikaImpl property name=parseContext ref=parseContext/property property name=parser ref=parser/property property name=handler ref=handler/property property name=resourcesizevalue10485760/value/property /bean === In my class I have 3 methods 1. To retrieve metadata 2. to retrieve content 3. to retrieve both. So for 1. Retrieve metadata, I am using: parser.parse(stream, handler, metadata, parseContext) 2. To retrieve the content, i am using: Tika tika = new Tika(); tika.setMaxStringLength(resourcesize); String content = tika.parseToString(stream); 3. To retrieve both: I am using: BodyContentHandler bodyContentHandler = new BodyContentHandler(resourcesize); Metadata metadata = new Metadata(); parser.parse(TikaInputStream.get(stream), bodyContentHandler, metadata, parseContext); Question is: Is my approach thread safe? Introduced 3 methods, thinking that just getting metadata from the first method is faster than the 3rd method. Need your suggestion badly. Thank you in advance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1558) Create a Parser Blacklist
[ https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1558. --- Resolution: Fixed Fix Version/s: 1.8 Assignee: Tyler Palsulich Above strategy added in r1661284. You can now blacklist Parsers by adding names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same format as the normal services file. If a class is blacklisted, all of its subclasses are automatically blacklisted. Create a Parser Blacklist - Key: TIKA-1558 URL: https://issues.apache.org/jira/browse/TIKA-1558 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Assignee: Tyler Palsulich Fix For: 1.8 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to disable Parsers without pulling their dependencies out. In some cases (e.g. disable all ExternalParsers), there may not be an easy way to exclude the dependencies via Maven. So, an initial design would be to include another file like {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1437) encoding issue in AutoDetectReader
[ https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330017#comment-14330017 ] Tyler Palsulich commented on TIKA-1437: --- [~Lukeliush], can you make a couple updates to make this easier to test? First, come up with a small (few line) file with this problem. That way, we can be sure we can legally include the file within Tika. Also, can you reformat your testing script as a Tika JUnit TestCase? You can see an example [here|https://github.com/apache/tika/blob/trunk/tika-parsers/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java]. The file you have might just be corrupted -- giving different results. And, as Tim mentioned, no detector will be perfect, so different detectors will give different results. But, the above changes will help us narrow it down. Thanks! encoding issue in AutoDetectReader -- Key: TIKA-1437 URL: https://issues.apache.org/jira/browse/TIKA-1437 Project: Tika Issue Type: Bug Components: detector, parser Affects Versions: 1.6 Environment: Windows 8 Reporter: Luke sh Priority: Critical Attachments: EncodingProblem.java, computrabajo-ar-20121108.tsv, e9.jpg, ef.jpg We are having an encoding problem with Tika AutoDetectReader; we are using AutoDetectReader to read an stream to extract the string values by calling readLine()::AutoDetectReader. We find that the Encoding problem is happening in UniversalEncodingDetector being called by AutoDetectReader when reading the input stream being passed as one of the arguments in our TSVParser’s parse method. We are using AutoDetectReader in our parser and we believed it was able auto detect an correct encoding from the input stream being passed to it, but we are seeing several garbled chars bubbling up in our outputted and converted files from our parser; we find out that the encoding problem is happening in the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is reading the stream with UTF-8 which is incorrect encoding; and the correct encoding is ISO-8859-1. I am attaching the screenshot of what char difference we are seeing in the input tsv file and converted/outputed file. they are e9.jpg and ef.jpg, please read the description for details. The problem is that the AutoDetectReader is decoding and reading the chars with incorrect encoding. BTW, We were able to work around this problem with CharsetDetector, which seems to generate a valid encoding for the moment with which we can use to read the tsv file properly. However, the problem is we cannot use AutoDetectReader, we have to create our own TSVAutoDetectReader incorporated with CharsetDetector in the detect method; AutoDetectReader class seems to be less flexible for us to extend its functions, many of its methods are restricted with private constraints, we cannot manually set encoding or override the existing implementation for detecting encoding. In addition, I am also not confident about CharsetDetector either; as I am seeing different encodings produced by CharsetDetector and AutoDetectReader for different tsv files; But for now, we might live with CharsetDetector, as CharsetDetector is solving the current encoding problem. Finally, I would like to also please give you my test program (PFA: EncodingProblem.java) that reads an inputted tsv directory and displays a list of encodings for each of the tsv files in the directory produced by AutoDetectReader, UniversalEncodingDetector(which is being called by AutoDetectReader) and CharsetDetector; so you could probably see the difference, they are producing different encodings for some tsv files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1554) Improve EMF file detection
Luis Filipe Nassif created TIKA-1554: Summary: Improve EMF file detection Key: TIKA-1554 URL: https://issues.apache.org/jira/browse/TIKA-1554 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.7 Reporter: Luis Filipe Nassif I am getting many files being incorrectly detected as application/x-emf. I think the current magic is too common. According to MS documentation (https://msdn.microsoft.com/en-us/library/cc230635.aspx and https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved to: {code} mime-type type=application/x-emf acronymEMF/acronym _commentExtended Metafile/_comment glob pattern=*.emf/ magic priority=50 match value=0x0100 type=string offset=0 match value= EMF type=string offset=40/ /match /magic /mime-type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
David Pilato created TIKA-1555: -- Summary: posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1554) Improve EMF file detection
[ https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329138#comment-14329138 ] Nick Burch commented on TIKA-1554: -- Do you have any small files which incorrectly trigger it now? One of those would be good for a unit test for this! Improve EMF file detection -- Key: TIKA-1554 URL: https://issues.apache.org/jira/browse/TIKA-1554 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.7 Reporter: Luis Filipe Nassif I am getting many files being incorrectly detected as application/x-emf. I think the current magic is too common. According to MS documentation (https://msdn.microsoft.com/en-us/library/cc230635.aspx and https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved to: {code} mime-type type=application/x-emf acronymEMF/acronym _commentExtended Metafile/_comment glob pattern=*.emf/ magic priority=50 match value=0x0100 type=string offset=0 match value= EMF type=string offset=40/ /match /magic /mime-type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1554) Improve EMF file detection
[ https://issues.apache.org/jira/browse/TIKA-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Filipe Nassif updated TIKA-1554: - Attachment: nonEmf.dat Yes, I have attached one very simple, constituted only by the current 4 bytes magic. Improve EMF file detection -- Key: TIKA-1554 URL: https://issues.apache.org/jira/browse/TIKA-1554 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.7 Reporter: Luis Filipe Nassif Attachments: nonEmf.dat I am getting many files being incorrectly detected as application/x-emf. I think the current magic is too common. According to MS documentation (https://msdn.microsoft.com/en-us/library/cc230635.aspx and https://msdn.microsoft.com/en-us/library/dd240211.aspx), it can be improved to: {code} mime-type type=application/x-emf acronymEMF/acronym _commentExtended Metafile/_comment glob pattern=*.emf/ magic priority=50 match value=0x0100 type=string offset=0 match value= EMF type=string offset=40/ /match /magic /mime-type {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1460) Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2'
[ https://issues.apache.org/jira/browse/TIKA-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330031#comment-14330031 ] Tyler Palsulich commented on TIKA-1460: --- Hi [~onyas]. The dialog isn't in a very intuitive spot. It's under More Attach files. I found a PostScript version of the file under {{/usr/share/fonts/cmap/}}. But, not a PDF. I'm also curious if a newer version of Tika would solve your problem. Could not parse predefined CMAP file for 'Adobe-GBK1-UCS2' -- Key: TIKA-1460 URL: https://issues.apache.org/jira/browse/TIKA-1460 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: win7,myeclipse8.5 Reporter: onyas Priority: Critical for some reason,I could not upload the file,Here is the info.. and i checked all the version in the directory of \org\apache\pdfbox\resources\cmap, I have not found the ’Adobe-GBK1-UCS2‘ file org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@d640af at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) Caused by: java.lang.IllegalArgumentException: Position 66048 past the end of the file at org.apache.poi.poifs.nio.FileBackedDataSource.read(FileBackedDataSource.java:50) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:420) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readBAT(NPOIFSFileSystem.java:397) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.readCoreContents(NPOIFSFileSystem.java:356) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:202) at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.init(NPOIFSFileSystem.java:184) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 21 more the major code is : Parser parser = new AutoDetectParser(); ContentHandler handler = new BodyContentHandler(getNum()); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); InputStream stream = null; StringBuffer content = new StringBuffer(); try { stream = new FileInputStream(file); if (stream != null) { parser.parse(stream, handler, metadata, context); content = content.append(handler); if(StringUtils.isNotBlank(content.toString())){ hasContent = true; handler = null; metadata = null; context = null; } } And the exception is throwed at this line== parser.parse(stream, handler, metadata, context); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1521) Handle password protected 7zip files
[ https://issues.apache.org/jira/browse/TIKA-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1521. --- Resolution: Fixed Thanks for finding a workaround, Tim! Closing this now that Jenkins is happy. Handle password protected 7zip files Key: TIKA-1521 URL: https://issues.apache.org/jira/browse/TIKA-1521 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch Fix For: 1.8 While working on TIKA-1028, I notice that while Commons Compress doesn't currently handle decrypting password protected zip files, it does handle password protected 7zip files We should therefore add logic into the package parser to spot password protected 7zip files, and fetch the password for them from a PasswordProvider if given -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329276#comment-14329276 ] Uwe Schindler commented on TIKA-1555: - Also, this issue in the JDK is already fixed in Java 7u80 and 8u40 (to be released in the next 2 months): https://bugs.openjdk.java.net/browse/JDK-8047340 posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329282#comment-14329282 ] Uwe Schindler commented on TIKA-1555: - @UweSays: https://twitter.com/UweSays/status/501425093613207552 posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329300#comment-14329300 ] David Pilato commented on TIKA-1555: Thank you Uwe. I don't understand why I was not able to find the other issue! I'm pretty sure I search for it before opening that one... I guess we can close this one as duplicate then? Thanks! posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329272#comment-14329272 ] Uwe Schindler commented on TIKA-1555: - This is a duplicate of TIKA-1526. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-1555. - Resolution: Duplicate Assignee: Tyler Palsulich posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers
[ https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329350#comment-14329350 ] Uwe Schindler commented on TIKA-1526: - I was not able to test this, because I have no MacOSX computer and FreeBSD is only a Jenkins server Maybe [~dadoonet] can try the same with elasticsearch-mapper-attachments module. ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers Key: TIKA-1526 URL: https://issues.apache.org/jira/browse/TIKA-1526 Project: Tika Issue Type: Wish Reporter: Hoss Man the JDK has numerous pain points regarding the Turkish locale, posix_spawn lowercasing being one of them... https://bugs.openjdk.java.net/browse/JDK-8047340 https://bugs.openjdk.java.net/browse/JDK-8055301 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is enabled configured by default in Tika, and uses ExternalParser.check to see if tesseract is available -- but because of the JDK bug, this means that Tika fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like so... {noformat} [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:105) [junit4] at java.lang.UNIXProcess$1.run(UNIXProcess.java:94) [junit4] at java.security.AccessController.doPrivileged(Native Method) [junit4] at java.lang.UNIXProcess.clinit(UNIXProcess.java:92) [junit4] at java.lang.ProcessImpl.start(ProcessImpl.java:130) [junit4] at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) [junit4] at java.lang.Runtime.exec(Runtime.java:620) [junit4] at java.lang.Runtime.exec(Runtime.java:485) [junit4] at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) [junit4] at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) [junit4] at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) [junit4] at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) [junit4] at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) [junit4] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) [junit4] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} ...unless they go out of their way to white list only the parsers they need/want so TesseractOCRParser (and any other ExternalParsers) will never even be check()ed. It would be nice if Tika's ExternalParser class added a similar hack/workarround to what was done in SOLR-6387 to trap these types of errors. In Solr we just propogate a better error explaining why Java hates the turkish langauge... {code} } catch (Error err) { if (err.getMessage() != null (err.getMessage().contains(posix_spawn) || err.getMessage().contains(UNIXProcess))) { log.warn(Error forking command due to JVM locale bug (see https://issues.apache.org/jira/browse/SOLR-6387): + err.getMessage()); return (error executing: + cmd + ); } } {code} ...but with Tika, it might be better for all ExternalParsers to just opt out as if they don't recognize the filetype when they detect this type of error fro m the check method (or perhaps it would be better if AutoDetectParser handled this? ... i'm not really sure how it would best fit into Tika's architecture) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329344#comment-14329344 ] Uwe Schindler commented on TIKA-1555: - Hi David, can you try to compile Tika from current trunk checkout and test it with ES? If this fixes the issue with turkish locale, could you report on TIKA-1526. For me its hard to reproduce with Windows or Linux. I just have analyzed the issue and reported the bug to Oracle and fixed Solr 5.0, but I did no thorough testing on the Tika issue. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1323) Improve exception reporting in JAX-RS server
[ https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1323. --- Resolution: Fixed r1661193 Commandline option -includeStack will enable this behavior. I centralized and made exception handling more uniform for parsing for /tika, /unpack and /rmeta. /meta is still slightly different for backwards compatibility. Improve exception reporting in JAX-RS server Key: TIKA-1323 URL: https://issues.apache.org/jira/browse/TIKA-1323 Project: Tika Issue Type: Improvement Components: server Reporter: Tim Allison Priority: Minor I'd like to use tika-server for TIKA-1302. As part of that, I'd like to record exception stacktraces per document. I see two options: transmit the info back to the client (assuming a doc didn't bring the server down :) ) along with the current error code or log the document id and stacktrace via the server. Given my current design thoughts, I'd prefer the first option. Any objections or recommendations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1556) Clean up whitespace in tika-server
[ https://issues.apache.org/jira/browse/TIKA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1556. --- Resolution: Fixed r1661200. Clean up whitespace in tika-server -- Key: TIKA-1556 URL: https://issues.apache.org/jira/browse/TIKA-1556 Project: Tika Issue Type: Task Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.8 We have 2- and 4-space indents in different parts of tika-server's code. Let's make consistent with rest of Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329364#comment-14329364 ] Uwe Schindler commented on TIKA-1555: - bq. BTW I wonder if we could add a setting which can return false for TesseractOCRParser#hasTesseract even if we have tesseract available. You can remove / add custom parsers through the TikaConfig. But I agree, its hard to maintain, because you have to provide a static list. I would really like to have a separate TikaConfig option to explicitely disable some parsers, so I can use the default SPI lookup, but blacklist parsers. We would like to do the same in Solr, too. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1556) Clean up whitespace in tika-server
Tim Allison created TIKA-1556: - Summary: Clean up whitespace in tika-server Key: TIKA-1556 URL: https://issues.apache.org/jira/browse/TIKA-1556 Project: Tika Issue Type: Task Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.8 We have 2- and 4-space indents in different parts of tika-server's code. Let's make consistent with rest of Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1323) Improve exception reporting in JAX-RS server
[ https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329426#comment-14329426 ] Tim Allison commented on TIKA-1323: --- now running with this option on TIKA-1301's server: 162.209.99.130 port 9998 Improve exception reporting in JAX-RS server Key: TIKA-1323 URL: https://issues.apache.org/jira/browse/TIKA-1323 Project: Tika Issue Type: Improvement Components: server Reporter: Tim Allison Priority: Minor I'd like to use tika-server for TIKA-1302. As part of that, I'd like to record exception stacktraces per document. I see two options: transmit the info back to the client (assuming a doc didn't bring the server down :) ) along with the current error code or log the document id and stacktrace via the server. Given my current design thoughts, I'd prefer the first option. Any objections or recommendations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1323) Improve exception reporting in JAX-RS server
[ https://issues.apache.org/jira/browse/TIKA-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329432#comment-14329432 ] Hudson commented on TIKA-1323: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #500 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/500/]) TIKA-1323: allow tika-server to return stack traces from parse exceptions for easier analysis of parser exceptions via tika-server. (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1661193) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-server/pom.xml * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/MetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/RecursiveMetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaExceptionMapper.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerParseException.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerParseExceptionMapper.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/CXFTestBase.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/MetadataResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceOffTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java * /tika/trunk/tika-server/src/test/resources/META-INF * /tika/trunk/tika-server/src/test/resources/META-INF/services * /tika/trunk/tika-server/src/test/resources/META-INF/services/org.apache.tika.parser.Parser * /tika/trunk/tika-server/src/test/resources/evil * /tika/trunk/tika-server/src/test/resources/evil/null_pointer.evil * /tika/trunk/tika-server/src/test/resources/mime * /tika/trunk/tika-server/src/test/resources/mime/custom-mimetypes.xml Improve exception reporting in JAX-RS server Key: TIKA-1323 URL: https://issues.apache.org/jira/browse/TIKA-1323 Project: Tika Issue Type: Improvement Components: server Reporter: Tim Allison Priority: Minor I'd like to use tika-server for TIKA-1302. As part of that, I'd like to record exception stacktraces per document. I see two options: transmit the info back to the client (assuming a doc didn't bring the server down :) ) along with the current error code or log the document id and stacktrace via the server. Given my current design thoughts, I'd prefer the first option. Any objections or recommendations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1556) Clean up whitespace in tika-server
[ https://issues.apache.org/jira/browse/TIKA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329433#comment-14329433 ] Hudson commented on TIKA-1556: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #500 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/500/]) TIKA-1556 clean up whitespace in tika-server (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1661200) * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/CSVMessageBodyWriter.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/DetectorResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/HTMLHelper.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/MetadataListMessageBodyWriter.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/MetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/RecursiveMetadataResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/RichTextContentHandler.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TarWriter.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TextMessageBodyWriter.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaDetectors.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaLoggingFilter.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaMimeTypes.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaParsers.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaServerParseExceptionMapper.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaVersion.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaWelcome.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/XMPMessageBodyWriter.java * /tika/trunk/tika-server/src/main/java/org/apache/tika/server/ZipWriter.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/CXFTestBase.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/DetectorResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/MetadataResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/RecursiveMetadataResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceOffTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/StackTraceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaDetectorsTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaParsersTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaResourceTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaVersionTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaWelcomeTest.java * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/UnpackerResourceTest.java Clean up whitespace in tika-server -- Key: TIKA-1556 URL: https://issues.apache.org/jira/browse/TIKA-1556 Project: Tika Issue Type: Task Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.8 We have 2- and 4-space indents in different parts of tika-server's code. Let's make consistent with rest of Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329438#comment-14329438 ] Tyler Palsulich commented on TIKA-1555: --- You can also disable OCR by setting the Tesseract path to in the [TesseractOCRConfig|https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java]. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329446#comment-14329446 ] David Pilato commented on TIKA-1555: I read the code and it sounds like to me that is the default value. executable is appended to this path then. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329452#comment-14329452 ] David Pilato commented on TIKA-1555: Well I could try but for now I did not manage to reproduce it at 100% of time. I need to think about it and understand what is wrong with my test config. Sadly, when I got the issue I could not see the Locale and all other settings used. But I will for sure. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329470#comment-14329470 ] Tyler Palsulich commented on TIKA-1555: --- My mistake. Please see [this test|https://github.com/apache/tika/blob/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java#L56]. Try setting the path to gibberish. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329474#comment-14329474 ] Uwe Schindler commented on TIKA-1555: - bq. You can also disable OCR by setting the Tesseract path to in the TesseractOCRConfig. This did not work. If this would disable the fork I would be happy. But it just disables parser as side effect because it tries to fork an invalid process path which is created from empty string and sone sufix. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1557) Create TesseractOCR Option to Never Run
Tyler Palsulich created TIKA-1557: - Summary: Create TesseractOCR Option to Never Run Key: TIKA-1557 URL: https://issues.apache.org/jira/browse/TIKA-1557 Project: Tika Issue Type: Bug Components: parser Reporter: Tyler Palsulich Assignee: Tyler Palsulich Priority: Minor Fix For: 1.8 As brought up in TIKA-1555, TesseractOCRParser should have an option to never be run. So, we can add an {{enabled}} option to the Config. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1555) posix_spawn is not a supported process launch mechanism on this platform
[ https://issues.apache.org/jira/browse/TIKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329486#comment-14329486 ] Tyler Palsulich commented on TIKA-1555: --- [~thetaphi], that's true. Please see TIKA-1557. posix_spawn is not a supported process launch mechanism on this platform Key: TIKA-1555 URL: https://issues.apache.org/jira/browse/TIKA-1555 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Environment: MacOS X 10.10.2 Reporter: David Pilato Assignee: Tyler Palsulich Labels: ocr, parser It could happen on some systems that posix_spawn is not a supported process launch mechanism. We are doing random testing which simulates different kind of Locale so I could sometime hit that issue: {code} java.lang.Error: posix_spawn is not a supported process launch mechanism on this platform. at java.lang.UNIXProcess$1.run(UNIXProcess.java:104) at java.lang.UNIXProcess$1.run(UNIXProcess.java:93) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.clinit(UNIXProcess.java:91) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022) at java.lang.Runtime.exec(Runtime.java:617) at java.lang.Runtime.exec(Runtime.java:485) at org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344) at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:506) {code} It sounds like it's related to this: http://java.thedizzyheights.com/2014/07/java-error-posix_spawn-is-not-a-supported-process-launch-mechanism-on-this-platform-when-trying-to-spawn-a-process/ Though I have hard time to reproduce it! BTW I wonder if we could add a setting which can return {{false}} for {{TesseractOCRParser#hasTesseract}} even if we have tesseract available. For example, let say that my machine shares multiple application and for one of them I don't want any OCR on my documents. Hope this helps. Let me know if you need more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)