Re: Miredot License Key for Apache Tika Project
Dear Lewis, The free licence key for the tika-server artifact is: licenceUHJvamVjdHxvcmcuYXBhY2hlLnRpa2EudGlrYS1zZXJ2ZXJ8MjAxNi0wOC0wMXx0cnVlI01Dd0NGRklXRzRqRmNTZXNJb2laRElKZVF4RXpieUNTQWhSMHBmTzZCMUdMbDBPQ1B1WmJYQ3NpZElZSCtRPT0=/licence This licence key is valid for two years (until august 1st, 2016). After this period has expired, you can request a new licence key if you wish. If you have any questions or remarks using MireDot, do not hesitate to ask us. Kind Regards, Yves Lewis John Mcgibbney schreef op 22/07/2014 22:54: Hi Yves, Thanks for the feeddback on this one. On Fri, Jul 18, 2014 at 11:38 PM, Yves Vandewoude yves.vandewo...@miredot.com mailto:yves.vandewo...@miredot.com wrote: Due to the open source nature of Apache Tika, you are indeed granted free licence key(s) if you wish to use MireDot to document your REST API documentation. If you provide me with the groupid/artifactid of the maven module(s) in which your rest api interfaces reside, I will send you the key(s). OK so the code this will be used in resides here https://svn.apache.org/repos/asf/tika/trunk/tika-server groupId - org.apache.tika artifactId - tika-server Also, we currently do not list our customers on the MireDot website. However, should we ever decide to do so, are we allowed to mention the Apache Foundation as a user? We got no blocking feedback on the above, so I suggest that it is OK for the time being. Thank you Lewis
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071516#comment-14071516 ] Sergey Beryozkin commented on TIKA-1371: Can you clarify please what exactly does not work ? Absolute request URI is not logged ? Can you please type a sample request URI issued against Tika 1.5 and explain what do you expect the server to do... Thanks, Sergey passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071625#comment-14071625 ] Rob Tulloh commented on TIKA-1371: -- In tika 1.2, we pass a URL http://HOST:9998/tika/GUID/FN that results in this kind of log in the tika log: 2014-07-23_06:53:16.57840 INFO: tika/10083908/n/a (text/html) When we use the same approach in tika 1.5 curl -T http://localhost:9998/123456/x.csv we see this in the tika log 2014-07-23_11:23:33.92903 WARNING: No operation matching request path /tika/123456/x.csv is found, Relative Path: /123456/x.csv, HTTP Method: PUT, ContentType: */*, Accept: */*,. Please enable FINE/TRACE log level for more details. 2014-07-23_11:23:33.93747 Jul 23, 2014 6:23:33 AM org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse 2014-07-23_11:23:33.93749 WARNING: javax.ws.rs.ClientErrorException 2014-07-23_11:23:33.93750 at org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:503) 2014-07-23_11:23:33.93753 at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:218) 2014-07-23_11:23:33.93753 at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:90) 2014-07-23_11:23:33.93754 at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:272) 2014-07-23_11:23:33.93754 at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) 2014-07-23_11:23:33.93755 at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.serviceRequest(JettyHTTPDestination.java:35 passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071631#comment-14071631 ] Rob Tulloh commented on TIKA-1371: -- Also, I was trying out the detect/stream URL that is documented on the Tika wiki http://wiki.apache.org/tika/TikaJAXRS and that doesn't seem to work either. I tried both tika/detect/stream and detect/stream 2014-07-23_11:32:59.68485 WARNING: No operation matching request path /tika/detect/stream is found, Relative Path: /detect/stream, HTTP Method: PUT, ContentType: */*, Accept: */*,. Please enable FINE/TRACE log level for more details. 2014-07-23_11:33:21.14176 WARNING: No operation matching request path /detect/stream is found, Relative Path: /detect/stream, HTTP Method: PUT, ContentType: */*, Accept: */*,. Please enable FINE/TRACE log level for more details. passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071637#comment-14071637 ] Rob Tulloh commented on TIKA-1371: -- Also, we have been using tika to externally pre-process documents before we index them into Solr. We are using Solr 4.7 and it seems that the additional meta-data that Tika 1.5 now produces is not being recognized by Solr. We purposely decouple Tika from Solr because we don't want document processing to destabilize indexing/search (separating concerns). When we use the supported URL format, we get back meta-data and content and this confuses Solr's language detection. This is a separate question, but I want to mention it as it may explain more about how we have been using Tika and the problems we have encountered with upgrading to Tika 1.5. passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (TIKA-1372) PDCheckbox NPE
[ https://issues.apache.org/jira/browse/TIKA-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1372. - Resolution: Fixed [~tilman], thank you for notifying us. Y, that was Tika's (well, my) fault. I fixed that thanks to a doc in govdocs1 and work on TIKA-1302. Tika SNAPSHOT works on the file submitted with PDFBOX-2218, and we should have TIKA 1.6 out shortly. [~mdhussain], thank you for submitting the issue to PDFBOX. Let us know if you are having problems with Tika trunk and your file. PDCheckbox NPE -- Key: TIKA-1372 URL: https://issues.apache.org/jira/browse/TIKA-1372 Project: Tika Issue Type: Bug Reporter: Tilman Hausherr One of your users, [~mdhussain], opened PDFBOX-2218: PDF parsing fails for attached PDF. Stack trace of failure: {code} Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1747c at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.sabax.extraction.FileExtractionHandler.getFileData(FileExtractionHandler.java:145) at GenerateIndex.main(GenerateIndex.java:59) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) Caused by: java.lang.NullPointerException at org.apache.pdfbox.pdmodel.interactive.form.PDCheckbox.getOnValue(PDCheckbox.java:141) at org.apache.pdfbox.pdmodel.interactive.form.PDCheckbox.isChecked(PDCheckbox.java:79) at org.apache.pdfbox.pdmodel.interactive.form.PDRadioCollection.getValue(PDRadioCollection.java:128) at org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:507) at org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:461) at org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:479) at org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:447) at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:195) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:341) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 9 more {code} Sample code use to parse {code} TikaInputStream tikaStream = TikaInputStream.get(stream); TikaResultWrapper result; try { long streamSize = tikaStream.getLength(); Metadata metadata = constructMetadata(fileName, mimeType, streamSize); if (streamSize maxFileSize) { SamplingSaxHandler handler = new SamplingSaxHandler(samplingSize, metadata); handler.setBufferLimit(bufferSize); parser.parse(tikaStream, handler, metadata, new ParseContext()); result = handler.getResult(); } else { result = new TikaResultWrapper(null, metadata); } } finally { tikaStream.close(); } return result; {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071643#comment-14071643 ] Hong-Thai Nguyen commented on TIKA-1373: Can you format your description with {code} annotation and if I understand well the output of 1st section is empty ? AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) It returns (using the SourceCodeParser): Text extracted: But when I use this code: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) The Text Parser is used and I get: Text extracted: public class HelloWorld {} I have also tested this command: java -jar tika-app-1.5.jar -t D:\text.java (no text) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071693#comment-14071693 ] Nick Burch commented on TIKA-1371: -- The /detect/stream URL ought to work, see DetectorResourceTest for examples of calling it that the tests use. Note that there's no /tika prefix on it The Tika Server now (post-1.5) is able to tell you in a very basic way what URLs it supports, you can use that to see what the URLs are / aren't passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071697#comment-14071697 ] Nick Burch commented on TIKA-1373: -- I've just tried it with svn trunk, and I think I see the issue there. The problem looks to be that we're getting back html when we ask for text {code}$ tika --text tika-core/src/main/java/org/apache/tika/Tika.java | head !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; html xmlns=http://www.w3.org/1999/xhtml; xml:lang=en lang=en head meta http-equiv=content-type content=text/html; charset=ISO-8859-1 / meta name=generator content=JHighlight v1.0 (http://jhighlight.dev.java.net) / titleTika.java/title link rel=Help href=http://jhighlight.dev.java.net; / style type=text/css .java_type { {code} Note that I've asked for Text, but got HTML! AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) It returns (using the SourceCodeParser): Text extracted: But when I use this code: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) The Text Parser is used and I get: Text extracted: public class HelloWorld {} I have also tested this command: java -jar tika-app-1.5.jar -t D:\text.java (no text) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071713#comment-14071713 ] Hong-Thai Nguyen commented on TIKA-1373: Yes, I saw the trouble when implementing this parser. How can we get that we are asking for text instead of HTML ? Can Handler is instanceOf BodyContentHandler is enough ? AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) It returns (using the SourceCodeParser): Text extracted: But when I use this code: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) The Text Parser is used and I get: Text extracted: public class HelloWorld {} I have also tested this command: java -jar tika-app-1.5.jar -t D:\text.java (no text) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071643#comment-14071643 ] Hong-Thai Nguyen edited comment on TIKA-1373 at 7/23/14 1:42 PM: - Can you format your description with {noformat}{code}{noformat} annotation and if I understand well the output of 1st section is empty ? was (Author: thaichat04): Can you format your description with {code} annotation and if I understand well the output of 1st section is empty ? AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) It returns (using the SourceCodeParser): Text extracted: But when I use this code: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) The Text Parser is used and I get: Text extracted: public class HelloWorld {} I have also tested this command: java -jar tika-app-1.5.jar -t D:\text.java (no text) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1373: -- Description: When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} was: When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) It returns (using the SourceCodeParser): Text extracted: But when I use this code: String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) The Text Parser is used and I get: Text extracted: public class HelloWorld {} I have also tested this command: java -jar tika-app-1.5.jar -t D:\text.java (no text) AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071723#comment-14071723 ] Rob Tulloh commented on TIKA-1371: -- curl -T x.csv http://localhost:9998/detect/stream results in 2014-07-23_11:33:21.14176 WARNING: No operation matching request path /detect/stream is found, Relative Path: /detect/stream, HTTP Method: PUT, ContentType: /, Accept: /,. Please enable FINE/TRACE log level for more details. passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071744#comment-14071744 ] Nick Burch commented on TIKA-1373: -- {quote}Yes, I saw the trouble when implementing this parser. How can we get that we are asking for text instead of HTML?{quote} It isn't up to the parser. The parser should be outputting its html as sax events to the handler. Text-only extraction happens downstream It would seem that at the moment, the parser's generated html isn't being correctly turned into / passed into sax events on the content handler AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071748#comment-14071748 ] Nick Burch commented on TIKA-1371: -- I've just tried with a recent nightly build, and it worked fine: {code}$ curl -T test.xlsx http://localhost:9998/detect/stream application/vnd.openxmlformats-officedocument.spreadsheetml.sheet{code} There has been a load of work on the Tika Server since 1.5, so make sure you're using a recent nightly build / svn trunk checkout build passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: How should video files with audio be handled by parsers?
On Tue, 22 Jul 2014, Ray Gauss wrote: The info on what the streams are and how they relate can be conveyed via PBCore, i.e.: pbcore:instantiationTracks=1 video track, English and Spanish audio, Director's commentary audio Ah, that's good. Looks a sensible enough and easy to follow standard to crib from ... pbcore:instantiationEssenceTrack[0]/pbcore:essenceTrackType=Video ... pbcore:instantiationEssenceTrack[1]/pbcore:essenceTrackType=Audio pbcore:instantiationEssenceTrack[1]/pbcore:essenceTrackLanguage=eng I'm not quite so keen on these metadata keys though. Do we gain anything from this long form, vs stream[0]/pbcore:essenceTrackType=Video stream[1]/pbcore:essenceTrackType=Audio stream[1]/pbcore:essenceTrackLanguage=eng ? The current FFmpeg parser wouldn't be able to extract things like annotations, but it was only targeting the intrinsic metadata. The Ogg parsers should be able to output that fairly easily, the only reason they don't is that I didn't know what to output as! Nick
Re: Miredot License Key for Apache Tika Project
Fantastic, thank you. Best Lewis On Wed, Jul 23, 2014 at 12:00 AM, Yves Vandewoude yves.vandewo...@miredot.com wrote: Dear Lewis, The free licence key for the tika-server artifact is: licenceUHJvamVjdHxvcmcuYXBhY2hlLnRpa2EudGlrYS1zZXJ2ZXJ8MjAxNi0wOC0wMXx0cnVlI01Dd0NGRklXRzRqRmNTZXNJb2laRElKZVF4RXpieUNTQWhSMHBmTzZCMUdMbDBPQ1B1WmJYQ3NpZElZSCtRPT0=/licence This licence key is valid for two years (until august 1st, 2016). After this period has expired, you can request a new licence key if you wish. If you have any questions or remarks using MireDot, do not hesitate to ask us. Kind Regards, Yves Lewis John Mcgibbney schreef op 22/07/2014 22:54: Hi Yves, Thanks for the feeddback on this one. On Fri, Jul 18, 2014 at 11:38 PM, Yves Vandewoude yves.vandewo...@miredot.com wrote: Due to the open source nature of Apache Tika, you are indeed granted free licence key(s) if you wish to use MireDot to document your REST API documentation. If you provide me with the groupid/artifactid of the maven module(s) in which your rest api interfaces reside, I will send you the key(s). OK so the code this will be used in resides here https://svn.apache.org/repos/asf/tika/trunk/tika-server groupId - org.apache.tika artifactId - tika-server Also, we currently do not list our customers on the MireDot website. However, should we ever decide to do so, are we allowed to mention the Apache Foundation as a user? We got no blocking feedback on the above, so I suggest that it is OK for the time being. Thank you Lewis -- *Lewis*
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071893#comment-14071893 ] Rob Tulloh commented on TIKA-1371: -- OK, so it seems fixed in the pending 1.6 release. That is good news. Now, going back to the original question, how can we get details of the document we are indexing into the log file like was possible in 1.2? Is there a way we can tell Tika server to log the GUID (the client specifies) and the filename? This was easy to do in 1.2 because it was part of the URL and Tika logged this information by default. Another useful feature would be to allow the server to return not XHTML, but just the body text (ala --text on the tika-app). Since our service is using Tika server to pre-process all the documents in a separate service, it would be helpful if Tika server would have an option to just return the body text and not the full XHTML. I have code that will parse the result document and extract the body, but it make more sense to allow the Tika service to just return this the way it used to in previous versions. I am particularly concerned because this post-processing of the XHTML adds memory overhead to our server JVM. So, an option to launch the Tika server with an option like --text (or a URL like tika/text) and have it just return the body content would make it compatible with previous versions that did this. Then our application logic would be much simpler and the Solr integration would work as it did with Tika server 1.2. passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrés Aguilar-Umaña updated TIKA-1373: --- Description: When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} was: When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); autoDetectParser = new SourceCodeParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new
[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071942#comment-14071942 ] Tyler Palsulich commented on TIKA-1373: --- The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}. codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it _looks_ like the --text is returning the text, but it's just that the text content is html. I'm not sure how we can turn the jhighlight html tags into SAX events. Tika HtmlParser? Something like {code} XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); Renderer renderer = getRenderer(type.toString()); String content = out.toString(); String codeAsHtml = renderer.highlight(name, content, charset.name(), false); HtmlParser htmlParser = new HtmlParser(); htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, metadata, context); {code} AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected
[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071942#comment-14071942 ] Tyler Palsulich edited comment on TIKA-1373 at 7/23/14 4:52 PM: The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}. codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it _looks_ like the --text isn't returning the text, but it's just that the text content is html. I'm not sure how we can turn the jhighlight html tags into SAX events. Tika HtmlParser? Something like {code} XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); Renderer renderer = getRenderer(type.toString()); String content = out.toString(); String codeAsHtml = renderer.highlight(name, content, charset.name(), false); HtmlParser htmlParser = new HtmlParser(); htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, metadata, context); {code} was (Author: tpalsulich): The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}. codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it _looks_ like the --text is returning the text, but it's just that the text content is html. I'm not sure how we can turn the jhighlight html tags into SAX events. Tika HtmlParser? Something like {code} XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); Renderer renderer = getRenderer(type.toString()); String content = out.toString(); String codeAsHtml = renderer.highlight(name, content, charset.name(), false); HtmlParser htmlParser = new HtmlParser(); htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, metadata, context); {code} AutoDetectParser extracts no text when SourceCodeParser is selected --- Key: TIKA-1373 URL: https://issues.apache.org/jira/browse/TIKA-1373 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Andrés Aguilar-Umaña When using the AutoDetectParser in java code, and the SourceCodeParser is selected (i.e. java files), the handler gets no text: I have this test program: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/x-java-source); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} It returns (using the SourceCodeParser): {code} Text extracted: {code} But when I use this code: {code} String data = public class HelloWorld {}; ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); Parser autoDetectParser = new AutoDetectParser(); BodyContentHandler bch = new BodyContentHandler(50); ParseContext parseContext = new ParseContext(); Metadata metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, text/plain); try { autoDetectParser.parse(bais, bch, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); } System.out.println(Text extracted: +bch.toString()) {code} The Text Parser is used and I get: {code} Text extracted: public class HelloWorld {} {code} I have also tested this command: {code} java -jar tika-app-1.5.jar -t D:\text.java (no text) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071953#comment-14071953 ] Nick Burch commented on TIKA-1371: -- {quote}Another useful feature would be to allow the server to return not XHTML, but just the body text{quote} That's already supported though! Just send an Accept header of text/plain and you'll get that back. See the Get the Text of a Document section on the wiki for examples passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)
[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072046#comment-14072046 ] Rob Tulloh commented on TIKA-1371: -- Perfect! Though I have to say the examples on the wiki are not exactly clear. Just tried this and it worked as expected. Thank you. One more thought on the URL, if Tika just logged what it received, we could pass the extra context via parameters like this: http://localhost:9998/tika?guid=XXX I test this and it worked for producing the correct result. If Tika server could just log the URL received, that would help immensely in the distributed world where many clients interact with Tika server and correlation is useful for diagnosing what was sent where. passing parameters via URL no longer works (regression) --- Key: TIKA-1371 URL: https://issues.apache.org/jira/browse/TIKA-1371 Project: Tika Issue Type: Bug Components: server Affects Versions: 1.5 Reporter: Rob Tulloh In Tika 1.1 and 1.2, it was possible to add some values to the URL that get logged like this: http://localhost:9998/tika/GUID/FILENAME This was very useful for correlating between client and server in a distributed compute environment. In 1.5 and in the nighty builds (for 1.6), this feature no longer works. Not having this makes it very difficult to troubleshoot problems with document processing in a distributed environment. Please add back this feature so that operations and development teams can more easily figure out which tika instance is processing which document and what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1191) ForkParser / ClassLoaderProxy does not define package
[ https://issues.apache.org/jira/browse/TIKA-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Belisle updated TIKA-1191: -- Attachment: test.eml Test.java Test for ForkParser exception. ForkParser / ClassLoaderProxy does not define package - Key: TIKA-1191 URL: https://issues.apache.org/jira/browse/TIKA-1191 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Nicolas Belisle Attachments: ClassLoaderProxy.java.patch, Test.java, test.eml ForkParser will throw an Exception in some cases : org.apache.tika.exception.TikaException: Invalid embedded resource at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:189) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tika.fork.ForkServer.call(ForkServer.java:144) at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124) at org.apache.tika.fork.ForkServer.main(ForkServer.java:69) Caused by: java.lang.NullPointerException at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:136) at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:499) at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:169) at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getTikaConfig(AbstractPOIFSExtractor.java:72) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getDetector(AbstractPOIFSExtractor.java:79) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:176) ... 10 more A patch will follow -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1191) ForkParser / ClassLoaderProxy does not define package
[ https://issues.apache.org/jira/browse/TIKA-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072104#comment-14072104 ] Nicolas Belisle commented on TIKA-1191: --- I was able to reproduce a similar issue with another file using Tika 1.5. See attached eml.test and the test (Test.java). The exception : Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.mail.RFC822Parser@6743bc0f at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.tika.fork.ForkServer.call(ForkServer.java:144) at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124) at org.apache.tika.fork.ForkServer.main(ForkServer.java:69) Caused by: java.lang.NullPointerException at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:158) at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:516) at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:169) at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268) at org.apache.tika.parser.AutoDetectParser.init(AutoDetectParser.java:51) at org.apache.tika.parser.mail.RFC822Parser.adaptedExtractMultipart(RFC822Parser.java:167) at org.apache.tika.parser.mail.RFC822Parser.adaptedExtractMultipart(RFC822Parser.java:156) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:101) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 9 more ForkParser / ClassLoaderProxy does not define package - Key: TIKA-1191 URL: https://issues.apache.org/jira/browse/TIKA-1191 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Nicolas Belisle Attachments: ClassLoaderProxy.java.patch, Test.java, test.eml ForkParser will throw an Exception in some cases : org.apache.tika.exception.TikaException: Invalid embedded resource at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:189) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tika.fork.ForkServer.call(ForkServer.java:144) at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124) at org.apache.tika.fork.ForkServer.main(ForkServer.java:69) Caused by: java.lang.NullPointerException at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:136) at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:499) at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:169) at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getTikaConfig(AbstractPOIFSExtractor.java:72) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getDetector(AbstractPOIFSExtractor.java:79) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:176) ... 10 more A patch will follow -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1191) ForkParser / ClassLoaderProxy does not define package
[ https://issues.apache.org/jira/browse/TIKA-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Belisle updated TIKA-1191: -- Affects Version/s: 1.5 ForkParser / ClassLoaderProxy does not define package - Key: TIKA-1191 URL: https://issues.apache.org/jira/browse/TIKA-1191 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4, 1.5 Reporter: Nicolas Belisle Attachments: ClassLoaderProxy.java.patch, Test.java, test.eml ForkParser will throw an Exception in some cases : org.apache.tika.exception.TikaException: Invalid embedded resource at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:189) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tika.fork.ForkServer.call(ForkServer.java:144) at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124) at org.apache.tika.fork.ForkServer.main(ForkServer.java:69) Caused by: java.lang.NullPointerException at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:136) at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:499) at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60) at org.apache.tika.config.TikaConfig.init(TikaConfig.java:169) at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getTikaConfig(AbstractPOIFSExtractor.java:72) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getDetector(AbstractPOIFSExtractor.java:79) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:176) ... 10 more A patch will follow -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1269) Self-hosted documentation for the JAX-RS Server
[ https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated TIKA-1269: --- Attachment: TIKA-1269-miredot.patch Patch for enabling recent free license from Miredot. The documentation looks great with a fully navigable tree for clean and efficient navigation of tika-server documentation. I propose to commit this patch, hook it up to the Jenkins build and then open a subsequent issue to fully document all methods in tika-server with appropriate Javadoc. The license key is documented within the patch and is valid for around 2 years... around then we can apply for a new Key. Self-hosted documentation for the JAX-RS Server --- Key: TIKA-1269 URL: https://issues.apache.org/jira/browse/TIKA-1269 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.5 Reporter: Nick Burch Fix For: 1.7 Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch Currently, if you fire up the JAX-RS Tika Server, and go to the root of the server in a web browser, you get an empty page back. You have to know to head over to https://wiki.apache.org/tika/TikaJAXRS find out what the available URLs are We should self-host some simple documentation on the server at the root of it, so that people can discover what it offers. Ideally, this should be largely auto-generated based on the endpoints, so that we don't risk missing things when we add new features This will also allow us to potentially offer a sample running version of the server for people to discover Tika with -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server
[ https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072311#comment-14072311 ] Nick Burch commented on TIKA-1269: -- I guess we'll need a bit of Maven / Ant-in-Maven magic to bundle the generated html up into the Tika Server jar, so it can be hosted from within the Server? Self-hosted documentation for the JAX-RS Server --- Key: TIKA-1269 URL: https://issues.apache.org/jira/browse/TIKA-1269 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.5 Reporter: Nick Burch Fix For: 1.7 Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch Currently, if you fire up the JAX-RS Tika Server, and go to the root of the server in a web browser, you get an empty page back. You have to know to head over to https://wiki.apache.org/tika/TikaJAXRS find out what the available URLs are We should self-host some simple documentation on the server at the root of it, so that people can discover what it offers. Ideally, this should be largely auto-generated based on the endpoints, so that we don't risk missing things when we add new features This will also allow us to potentially offer a sample running version of the server for people to discover Tika with -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: How should video files with audio be handled by parsers?
They are a bit verbose, but: 1) I'd really like to stick to the specification as closely as possible. 2) There are are several PBCore instantiation properties that apply to the entire file like duration and tracks that we'd want prefixed with pbcore so I think it would be odd to see: pbcore:instantiationDuration=00:00:05.20 stream[0]/pbcore:essenceTrackType=Video 3) PBCore allows for essence track types like text that might not necessarily be considered 'streams'. That's great that the Ogg parsers will be able to do the informational side! Regards, Ray On July 23, 2014 at 10:17:29 AM, Nick Burch (apa...@gagravarr.org) wrote: ... pbcore:instantiationEssenceTrack[0]/pbcore:essenceTrackType=Video ... pbcore:instantiationEssenceTrack[1]/pbcore:essenceTrackType=Audio pbcore:instantiationEssenceTrack[1]/pbcore:essenceTrackLanguage=eng I'm not quite so keen on these metadata keys though. Do we gain anything from this long form, vs stream[0]/pbcore:essenceTrackType=Video stream[1]/pbcore:essenceTrackType=Audio stream[1]/pbcore:essenceTrackLanguage=eng ? The current FFmpeg parser wouldn't be able to extract things like annotations, but it was only targeting the intrinsic metadata. The Ogg parsers should be able to output that fairly easily, the only reason they don't is that I didn't know what to output as! Nick
[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server
[ https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072368#comment-14072368 ] Lewis John McGibbney commented on TIKA-1269: Hi [~gagravarr] yeah I have a couple of thoughts here * use the maven-assembly-plugin to generate the following release artifacts, server without packaged dependencies, server with packaged depdencies and embedded server (similar to what we currently have) which is invoked via bash script and which runs via org.mortbay.jetty:jetty-runner. * also put work into the .war file which we can run as a web app within any servlet container... as we also do with Any23. For the artifacts which ship with the generated artifacts, we would simply define the generated Miredot documentation as resources within the XML descriptors for the plugin Maven configuration. As for the WAR documentation, we would define them as WebResources which would then be picked up by the maven-war-plugin when we generate the WAR artifact. I therefore propose (if you guys are happy with using Miredot for the documentation) that we commit the current patch on this issue, then address the Javadoc as well as JAX-RS Annotations in a separate issue, before writing the assembly descriptors and web application WAR file all in separate issues. wdyt? Self-hosted documentation for the JAX-RS Server --- Key: TIKA-1269 URL: https://issues.apache.org/jira/browse/TIKA-1269 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.5 Reporter: Nick Burch Fix For: 1.7 Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch Currently, if you fire up the JAX-RS Tika Server, and go to the root of the server in a web browser, you get an empty page back. You have to know to head over to https://wiki.apache.org/tika/TikaJAXRS find out what the available URLs are We should self-host some simple documentation on the server at the root of it, so that people can discover what it offers. Ideally, this should be largely auto-generated based on the endpoints, so that we don't risk missing things when we add new features This will also allow us to potentially offer a sample running version of the server for people to discover Tika with -- This message was sent by Atlassian JIRA (v6.2#6252)