Re: Miredot License Key for Apache Tika Project

2014-07-23 Thread Yves Vandewoude

Dear Lewis,

The free licence key for the tika-server artifact is:

licenceUHJvamVjdHxvcmcuYXBhY2hlLnRpa2EudGlrYS1zZXJ2ZXJ8MjAxNi0wOC0wMXx0cnVlI01Dd0NGRklXRzRqRmNTZXNJb2laRElKZVF4RXpieUNTQWhSMHBmTzZCMUdMbDBPQ1B1WmJYQ3NpZElZSCtRPT0=/licence

This licence key is valid for two years (until august 1st, 2016). After 
this period has expired, you can request a new licence key if you wish.


If you have any questions or remarks using MireDot, do not hesitate to 
ask us.

Kind Regards,
Yves


Lewis John Mcgibbney schreef op 22/07/2014 22:54:

Hi Yves,
Thanks for the feeddback on this one.

On Fri, Jul 18, 2014 at 11:38 PM, Yves Vandewoude 
yves.vandewo...@miredot.com mailto:yves.vandewo...@miredot.com wrote:



Due to the open source nature of Apache Tika, you are indeed
granted free licence key(s) if you wish to use MireDot to document
your REST API documentation. If you provide me with the
groupid/artifactid of the maven module(s) in which your rest api
interfaces reside, I will send you the key(s).


OK so the code this will be used in resides here
https://svn.apache.org/repos/asf/tika/trunk/tika-server
groupId - org.apache.tika
artifactId - tika-server


Also, we currently do not list our customers on the MireDot
website. However, should we ever decide to do so, are we allowed
to mention the Apache Foundation as a user?


We got no blocking feedback on the above, so I suggest that it is OK 
for the time being.

Thank you
Lewis




[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071516#comment-14071516
 ] 

Sergey Beryozkin commented on TIKA-1371:


Can you clarify please what exactly does not work ?
Absolute request URI is not logged ? Can you please type a sample request URI 
issued against Tika 1.5 and explain what do you expect the server to do...

Thanks, Sergey

 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071625#comment-14071625
 ] 

Rob Tulloh commented on TIKA-1371:
--

In tika 1.2, we pass a URL

http://HOST:9998/tika/GUID/FN

that results in this kind of log in the tika log:

2014-07-23_06:53:16.57840 INFO: tika/10083908/n/a (text/html)

When we use the same approach in tika 1.5

curl -T http://localhost:9998/123456/x.csv

we see this in the tika log

2014-07-23_11:23:33.92903 WARNING: No operation matching request path 
/tika/123456/x.csv is found, Relative Path: /123456/x.csv, HTTP Method: PUT, 
ContentType: */*, Accept: */*,. Please enable FINE/TRACE log level for more 
details.
2014-07-23_11:23:33.93747 Jul 23, 2014 6:23:33 AM 
org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse
2014-07-23_11:23:33.93749 WARNING: javax.ws.rs.ClientErrorException
2014-07-23_11:23:33.93750   at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.findTargetMethod(JAXRSUtils.java:503)
2014-07-23_11:23:33.93753   at 
org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:218)
2014-07-23_11:23:33.93753   at 
org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:90)
2014-07-23_11:23:33.93754   at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:272)
2014-07-23_11:23:33.93754   at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
2014-07-23_11:23:33.93755   at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.serviceRequest(JettyHTTPDestination.java:35

 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071631#comment-14071631
 ] 

Rob Tulloh commented on TIKA-1371:
--

Also, I was trying out the detect/stream URL that is documented on the Tika wiki

http://wiki.apache.org/tika/TikaJAXRS

and that doesn't seem to work either. I tried both tika/detect/stream and 
detect/stream

2014-07-23_11:32:59.68485 WARNING: No operation matching request path 
/tika/detect/stream is found, Relative Path: /detect/stream, HTTP Method: 
PUT, ContentType: */*, Accept: */*,. Please enable FINE/TRACE log level for 
more details.

2014-07-23_11:33:21.14176 WARNING: No operation matching request path 
/detect/stream is found, Relative Path: /detect/stream, HTTP Method: PUT, 
ContentType: */*, Accept: */*,. Please enable FINE/TRACE log level for more 
details.



 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071637#comment-14071637
 ] 

Rob Tulloh commented on TIKA-1371:
--

Also, we have been using tika to externally pre-process documents before we 
index them into Solr. We are using Solr 4.7 and it seems that the additional 
meta-data that Tika 1.5 now produces is not being recognized by Solr. We 
purposely decouple Tika from Solr because we don't want document processing to 
destabilize indexing/search (separating concerns).

When we use the supported URL format, we get back meta-data and content and 
this confuses Solr's language detection. This is a separate question, but I 
want to mention it as it may explain more about how we have been using Tika and 
the problems we have encountered with upgrading to Tika 1.5.

 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (TIKA-1372) PDCheckbox NPE

2014-07-23 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1372.
-

Resolution: Fixed

[~tilman], thank you for notifying us.  Y, that was Tika's (well, my) fault.  I 
fixed that thanks to a doc in govdocs1 and work on TIKA-1302.  Tika SNAPSHOT 
works on the file submitted with PDFBOX-2218, and we should have TIKA 1.6 out 
shortly.  

[~mdhussain], thank you for submitting the issue to PDFBOX. Let us know if you 
are having problems with Tika trunk and your file.

 PDCheckbox NPE
 --

 Key: TIKA-1372
 URL: https://issues.apache.org/jira/browse/TIKA-1372
 Project: Tika
  Issue Type: Bug
Reporter: Tilman Hausherr

 One of your users, [~mdhussain], opened PDFBOX-2218:
 PDF parsing fails for attached PDF.
 Stack trace of failure:
 {code}
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1747c
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 com.sabax.extraction.FileExtractionHandler.getFileData(FileExtractionHandler.java:145)
   at GenerateIndex.main(GenerateIndex.java:59)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   at java.lang.reflect.Method.invoke(Unknown Source)
   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 Caused by: java.lang.NullPointerException
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDCheckbox.getOnValue(PDCheckbox.java:141)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDCheckbox.isChecked(PDCheckbox.java:79)
   at 
 org.apache.pdfbox.pdmodel.interactive.form.PDRadioCollection.getValue(PDRadioCollection.java:128)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:507)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:461)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:479)
   at 
 org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:447)
   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:195)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:341)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   ... 9 more
 {code}
 Sample code use to parse
 {code}
 TikaInputStream tikaStream = TikaInputStream.get(stream);
 TikaResultWrapper result;
 try {
   long streamSize = tikaStream.getLength();
   Metadata metadata =
 constructMetadata(fileName, mimeType, streamSize);
   if (streamSize  maxFileSize) {
 SamplingSaxHandler handler =
   new SamplingSaxHandler(samplingSize, metadata);
 handler.setBufferLimit(bufferSize);
 parser.parse(tikaStream, handler, metadata, new ParseContext());
 result = handler.getResult();
   } else {
 result = new TikaResultWrapper(null, metadata);
   }
 } finally {
   tikaStream.close();
 }
 return result;
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071643#comment-14071643
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Can you format your description with {code} annotation and if I understand well 
the output of 1st section is empty ?

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 It returns (using the SourceCodeParser): 
  Text extracted: 
 But when I use this code:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 The Text Parser is used and I get:
  Text extracted: public class HelloWorld {}
 I have also tested this command: 
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071693#comment-14071693
 ] 

Nick Burch commented on TIKA-1371:
--

The /detect/stream URL ought to work, see DetectorResourceTest for examples of 
calling it that the tests use. Note that there's no /tika prefix on it

The Tika Server now (post-1.5) is able to tell you in a very basic way what 
URLs it supports, you can use that to see what the URLs are / aren't

 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071697#comment-14071697
 ] 

Nick Burch commented on TIKA-1373:
--

I've just tried it with svn trunk, and I think I see the issue there. The 
problem looks to be that we're getting back html when we ask for text

{code}$ tika --text tika-core/src/main/java/org/apache/tika/Tika.java | head
!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN
  http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;
html xmlns=http://www.w3.org/1999/xhtml; xml:lang=en lang=en
head
meta http-equiv=content-type content=text/html; charset=ISO-8859-1 /
meta name=generator content=JHighlight v1.0 
(http://jhighlight.dev.java.net) /
titleTika.java/title
link rel=Help href=http://jhighlight.dev.java.net; /
style type=text/css
.java_type {
{code}

Note that I've asked for Text, but got HTML!

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 It returns (using the SourceCodeParser): 
  Text extracted: 
 But when I use this code:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 The Text Parser is used and I get:
  Text extracted: public class HelloWorld {}
 I have also tested this command: 
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071713#comment-14071713
 ] 

Hong-Thai Nguyen commented on TIKA-1373:


Yes, I saw the trouble when implementing this parser. How can we get that we 
are asking for text instead of HTML ? Can Handler is instanceOf 
BodyContentHandler is enough ?

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 It returns (using the SourceCodeParser): 
  Text extracted: 
 But when I use this code:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 The Text Parser is used and I get:
  Text extracted: public class HelloWorld {}
 I have also tested this command: 
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071643#comment-14071643
 ] 

Hong-Thai Nguyen edited comment on TIKA-1373 at 7/23/14 1:42 PM:
-

Can you format your description with {noformat}{code}{noformat} annotation and 
if I understand well the output of 1st section is empty ?


was (Author: thaichat04):
Can you format your description with {code} annotation and if I understand well 
the output of 1st section is empty ?

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 It returns (using the SourceCodeParser): 
  Text extracted: 
 But when I use this code:
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 The Text Parser is used and I get:
  Text extracted: public class HelloWorld {}
 I have also tested this command: 
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1373:
--

Description: 
When using the AutoDetectParser in java code, and the SourceCodeParser is 
selected (i.e. java files), the handler gets no text:

I have this test program:
{code}
String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
try {
   autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
   e.printStackTrace();
}
System.out.println(Text extracted: +bch.toString())
{code}
It returns (using the SourceCodeParser): 
{code}  Text extracted: {code}

But when I use this code:
{code}
String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/plain);
try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } catch 
(Exception e) {  e.printStackTrace();  }
System.out.println(Text extracted: +bch.toString())
{code}





The Text Parser is used and I get:

{code}  Text extracted: public class HelloWorld {} {code}


I have also tested this command: 
{code}
 java -jar tika-app-1.5.jar -t D:\text.java
  (no text)
{code}


  was:
When using the AutoDetectParser in java code, and the SourceCodeParser is 
selected (i.e. java files), the handler gets no text:

I have this test program:

String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
try {
   autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
   e.printStackTrace();
}
System.out.println(Text extracted: +bch.toString())

It returns (using the SourceCodeParser): 
 Text extracted: 

But when I use this code:

String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/plain);
try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } catch 
(Exception e) {  e.printStackTrace();  }
System.out.println(Text extracted: +bch.toString())






The Text Parser is used and I get:

 Text extracted: public class HelloWorld {}


I have also tested this command: 

 java -jar tika-app-1.5.jar -t D:\text.java
  (no text)
 



 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new 

[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071723#comment-14071723
 ] 

Rob Tulloh commented on TIKA-1371:
--

curl -T x.csv http://localhost:9998/detect/stream 

results in

2014-07-23_11:33:21.14176 WARNING: No operation matching request path 
/detect/stream is found, Relative Path: /detect/stream, HTTP Method: PUT, 
ContentType: /, Accept: /,. Please enable FINE/TRACE log level for more details.

 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071744#comment-14071744
 ] 

Nick Burch commented on TIKA-1373:
--

{quote}Yes, I saw the trouble when implementing this parser. How can we get 
that we are asking for text instead of HTML?{quote}

It isn't up to the parser. The parser should be outputting its html as sax 
events to the handler. Text-only extraction happens downstream

It would seem that at the moment, the parser's generated html isn't being 
correctly turned into / passed into sax events on the content handler

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 autoDetectParser = new SourceCodeParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071748#comment-14071748
 ] 

Nick Burch commented on TIKA-1371:
--

I've just tried with a recent nightly build, and it worked fine:

{code}$ curl -T test.xlsx http://localhost:9998/detect/stream
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet{code}

There has been a load of work on the Tika Server since 1.5, so make sure you're 
using a recent nightly build / svn trunk checkout build

 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: How should video files with audio be handled by parsers?

2014-07-23 Thread Nick Burch

On Tue, 22 Jul 2014, Ray Gauss wrote:
The info on what the streams are and how they relate can be conveyed via 
PBCore, i.e.:


pbcore:instantiationTracks=1 video track, English and Spanish audio, 
Director's commentary audio


Ah, that's good. Looks a sensible enough and easy to follow standard to 
crib from



...
pbcore:instantiationEssenceTrack[0]/pbcore:essenceTrackType=Video
...
pbcore:instantiationEssenceTrack[1]/pbcore:essenceTrackType=Audio
pbcore:instantiationEssenceTrack[1]/pbcore:essenceTrackLanguage=eng


I'm not quite so keen on these metadata keys though. Do we gain anything 
from this long form, vs


stream[0]/pbcore:essenceTrackType=Video
stream[1]/pbcore:essenceTrackType=Audio
stream[1]/pbcore:essenceTrackLanguage=eng

?

The current FFmpeg parser wouldn't be able to extract things like 
annotations, but it was only targeting the intrinsic metadata.


The Ogg parsers should be able to output that fairly easily, the only 
reason they don't is that I didn't know what to output as!


Nick


Re: Miredot License Key for Apache Tika Project

2014-07-23 Thread Lewis John Mcgibbney
Fantastic, thank you.
Best
Lewis


On Wed, Jul 23, 2014 at 12:00 AM, Yves Vandewoude 
yves.vandewo...@miredot.com wrote:

  Dear Lewis,

 The free licence key for the tika-server artifact is:


 licenceUHJvamVjdHxvcmcuYXBhY2hlLnRpa2EudGlrYS1zZXJ2ZXJ8MjAxNi0wOC0wMXx0cnVlI01Dd0NGRklXRzRqRmNTZXNJb2laRElKZVF4RXpieUNTQWhSMHBmTzZCMUdMbDBPQ1B1WmJYQ3NpZElZSCtRPT0=/licence

 This licence key is valid for two years (until august 1st, 2016). After
 this period has expired, you can request a new licence key if you wish.

 If you have any questions or remarks using MireDot, do not hesitate to ask
 us.
 Kind Regards,
 Yves


 Lewis John Mcgibbney schreef op 22/07/2014 22:54:

  Hi Yves,
  Thanks for the feeddback on this one.

 On Fri, Jul 18, 2014 at 11:38 PM, Yves Vandewoude 
 yves.vandewo...@miredot.com wrote:


 Due to the open source nature of Apache Tika, you are indeed granted free
 licence key(s) if you wish to use MireDot to document your REST API
 documentation. If you provide me with the groupid/artifactid of the maven
 module(s) in which your rest api interfaces reside, I will send you the
 key(s).


  OK so the code this will be used in resides here
 https://svn.apache.org/repos/asf/tika/trunk/tika-server
  groupId - org.apache.tika
  artifactId - tika-server



 Also, we currently do not list our customers on the MireDot website.
 However, should we ever decide to do so, are we allowed to mention the
 Apache Foundation as a user?


  We got no blocking feedback on the above, so I suggest that it is OK
 for the time being.
  Thank you
 Lewis





-- 
*Lewis*


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071893#comment-14071893
 ] 

Rob Tulloh commented on TIKA-1371:
--

OK, so it seems fixed in the pending 1.6 release. That is good news. Now, going 
back to the original question, how can we get details of the document we are 
indexing into the log file like was possible in 1.2? Is there a way we can tell 
Tika server to log the GUID (the client specifies) and the filename? This was 
easy to do in 1.2 because it was part of the URL and Tika logged this 
information by default.

Another useful feature would be to allow the server to return not XHTML, but 
just the body text (ala --text on the tika-app). Since our service is using 
Tika server to pre-process all the documents in a separate service, it would be 
helpful if Tika server would have an option to just return the body text and 
not the full XHTML. I have code that will parse the result document and extract 
the body, but it make more sense to allow the Tika service to just return this 
the way it used to in previous versions. I am particularly concerned because 
this post-processing of the XHTML adds memory overhead to our server JVM. So, 
an option to launch the Tika server with an option like --text (or a URL like 
tika/text) and have it just return the body content would make it compatible 
with previous versions that did this. Then our application logic would be much 
simpler and the Solr integration would work as it did with Tika server 1.2.



 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrés Aguilar-Umaña updated TIKA-1373:
---

Description: 
When using the AutoDetectParser in java code, and the SourceCodeParser is 
selected (i.e. java files), the handler gets no text:

I have this test program:
{code}
String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
try {
   autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
   e.printStackTrace();
}
System.out.println(Text extracted: +bch.toString())
{code}
It returns (using the SourceCodeParser): 
{code}  Text extracted: {code}

But when I use this code:
{code}
String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/plain);
try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } catch 
(Exception e) {  e.printStackTrace();  }
System.out.println(Text extracted: +bch.toString())
{code}





The Text Parser is used and I get:

{code}  Text extracted: public class HelloWorld {} {code}


I have also tested this command: 
{code}
 java -jar tika-app-1.5.jar -t D:\text.java
  (no text)
{code}


  was:
When using the AutoDetectParser in java code, and the SourceCodeParser is 
selected (i.e. java files), the handler gets no text:

I have this test program:
{code}
String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
try {
   autoDetectParser.parse(bais, bch, metadata, parseContext);
} catch (Exception e) {
   e.printStackTrace();
}
System.out.println(Text extracted: +bch.toString())
{code}
It returns (using the SourceCodeParser): 
{code}  Text extracted: {code}

But when I use this code:
{code}
String data = public class HelloWorld {};
ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
Parser autoDetectParser = new AutoDetectParser();
autoDetectParser = new SourceCodeParser();
BodyContentHandler bch = new BodyContentHandler(50);
ParseContext parseContext = new ParseContext();
Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, text/plain);
try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } catch 
(Exception e) {  e.printStackTrace();  }
System.out.println(Text extracted: +bch.toString())
{code}





The Text Parser is used and I get:

{code}  Text extracted: public class HelloWorld {} {code}


I have also tested this command: 
{code}
 java -jar tika-app-1.5.jar -t D:\text.java
  (no text)
{code}



 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new 

[jira] [Commented] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071942#comment-14071942
 ] 

Tyler Palsulich commented on TIKA-1373:
---

The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}. 
codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it  _looks_ 
like the --text is returning the text, but it's just that the text content is 
html.

I'm not sure how we can turn the jhighlight html tags into SAX events. Tika 
HtmlParser? Something like
{code}
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
Renderer renderer = getRenderer(type.toString());
String content = out.toString();
String codeAsHtml = renderer.highlight(name, content, charset.name(), 
false);
HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, 
metadata, context);
{code}

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1373) AutoDetectParser extracts no text when SourceCodeParser is selected

2014-07-23 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071942#comment-14071942
 ] 

Tyler Palsulich edited comment on TIKA-1373 at 7/23/14 4:52 PM:


The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}. 
codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it  _looks_ 
like the --text isn't returning the text, but it's just that the text content 
is html.

I'm not sure how we can turn the jhighlight html tags into SAX events. Tika 
HtmlParser? Something like
{code}
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
Renderer renderer = getRenderer(type.toString());
String content = out.toString();
String codeAsHtml = renderer.highlight(name, content, charset.name(), 
false);
HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, 
metadata, context);
{code}


was (Author: tpalsulich):
The only SAX event in SourceCodeParser is {{xhtml.element(p, codeAsHtml);}}. 
codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it  _looks_ 
like the --text is returning the text, but it's just that the text content is 
html.

I'm not sure how we can turn the jhighlight html tags into SAX events. Tika 
HtmlParser? Something like
{code}
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
Renderer renderer = getRenderer(type.toString());
String content = out.toString();
String codeAsHtml = renderer.highlight(name, content, charset.name(), 
false);
HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, 
metadata, context);
{code}

 AutoDetectParser extracts no text when SourceCodeParser is selected
 ---

 Key: TIKA-1373
 URL: https://issues.apache.org/jira/browse/TIKA-1373
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Andrés Aguilar-Umaña

 When using the AutoDetectParser in java code, and the SourceCodeParser is 
 selected (i.e. java files), the handler gets no text:
 I have this test program:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/x-java-source);
 try {
autoDetectParser.parse(bais, bch, metadata, parseContext);
 } catch (Exception e) {
e.printStackTrace();
 }
 System.out.println(Text extracted: +bch.toString())
 {code}
 It returns (using the SourceCodeParser): 
 {code}  Text extracted: {code}
 But when I use this code:
 {code}
 String data = public class HelloWorld {};
 ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes());
 Parser autoDetectParser = new AutoDetectParser();
 BodyContentHandler bch = new BodyContentHandler(50);
 ParseContext parseContext = new ParseContext();
 Metadata metadata = new Metadata();
 metadata.set(Metadata.CONTENT_TYPE, text/plain);
 try {  autoDetectParser.parse(bais, bch, metadata, parseContext);  } 
 catch (Exception e) {  e.printStackTrace();  }
 System.out.println(Text extracted: +bch.toString())
 {code}
 The Text Parser is used and I get:
 {code}  Text extracted: public class HelloWorld {} {code}
 I have also tested this command: 
 {code}
  java -jar tika-app-1.5.jar -t D:\text.java
   (no text)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071953#comment-14071953
 ] 

Nick Burch commented on TIKA-1371:
--

{quote}Another useful feature would be to allow the server to return not XHTML, 
but just the body text{quote}

That's already supported though! Just send an Accept header of text/plain and 
you'll get that back. See the Get the Text of a Document section on the wiki 
for examples

 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1371) passing parameters via URL no longer works (regression)

2014-07-23 Thread Rob Tulloh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072046#comment-14072046
 ] 

Rob Tulloh commented on TIKA-1371:
--

Perfect! Though I have to say the examples on the wiki are not exactly clear. 
Just tried this and it worked as expected. Thank you.

One more thought on the URL, if Tika just logged what it received, we could 
pass the extra context via parameters like this:

http://localhost:9998/tika?guid=XXX

I test this and it worked for producing the correct result.

If Tika server could just log the URL received, that would help immensely in 
the distributed world where many clients interact with Tika server and 
correlation is useful for diagnosing what was sent where.

 passing parameters via URL no longer works (regression)
 ---

 Key: TIKA-1371
 URL: https://issues.apache.org/jira/browse/TIKA-1371
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.5
Reporter: Rob Tulloh

 In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
 logged like this:
 http://localhost:9998/tika/GUID/FILENAME
 This was very useful for correlating between client and server in a 
 distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
 this feature no longer works. Not having this makes it very difficult to 
 troubleshoot problems with document processing in a distributed environment. 
 Please add back this feature so that operations and development teams can 
 more easily figure out which tika instance is processing which document and 
 what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1191) ForkParser / ClassLoaderProxy does not define package

2014-07-23 Thread Nicolas Belisle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Belisle updated TIKA-1191:
--

Attachment: test.eml
Test.java

Test for ForkParser exception.

 ForkParser / ClassLoaderProxy does not define package
 -

 Key: TIKA-1191
 URL: https://issues.apache.org/jira/browse/TIKA-1191
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Nicolas Belisle
 Attachments: ClassLoaderProxy.java.patch, Test.java, test.eml


 ForkParser will throw an Exception in some cases : 
 org.apache.tika.exception.TikaException: Invalid embedded resource
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:189)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.tika.fork.ForkServer.call(ForkServer.java:144)
   at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124)
   at org.apache.tika.fork.ForkServer.main(ForkServer.java:69)
 Caused by: java.lang.NullPointerException
   at 
 org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:136)
   at 
 org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:499)
   at 
 org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60)
   at org.apache.tika.config.TikaConfig.init(TikaConfig.java:169)
   at 
 org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getTikaConfig(AbstractPOIFSExtractor.java:72)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getDetector(AbstractPOIFSExtractor.java:79)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:176)
   ... 10 more
 A patch will follow



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1191) ForkParser / ClassLoaderProxy does not define package

2014-07-23 Thread Nicolas Belisle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072104#comment-14072104
 ] 

Nicolas Belisle commented on TIKA-1191:
---

I was able to reproduce a similar issue with another file using Tika 1.5. 
See attached eml.test and the test (Test.java).
The exception : 

Exception in thread main org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.mail.RFC822Parser@6743bc0f
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.tika.fork.ForkServer.call(ForkServer.java:144)
at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124)
at org.apache.tika.fork.ForkServer.main(ForkServer.java:69)
Caused by: java.lang.NullPointerException
at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:158)
at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:516)
at org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60)
at org.apache.tika.config.TikaConfig.init(TikaConfig.java:169)
at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268)
at org.apache.tika.parser.AutoDetectParser.init(AutoDetectParser.java:51)
at 
org.apache.tika.parser.mail.RFC822Parser.adaptedExtractMultipart(RFC822Parser.java:167)
at 
org.apache.tika.parser.mail.RFC822Parser.adaptedExtractMultipart(RFC822Parser.java:156)
at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:101)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 9 more


 ForkParser / ClassLoaderProxy does not define package
 -

 Key: TIKA-1191
 URL: https://issues.apache.org/jira/browse/TIKA-1191
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Nicolas Belisle
 Attachments: ClassLoaderProxy.java.patch, Test.java, test.eml


 ForkParser will throw an Exception in some cases : 
 org.apache.tika.exception.TikaException: Invalid embedded resource
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:189)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.tika.fork.ForkServer.call(ForkServer.java:144)
   at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124)
   at org.apache.tika.fork.ForkServer.main(ForkServer.java:69)
 Caused by: java.lang.NullPointerException
   at 
 org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:136)
   at 
 org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:499)
   at 
 org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60)
   at org.apache.tika.config.TikaConfig.init(TikaConfig.java:169)
   at 
 org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getTikaConfig(AbstractPOIFSExtractor.java:72)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getDetector(AbstractPOIFSExtractor.java:79)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:176)
   ... 10 more
 A patch will follow



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1191) ForkParser / ClassLoaderProxy does not define package

2014-07-23 Thread Nicolas Belisle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Belisle updated TIKA-1191:
--

Affects Version/s: 1.5

 ForkParser / ClassLoaderProxy does not define package
 -

 Key: TIKA-1191
 URL: https://issues.apache.org/jira/browse/TIKA-1191
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4, 1.5
Reporter: Nicolas Belisle
 Attachments: ClassLoaderProxy.java.patch, Test.java, test.eml


 ForkParser will throw an Exception in some cases : 
 org.apache.tika.exception.TikaException: Invalid embedded resource
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:189)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.tika.fork.ForkServer.call(ForkServer.java:144)
   at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124)
   at org.apache.tika.fork.ForkServer.main(ForkServer.java:69)
 Caused by: java.lang.NullPointerException
   at 
 org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:136)
   at 
 org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:499)
   at 
 org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:60)
   at org.apache.tika.config.TikaConfig.init(TikaConfig.java:169)
   at 
 org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:268)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getTikaConfig(AbstractPOIFSExtractor.java:72)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.getDetector(AbstractPOIFSExtractor.java:79)
   at 
 org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:176)
   ... 10 more
 A patch will follow



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-07-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-1269:
---

Attachment: TIKA-1269-miredot.patch

Patch for enabling recent free license from Miredot.
The documentation looks great with a fully navigable tree for clean and 
efficient navigation of tika-server documentation.

I propose to commit this patch, hook it up to the Jenkins build and then open a 
subsequent issue to fully document all methods in tika-server with appropriate 
Javadoc. The license key is documented within the patch and is valid for around 
2 years... around then we can apply for a new Key.

 Self-hosted documentation for the JAX-RS Server
 ---

 Key: TIKA-1269
 URL: https://issues.apache.org/jira/browse/TIKA-1269
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.5
Reporter: Nick Burch
 Fix For: 1.7

 Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch


 Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
 server in a web browser, you get an empty page back. You have to know to head 
 over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
 URLs are
 We should self-host some simple documentation on the server at the root of 
 it, so that people can discover what it offers. Ideally, this should be 
 largely auto-generated based on the endpoints, so that we don't risk missing 
 things when we add new features
 This will also allow us to potentially offer a sample running version of the 
 server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-07-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072311#comment-14072311
 ] 

Nick Burch commented on TIKA-1269:
--

I guess we'll need a bit of Maven / Ant-in-Maven magic to bundle the generated 
html up into the Tika Server jar, so it can be hosted from within the Server?

 Self-hosted documentation for the JAX-RS Server
 ---

 Key: TIKA-1269
 URL: https://issues.apache.org/jira/browse/TIKA-1269
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.5
Reporter: Nick Burch
 Fix For: 1.7

 Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch


 Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
 server in a web browser, you get an empty page back. You have to know to head 
 over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
 URLs are
 We should self-host some simple documentation on the server at the root of 
 it, so that people can discover what it offers. Ideally, this should be 
 largely auto-generated based on the endpoints, so that we don't risk missing 
 things when we add new features
 This will also allow us to potentially offer a sample running version of the 
 server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: How should video files with audio be handled by parsers?

2014-07-23 Thread Ray Gauss
They are a bit verbose, but:

1) I'd really like to stick to the specification as closely as possible.

2) There are are several PBCore instantiation properties that apply to the 
entire file like duration and tracks that we'd want prefixed with pbcore so I 
think it would be odd to see:

  pbcore:instantiationDuration=00:00:05.20
  stream[0]/pbcore:essenceTrackType=Video

3) PBCore allows for essence track types like text that might not necessarily 
be considered 'streams'.

That's great that the Ogg parsers will be able to do the informational side!

Regards,

Ray


On July 23, 2014 at 10:17:29 AM, Nick Burch (apa...@gagravarr.org) wrote:

  ...
  pbcore:instantiationEssenceTrack[0]/pbcore:essenceTrackType=Video
  ...
  pbcore:instantiationEssenceTrack[1]/pbcore:essenceTrackType=Audio
  pbcore:instantiationEssenceTrack[1]/pbcore:essenceTrackLanguage=eng
  
 I'm not quite so keen on these metadata keys though. Do we gain anything
 from this long form, vs
  
 stream[0]/pbcore:essenceTrackType=Video
 stream[1]/pbcore:essenceTrackType=Audio
 stream[1]/pbcore:essenceTrackLanguage=eng
  
 ?
  
  The current FFmpeg parser wouldn't be able to extract things like
  annotations, but it was only targeting the intrinsic metadata.
  
 The Ogg parsers should be able to output that fairly easily, the only
 reason they don't is that I didn't know what to output as!
  
 Nick
  


[jira] [Commented] (TIKA-1269) Self-hosted documentation for the JAX-RS Server

2014-07-23 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072368#comment-14072368
 ] 

Lewis John McGibbney commented on TIKA-1269:


Hi [~gagravarr] yeah I have a couple of thoughts here 
 * use the maven-assembly-plugin to generate the following release artifacts, 
server without packaged dependencies, server with packaged depdencies and 
embedded server (similar to what we currently have) which is invoked via bash 
script and which runs via org.mortbay.jetty:jetty-runner.
 * also put work into the .war file which we can run as a web app within any 
servlet container... as we also do with Any23.
For the artifacts which ship with the generated artifacts, we would simply 
define the generated Miredot documentation as resources within the XML 
descriptors for the plugin Maven configuration.
As for the WAR documentation, we would define them as WebResources which would 
then be picked up by the maven-war-plugin when we generate the WAR artifact.

I therefore propose (if you guys are happy with using Miredot for the 
documentation) that we commit the current patch on this issue, then address the 
Javadoc as well as JAX-RS Annotations in a separate issue, before writing the 
assembly descriptors and web application WAR file all in separate issues.

wdyt?

 Self-hosted documentation for the JAX-RS Server
 ---

 Key: TIKA-1269
 URL: https://issues.apache.org/jira/browse/TIKA-1269
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.5
Reporter: Nick Burch
 Fix For: 1.7

 Attachments: TIKA-1269-miredot.patch, enable-enunciate.patch


 Currently, if you fire up the JAX-RS Tika Server, and go to the root of the 
 server in a web browser, you get an empty page back. You have to know to head 
 over to https://wiki.apache.org/tika/TikaJAXRS find out what the available 
 URLs are
 We should self-host some simple documentation on the server at the root of 
 it, so that people can discover what it offers. Ideally, this should be 
 largely auto-generated based on the endpoints, so that we don't risk missing 
 things when we add new features
 This will also allow us to potentially offer a sample running version of the 
 server for people to discover Tika with



--
This message was sent by Atlassian JIRA
(v6.2#6252)