[ https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071893#comment-14071893 ]
Rob Tulloh commented on TIKA-1371: ---------------------------------- OK, so it seems fixed in the pending 1.6 release. That is good news. Now, going back to the original question, how can we get details of the document we are indexing into the log file like was possible in 1.2? Is there a way we can tell Tika server to log the GUID (the client specifies) and the filename? This was easy to do in 1.2 because it was part of the URL and Tika logged this information by default. Another useful feature would be to allow the server to return not XHTML, but just the body text (ala --text on the tika-app). Since our service is using Tika server to pre-process all the documents in a separate service, it would be helpful if Tika server would have an option to just return the body text and not the full XHTML. I have code that will parse the result document and extract the body, but it make more sense to allow the Tika service to just return this the way it used to in previous versions. I am particularly concerned because this post-processing of the XHTML adds memory overhead to our server JVM. So, an option to launch the Tika server with an option like --text (or a URL like tika/text) and have it just return the body content would make it compatible with previous versions that did this. Then our application logic would be much simpler and the Solr integration would work as it did with Tika server 1.2. > passing parameters via URL no longer works (regression) > ------------------------------------------------------- > > Key: TIKA-1371 > URL: https://issues.apache.org/jira/browse/TIKA-1371 > Project: Tika > Issue Type: Bug > Components: server > Affects Versions: 1.5 > Reporter: Rob Tulloh > > In Tika 1.1 and 1.2, it was possible to add some values to the URL that get > logged like this: > http://localhost:9998/tika/GUID/FILENAME > This was very useful for correlating between client and server in a > distributed compute environment. In 1.5 and in the nighty builds (for 1.6), > this feature no longer works. Not having this makes it very difficult to > troubleshoot problems with document processing in a distributed environment. > Please add back this feature so that operations and development teams can > more easily figure out which tika instance is processing which document and > what the result of the processing resulted in. -- This message was sent by Atlassian JIRA (v6.2#6252)