[ 
https://issues.apache.org/jira/browse/TIKA-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071893#comment-14071893
 ] 

Rob Tulloh commented on TIKA-1371:
----------------------------------

OK, so it seems fixed in the pending 1.6 release. That is good news. Now, going 
back to the original question, how can we get details of the document we are 
indexing into the log file like was possible in 1.2? Is there a way we can tell 
Tika server to log the GUID (the client specifies) and the filename? This was 
easy to do in 1.2 because it was part of the URL and Tika logged this 
information by default.

Another useful feature would be to allow the server to return not XHTML, but 
just the body text (ala --text on the tika-app). Since our service is using 
Tika server to pre-process all the documents in a separate service, it would be 
helpful if Tika server would have an option to just return the body text and 
not the full XHTML. I have code that will parse the result document and extract 
the body, but it make more sense to allow the Tika service to just return this 
the way it used to in previous versions. I am particularly concerned because 
this post-processing of the XHTML adds memory overhead to our server JVM. So, 
an option to launch the Tika server with an option like --text (or a URL like 
tika/text) and have it just return the body content would make it compatible 
with previous versions that did this. Then our application logic would be much 
simpler and the Solr integration would work as it did with Tika server 1.2.



> passing parameters via URL no longer works (regression)
> -------------------------------------------------------
>
>                 Key: TIKA-1371
>                 URL: https://issues.apache.org/jira/browse/TIKA-1371
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.5
>            Reporter: Rob Tulloh
>
> In Tika 1.1 and 1.2, it was possible to add some values to the URL that get 
> logged like this:
> http://localhost:9998/tika/GUID/FILENAME
> This was very useful for correlating between client and server in a 
> distributed compute environment. In 1.5 and in the nighty builds (for 1.6), 
> this feature no longer works. Not having this makes it very difficult to 
> troubleshoot problems with document processing in a distributed environment. 
> Please add back this feature so that operations and development teams can 
> more easily figure out which tika instance is processing which document and 
> what the result of the processing resulted in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to