Hi Dave, Apologies to come back to this over a month later, but we had worked around / not seen the issue for a while, but as we start to ramp up our testing it's come back. Investigating it from several angles today, the problem seems to be that SOME PNG files are failing when being parsed by Tika, but only when the -T or -t switch is applied.
So I am currently running tika locally (under Java 1.6.0_26) using the following command: java -jar ~/software/tika/tika-app-1.3.jar -t -s -p 9100 And then running the following Ruby code (under ruby 1.8.7 patch 371, although I think this would work on all releases) #!/usr/bin/env ruby require 'socket' class FileStreamer attr_reader :filename def initialize(filename) @filename = filename end def do_it TCPSocket.open('127.0.0.1', 9100) do |socket| File.open(filename) do |file| content = file.read socket.write(content) socket.close_write puts socket.read end end end end file_streamer = FileStreamer.new('./Pictures/test.png').do_it --> This then throws the following error: Errno::ECONNRESET: (eval):19:in `read': Connection reset by peer from /home/bturner/.rbenv/versions/1.8.7-p371/lib/ruby/gems/1.8/gems/interactive_editor-0.0.10/lib/interactive_editor.rb:55:in `eval' from (eval):19:in `do_it' from (eval):15:in `open' from (eval):15:in `do_it' from (eval):14:in `open' from (eval):14:in `do_it' from (eval):26 The file I am using to cause this error can be downloaded from http://imgur.com/r/quotesporn/hUGXn using the "Download Full Resolution" link - or this direct link: http://bit.ly/ZLT9Xs Our process is trying to extract content only (and not metadata) from all files that are thrown at it - we realise this means PNG and JPEG files will return nothing, but we're trying to handle all files the same, where possible, as we can't be 100% sure of the file types before processing. Hence we use the -t flag, and NOT the -m flag. It should be noted that changing the -t flag to -m flag causes the PNG to be correctly processed with a blank return value. Also it should be noted that we've not experienced this behaviour from JPEGs or other "no textual content" formats so far. Thanks and regards, Ben On 13 March 2013 11:12, Dave Meikle <loo...@gmail.com> wrote: > Hi Ben, > > On 12 Mar 2013, at 05:33, Ben Turner <ben.tur...@pobox.com> wrote: > > > * We then talk to it via ruby sockets (for non-rubyists, this streams a > document from the file system into our local tika server over a simple > socket) : > > > > #!/usr/bin/env ruby > > require 'socket' > > TCPSocket.open('127.0.0.1', 12345) do |socket| > > File.open('/tmp/test.png', 'r') do |chunk| > > socket.write(chunk) > > end > > socket.close_write > > puts socket.read > > end > > There is no know fault around this so tried this locally, and with a wee > tweak to the Ruby code to use socket.write(chunk.read), it works for me > with all document types. I also used -m on the server to make sure the PNG > was being processed and it dumps back the metadata. > > Is there anything else in the way over the network (firewall, IDS, etc)? > > Cheers, > Dave > > >