RE: Solr spewage and dropped documents, while indexing

karl.wright Thu, 10 Jun 2010 16:07:08 -0700

Some results of looking at commons-fileupload...

There are 4 source files that have "catch" in them:

./src/java/org/apache/commons/fileupload/MultipartStream.java
./src/java/org/apache/commons/fileupload/util/Streams.java
./src/java/org/apache/commons/fileupload/FileUploadBase.java
./src/java/org/apache/commons/fileupload/disk/DiskFileItem.java

Of these 4, it looks like two have the potential for silently eating thrown 
IOException's:

./src/java/org/apache/commons/fileupload/MultipartStream.java
./src/java/org/apache/commons/fileupload/disk/DiskFileItem.java

DiskFileItem can eat exceptions if it gets them upon reading a multipart 
section from disk.  MultipartStream can eat exceptions if there are any I/O 
errors reading the input stream.  I suspect the latter is what might be 
happening here.

If anyone wants to verify this, the code in question first converts IOException 
errors to MalformedStreamException's.  Then, later, it eats most 
MalformedStreamException's, and treats the stream as being empty.

Karl

-----Original Message-----
From: Wright Karl (Nokia-S/Cambridge) 
Sent: Thursday, June 10, 2010 1:34 PM
To: [email protected]
Subject: RE: Solr spewage and dropped documents, while indexing

Hmmm.

I did a run of the proposed change and it did not help.  If anything, the 
system behaved worse and generated many more 400's than before.  So, probably 
the change is having the intended effect, but the extra file-deletion time is 
interfering even further with "solr keeping up".

Further analysis shows that there are actually two problems.  First problem is 
the fact that perfectly reasonable documents sometimes generate 400's.  Second 
problem is the connection reset (which is what actually kills the client), 
which could well be due to a socket timeout.  The client reports this trace, 
which simply shows that the post response socket was closed by somebody on the 
server end:

java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:168)
        at HttpPoster.getResponse(HttpPoster.java:280)
        at HttpPoster.indexPost(HttpPoster.java:191)
        at ParseAndLoad$PostThread.run(ParseAndLoad.java:638)

Could the 400 error be due to a similar socket timeout issue?  Well, that would 
depend on whether commons-fileupload is capable of silently eating socket 
exceptions, and instead truncating the post it has partially received.  And, of 
course, on what jetty's default socket parameters look like.  Can anyone save 
me some time and give me a pointer to where/how/what those parameters are set 
to, for the example?

Karl

-----Original Message-----
From: Wright Karl (Nokia-S/Cambridge) 
Sent: Wednesday, June 09, 2010 11:24 AM
To: [email protected]
Subject: RE: Solr spewage and dropped documents, while indexing

Ah, the old "misleading documentation" trick!
I'll have to give this a try and see if my problem goes away.
Karl

-----Original Message-----
From: ext Mark Miller [mailto:[email protected]] 
Sent: Wednesday, June 09, 2010 11:19 AM
To: [email protected]
Subject: Re: Solr spewage and dropped documents, while indexing

Hang on though - I saw a commons jira issue from 08 that claimed the 
javadoc for this class was misleading and there was no default cleaner 
set - that issue was resolved, but the javadoc *still* seemed to 
indicate there was a default cleaner in use ... so I wondered if the 
code had changed, or the javadoc was still misleading ...

Looking at getFileCleaningTracker(), it also says:

An instance of FileCleaningTracker, defaults to FileCleaner.getInstance().

But then looking at the code, I don't see how that is possible. It 
really appears to default to null (no cleaner).

So I ran a quick test, printing out the cleaning tracker, and it prints 
'null'.

So, perhaps we try setting one and see where your problem is? It really 
appears the javadoc I'm seeing does not match the code.

- Mark

On 6/9/10 8:01 AM, [email protected] wrote:
> Ok, that theory bites the dust then...
>
> I'll have to work on some diagnostics then to see why the content doesn't get 
> added.
>
> Karl
>
> -----Original Message-----
> From: ext Mark Miller [mailto:[email protected]]
> Sent: Wednesday, June 09, 2010 10:39 AM
> To: [email protected]
> Subject: Re: Solr spewage and dropped documents, while indexing
>
> On 6/9/10 6:01 AM, [email protected] wrote:
>>
>> but if I correctly recall how DiskFileItemFactory works, it creates
>> files and registers them to be cleaned up on JVM exit.  If that's the
>> only cleanup, that's not going to cut it for a real-world system.
>
> Class DiskFileItemFactory
>
> "Temporary files are automatically deleted as soon as they are no longer
> needed. (More precisely, when the corresponding instance of File is
> garbage collected.) Cleaning up those files is done by an instance of
> FileCleaningTracker, and an associated thread."
>

-- 
- Mark

http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Solr spewage and dropped documents, while indexing

Reply via email to