[jira] Commented: (SOLR-798) FileListEntityProcessor can't handle directories containing lots of files

Fergus McMenemie (JIRA) Tue, 03 Feb 2009 03:54:27 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669932#action_12669932
 ]


Fergus McMenemie commented on SOLR-798:
---------------------------------------

Despite the above. In my experience, using solutions provided by messrs Verity 
and Autonomy, letting the search engine walk a directory tree with millions of 
documents always lets you down. It could take days to recover from some 
situations. You have to manage the collection of files yourself and while doing 
so build bulk insert/delete files (bif files) which are passed to the search 
engine to control indexing. So it perhaps a blessing in disguise to see that 
Solr wont even let me walk large directory trees.

I have a vague intention to write a DIH enhancement to implement reading BIF 
files containing a list of add/delete instructions. I only my java was better!

However for the record, how large a directory tree were you able to walk? I am 
currently walking about 40,000 documents. But this is only while messing about 
trying to get a feel for Solr, this strategy could not be used in production.



> FileListEntityProcessor can't handle directories containing lots of files
> -------------------------------------------------------------------------
>
>                 Key: SOLR-798
>                 URL: https://issues.apache.org/jira/browse/SOLR-798
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> The FileListEntityProcessor currently tries to process all documents in a 
> single directory at once, and stores the results into a hashmap.  On 
> directories containing a large number of documents, this quickly causes 
> OutOfMemory errors.
> Unfortunately, the typical fix to this is to hack FileFilter to do the work 
> for you and always return false from the accept method.  It may be possible 
> to hook up some type of Producer/Consumer multithreaded FileFilter approach 
> whereby the FileFilter blocks until the nextRow() mechanism requests another 
> row, thereby avoiding the need to cache everything in the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-798) FileListEntityProcessor can't handle directories containing lots of files

Reply via email to