[ https://issues.apache.org/jira/browse/SOLR-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669932#action_12669932 ]
Fergus McMenemie commented on SOLR-798: --------------------------------------- Despite the above. In my experience, using solutions provided by messrs Verity and Autonomy, letting the search engine walk a directory tree with millions of documents always lets you down. It could take days to recover from some situations. You have to manage the collection of files yourself and while doing so build bulk insert/delete files (bif files) which are passed to the search engine to control indexing. So it perhaps a blessing in disguise to see that Solr wont even let me walk large directory trees. I have a vague intention to write a DIH enhancement to implement reading BIF files containing a list of add/delete instructions. I only my java was better! However for the record, how large a directory tree were you able to walk? I am currently walking about 40,000 documents. But this is only while messing about trying to get a feel for Solr, this strategy could not be used in production. > FileListEntityProcessor can't handle directories containing lots of files > ------------------------------------------------------------------------- > > Key: SOLR-798 > URL: https://issues.apache.org/jira/browse/SOLR-798 > Project: Solr > Issue Type: Bug > Components: contrib - DataImportHandler > Reporter: Grant Ingersoll > Priority: Minor > > The FileListEntityProcessor currently tries to process all documents in a > single directory at once, and stores the results into a hashmap. On > directories containing a large number of documents, this quickly causes > OutOfMemory errors. > Unfortunately, the typical fix to this is to hack FileFilter to do the work > for you and always return false from the accept method. It may be possible > to hook up some type of Producer/Consumer multithreaded FileFilter approach > whereby the FileFilter blocks until the nextRow() mechanism requests another > row, thereby avoiding the need to cache everything in the map. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.