[ 
https://issues.apache.org/jira/browse/SOLR-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156615#comment-13156615
 ] 

Shalin Shekhar Mangar commented on SOLR-2864:
---------------------------------------------

Thanks for the patch Gabriel.

1. Your use-case seems to be about making sure that when you add new files to a 
directory, an old file does not overwrite the new records -- one could use 
"newerThan" to process only the new files.
2. FileListEntityProcessor has never guaranteed order so technically this is 
not a bug.
3. What you have proposed is a very arbitrary sort order. In particular, sort 
order for directories is different than the order for files. Probably it is 
relevant to your use-case but once we start down this path, we will have people 
asking for other sort orders.

That being said, I'd hate for your work to go waste. A change in walk order 
shouldn't affect anyone because the order was never guaranteed anyway but at 
the least, we should have the same sort order for both directories and files 
otherwise the scenario you've described in the issue description can still 
happen.
                
> DataImportHandler has non-deterministic sort order for XML files
> ----------------------------------------------------------------
>
>                 Key: SOLR-2864
>                 URL: https://issues.apache.org/jira/browse/SOLR-2864
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>    Affects Versions: 3.4
>            Reporter: Gabriel Cooper
>            Priority: Minor
>              Labels: dataimport, patch, xml
>             Fix For: 3.5
>
>         Attachments: lucene-2864.patch, lucene-2864.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> DataImportHandler's FileListEntityProcessor relies on Java's File.list() 
> method to retrieve a list of files from the configured dataimport directory, 
> but list() does not guarantee a sort order ^(1)^. This means that if you have 
> two files that update the same record, the results are non-deterministic. 
> Typically, list() does in fact return them lexigraphically sorted, but this 
> is not guaranteed ^(2)^.
> An example of how you can get into trouble is to imagine the following:
> xyz.xml -- Created one hour ago. Contains updates to records "Foo" and "Bar".
> abc.xml -- Created one minute ago. Contains updates to records "Bar" and 
> "Baz".
> In this case, the newest file, in abc.xml, would (likely, but not guaranteed) 
> be run first, updating the "Bar" and "Baz" records. Next, the older file, 
> xyz.xml, would update "Foo" and overwrite "Bar" with outdated changes.
>  (1) Per 
> http://download.oracle.com/javase/1,5,0/docs/api/java/io/File.html#list%28%29
> "There is no guarantee that the name strings in the resulting array will 
> appear in any specific order; they are not, in particular, guaranteed to 
> appear in alphabetical order."
>  (2)  Even if it was guaranteed, lexigraphical sorting would give you the 
> following sort order:
>   1.xml
>   10.xml
>   2.xml
>   ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to