[jira] [Updated] (SOLR-445) XmlUpdateRequestHandler bad documents mid batch aborts rest of batch
[ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-445: Attachment: SOLR-445.patch So, Grant. How do you feel about refactorings G? I got bitten by this problem again so I decided to dust off the patch, and I re-created it. This one shouldn't have the gratuitous re-formatting. But, after I added the bookkeeping, the method got even more unwieldy, so I extracted some of the code to methods in XMLLoader. I also have the un-refactored version if this one is too painful. This patch incorporates the changes you suggested months ago. I'm a little uncertain whether putting a constant in UpdateParams.java was the correct place, but it seemed like a pattern used for other parameters. One minor issue: The behavior is the same here as it used to be if you don't start the packet with add. An NPE is thrown. That's because the addCmd variable isn't initialized until the add tag is encountered and the NPE is a result of using the addCmd variable later (I think I was seeing it at line 118). I think it would be better to fail if the first element wasn't an add element rather than because it just happens to cause an NPE. While I'm at it, though, what do you think about making this robust enough to ignore ?xml and/or !DOCTYPE entries? Or is that just not worth the bother? Erick XmlUpdateRequestHandler bad documents mid batch aborts rest of batch Key: SOLR-445 URL: https://issues.apache.org/jira/browse/SOLR-445 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.3 Reporter: Will Johnson Assignee: Grant Ingersoll Fix For: Next Attachments: SOLR-445-3_x.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445_3x.patch, solr-445.xml Has anyone run into the problem of handling bad documents / failures mid batch. Ie: add doc field name=id1/field /doc doc field name=id2/field field name=myDateFieldI_AM_A_BAD_DATE/field /doc doc field name=id3/field /doc /add Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-445) XmlUpdateRequestHandler bad documents mid batch aborts rest of batch
[ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-445: Attachment: SOLR-445_3x.patch SOLR-445.patch OK, I think this is ready to go if someone wants to take a look and commit. This patch includes the ability to turn on continuing to process documents after the first failure, as per Erik H's comments. The default is the old behavior of stopping upon the first error. Changed example solrconfig.xml to include the new parameter as false (mimicing old behavior) in both 3x and trunk. XmlUpdateRequestHandler bad documents mid batch aborts rest of batch Key: SOLR-445 URL: https://issues.apache.org/jira/browse/SOLR-445 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.3 Reporter: Will Johnson Assignee: Erick Erickson Fix For: Next Attachments: SOLR-445-3_x.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, solr-445.xml, SOLR-445_3x.patch Has anyone run into the problem of handling bad documents / failures mid batch. Ie: add doc field name=id1/field /doc doc field name=id2/field field name=myDateFieldI_AM_A_BAD_DATE/field /doc doc field name=id3/field /doc /add Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-445) XmlUpdateRequestHandler bad documents mid batch aborts rest of batch
[ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-445: Attachment: SOLR-445-3_x.patch SOLR-445.patch I think it's ready for review, both trunk and 3_x. Would someone look this over and commit it if they think it's ready? Note to self: do NOT call initCore in a test case just because you need a different schema. The problem I was having with running tests was because I needed a schema file with a required field so I naively called initCore with schema11.xml in spite of the fact that @BeforeClass called it with just schema.xml. Which apparently does bad things with the state of *something* and caused other tests to fail... I can get TestDistributedSearch to fail on unchanged source code simply by calling initCore with schema11.xml and doing nothing else in a new test case in BasicFunctionalityTest. So I put my new tests that required schema11 in a new file instead. The XML file attached is not intended to be committed, it is just a convenience for anyone checking out this patch to run against a Solr instance to see what is returned. This seems to return the data in the SolrJ case as well. NOTE: This does change the behavior of Solr. Without this patch, the first document that is incorrect stops processing. Now, it continues merrily on adding documents as it can. Is this desirable behavior? It would be easy to abort on first error if that's the consensus, and I could take some tedious record-keeping out. I think there's no big problem with continuing on, since the state of committed documents is indeterminate already when errors occur so worrying about this should be part of a bigger issue. XmlUpdateRequestHandler bad documents mid batch aborts rest of batch Key: SOLR-445 URL: https://issues.apache.org/jira/browse/SOLR-445 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.3 Reporter: Will Johnson Assignee: Erick Erickson Fix For: Next Attachments: SOLR-445-3_x.patch, SOLR-445.patch, SOLR-445.patch, solr-445.xml Has anyone run into the problem of handling bad documents / failures mid batch. Ie: add doc field name=id1/field /doc doc field name=id2/field field name=myDateFieldI_AM_A_BAD_DATE/field /doc doc field name=id3/field /doc /add Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-445) XmlUpdateRequestHandler bad documents mid batch aborts rest of batch
[ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated SOLR-445: Attachment: solr-445.xml SOLR-445.patch Here's a cut at an improvement at least. The attached XML file contains an add packet with a number of documents illustrating a number of errors. The xml file can be POSTed Solr to index via the post.jar file so you can see the output. This patch attempts to report back to the user the following for each document that failed: 1 the ordinal position in the file where the error occurred (e.g. the first, second, etc doc tag). 2 the uniqueKey if available. 3 the error. The general idea is to accrue the errors in a StringBuilder and eventually re-throw the error after processing as far as possible. Issues: 1 the reported format in the log file is kind of hard to read. I pipe-delimited the various doc tags, but they run together in a Windows DOS window. What happens on Unix I'm not quite sure. Suggestions welcome. 2 From the original post, rolling this back will be tricky. Very tricky. The autocommit feature makes it indeterminate what's been committed to the index, so I don't know how to even approach rolling back everything. 3 The intent here is to give the user a clue where to start when figuring out what document(s) failed so they don't have to guess. 4 Tests fail, but I have no clue why. I checked out a new copy of trunk and that fails as well, so I don't think that this patch is the cause of the errors. But let's not commit this until we can be sure. 5 What do you think about limiting the number of docs that fail before quitting? One could imagine some ratio (say 10%) that have to fail before quitting (with some safeguards, like don't bother calculating the ratio until 20 docs had been processed or...). Or an absolute number. Should this be a parameter? Or hard-coded? The assumption here is that if 10 (or 100 or..) docs fail, there's something pretty fundamentally wrong and it's a waste to keep on. I don't have any strong feeling here, I can argue it either way 6 Sorry, all, but I reflexively hit the reformat keystrokes so the raw patch may be hard to read. But I'm pretty well in the camp that you *have* to reformat as you go or the code will be held hostage to the last person who *didn't* format properly. I'm pretty sure I'm using the right codestyle.xml file, but let me know if not. 7 I doubt that this has any bearing on, say, SolrJ indexing. Should that be another bug (or is there one already)? Anybody got a clue where I'd look for that since I'm in the area anyway? Erick XmlUpdateRequestHandler bad documents mid batch aborts rest of batch Key: SOLR-445 URL: https://issues.apache.org/jira/browse/SOLR-445 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.3 Reporter: Will Johnson Assignee: Erick Erickson Fix For: Next Attachments: SOLR-445.patch, solr-445.xml Has anyone run into the problem of handling bad documents / failures mid batch. Ie: add doc field name=id1/field /doc doc field name=id2/field field name=myDateFieldI_AM_A_BAD_DATE/field /doc doc field name=id3/field /doc /add Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-445) XmlUpdateRequestHandler bad documents mid batch aborts rest of batch
[ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-445: --- Fix Version/s: 1.4 XmlUpdateRequestHandler bad documents mid batch aborts rest of batch Key: SOLR-445 URL: https://issues.apache.org/jira/browse/SOLR-445 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.3 Reporter: Will Johnson Assignee: Grant Ingersoll Fix For: 1.4 Has anyone run into the problem of handling bad documents / failures mid batch. Ie: add doc field name=id1/field /doc doc field name=id2/field field name=myDateFieldI_AM_A_BAD_DATE/field /doc doc field name=id3/field /doc /add Right now solr adds the first doc and then aborts. It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3. Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API. I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.