[jira] [Updated] (SOLR-445) Update Handlers abort with bad documents

Hoss Man (JIRA) Wed, 16 Dec 2015 16:32:07 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-445:
--------------------------
    Attachment: SOLR-445.patch


I started playing arround with this patch a bit to see if I could help move it 
forward.  I'm a little out of my depth with a lot of the details of how 
distribute updates work, but the more I tried to make sense of it, the more 
convinced I was that there was a lot of things that just weren't very well 
accounted for in the existing tests (which were consistently failing, but the 
failures themselves weren't consistent between runs).

Here's a summary of what's new/different in the patch i'm attaching...


* DistributedUpdateProcessor.DistribPhase
** not sure why this enum was made non-static in earlier patches ... i reverted 
this unneeded change.
* TolerantUpdateProcessor
** processDelete
*** Method has a couple of glaringly obvious bugs, that aparently don't trip 
under the current tests
*** added several nocommits of things that jumpted out at me
* DistribTolerantUpdateProcessorTest
** beefed up assertion msgs in assertUSucceedsWithErrors
** fixed testValidAdds so it's not dead code
** testInvalidAdds
*** sanity check code wasn't passing reliably
**** details of what failed are lost depending on how update is routed (random 
seed)
**** relaxed this check to be reliable with a nocommit comment to see if we can 
tighten it up
*** assuming sanity check passes assertUSucceedsWithErrors (still) fails on 
some seeds w/null error list
**** I'm Guessing this is what anshum alluded to in last comment: "Node2 as of 
now return an HTTP OK and doesn't throw an exception, the StreamingSolrClient 
used but the Distributed Updated Processor doesn't realize the error that was 
consumed by the leader of shard 1"
* TestTolerantUpdateProcessorCloud
** New MiniSolrCloudCluster based test to try and demonstrate all the possible 
distrib code paths i could think of (see below)

TestTolerantUpdateProcessorCloud is the real meat of what i've added here.  
Starting with the basic behavior/assertions currently tested in 
TolerantUpdateProcessorTest, I built it up to try and exorcise every possible 
distribute update code path i could imagine (updates with docs all on one shard 
some of which fail, updates with docs for diff shards and some from each shard 
fail, updates with docs for diff shards but only one shard fails, etc...) -- 
but only tested against a MinSolrCloud collection that actaully had 1 node, 1 
shard, 1 replica and an HttpSolrClient talking directly to that node.  Once all 
those assertions were passing, then I changed it to use 5 nodes, 2 shards, 2 
replicas and started testing all of those scenerios against 5 HttpSolrClients 
pointed at every individual node (one of which hosts no replicas) as well as a 
ZK aware CloudSolrClient.  All 6 tests against all 6 clients currently fail 
(reliably) at some point in these scenerios.

----

Independent of all the things i still need to make sense of in the existing 
code to try and help get these tests passing, I still have one big question 
about what the desired/epected behavior should be for clients when maxErrors is 
exceeded -- at the moment, in single node setups, the client gets a 400 error 
with the top level "error" section corisponding with whatever error caused it 
to exceed the maxErrors, but the responseHeader is still populated with the 
individual errors and the appropraite numAdds & numErrors, for example...

{code}
$ curl -v -X POST 
'http://localhost:8983/solr/techproducts/update?indent=true&commit=true&update.chain=tolerant'
 -H 'Content-Type: application/json' --data-binary 
'[{"id":"hoss1","foo_i":42},{"id":"bogus1","foo_i":"bogus"},{"id":"hoss2","foo_i":66},{"id":"bogus2","foo_i":"bogus"},{"id":"bogus3","foo_i":"bogus"},{"id":"hoss3","foo_i":42}]'
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8983 (#0)
> POST /solr/techproducts/update?indent=true&commit=true&update.chain=tolerant 
> HTTP/1.1
> User-Agent: curl/7.38.0
> Host: localhost:8983
> Accept: */*
> Content-Type: application/json
> Content-Length: 175
> 
* upload completely sent off: 175 out of 175 bytes
< HTTP/1.1 400 Bad Request
< Content-Type: text/plain;charset=utf-8
< Transfer-Encoding: chunked
< 
{
  "responseHeader":{
    "numErrors":3,
    "errors":{
      "bogus1":{
        "message":"ERROR: [doc=bogus1] Error adding field 'foo_i'='bogus' 
msg=For input string: \"bogus\""},
      "bogus2":{
        "message":"ERROR: [doc=bogus2] Error adding field 'foo_i'='bogus' 
msg=For input string: \"bogus\""},
      "bogus3":{
        "message":"ERROR: [doc=bogus3] Error adding field 'foo_i'='bogus' 
msg=For input string: \"bogus\""}},
    "numAdds":2,
    "status":400,
    "QTime":4},
  "error":{
    "msg":"ERROR: [doc=bogus3] Error adding field 'foo_i'='bogus' msg=For input 
string: \"bogus\"",
    "code":400}}
* Connection #0 to host localhost left intact
{code}

...but because this is a 400 error, that means that if you use HttpSolrClient, 
you're not going to get access to any of that detailed error information at all 
-- you'll just get a RemoteSolrException with the bare details.

* Should the use of this processor force *all* "error" responses to be 
rewritten as HTTP 200s?
* Should the solrj clients be updated so that RemoteSolrException still 
provides an accessor to get the parsed/structured SolrResponse (assuming the 
HTTP response body can be parsed w/o any other errors?)

?


> Update Handlers abort with bad documents
> ----------------------------------------
>
>                 Key: SOLR-445
>                 URL: https://issues.apache.org/jira/browse/SOLR-445
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Will Johnson
>            Assignee: Anshum Gupta
>         Attachments: SOLR-445-3_x.patch, SOLR-445-alternative.patch, 
> SOLR-445-alternative.patch, SOLR-445-alternative.patch, 
> SOLR-445-alternative.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, 
> SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, 
> SOLR-445.patch, SOLR-445_3x.patch, solr-445.xml
>
>
> Has anyone run into the problem of handling bad documents / failures mid 
> batch.  Ie:
> <add>
>   <doc>
>     <field name="id">1</field>
>   </doc>
>   <doc>
>     <field name="id">2</field>
>     <field name="myDateField">I_AM_A_BAD_DATE</field>
>   </doc>
>   <doc>
>     <field name="id">3</field>
>   </doc>
> </add>
> Right now solr adds the first doc and then aborts.  It would seem like it 
> should either fail the entire batch or log a message/return a code and then 
> continue on to add doc 3.  Option 1 would seem to be much harder to 
> accomplish and possibly require more memory while Option 2 would require more 
> information to come back from the API.  I'm about to dig into this but I 
> thought I'd ask to see if anyone had any suggestions, thoughts or comments.   
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-445) Update Handlers abort with bad documents

Reply via email to