Shawn, Thank you again for your help.

The problem appeared to be with httpclient.
I turned on debug logging for all libraries and saw a message "Garbage in
response" coming from httpclient just before the failure.
this is a log snippet:



31 Oct 2013 18:10:40,360 [explicit-fetchindex-cmd] DEBUG
DefaultClientConnection - Sending request: GET
/solr-master/replication?comman
d=filecontent&generation=6814&qt=%2Freplication&file=_aa7_Lucene41_0.pos&checksum=true&wt=filestream
HTTP/1.1
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - >> "GET
/solr-master/replication?command=filecontent&generation=6814&qt
=%2Freplication&file=_aa7_Lucene41_0.pos&checksum=true&wt=filestream
HTTP/1.1[\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - >>
"User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrServer]
1.0[\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - >> "Host:
solr-master.saltdev.sealdoc.com:8081[\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - >>
"Connection: Keep-Alive[\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - >>
"[\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG headers - >> GET
/solr-master/replication?command=filecontent&generation=6814&
qt=%2Freplication&file=_aa7_Lucene41_0.pos&checksum=true&wt=filestream
HTTP/1.1
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG headers - >>
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrServer
] 1.0
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG headers - >> Host:
solr-master.saltdev.sealdoc.com:8081
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG headers - >>
Connection: Keep-Alive
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - <<
"[\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG
DefaultHttpResponseParser - Garbage in response:
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - <<
"4[\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG
DefaultHttpResponseParser - Garbage in response: 4
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - <<
"[0x0][0x0][0x0][0x0][\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG
DefaultHttpResponseParser - Garbage in response: ^@^@^@^@
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - <<
"0[\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG
DefaultHttpResponseParser - Garbage in response: 0
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG wire - <<
"[\r][\n]"
31 Oct 2013 18:10:40,361 [explicit-fetchindex-cmd] DEBUG
DefaultHttpResponseParser - Garbage in response:
31 Oct 2013 18:10:40,398 [explicit-fetchindex-cmd] DEBUG
DefaultClientConnection - Connection 0.0.0.0:55266<->172.16.77.121:8081closed
31 Oct 2013 18:10:40,398 [explicit-fetchindex-cmd] DEBUG
DefaultClientConnection - Connection 0.0.0.0:55266<->172.16.77.121:8081shut down
31 Oct 2013 18:10:40,398 [explicit-fetchindex-cmd] DEBUG
DefaultClientConnection - Connection 0.0.0.0:55266<->172.16.77.121:8081closed
31 Oct 2013 18:10:40,398 [explicit-fetchindex-cmd] DEBUG
PoolingClientConnectionManager - Connection released: [id: 0][route:
{}->http://solr-master.saltdev.sealdoc.com:8081][total kept alive: 1; route
allocated: 1 of 10000; total allocated: 1 of 10000]
31 Oct 2013 18:10:40,425 [explicit-fetchindex-cmd] DEBUG
CachingDirectoryFactory - Releasing directory:
/opt/watchdox/solr-slave/data/index 2 false
31 Oct 2013 18:10:40,425 [explicit-fetchindex-cmd] DEBUG
CachingDirectoryFactory - Reusing cached directory:
CachedDir<<refCount=1;path=/opt/watchdox/solr-slave/data;done=false>>
31 Oct 2013 18:10:40,425 [explicit-fetchindex-cmd] DEBUG
CachingDirectoryFactory - Releasing directory:
/opt/watchdox/solr-slave/data 0 false
31 Oct 2013 18:10:40,425 [explicit-fetchindex-cmd] DEBUG
CachingDirectoryFactory - Reusing cached directory:
CachedDir<<refCount=1;path=/opt/watchdox/solr-slave/data;done=false>>
31 Oct 2013 18:10:40,427 [explicit-fetchindex-cmd] DEBUG
CachingDirectoryFactory - Releasing directory:
/opt/watchdox/solr-slave/data 0 false
31 Oct 2013 18:10:40,428 [explicit-fetchindex-cmd] DEBUG
CachingDirectoryFactory - Done with dir:
CachedDir<<refCount=1;path=/opt/watchdox/solr-slave/data/index.20131031180837277;done=true>>
31 Oct 2013 18:10:40,428 [explicit-fetchindex-cmd] DEBUG
CachingDirectoryFactory - Releasing directory:
/opt/watchdox/solr-slave/data/index.20131031180837277 0 true
31 Oct 2013 18:10:40,428 [explicit-fetchindex-cmd] INFO
CachingDirectoryFactory - looking to close
/opt/watchdox/solr-slave/data/index.20131031180837277
[CachedDir<<refCount=0;path=/opt/watchdox/solr-slave/data/index.20131031180837277;done=true>>]
31 Oct 2013 18:10:40,428 [explicit-fetchindex-cmd] INFO
CachingDirectoryFactory - Closing directory:
/opt/watchdox/solr-slave/data/index.20131031180837277
31 Oct 2013 18:10:40,428 [explicit-fetchindex-cmd] INFO
CachingDirectoryFactory - Removing directory before core close:
/opt/watchdox/solr-slave/data/index.20131031180837277
31 Oct 2013 18:10:40,878 [explicit-fetchindex-cmd] DEBUG
CachingDirectoryFactory - Removing from cache:
CachedDir<<refCount=0;path=/opt/watchdox/solr-slave/data/index.20131031180837277;done=true>>
31 Oct 2013 18:10:40,878 [explicit-fetchindex-cmd] DEBUG
CachingDirectoryFactory - Releasing directory:
/opt/watchdox/solr-slave/data/index 1 false
31 Oct 2013 18:10:40,879 [explicit-fetchindex-cmd] ERROR ReplicationHandler
- SnapPull failed :org.apache.solr.common.SolrException: Unable to download
_aa7_Lucene41_0.pos completely. Downloaded 0!=1081710
        at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.cleanup(SnapPuller.java:1212)
        at
org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1092)
        at
org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:719)
        at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:397)
        at
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317)
        at
org.apache.solr.handler.ReplicationHandler$1.run(ReplicationHandler.java:218)

31 Oct 2013 18:10:40,910 [http-bio-8080-exec-8] DEBUG
CachingDirectoryFactory - Reusing cached directory:
CachedDir<<refCount=2;path=/opt/watchdox/solr-slave/data/index;done=false>>




So I upgraded the httpcomponents jars to their latest 4.3.x version and the
problem disappeared.
the httpcomponents jars which are dependencies of solrj where in the 4.2.x
version, I upgraded to httpclient-4.3.1 , httpcore-4.3 and httpmime-4.3.1
I ran the replication a few times now and no problem at all, it is now
working as expected.
It seams that the upgrade is necessary only on the slave side but I'm going
to upgrade the master too.


Thank you so much for your help.

Shalom








On Thu, Oct 31, 2013 at 6:46 PM, Shawn Heisey <s...@elyograg.org> wrote:

> On 10/31/2013 7:26 AM, Shalom Ben-Zvii Kazaz wrote:
> > Shawn, Thank you for your answer.
> > for the purpose of testing it we have a test environment where we are not
> > indexing anymore. We also disabled the DIH delta import. so as I
> understand
> > there shouldn't be any commits on the master.
> > I also tried with
> > <str name="commitReserveDuration">50:50:50</str>
> > and get the same failure.
>
> If it's in an environment where there are no commits, that's really
> odd.  I would suspect underlying filesystem or network issues.  There's
> one problem that's not well known, but is very common - problems with
> NIC firmware, most commonly Broadcom NICs.  These problems result in
> things working correctly almost all the time, but when there is a high
> network load, things break in strange ways, and the resulting errors
> rarely look like they are network-related.
>
> Most embedded NICs are either Broadcom or Realtek, both of which are
> famous for their firmware problems.  Broadcom NICs are very common on
> Dell and HP servers.  Upgrading the firmware (which is not usually the
> same thing as upgrading the driver) is the only fix.  NICs from other
> manufacturers also have upgradable firmware, but don't usually have the
> same kind of high-profile problems as Broadcom.
>
> The NIC firmware might not have anything to do with this problem, but
> it's the only thing left that I can think of.  I personally haven't used
> replication since Solr 1.4.1, but a lot of people do.  I can't say that
> there's no bugs, but so far I'm not seeing the kind of problem reports
> that appear when a bug in a critical piece of the software exists.
>
> Thanks,
> Shawn
>
>

Reply via email to