Today we again had some leftover attempt dirs -- 2 out of 5 reduce tasks
with a redundant speculative attempt left around their extra attempt:
part-5:
conf data solr_attempt_201101270134_42328_r_05_1.1
part-6:
conf data solr_attempt_201101270134_42328_r_06_1.1
Is it possible that the disk being really full could cause transient
filesystem glitches that aren't being thrown as exceptions?
On Sat, Oct 29, 2011 at 2:23 PM, Meng Mao meng...@gmail.com wrote:
We've been getting up to speed on SOLR, and one of the recent problems
we've run into is with successful jobs delivering corrupt index shards.
This is a look at a 12-sharded index we built and then copied to local
disk off of HDFS.
$ ls -l part-1
total 16
drwxr-xr-x 2 vmc visible 4096 Oct 29 09:39 conf
drwxr-xr-x 4 vmc visible 4096 Oct 29 09:42 data
$ ls -l part-4/
total 16
drwxr-xr-x 4 vmc visible 4096 Oct 29 09:54 data
drwxr-xr-x 2 vmc visible 4096 Oct 29 09:54
solr_attempt_201101270134_42143_r_04_1.1
Right away, you can see that there's apparently some lack of cleanup or
incompleteness. Shard 04 is missing a conf directory, and has an empty
attempt directory lying around.
This is what a complete shard listing looks like:
$ ls -l part-1/*/*
-rw-r--r-- 1 vmc visible 33402 Oct 29 09:39 part-1/conf/schema.xml
part-1/data/index:
total 13088776
-rw-r--r-- 1 vmc visible 6036701453 Oct 29 09:40 _1m3.fdt
*-rw-r--r-- 1 vmc visible 246345692 Oct 29 09:40 _1m3.fdx #missing from
shard 04*
-rw-r--r-- 1 vmc visible211 Oct 29 09:40 _1m3.fnm
-rw-r--r-- 1 vmc visible 3516034769 Oct 29 09:41 _1m3.frq
-rw-r--r-- 1 vmc visible 92379637 Oct 29 09:41 _1m3.nrm
-rw-r--r-- 1 vmc visible 695935796 Oct 29 09:41 _1m3.prx
-rw-r--r-- 1 vmc visible 28548963 Oct 29 09:41 _1m3.tii
-rw-r--r-- 1 vmc visible 2773769958 Oct 29 09:42 _1m3.tis
-rw-r--r-- 1 vmc visible284 Oct 29 09:42 segments_2
-rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen
part-1/data/spellchecker:
total 16
-rw-r--r-- 1 vmc visible 32 Oct 29 09:42 segments_1
-rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen
And shard 04:
$ ls -l part-4/*/*
#missing conf/
part-4/data/index:
total 12818420
-rw-r--r-- 1 vmc visible 6036000614 Oct 29 09:52 _1m1.fdt
-rw-r--r-- 1 vmc visible211 Oct 29 09:52 _1m1.fnm
-rw-r--r-- 1 vmc visible 3515333900 Oct 29 09:53 _1m1.frq
-rw-r--r-- 1 vmc visible 92361544 Oct 29 09:53 _1m1.nrm
-rw-r--r-- 1 vmc visible 696258210 Oct 29 09:54 _1m1.prx
-rw-r--r-- 1 vmc visible 28552866 Oct 29 09:54 _1m1.tii
-rw-r--r-- 1 vmc visible 2744647680 Oct 29 09:54 _1m1.tis
-rw-r--r-- 1 vmc visible283 Oct 29 09:54 segments_2
-rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen
part-4/data/spellchecker:
total 16
-rw-r--r-- 1 vmc visible 32 Oct 29 09:54 segments_1
-rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen
What might cause that attempt path to be lying around at the time of
completion? Has anyone seen anything like this? My gut says if we were able
to disable speculative execution, we would probably see this go away. But
that might be overreacting.
In this job, of the 12 reduce tasks, 5 had an extra speculative attempt.
Of those 5, 2 were cases where the speculative attempt won out over the
first attempt. And one of them had this output inconsistency.
Here is an excerpt from the task log for shard 04:
2011-10-29 07:08:33,152 INFO org.apache.solr.hadoop.SolrRecordWriter:
SolrHome:
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip
2011-10-29 07:08:33,889 INFO org.apache.solr.hadoop.SolrRecordWriter:
Constructed instance information solr.home
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip
(/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip),
instance dir
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/,
conf dir
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/conf/,
writing index to temporary directory
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_04_1.1/data,
with permdir /PROD/output/solr/solr-20111029063514-12/part-4
2011-10-29 07:08:35,868 INFO org.apache.solr.schema.IndexSchema: Reading
Solr Schema
... much later ...
2011-10-29 07:08:37,263 WARN org.apache.solr.core.SolrCore: [core1] Solr
index directory
'/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_04_1.1/data/index'
doesn't exist. Creating new index...
The log doesn't look much