I'm running Cassandra 0.7 and I'm trying to get Pig integration to work
correctly. I'm using Pig 0.8 running against Hadoop 20.2, I've also tried this
running against CDH2.
I can log into the grunt shell, and execute scripts, but when they run, they
don't read all of the data from Cassandra. The job only results in one mapper
being created, and that only reads a small fraction of the data on a node. I
don't see any obvious error messages anywhere, so I'm not sure how to pinpoint
the problem.
To confirm that I had the cluster set up correctly, I wrote a simple map reduce
job in Java that seems to use the ColumnFamily input format correctly and
appears to distribute the job correctly across all the nodes in the cluster. I
had a small number of killed jobs at the end of the process though, and I'm not
sure whether that is a symptom if something. It looked like the Map phase
would have been much faster if those jobs weren't waiting to be killed. But
the output was correct, I compared it to a job that operated on the source data
that I used to populate the cluster and the output was identical. In case its
interesting, this data is 134 million records, the Cassandra Map Reduce Job ran
in 14 minutes and the same calculation running on the raw data in HDFS took
three minutes.
I suspected at first that I was not correctly connecting the grunt shell to the
cluster, but when I start grunt it correctly indicates the correct URLs for
HDFS and the job tracker.
When the job appears in the job tracker web UI, it is only executing one map.
What's really interesting, is that Pig reports that it read 65k input records.
When I multiply 65k, by the number of maps spawned by the Java Map Reduce job
that actually works, I get 134 million, which is the number of records I'm
reading. So it looks like the input split size is being calculated correctly,
but only one of the maps gets executed. That has me kind of stumped.
Here is the grunt session with line numbers prepended:
1 cassandra@rdcl000:~/benchmark/cassandra-0.7.0/contrib/pig$ bin/pig_cassandra
2 2011-02-01 12:47:02,353 [main] INFO org.apache.pig.Main - Logging error
messages to:
/home/cassandra/benchmark/cassandra-0.7.0/contrib/pig/pig_1296582422349.log
3 2011-02-01 12:47:02,538 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
hadoop file system at: hdfs://rdcl000:9000
4 2011-02-01 12:47:02,644 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
map-reduce job tracker at: rdcl000:9001
5 grunt> register /home/hadoop/local/pig/pig-0.8.0-core.jar; register
/home/cassandra/benchmark/cassandra-0.7.0/lib/libthrift-0.5.jar;
6 grunt> rows = LOAD 'cassandra://rdclks/mycftest' USING CassandraStorage();
7 grunt> countthis = GROUP rows ALL;
8 grunt> countedrows = FOREACH countthis GENERATE COUNT(rows.$0);
9 grunt> dump countedrows;
10 2011-02-01 12:47:31,219 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script:
GROUP_BY
11 2011-02-01 12:47:31,219 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
pig.usenewlogicalplan is set to true. New logical plan will be used.
12 2011-02-01 12:47:31,397 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:
countedrows:
Store(hdfs://rdcl000:9000/tmp/temp-1188844399/tmp1986503871:org.apache.pig.im
pl.io.InterStorage) - scope-10 Operator Key: scope-10)
13 2011-02-01 12:47:31,408 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File
concatenation threshold: 100 optimistic? false
14 2011-02-01 12:47:31,419 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer
- Choosing to move algebraic foreach to combiner
15 2011-02-01 12:47:31,447 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
16 2011-02-01 12:47:31,447 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
17 2011-02-01 12:47:31,478 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to
the job
18 2011-02-01 12:47:31,491 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
19 2011-02-01 12:47:35,418 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
20 2011-02-01 12:47:35,478 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
21 2011-02-01 12:47:35,980 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
22 2011-02-01 12:47:35,995 [T