Cassandra + Hadoop - 2 Task attempts with million of rows

2013-04-22 Thread Shamim
Hello all,   recently we have upgrade our cluster (6 nodes) from cassandra version 1.1.6 to 1.2.1. Our cluster is evenly partitioned (Murmur3Partitioner). We are using pig for parse and compute aggregate data. When we submit job through pig, what i consistently see is that, while most of the ta

Re: Cassandra + Hadoop - 2 Task attempts with million of rows

2013-04-22 Thread Shamim
We are using Hadoop 1.0.3 and pig 0.11.1 version -- Best regards   Shamim A. 22.04.2013, 21:48, "Shamim" : > Hello all, >   recently we have upgrade our cluster (6 nodes) from cassandra version 1.1.6 > to 1.2.1. Our cluster is evenly partitioned (Murmur3Partitioner). We are > using pig for par

Re: Cassandra + Hadoop - 2 Task attempts with million of rows

2013-04-23 Thread aaron morton
>> Our cluster is evenly partitioned (Murmur3Partitioner) Murmor3Partitioner is only available in 1.2 and changing partitioners is not supported. Did you change from Random Partitioner under 1.1? Are you using virtual nodes in your 1.2 cluster ? >> We have roughly 97million rows in our cluster.

Re: Cassandra + Hadoop - 2 Task attempts with million of rows

2013-04-25 Thread Shamim
Hello Aaron, I have got the following Log from the server (Sorry for being late) job_201304231203_0004 attempt_201304231203_0004_m_000501_0 2013-04-23 16:09:14,196 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2013-04-23 16:09:14,438 INF

Re: Cassandra + Hadoop - 2 Task attempts with million of rows

2013-04-25 Thread aaron morton
> 2013-04-23 16:09:17,838 INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader: > Current split being processed ColumnFamilySplit((9197470410121435301, '-1] > @[p00nosql02.00, p00nosql01.00]) > Why it's split data from two nodes? we have 6 nodes cassandra cluster +

Re: Cassandra + Hadoop - 2 Task attempts with million of rows

2013-05-27 Thread Arya Goudarzi
We haven't tried using Pig. However, we had a problem where our mapreduce job blew up for a subset of data. It appeared that we had a bug in our code that had generated a row as big as 3Gb. It was actually causing long GC pauses and would cause GC thrashing. The hadoop job of course would time out.