If that works, it's a neat/fancy trick.  But, after looking into the docs
for that setting, related to vnodes which I'm not using, I'm not
comfortable messing with it on my production cluster.

I guess this implies that the # of mappers is related to the number of
physical/virtual nodes of the cassandra cluster, which makes sense...

I think I'm just going to hack up a script that I can run locally on a
cassandra box, and only run queries/writes/deletes against the local token
range (and get parallelism from cassandra's distributed nature).

will


On Sun, Apr 6, 2014 at 7:30 AM, Suraj Nayak <snay...@gmail.com> wrote:

> Hi Will,
>
> Try changing the value of num_tokens in conf/cassandra.yaml.
>
> Set to your desired value minus 1.
>
> Example, if you want 5 map tasks to run, set the value of num_tokens to 4
> (default is 256)
>
> I encountered almost same situation when I was trying to load or write
> very small data from/into Cassandra. It was launching 257 map tasks.  When
> num_tokens value reduced to 1 it Pig launched only 2 job. Do restart
> Cassandra service after change.
>
> Hope it might help..
>
> --
> Suraj
> On 04-Apr-2014 11:44 PM, "William Oberman" <ober...@civicscience.com>
> wrote:
>
>> Apologies for cross posting!
>>
>> My core issue is unblocked, but I'm still curious on one aspect of my
>> question to the cassandra mailing list.  How does Pig/Hadoop decide how
>> many tasks there are?  The forwarded email below has the gory details, but
>> basically:
>> -My Pig loadFunc was CassandraStorage
>> -The "table" (column family in cassandra) has something like a billion
>> rows
>> in it, and I want to say ~3TB of data.
>> -No matter what I tried(*), Pig/Hadoop decided this was worthy of 20 tasks
>>
>> (*) I changed settings in the loadFunc, I booted hadoop clusters with more
>> or less task slots, etc...
>>
>> I'm using AWS's EMR, which claims to be hadoop 1.0.3 + pig 11.
>>
>> will
>>
>> ---------- Forwarded message ----------
>> From: William Oberman <ober...@civicscience.com>
>> Date: Fri, Apr 4, 2014 at 12:24 PM
>> Subject: using hadoop + cassandra for CF mutations (delete)
>> To: "u...@cassandra.apache.org" <u...@cassandra.apache.org>
>>
>>
>> Hi,
>>
>> I have some history with cassandra + hadoop:
>> 1.) Single DC + integrated hadoop = Was "ok" until I needed steady
>> performance (the single DC was used in a production environment)
>> 2.) Two DC's + integrated hadoop on 1 of 2 DCs = Was "ok" until my data
>> grew and in AWS compute is expensive compared to data storage... e.g.
>> running a 24x7 DC was a lot more expensive than the following solution...
>> 3.) Single DC + a constant "ETL" to S3 = Is still ok, I can spawn an
>> "arbitrarily large" EMR cluster.  And 24x7 data storage + transient EMR is
>> cost effective.
>>
>> But, one of my CF's has had a change of usage pattern making a large %,
>> but
>> not all of the data, fairly pointless to store.  I thought I'd write a Pig
>> UDF that could peek at a row of data and delete if it fails my criteria.
>>  And it "works" in terms of logic, but not in terms of practical
>> execution.
>>  The CF in question has O(billion) keys, and afterwards it will have ~10%
>> of that at most.
>>
>> I basically keep losing the jobs due to too many task failures, all rooted
>> in:
>> Caused by: TimedOutException()
>> at
>>
>> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:13020)
>>
>> And yes, I've messed around with:
>> -Number of failures for map/reduce/tracker (in the hadoop confs)
>> -split_size (on the URL)
>> -cassandra.range.batch.size
>>
>> But it hasn't helped.  My failsafe is to roll my own distributed process,
>> rather than falling into a pit of internal hadoop settings.  But I feel
>> like I'm close.
>>
>> The problem in my opinion, watching how things are going, is the
>> correlation of splits <-> tasks.  I'm obviously using Pig, so this part of
>> the process is fairly opaque to me at the moment.  But, "something
>> somewhere" is picking 20 tasks for my job, and this is fairly independent
>> of the # of task slots (I've booted EMR cluster with different #'s and
>> always get 20).  Why does this matter?  When a task fails, it retries from
>> the start, which is a killer for me as I "delete as I go", making that
>> pointless work and massively increasing the odds of an overall job
>> failure.
>>  If hadoop/pig chose a large number of tasks, the retries would be much
>> less of a burden.  But, I don't see where/what lets me mess with that
>> logic.
>>
>> Pig gives the ability to mess with reducers (PARALLEL), but I'm in the
>> load
>> path, which is all mappers.  I've never jumped to the lower, raw hadoop
>> level before.  But, I'm worried that will be the "falling into a pit"
>> issue...
>>
>> I'm using Cassandra 1.2.15.
>>
>> will
>>
>

Reply via email to