Hi Paulo,

I’m using C* 1.2.15 and have no easy option to upgrade (at least not to 2.0.* 
branch).
I’ve started to look if I can implement my variant of InputFormat.
Thanks a lot for the hint, I’m for sure will check how it’s done in 2.0.6 and 
if it’s possible to backport it to 1.2.* branch.


WBR,
Anton

From: Paulo Ricardo Motta Gomes [mailto:paulo.mo...@chaordicsystems.com]
Sent: Thursday, May 15, 2014 3:21 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

Hello Anton,

What version of Cassandra are you using? If between 1.2.6 and 2.0.6 the 
setInputRange(startToken, endToken) is not working.

This was fixed in 2.0.7: https://issues.apache.org/jira/browse/CASSANDRA-6436

If you can't upgrade you can copy AbstractCFIF and CFIF to your project and 
apply the patch there.

Cheers,

Paulo

On Wed, May 14, 2014 at 10:29 PM, Anton Brazhnyk 
<anton.brazh...@genesys.com<mailto:anton.brazh...@genesys.com>> wrote:
Greetings,

I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like 
to read just part of it - something like Spark's sample() function.
Cassandra's API seems allow to do it with its 
ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but 
it doesn't work.
The limit is just ignored and the entire column family is scanned. It seems 
this kind of feature is just not supported
and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
Questions:
1. Am I right that there is no way to get some data limited by token range with 
ColumnFamilyInputFormat?
2. Is there other way to limit the amount of data read from Cassandra with 
Spark and ColumnFamilyInputFormat,
so that this amount is predictable (like 5% of entire dataset)?


WBR,
Anton




--
Paulo Motta

Chaordic | Platform
www.chaordic.com.br<http://www.chaordic.com.br/>
+55 48 3232.3200

Reply via email to