Hi Paulo, I’m using C* 1.2.15 and have no easy option to upgrade (at least not to 2.0.* branch). I’ve started to look if I can implement my variant of InputFormat. Thanks a lot for the hint, I’m for sure will check how it’s done in 2.0.6 and if it’s possible to backport it to 1.2.* branch.
WBR, Anton From: Paulo Ricardo Motta Gomes [mailto:paulo.mo...@chaordicsystems.com] Sent: Thursday, May 15, 2014 3:21 AM To: user@cassandra.apache.org Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat) Hello Anton, What version of Cassandra are you using? If between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working. This was fixed in 2.0.7: https://issues.apache.org/jira/browse/CASSANDRA-6436 If you can't upgrade you can copy AbstractCFIF and CFIF to your project and apply the patch there. Cheers, Paulo On Wed, May 14, 2014 at 10:29 PM, Anton Brazhnyk <anton.brazh...@genesys.com<mailto:anton.brazh...@genesys.com>> wrote: Greetings, I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like to read just part of it - something like Spark's sample() function. Cassandra's API seems allow to do it with its ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but it doesn't work. The limit is just ignored and the entire column family is scanned. It seems this kind of feature is just not supported and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO). Questions: 1. Am I right that there is no way to get some data limited by token range with ColumnFamilyInputFormat? 2. Is there other way to limit the amount of data read from Cassandra with Spark and ColumnFamilyInputFormat, so that this amount is predictable (like 5% of entire dataset)? WBR, Anton -- Paulo Motta Chaordic | Platform www.chaordic.com.br<http://www.chaordic.com.br/> +55 48 3232.3200