Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

Clint Kelly Fri, 16 May 2014 16:35:52 -0700

Hi Anton,

One approach you could look at is to write a custom InputFormat that
allows you to limit the token range of rows that you fetch (if the
AbstractColumnFamilyInputFormat does not do what you want).  Doing so
is not too much work.


If you look at the class RowIterator within CqlRecordReader, you can
see code in the constructor that creates a query with a certain token
range:

            ResultSet rs = session.execute(cqlQuery,
type.compose(type.fromString(split.getStartToken())),
type.compose(type.fromString(split.getEndToken())) );

 I think you can make a new version of the InputFormat and just tweak
this method to achieve what you want.  Alternatively, if you just want
to get a sample of the data, you might want to change the InputFormat
itself such that it chooses to query only a subset of the total input
splits (or CfSplits).  That might be easier.

Best regards,
Clint

On Wed, May 14, 2014 at 6:29 PM, Anton Brazhnyk
<anton.brazh...@genesys.com> wrote:
> Greetings,
>
> I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd 
> like to read just part of it - something like Spark's sample() function.
> Cassandra's API seems allow to do it with its 
> ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, 
> but it doesn't work.
> The limit is just ignored and the entire column family is scanned. It seems 
> this kind of feature is just not supported
> and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
> Questions:
> 1. Am I right that there is no way to get some data limited by token range 
> with ColumnFamilyInputFormat?
> 2. Is there other way to limit the amount of data read from Cassandra with 
> Spark and ColumnFamilyInputFormat,
> so that this amount is predictable (like 5% of entire dataset)?
>
>
> WBR,
> Anton
>
>

Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

Reply via email to