RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

Anton Brazhnyk Tue, 20 May 2014 15:07:35 -0700

I went with recommendations to create my own input format or backport the 2.0.7 
code and it works now.
To be more specific...
AbstractColumnFamilyInputFormat. getSplits(JobContext) handled just the case 
with ordered partitioner and ranges based on keys.
It did converted keys to tokens and used all the support which is there on low 
level (which you probably talk about).
BUT there were no way to engage that support via ColumnFamilyInputFormat and 
ConfigHelper.setInputRange(startToken, endToken)
prior to 2.0.7 without tapping into the code of C*.



-----Original Message-----
From: Aaron Morton [mailto:aa...@thelastpickle.com] 
Sent: Monday, May 19, 2014 11:58 PM
To: Cassandra User
Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

>> "between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
>> working"
> Can you confirm or disprove?


My reading of the code is that it will consider the part of a token range (from 
vnodes or initial tokens) that overlap with the provided token range. 

> I've already got one confirmation that in C* version I use (1.2.15) setting 
> limits with setInputRange(startToken, endToken) doesn't work.
Can you be more specific ?

> work only for ordered partitioners (in 1.2.15).

it will work with ordered and unordered partitioners equally. The difference is 
probably what you consider to "working" to mean.  The token ranges are handled 
the same, it's the row in them that changes. 

Cheers
Aaron

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 20/05/2014, at 11:37 am, Anton Brazhnyk <anton.brazh...@genesys.com> wrote:

> Hi Aaron,
>  
> I've seen the code which you describe (working with splits and intersections) 
> but that range is derived from keys and work only for ordered partitioners 
> (in 1.2.15).
> I've already got one confirmation that in C* version I use (1.2.15) setting 
> limits with setInputRange(startToken, endToken) doesn't work.
> "between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not 
> working"
> Can you confirm or disprove?
>  
> WBR,
> Anton
>  
> From: Aaron Morton [mailto:aa...@thelastpickle.com]
> Sent: Monday, May 19, 2014 1:58 AM
> To: Cassandra User
> Subject: Re: Cassandra token range support for Hadoop 
> (ColumnFamilyInputFormat)
>  
> The limit is just ignored and the entire column family is scanned.
> Which limit ? 
> 
> 
> 1. Am I right that there is no way to get some data limited by token range 
> with ColumnFamilyInputFormat?
> From what I understand setting the input range is used when calculating the 
> splits. The token ranges in the cluster are iterated and if they intersect 
> with the supplied range the overlapping range is used to calculate the split. 
> Rather than the full token range. 
>  
> 2. Is there other way to limit the amount of data read from Cassandra 
> with Spark and ColumnFamilyInputFormat, so that this amount is predictable 
> (like 5% of entire dataset)?
> if you suppled a token range is that is 5% of the possible range of values 
> for the token that should be close to a random 5% sample. 
>  
>  
> Hope that helps. 
> Aaron
>  
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>  
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>  
> On 14/05/2014, at 10:46 am, Anton Brazhnyk <anton.brazh...@genesys.com> wrote:
> 
> 
> Greetings,
> 
> I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd 
> like to read just part of it - something like Spark's sample() function.
> Cassandra's API seems allow to do it with its 
> ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, 
> but it doesn't work.
> The limit is just ignored and the entire column family is scanned. It 
> seems this kind of feature is just not supported and sources of 
> AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
> Questions:
> 1. Am I right that there is no way to get some data limited by token range 
> with ColumnFamilyInputFormat?
> 2. Is there other way to limit the amount of data read from Cassandra 
> with Spark and ColumnFamilyInputFormat, so that this amount is predictable 
> (like 5% of entire dataset)?
> 
> 
> WBR,
> Anton
>

RE: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

Reply via email to