Hi Avi,

The spark-project documentation is quite good, as well as the
spark-cassandra-connector github project, which contains some basic
examples you can easily get inspired from. A few random advice you might
find usefull:
- You will want one spark worker on each node, and a spark master on either
one of the node, or on a separate node.
- Pay close attention at your port configuration (firewall) as the spark
error log does not always give you the right hint.
- Pay close attention at your heap size. Make sure to configure your heap
size such as Cassandra heap size + spark heap size < your node memory (take
into account Cassandra off heap usage if enabled, OS etc...)
- If your Cassandra data center is used in production, make sure you
throttle read / write from Spark, pay attention to your latencies, and
consider using a separate analytic cassandra data center if you get serious
with Spark.
- More or less everyone I know find that writing spark jobs in scala is
natural, while writing them in java is painful :D

Getting spark running will be a bit of an investment at the beginning, but
overall you will find out it allows you to run queries you can't naturally
run in Cassandra, like the one you described.

Cheers,

Christophe

On 21 August 2017 at 16:16, Avi Levi <a...@indeni.com> wrote:

> Thanks Christophe,
> we didn't want to add too many moving parts but is sound like a good
> solution. do you have any reference / link that I can look at ?
>
> Cheers
> Avi
>
> On Mon, Aug 21, 2017 at 3:43 AM, Christophe Schmitz <
> christo...@instaclustr.com> wrote:
>
>> Hi Avi,
>>
>> Have you thought of using Spark for that work? If you collocate the spark
>> workers on each Cassandra nodes, the spark-cassandra connector will split
>> automatically the token range for you in such a way that each spark worker
>> only hit the Cassandra local node. This will also be done in parallel.
>> Should be much faster that way.
>>
>> Cheers,
>> Christophe
>>
>>
>> On 21 August 2017 at 01:34, Avi Levi <a...@indeni.com> wrote:
>>
>>> Thank you very much , one question . you wrote that I do not need
>>> distinct here since it's a part from the primary key. but only the
>>> combination is unique (*PRIMARY KEY (id, timestamp) ) .* also if I take
>>> the last token and feed it back as you showed wouldn't I get overlapping
>>> boundaries ?
>>>
>>> On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens <migh...@gmail.com> wrote:
>>>
>>>> You should be able to fairly efficiently iterate all the partition keys
>>>> like:
>>>>
>>>> select id, token(id) from table where token(id) >= -9204925292781066255
>>>> limit 1000;
>>>>  id                                         | system.token(id)
>>>> --------------------------------------------+----------------------
>>>> ...
>>>>  0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686
>>>>
>>>> Take the last token you receive and feed it back in, skipping
>>>> duplicates from the previous page (on the unlikely chance that you have two
>>>> ID's with a token collision on the page boundary):
>>>>
>>>> select id, token(id) from table where token(id) >=
>>>> -7821793584824523686 limit 1000;
>>>>  id                                         | system.token(id)
>>>> --------------------------------------------+---------------------
>>>> ...
>>>>  0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339
>>>>
>>>> Continue until you have no more results.  You don't really need
>>>> distinct here: it's part of your primary key, it must already be distinct.
>>>>
>>>> If you want to parallelize it, split the ring into *n* ranges and
>>>> include it as an upper bound for each segment.
>>>>
>>>> select id, token(id) from table where token(id) >= -9204925292781066255
>>>> AND token(id) < $rangeUpperBound limit 1000;
>>>>
>>>>
>>>> On Sun, Aug 20, 2017 at 12:33 AM Avi Levi <a...@indeni.com> wrote:
>>>>
>>>>> I need to get all unique keys (not the complete primary key, just the
>>>>> partition key) in order to aggregate all the relevant records of that key
>>>>> and apply some calculations on it.
>>>>>
>>>>> *CREATE TABLE my_table (
>>>>>
>>>>>     id text,
>>>>>
>>>>>     timestamp bigint,
>>>>>
>>>>>     value double,
>>>>>
>>>>>     PRIMARY KEY (id, timestamp) )*
>>>>>
>>>>> I know that to query like this
>>>>>
>>>>> *SELECT DISTINCT id FROM my_table *
>>>>>
>>>>> is not very efficient but how about the approach presented here 
>>>>> <http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/>
>>>>>  sending queries in parallel and using the token
>>>>>
>>>>> *SELECT DISTINCT id FROM my_table WHERE token(id) >= -9204925292781066255 
>>>>> AND token(id) <= -9223372036854775808; *
>>>>>
>>>>> *or I can just maintain another table with the unique keys *
>>>>>
>>>>> *CREATE TABLE id_only ( id text,
>>>>>
>>>>>     PRIMARY KEY (id) )*
>>>>>
>>>>> but I tend not to since it is error prone and will enforce other 
>>>>> procedures to maintain data integrity between those two tables .
>>>>>
>>>>> any ideas ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Avi
>>>>>
>>>>>
>>>
>>
>>
>> --
>>
>>
>> *Christophe Schmitz*
>> *Director of consulting EMEA*
>>
>
>


-- 


*Christophe Schmitz*
*Director of consulting EMEA*

Reply via email to