Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

Jörn Franke Thu, 24 Nov 2016 13:45:06 -0800

I am not sure what use case you want to demonstrate with select count in 
general. Maybe you can elaborate more what your use case is.


Aside from this: this is a Cassandra issue. What is the setup of Cassandra? 
Dedicated nodes? How many? Replication strategy? Consistency configuration? How 
is the data spread on nodes?
Cassandra is more for use cases where you have a lot of data, but select only a 
subset from it or where you have a lot of single writes. 

If you want to analyze it you have to export it once to parquet, orc etc and 
then run queries on it. Depending on your use case you may want to go for that 
on hive2+tez+ldap or spark.

> On 24 Nov 2016, at 20:52, kant kodali <kanth...@gmail.com> wrote:
> 
> some accurate numbers here. so it took me 1hr:30 mins to count  698705723 
> rows (~700 Million)
> 
> and my code is just this 
> 
> sc.cassandraTable("cuneiform", "blocks").cassandraCount
> 
> 
> 
>> On Thu, Nov 24, 2016 at 10:48 AM, kant kodali <kanth...@gmail.com> wrote:
>> Take a look at this https://github.com/brianmhess/cassandra-count
>> 
>> Now It is just matter of incorporating it into spark-cassandra-connector I 
>> guess.
>> 
>>> On Thu, Nov 24, 2016 at 1:01 AM, kant kodali <kanth...@gmail.com> wrote:
>>> According to this link 
>>> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md
>>> 
>>> I tried the following but it still looks like it is taking forever
>>> 
>>> sc.cassandraTable(keyspace, table).cassandraCount
>>> 
>>>> On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth...@gmail.com> wrote:
>>>> I would be glad if SELECT COUNT(*) FROM hello can return any value for 
>>>> that size :) I can say for sure it didn't return anything for 30 mins and 
>>>> I probably need to build more patience to sit for few more hours after 
>>>> that! Cassandra recommends to use ColumnFamilyStats using nodetool cfstats 
>>>> which will give a pretty good estimate but not an accurate value.
>>>> 
>>>>> On Thu, Nov 24, 2016 at 12:48 AM, Anastasios Zouzias <zouz...@gmail.com> 
>>>>> wrote:
>>>>> How fast is Cassandra without Spark on the count operation?
>>>>> 
>>>>> cqsh> SELECT COUNT(*) FROM hello
>>>>> 
>>>>> (this is not equivalent with what you are doing but might help you find 
>>>>> the root of the cause)
>>>>> 
>>>>>> On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com> wrote:
>>>>>> I have the following code
>>>>>> 
>>>>>> I invoke spark-shell as follows
>>>>>> 
>>>>>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 
>>>>>> --executor-memory 15G --executor-cores 12 --conf 
>>>>>> spark.cassandra.input.split.size_in_mb=67108864
>>>>>> 
>>>>>> code
>>>>>> 
>>>>>>     scala> val df = spark.sql("SELECT test from hello") // Billion rows 
>>>>>> in hello and test column is 1KB
>>>>>>     
>>>>>>     df: org.apache.spark.sql.DataFrame = [test: binary]
>>>>>>     
>>>>>>     scala> df.count
>>>>>>     
>>>>>>     [Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean 
>>>>>> precisely.
>>>>>> 
>>>>>> If I invoke spark-shell as follows
>>>>>> 
>>>>>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>>>>>> 
>>>>>> code
>>>>>> 
>>>>>> 
>>>>>>     val df = spark.sql("SELECT test from hello") // This has about 
>>>>>> billion rows
>>>>>>     
>>>>>>     scala> df.count
>>>>>>     
>>>>>>     
>>>>>>     [Stage 0:=>  (686 + 2) / 24686] // What are these numbers precisely?
>>>>>> 
>>>>>> 
>>>>>> Both of these versions didn't work Spark keeps running forever and I 
>>>>>> have been waiting for more than 15 mins and no response. Any ideas on 
>>>>>> what could be wrong and how to fix this?
>>>>>> 
>>>>>> I am using Spark 2.0.2
>>>>>> and spark-cassandra-connector_2.11-2.0.0-M3.jar
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> -- Anastasios Zouzias
>>>> 
>>> 
>> 
>

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

Reply via email to