Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

kant kodali Thu, 24 Nov 2016 10:50:03 -0800

Take a look at this https://github.com/brianmhess/cassandra-count


Now It is just matter of incorporating it into spark-cassandra-connector I
guess.

On Thu, Nov 24, 2016 at 1:01 AM, kant kodali <kanth...@gmail.com> wrote:

> According to this link https://github.com/datastax/
> spark-cassandra-connector/blob/master/doc/3_selection.md
>
> I tried the following but it still looks like it is taking forever
>
> sc.cassandraTable(keyspace, table).cassandraCount
>
>
> On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> I would be glad if SELECT COUNT(*) FROM hello can return any value for
>> that size :) I can say for sure it didn't return anything for 30 mins and I
>> probably need to build more patience to sit for few more hours after that!
>> Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which
>> will give a pretty good estimate but not an accurate value.
>>
>> On Thu, Nov 24, 2016 at 12:48 AM, Anastasios Zouzias <zouz...@gmail.com>
>> wrote:
>>
>>> How fast is Cassandra without Spark on the count operation?
>>>
>>> cqsh> SELECT COUNT(*) FROM hello
>>>
>>> (this is not equivalent with what you are doing but might help you find
>>> the root of the cause)
>>>
>>> On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com> wrote:
>>>
>>>> I have the following code
>>>>
>>>> I invoke spark-shell as follows
>>>>
>>>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>>>> --executor-memory 15G --executor-cores 12 --conf
>>>> spark.cassandra.input.split.size_in_mb=67108864
>>>>
>>>> code
>>>>
>>>>     scala> val df = spark.sql("SELECT test from hello") // Billion rows
>>>> in hello and test column is 1KB
>>>>
>>>>     df: org.apache.spark.sql.DataFrame = [test: binary]
>>>>
>>>>     scala> df.count
>>>>
>>>>     [Stage 0:>   (0 + 2) / 13] // I dont know what these numbers mean
>>>> precisely.
>>>>
>>>> If I invoke spark-shell as follows
>>>>
>>>>     ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
>>>>
>>>> code
>>>>
>>>>
>>>>     val df = spark.sql("SELECT test from hello") // This has about
>>>> billion rows
>>>>
>>>>     scala> df.count
>>>>
>>>>
>>>>     [Stage 0:=>  (686 + 2) / 24686] // What are these numbers precisely?
>>>>
>>>>
>>>> Both of these versions didn't work Spark keeps running forever and I
>>>> have been waiting for more than 15 mins and no response. Any ideas on what
>>>> could be wrong and how to fix this?
>>>>
>>>> I am using Spark 2.0.2
>>>> and spark-cassandra-connector_2.11-2.0.0-M3.jar
>>>>
>>>>
>>>
>>>
>>> --
>>> -- Anastasios Zouzias
>>> <a...@zurich.ibm.com>
>>>
>>
>>
>

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

Reply via email to