Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
version
We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1

cqlsh
select * from site_users
returns fast, subsecond, only 3 rows

Can you show some code how you're doing the reads?
dse beeline
!connect ...
select * from site_users
--table has 3 rows, several columns in each row. Spark eunts 769 tasks and
estimates input as 80 TB

0: jdbc:hive2://dsenode01:1 select count(*) from site_users;

+--+

| _c0  |

+--+

| 3|

+--+

1 row selected (41.635 seconds)


Spark and Cassandra-connector

/usr/share/dse/spark/lib/spark-cassandra-connector-java_2.10-1.2.1.jar

/usr/share/dse/spark/lib/spark-cassandra-connector_2.10-1.2.1.jar

2015-06-17 13:52 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com:

 Can you show some code how you're doing the reads? Have you successfully
 read other stuff from Cassandra (i.e. do you have a lot of experience with
 this path and this particular table is causing issues or are you trying to
 figure out the right way to do a read).

 What version of Spark and Cassandra-connector are you using?
 Also, what do you get for select count(*) from foo -- is that just as
 bad?

 On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com
 wrote:

 Hi, can somebody suggest me the way to reduce quantity of task?

 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com:

 Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes,
 Each of them has spark worker.
 The problem is that spark runs 869 task to read 3 lines: select bar from
 foo.
 I've tried these properties:

 #try to avoid 769 tasks per dummy select foo from bar qeury
 spark.cassandra.input.split.size_in_mb=32mb
 spark.cassandra.input.fetch.size_in_rows=1000
 spark.cassandra.input.split.size=1

 but it doesn't help.

 Here are  mean metrics for the job :
 input1= 8388608.0 TB
 input2 = -320 B
 input3 = -400 B

 I'm confused with input, there are only 3 rows in C* table.
 Definitely, I don't have 8388608.0 TB of data :)








Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Yana Kadiyska
Can you show some code how you're doing the reads? Have you successfully
read other stuff from Cassandra (i.e. do you have a lot of experience with
this path and this particular table is causing issues or are you trying to
figure out the right way to do a read).

What version of Spark and Cassandra-connector are you using?
Also, what do you get for select count(*) from foo -- is that just as bad?

On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com
wrote:

 Hi, can somebody suggest me the way to reduce quantity of task?

 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com:

 Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes,
 Each of them has spark worker.
 The problem is that spark runs 869 task to read 3 lines: select bar from
 foo.
 I've tried these properties:

 #try to avoid 769 tasks per dummy select foo from bar qeury
 spark.cassandra.input.split.size_in_mb=32mb
 spark.cassandra.input.fetch.size_in_rows=1000
 spark.cassandra.input.split.size=1

 but it doesn't help.

 Here are  mean metrics for the job :
 input1= 8388608.0 TB
 input2 = -320 B
 input3 = -400 B

 I'm confused with input, there are only 3 rows in C* table.
 Definitely, I don't have 8388608.0 TB of data :)







Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
Hi, can somebody suggest me the way to reduce quantity of task?

2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com:

 Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each
 of them has spark worker.
 The problem is that spark runs 869 task to read 3 lines: select bar from
 foo.
 I've tried these properties:

 #try to avoid 769 tasks per dummy select foo from bar qeury
 spark.cassandra.input.split.size_in_mb=32mb
 spark.cassandra.input.fetch.size_in_rows=1000
 spark.cassandra.input.split.size=1

 but it doesn't help.

 Here are  mean metrics for the job :
 input1= 8388608.0 TB
 input2 = -320 B
 input3 = -400 B

 I'm confused with input, there are only 3 rows in C* table.
 Definitely, I don't have 8388608.0 TB of data :)






Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
So, there is some input:

So the problem could be in spark-sql-thriftserver.
When I use spark console to submit SQL query, it takes 10 seconds and
reasonable count of tasks.

import com.datastax.spark.connector._;

val cc = new CassandraSQLContext(sc);

cc.sql(select su.user_id from appdata.site_users su join
appdata.user_orders uo on uo.user_id=su.user_id).count();

res8: Long = 2

If the same query submitted through beeline, it takes minutes and spark
creates up to 2000 tasks to read 3 lines of data.

We think spark-sql-thriftserver has bugs in it.

2015-06-17 14:14 GMT+02:00 Serega Sheypak serega.shey...@gmail.com:

 version
 We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1

 cqlsh
 select * from site_users
 returns fast, subsecond, only 3 rows

 Can you show some code how you're doing the reads?
 dse beeline
 !connect ...
 select * from site_users
 --table has 3 rows, several columns in each row. Spark eunts 769 tasks and
 estimates input as 80 TB

 0: jdbc:hive2://dsenode01:1 select count(*) from site_users;

 +--+

 | _c0  |

 +--+

 | 3|

 +--+

 1 row selected (41.635 seconds)


 Spark and Cassandra-connector

 /usr/share/dse/spark/lib/spark-cassandra-connector-java_2.10-1.2.1.jar

 /usr/share/dse/spark/lib/spark-cassandra-connector_2.10-1.2.1.jar

 2015-06-17 13:52 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com:

 Can you show some code how you're doing the reads? Have you successfully
 read other stuff from Cassandra (i.e. do you have a lot of experience with
 this path and this particular table is causing issues or are you trying to
 figure out the right way to do a read).

 What version of Spark and Cassandra-connector are you using?
 Also, what do you get for select count(*) from foo -- is that just as
 bad?

 On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com
  wrote:

 Hi, can somebody suggest me the way to reduce quantity of task?

 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com:

 Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes,
 Each of them has spark worker.
 The problem is that spark runs 869 task to read 3 lines: select bar
 from foo.
 I've tried these properties:

 #try to avoid 769 tasks per dummy select foo from bar qeury
 spark.cassandra.input.split.size_in_mb=32mb
 spark.cassandra.input.fetch.size_in_rows=1000
 spark.cassandra.input.split.size=1

 but it doesn't help.

 Here are  mean metrics for the job :
 input1= 8388608.0 TB
 input2 = -320 B
 input3 = -400 B

 I'm confused with input, there are only 3 rows in C* table.
 Definitely, I don't have 8388608.0 TB of data :)