Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table
version We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1 cqlsh select * from site_users returns fast, subsecond, only 3 rows Can you show some code how you're doing the reads? dse beeline !connect ... select * from site_users --table has 3 rows, several columns in each row. Spark eunts 769 tasks and estimates input as 80 TB 0: jdbc:hive2://dsenode01:1 select count(*) from site_users; +--+ | _c0 | +--+ | 3| +--+ 1 row selected (41.635 seconds) Spark and Cassandra-connector /usr/share/dse/spark/lib/spark-cassandra-connector-java_2.10-1.2.1.jar /usr/share/dse/spark/lib/spark-cassandra-connector_2.10-1.2.1.jar 2015-06-17 13:52 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com: Can you show some code how you're doing the reads? Have you successfully read other stuff from Cassandra (i.e. do you have a lot of experience with this path and this particular table is causing issues or are you trying to figure out the right way to do a read). What version of Spark and Cassandra-connector are you using? Also, what do you get for select count(*) from foo -- is that just as bad? On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com wrote: Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury spark.cassandra.input.split.size_in_mb=32mb spark.cassandra.input.fetch.size_in_rows=1000 spark.cassandra.input.split.size=1 but it doesn't help. Here are mean metrics for the job : input1= 8388608.0 TB input2 = -320 B input3 = -400 B I'm confused with input, there are only 3 rows in C* table. Definitely, I don't have 8388608.0 TB of data :)
Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table
Can you show some code how you're doing the reads? Have you successfully read other stuff from Cassandra (i.e. do you have a lot of experience with this path and this particular table is causing issues or are you trying to figure out the right way to do a read). What version of Spark and Cassandra-connector are you using? Also, what do you get for select count(*) from foo -- is that just as bad? On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com wrote: Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury spark.cassandra.input.split.size_in_mb=32mb spark.cassandra.input.fetch.size_in_rows=1000 spark.cassandra.input.split.size=1 but it doesn't help. Here are mean metrics for the job : input1= 8388608.0 TB input2 = -320 B input3 = -400 B I'm confused with input, there are only 3 rows in C* table. Definitely, I don't have 8388608.0 TB of data :)
Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table
Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury spark.cassandra.input.split.size_in_mb=32mb spark.cassandra.input.fetch.size_in_rows=1000 spark.cassandra.input.split.size=1 but it doesn't help. Here are mean metrics for the job : input1= 8388608.0 TB input2 = -320 B input3 = -400 B I'm confused with input, there are only 3 rows in C* table. Definitely, I don't have 8388608.0 TB of data :)
Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table
So, there is some input: So the problem could be in spark-sql-thriftserver. When I use spark console to submit SQL query, it takes 10 seconds and reasonable count of tasks. import com.datastax.spark.connector._; val cc = new CassandraSQLContext(sc); cc.sql(select su.user_id from appdata.site_users su join appdata.user_orders uo on uo.user_id=su.user_id).count(); res8: Long = 2 If the same query submitted through beeline, it takes minutes and spark creates up to 2000 tasks to read 3 lines of data. We think spark-sql-thriftserver has bugs in it. 2015-06-17 14:14 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: version We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1 cqlsh select * from site_users returns fast, subsecond, only 3 rows Can you show some code how you're doing the reads? dse beeline !connect ... select * from site_users --table has 3 rows, several columns in each row. Spark eunts 769 tasks and estimates input as 80 TB 0: jdbc:hive2://dsenode01:1 select count(*) from site_users; +--+ | _c0 | +--+ | 3| +--+ 1 row selected (41.635 seconds) Spark and Cassandra-connector /usr/share/dse/spark/lib/spark-cassandra-connector-java_2.10-1.2.1.jar /usr/share/dse/spark/lib/spark-cassandra-connector_2.10-1.2.1.jar 2015-06-17 13:52 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com: Can you show some code how you're doing the reads? Have you successfully read other stuff from Cassandra (i.e. do you have a lot of experience with this path and this particular table is causing issues or are you trying to figure out the right way to do a read). What version of Spark and Cassandra-connector are you using? Also, what do you get for select count(*) from foo -- is that just as bad? On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com wrote: Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury spark.cassandra.input.split.size_in_mb=32mb spark.cassandra.input.fetch.size_in_rows=1000 spark.cassandra.input.split.size=1 but it doesn't help. Here are mean metrics for the job : input1= 8388608.0 TB input2 = -320 B input3 = -400 B I'm confused with input, there are only 3 rows in C* table. Definitely, I don't have 8388608.0 TB of data :)