RE: Need some Cassandra integration help

2015-06-01 Thread Mohammed Guller
Hi Yana,
Not sure whether you already solved this issue. As far as I know, the DataFrame 
support in Spark Cassandra connector was added in version 1.3. The first 
milestone release of SCC v1.3 was just announced.

Mohammed

From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Tuesday, May 26, 2015 1:31 PM
To: user@spark.apache.org
Subject: Need some Cassandra integration help

Hi folks, for those of you working with Cassandra, wondering if anyone has been 
successful processing a mix of Cassandra and hdfs data. I have a dataset which 
is stored partially in HDFS and partially in Cassandra (schema is the same in 
both places)

I am trying to do the following:

val dfHDFS = sqlContext.parquetFile(foo.parquet)

val cassDF = cassandraContext.sql(SELECT * FROM keyspace.user)



 dfHDFS.unionAll(cassDF).count
​

This is failing for me with the following -

Exception in thread main java.lang.AssertionError: assertion failed: No plan 
for CassandraRelation 
TableDef(yana_test,test_connector,ArrayBuffer(ColumnDef(key,PartitionKeyColumn,IntType,false)),ArrayBuff

er(),ArrayBuffer(ColumnDef(value,RegularColumn,VarCharType,false))), None, None



at scala.Predef$.assert(Predef.scala:179)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)

at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:278)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)

at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$anonfun$14.apply(SparkStrategies.scala:292)

at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$anonfun$14.apply(SparkStrategies.scala:292)

at 
scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)

at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:292)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)

at 
org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:152)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)

at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)

at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1092)

at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1090)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1096)

at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1096)

at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:887)

at org.apache.spark.sql.DataFrame.count(DataFrame.scala:899)

at com.akamai.rum.test.unit.SparkTest.sparkPerfTest(SparkTest.java:123)

at com.akamai.rum.test.unit.SparkTest.main(SparkTest.java:37)
​

Is there a way to pull the data out of cassandra on each executor but not try 
to push logic down into casandra?




Need some Cassandra integration help

2015-05-26 Thread Yana Kadiyska
Hi folks, for those of you working with Cassandra, wondering if anyone has
been successful processing a mix of Cassandra and hdfs data. I have a
dataset which is stored partially in HDFS and partially in Cassandra
(schema is the same in both places)

I am trying to do the following:

val dfHDFS = sqlContext.parquetFile(foo.parquet)
val cassDF = cassandraContext.sql(SELECT * FROM keyspace.user)

 dfHDFS.unionAll(cassDF).count

​

This is failing for me with the following -

Exception in thread main java.lang.AssertionError: assertion failed:
No plan for CassandraRelation
TableDef(yana_test,test_connector,ArrayBuffer(ColumnDef(key,PartitionKeyColumn,IntType,false)),ArrayBuff
er(),ArrayBuffer(ColumnDef(value,RegularColumn,VarCharType,false))), None, None

at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:278)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$anonfun$14.apply(SparkStrategies.scala:292)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$anonfun$14.apply(SparkStrategies.scala:292)
at 
scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:292)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:152)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:1092)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:1090)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:1096)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:1096)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:887)
at org.apache.spark.sql.DataFrame.count(DataFrame.scala:899)
at com.akamai.rum.test.unit.SparkTest.sparkPerfTest(SparkTest.java:123)
at com.akamai.rum.test.unit.SparkTest.main(SparkTest.java:37)

​

Is there a way to pull the data out of cassandra on each executor but not
try to push logic down into casandra?