Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Benjamin Ross
My company is interested in building a real-time time-series querying solution 
using Spark and Cassandra.  Specifically, we're interested in setting up a 
Spark system against Cassandra running a hive thrift server.  We need to be 
able to perform real-time queries on time-series data - things like, how many 
accounts have spent in total more than $300 on product X in the past 3 months, 
and purchased product Y in the past month.

These queries need to be fast - preferably sub-second but we can deal with a 
few seconds if absolutely necessary.  The data sizes are in the millions of 
records when rolled up to be per-monthly records.  Something on the order of 
100M per customer.

My question is, based on experience, how hard would it be to get Cassandra and 
Spark working together to give us sub-second response times in this use case?  
Note that we'll need to use DataStax enterprise (which is unappealing from a 
cost standpoint) because it's the only thing that provides the hive spark 
thrift server to Cassandra.

The two top contenders for our solution are Spark+Cassandra and Druid.

Neither of these solutions work perfectly out of the box:

-  Druid would need to be modified, possibly hacked, to support the 
queries we require.  I'm also not clear how operationally ready it is.

-  Cassandra and Spark would require paying money for DataStax 
enterprise.  It really feels like it's going to be tricky to configure 
Cassandra and Spark to be lightning fast for our use case.  Finally, window 
functions (which we need - see above) are not supported unless we use a 
pre-release milestone of the datastax spark Cassandra connector.

I was wondering if anyone had any thoughts.  How easy is it to get Spark and 
Cassandra down to sub-second speeds in our use case?

Thanks,
Ben


RE: Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Benjamin Ross
Hi Jorn,
Of course we're planning on doing a proof of concept here - the difficulty is 
that our timeline is short, so we cannot afford too many PoCs before we have to 
make a decision.  We also need to figure out *which* databases to proof of 
concept.

Note that one tricky aspect of our problem is that we need to support window 
functions partitioned on a per account basis.  I've found that support for 
window functions is very limited in most databases, and they're also generally 
slow when available.

Also, 1 customer certainly does not have 100M transactions per month.  There 
are 100M transactions total for a given customer when we roll everything up to 
be per-month.  We do not care about granularity smaller than a month.  There 
are also many columns that we care about - on the order of many thousands.

What makes you suggest that we do not need in-memory technology?

Ben



From: Jörn Franke [jornfra...@gmail.com]
Sent: Tuesday, August 18, 2015 4:14 PM
To: Benjamin Ross; user@spark.apache.org
Cc: Ron Gonzalez
Subject: Re: Evaluating spark + Cassandra for our use cases


Hi,

First you need to make your SLA clear. It does not sound for me they are 
defined very well or that your solution is necessary for the scenario. I also 
find it hard to believe that 1 customer has 100Million transactions per month.

Time series data is easy to precalculate - you do not necessarily need 
in-memory technology here.

I recommend your company to do a Proof of Concept and get more 
details/clarificarion on the requirements before risking million of dollars of 
investment.

Le mar. 18 août 2015 à 21:18, Benjamin Ross 
br...@lattice-engines.commailto:br...@lattice-engines.com a écrit :
My company is interested in building a real-time time-series querying solution 
using Spark and Cassandra.  Specifically, we’re interested in setting up a 
Spark system against Cassandra running a hive thrift server.  We need to be 
able to perform real-time queries on time-series data – things like, how many 
accounts have spent in total more than $300 on product X in the past 3 months, 
and purchased product Y in the past month.

These queries need to be fast – preferably sub-second but we can deal with a 
few seconds if absolutely necessary.  The data sizes are in the millions of 
records when rolled up to be per-monthly records.  Something on the order of 
100M per customer.

My question is, based on experience, how hard would it be to get Cassandra and 
Spark working together to give us sub-second response times in this use case?  
Note that we’ll need to use DataStax enterprise (which is unappealing from a 
cost standpoint) because it’s the only thing that provides the hive spark 
thrift server to Cassandra.

The two top contenders for our solution are Spark+Cassandra and Druid.

Neither of these solutions work perfectly out of the box:

-  Druid would need to be modified, possibly hacked, to support the 
queries we require.  I’m also not clear how operationally ready it is.

-  Cassandra and Spark would require paying money for DataStax 
enterprise.  It really feels like it’s going to be tricky to configure 
Cassandra and Spark to be lightning fast for our use case.  Finally, window 
functions (which we need – see above) are not supported unless we use a 
pre-release milestone of the datastax spark Cassandra connector.

I was wondering if anyone had any thoughts.  How easy is it to get Spark and 
Cassandra down to sub-second speeds in our use case?

Thanks,
Ben


RE: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-11 Thread Benjamin Ross
Jerry,
I was able to use window functions without the hive thrift server.  HiveContext 
does not imply that you need the hive thrift server running.

Here’s what I used to test this out:
var conf = new SparkConf(true).set(spark.cassandra.connection.host, 
127.0.0.1)

val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
val df = sqlContext
  .read
  .format(org.apache.spark.sql.cassandra)
  .options(Map( table - kv, keyspace - test))
  .load()
val w = Window.orderBy(value).rowsBetween(-2, 0)


I then submitted this using spark-submit.



From: Jerry [mailto:jerry.c...@gmail.com]
Sent: Monday, August 10, 2015 10:55 PM
To: Michael Armbrust
Cc: user
Subject: Re: Is there any external dependencies for lag() and lead() when using 
data frames?

By the way, if Hive is present in the Spark install, does show up in text when 
you start the spark shell? Any commands I can run to check if it exists? I 
didn't setup the spark machine that I use, so I don't know what's present or 
absent.
Thanks,
Jerry

On Mon, Aug 10, 2015 at 2:38 PM, Jerry 
jerry.c...@gmail.commailto:jerry.c...@gmail.com wrote:
Thanks...   looks like I now hit that bug about HiveMetaStoreClient as I now 
get the message about being unable to instantiate it. On a side note, does 
anyone know where hive-site.xml is typically located?
Thanks,
Jerry

On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust 
mich...@databricks.commailto:mich...@databricks.com wrote:
You will need to use a HiveContext for window functions to work.

On Mon, Aug 10, 2015 at 1:26 PM, Jerry 
jerry.c...@gmail.commailto:jerry.c...@gmail.com wrote:
Hello,
Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a 
data frame and I'm trying to figure out if I just have a bad setup or if this 
is a bug. As for the exceptions I get: when using selectExpr() with a string as 
an argument, I get NoSuchElementException: key not found: lag and when using 
the select method and ...spark.sql.functions.lag I get an AnalysisException. If 
I replace lag with abs in the first case, Spark runs without exception, so none 
of the other syntax is incorrect.
As for how I'm running it; the code is written in Java with a static method 
that takes the SparkContext as an argument which is used to create a 
JavaSparkContext which then is used to create an SQLContext which loads a json 
file from the local disk and runs those queries on that data frame object. FYI: 
the java code is compiled, jared and then pointed to with -cp when starting the 
spark shell, so all I do is Test.run(sc) in shell.
Let me know what to look for to debug this problem. I'm not sure where to look 
to solve this problem.
Thanks,
Jerry





RE: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-11 Thread Benjamin Ross
I forgot to mention, my setup was:

-  Spark 1.4.1 running in standalone mode

-  Datastax spark cassandra connector 1.4.0-M1

-  Cassandra DB

-  Scala version 2.10.4


From: Benjamin Ross
Sent: Tuesday, August 11, 2015 10:16 AM
To: Jerry; Michael Armbrust
Cc: user
Subject: RE: Is there any external dependencies for lag() and lead() when using 
data frames?

Jerry,
I was able to use window functions without the hive thrift server.  HiveContext 
does not imply that you need the hive thrift server running.

Here’s what I used to test this out:
var conf = new SparkConf(true).set(spark.cassandra.connection.host, 
127.0.0.1)

val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
val df = sqlContext
  .read
  .format(org.apache.spark.sql.cassandra)
  .options(Map( table - kv, keyspace - test))
  .load()
val w = Window.orderBy(value).rowsBetween(-2, 0)


I then submitted this using spark-submit.



From: Jerry [mailto:jerry.c...@gmail.com]
Sent: Monday, August 10, 2015 10:55 PM
To: Michael Armbrust
Cc: user
Subject: Re: Is there any external dependencies for lag() and lead() when using 
data frames?

By the way, if Hive is present in the Spark install, does show up in text when 
you start the spark shell? Any commands I can run to check if it exists? I 
didn't setup the spark machine that I use, so I don't know what's present or 
absent.
Thanks,
Jerry

On Mon, Aug 10, 2015 at 2:38 PM, Jerry 
jerry.c...@gmail.commailto:jerry.c...@gmail.com wrote:
Thanks...   looks like I now hit that bug about HiveMetaStoreClient as I now 
get the message about being unable to instantiate it. On a side note, does 
anyone know where hive-site.xml is typically located?
Thanks,
Jerry

On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust 
mich...@databricks.commailto:mich...@databricks.com wrote:
You will need to use a HiveContext for window functions to work.

On Mon, Aug 10, 2015 at 1:26 PM, Jerry 
jerry.c...@gmail.commailto:jerry.c...@gmail.com wrote:
Hello,
Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a 
data frame and I'm trying to figure out if I just have a bad setup or if this 
is a bug. As for the exceptions I get: when using selectExpr() with a string as 
an argument, I get NoSuchElementException: key not found: lag and when using 
the select method and ...spark.sql.functions.lag I get an AnalysisException. If 
I replace lag with abs in the first case, Spark runs without exception, so none 
of the other syntax is incorrect.
As for how I'm running it; the code is written in Java with a static method 
that takes the SparkContext as an argument which is used to create a 
JavaSparkContext which then is used to create an SQLContext which loads a json 
file from the local disk and runs those queries on that data frame object. FYI: 
the java code is compiled, jared and then pointed to with -cp when starting the 
spark shell, so all I do is Test.run(sc) in shell.
Let me know what to look for to debug this problem. I'm not sure where to look 
to solve this problem.
Thanks,
Jerry





How to run start-thrift-server in debug mode?

2015-08-07 Thread Benjamin Ross
Hi,
I'm trying to run the hive thrift server in debug mode.  I've tried to simply 
pass -Xdebug 
-Xrunjdwp:transport=dt_socket,address=127.0.0.1:,server=y,suspend=n to 
start-thriftserver.sh as a driver option, but it doesn't seem to host a server. 
 I've then tried to edit the various shell scripts to run hive thrift server 
but couldn't get things to work.  It seems that there must be an easier way to 
do this.  I've also tried to run it directly in eclipse, but ran into issues 
related to Scala that I haven't quite yet figured out.

start-thriftserver.sh --driver-java-options 
-agentlib:jdwp=transport=dt_socket,address=localhost:8000,server=y,suspend=n 
-XX:MaxPermSize=512  --master yarn://localhost:9000 --num-executors 2


jdb -attach localhost:8000
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at 
com.sun.tools.jdi.SocketTransportService.attach(SocketTransportService.java:222)
at 
com.sun.tools.jdi.GenericAttachingConnector.attach(GenericAttachingConnector.java:116)
at 
com.sun.tools.jdi.SocketAttachingConnector.attach(SocketAttachingConnector.java:90)
at 
com.sun.tools.example.debug.tty.VMConnection.attachTarget(VMConnection.java:519)
at 
com.sun.tools.example.debug.tty.VMConnection.open(VMConnection.java:328)
at com.sun.tools.example.debug.tty.Env.init(Env.java:63)
at com.sun.tools.example.debug.tty.TTY.main(TTY.java:1066)

Let me know if I'm missing something here...
Thanks in advance,
Ben


Failed to load class for data source: org.apache.spark.sql.cassandra

2015-07-30 Thread Benjamin Ross
Hey all,
I'm running what should be a very straight-forward application of the Cassandra 
sql connector, and I'm getting an error:

Exception in thread main java.lang.RuntimeException: Failed to load class for 
data source: org.apache.spark.sql.cassandra
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at com.latticeengines.test.CassandraTest$.main(CassandraTest.scala:33)
at com.latticeengines.test.CassandraTest.main(CassandraTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/07/30 15:34:47 INFO spark.SparkContext: Invoking stop() from shutdown hook

My jar is shaded, so I assume this shouldn't happen?

Here's the code I'm trying to run:
object CassandraTest {
  def main(args: Array[String]) {
println(Hello, scala!)

var conf = new SparkConf(true).set(spark.cassandra.connection.host, 
127.0.0.1)


val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext
  .read
  .format(org.apache.spark.sql.cassandra)
  .options(Map( table - kv, keyspace - test))
  .load()
val w = Window.orderBy(value).rowsBetween(-2, 0)
df.select(mean(value).over(w))

  }
}



RE: Failed to load class for data source: org.apache.spark.sql.cassandra

2015-07-30 Thread Benjamin Ross
I'm submitting the application this way:
spark-submit  test-2.0.5-SNAPSHOT-jar-with-dependencies.jar

I've confirmed that org.apache.spark.sql.cassandra and org.apache.cassandra 
classes are in the jar.

Apologies for this relatively newbie question - I'm still new to both spark and 
scala.
Thanks,
Ben


From: Benjamin Ross
Sent: Thursday, July 30, 2015 3:45 PM
To: user@spark.apache.org
Subject: Failed to load class for data source: org.apache.spark.sql.cassandra

Hey all,
I'm running what should be a very straight-forward application of the Cassandra 
sql connector, and I'm getting an error:

Exception in thread main java.lang.RuntimeException: Failed to load class for 
data source: org.apache.spark.sql.cassandra
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at com.latticeengines.test.CassandraTest$.main(CassandraTest.scala:33)
at com.latticeengines.test.CassandraTest.main(CassandraTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/07/30 15:34:47 INFO spark.SparkContext: Invoking stop() from shutdown hook

My jar is shaded, so I assume this shouldn't happen?

Here's the code I'm trying to run:
object CassandraTest {
  def main(args: Array[String]) {
println(Hello, scala!)

var conf = new SparkConf(true).set(spark.cassandra.connection.host, 
127.0.0.1)


val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext
  .read
  .format(org.apache.spark.sql.cassandra)
  .options(Map( table - kv, keyspace - test))
  .load()
val w = Window.orderBy(value).rowsBetween(-2, 0)
df.select(mean(value).over(w))

  }
}



RE: Failed to load class for data source: org.apache.spark.sql.cassandra

2015-07-30 Thread Benjamin Ross
If anyone's curious, the issue here is that I was using the 1.2.4 connector of 
the datastax spark Cassandra connector, rather than the 1.4.0-M1 pre-release.  
1.2.4 doesn't fully support data frames, and it's presumably still only 
experimental in 1.4.0-M1.

Ben


From: Benjamin Ross
Sent: Thursday, July 30, 2015 4:14 PM
To: user@spark.apache.org
Subject: RE: Failed to load class for data source: 
org.apache.spark.sql.cassandra

I'm submitting the application this way:
spark-submit  test-2.0.5-SNAPSHOT-jar-with-dependencies.jar

I've confirmed that org.apache.spark.sql.cassandra and org.apache.cassandra 
classes are in the jar.

Apologies for this relatively newbie question - I'm still new to both spark and 
scala.
Thanks,
Ben


From: Benjamin Ross
Sent: Thursday, July 30, 2015 3:45 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Failed to load class for data source: org.apache.spark.sql.cassandra

Hey all,
I'm running what should be a very straight-forward application of the Cassandra 
sql connector, and I'm getting an error:

Exception in thread main java.lang.RuntimeException: Failed to load class for 
data source: org.apache.spark.sql.cassandra
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at com.latticeengines.test.CassandraTest$.main(CassandraTest.scala:33)
at com.latticeengines.test.CassandraTest.main(CassandraTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/07/30 15:34:47 INFO spark.SparkContext: Invoking stop() from shutdown hook

My jar is shaded, so I assume this shouldn't happen?

Here's the code I'm trying to run:
object CassandraTest {
  def main(args: Array[String]) {
println(Hello, scala!)

var conf = new SparkConf(true).set(spark.cassandra.connection.host, 
127.0.0.1)


val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext
  .read
  .format(org.apache.spark.sql.cassandra)
  .options(Map( table - kv, keyspace - test))
  .load()
val w = Window.orderBy(value).rowsBetween(-2, 0)
df.select(mean(value).over(w))

  }
}



NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2015-07-29 Thread Benjamin Ross
Hello all,
I'm new to both spark and scala, and am running into an annoying error 
attempting to prototype some spark functionality.  From forums I've read 
online, this error should only present itself if there's a version mismatch 
between the version of scala used to compile spark and the scala version that 
I'm using.  However, that's not the case for me.  I'm using scala 2.10.4, and 
spark was compiled against scala 2.10.x.  Perhaps I'm missing something here.

Also, the NoClassDefFoundError presents itself when debugging in eclipse, but 
running directly via the jar, the following error appears:
Exception in thread main java.lang.NoClassDefFoundError: scala/collection/Seq
at com.latticeengines.test.CassandraTest.main(CassandraTest.scala)
Caused by: java.lang.ClassNotFoundException: scala.collection.Seq
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 1 more

I am getting the following warning when trying to invoke maven, but it doesn't 
seem to be related to the underlying issue:
[INFO] Checking for multiple versions of scala
[WARNING]  Expected all dependencies to require Scala version: 2.10.4
[WARNING]  com.mycompany:test:2.0.5-SNAPSHOT requires scala version: 2.10.4
[WARNING]  com.twitter:chill_2.10:0.5.0 requires scala version: 2.10.4
[WARNING]  org.spark-project.akka:akka-remote_2.10:2.3.4-spark requires scala 
version: 2.10.4
[WARNING]  org.spark-project.akka:akka-actor_2.10:2.3.4-spark requires scala 
version: 2.10.4
[WARNING]  org.spark-project.akka:akka-slf4j_2.10:2.3.4-spark requires scala 
version: 2.10.4
[WARNING]  org.apache.spark:spark-core_2.10:1.4.1 requires scala version: 2.10.4
[WARNING]  org.json4s:json4s-jackson_2.10:3.2.10 requires scala version: 2.10.0
[WARNING] Multiple versions of scala libraries detected!
[INFO] includes = [**/*.scala,**/*.java,]

Here's the code I'm trying to run:

object CassandraTest {
  def main(args: Array[String]) {
println(Hello, scala!)

val conf = new SparkConf(true).set(spark.cassandra.connection.host, 
127.0.0.1).set(
spark.driver.extraClassPath,

/home/bross/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.4/spark-cassandra-connector_2.10-1.2.4.jar;/home/bross/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.4/spark-cassandra-connector_2.10-1.2.4.jar;/home/bross/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar);

val sc = new SparkContext(local, test, conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext
  .read
  .format(org.apache.spark.sql.cassandra)
  .options(Map( table - kv, keyspace - test))
  .load()
val w = Window.orderBy(value).rowsBetween(-2, 0)
df.select(mean(value).over(w))

  }
}

Here's my maven file:
?xml version=1.0 encoding=UTF-8?
project xmlns=http://maven.apache.org/POM/4.0.0; 
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/maven-v4_0_0.xsd;

modelVersion4.0.0/modelVersion
artifactIdtest/artifactId
packagingjar/packaging
name${component-name}/name

properties
component-namele-sparkdb/component-name
hadoop.version2.6.0.2.2.0.0-2041/hadoop.version
scala.version2.10.4/scala.version
spark.version1.4.1/spark.version
avro.version1.7.7/avro.version
parquet.avro.version1.4.3/parquet.avro.version
le.domain.version2.0.5-SNAPSHOT/le.domain.version
le.common.version2.0.5-SNAPSHOT/le.common.version
le.eai.version2.0.5-SNAPSHOT/le.eai.version
spark.cassandra.version1.2.4/spark.cassandra.version
/properties
parent
groupIdcom.mycompany/groupId
artifactIdle-parent/artifactId
version2.0.5-SNAPSHOT/version
relativePathle-parent/relativePath
/parent

build
plugins
plugin
groupIdorg.scala-tools/groupId
artifactIdmaven-scala-plugin/artifactId
executions
execution
goals
goalcompile/goal
goaltestCompile/goal
/goals
/execution
/executions
/plugin
plugin
groupIdorg.apache.maven.plugins/groupId
artifactIdmaven-eclipse-plugin/artifactId
version${maven.eclipse.version}/version
configuration

RE: NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2015-07-29 Thread Benjamin Ross
Hey Ted,
Thanks for the quick response.  Sadly, all of those are 2.10.x:
─$ mvn dependency:tree | grep -A 2 -B 2 org.scala-lang  
 130 ↵
[INFO] |  |  \- org.tukaani:xz:jar:1.0:compile
[INFO] |  \- org.slf4j:slf4j-api:jar:1.6.4:compile
[INFO] +- org.scala-lang:scala-library:jar:2.10.4:compile
[INFO] +- org.apache.spark:spark-core_2.10:jar:1.4.1:compile
[INFO] |  +- com.twitter:chill_2.10:jar:0.5.0:compile
--
[INFO] |  |  \- org.json4s:json4s-core_2.10:jar:3.2.10:compile
[INFO] |  | +- org.json4s:json4s-ast_2.10:jar:3.2.10:compile
[INFO] |  | \- org.scala-lang:scalap:jar:2.10.0:compile
[INFO] |  +- com.sun.jersey:jersey-server:jar:1.9:compile
[INFO] |  |  \- asm:asm:jar:3.1:compile
--
[INFO] +- org.apache.spark:spark-sql_2.10:jar:1.4.1:compile
[INFO] |  +- org.apache.spark:spark-catalyst_2.10:jar:1.4.1:compile
[INFO] |  |  +- org.scala-lang:scala-compiler:jar:2.10.4:compile
[INFO] |  |  \- org.scalamacros:quasiquotes_2.10:jar:2.0.1:compile
[INFO] |  \- org.jodd:jodd-core:jar:3.6.3:compile
--
[INFO] |  +- org.joda:joda-convert:jar:1.2:compile
[INFO] |  +- com.twitter:jsr166e:jar:1.1.0:compile
[INFO] |  \- org.scala-lang:scala-reflect:jar:2.10.5:compile
[INFO] +- 
com.datastax.spark:spark-cassandra-connector-java_2.10:jar:1.2.4:compile
[INFO] +- commons-codec:commons-codec:jar:1.4:compile

Ben


From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, July 29, 2015 8:30 PM
To: Benjamin Ross
Cc: user@spark.apache.org
Subject: Re: NoClassDefFoundError: scala/collection/GenTraversableOnce$class

You can generate dependency tree using:

mvn dependency:tree

and grep for 'org.scala-lang' in the output to see if there is any clue.

Cheers

On Wed, Jul 29, 2015 at 5:14 PM, Benjamin Ross 
br...@lattice-engines.commailto:br...@lattice-engines.com wrote:
Hello all,
I’m new to both spark and scala, and am running into an annoying error 
attempting to prototype some spark functionality.  From forums I’ve read 
online, this error should only present itself if there’s a version mismatch 
between the version of scala used to compile spark and the scala version that 
I’m using.  However, that’s not the case for me.  I’m using scala 2.10.4, and 
spark was compiled against scala 2.10.x.  Perhaps I’m missing something here.

Also, the NoClassDefFoundError presents itself when debugging in eclipse, but 
running directly via the jar, the following error appears:
Exception in thread main java.lang.NoClassDefFoundError: scala/collection/Seq
at com.latticeengines.test.CassandraTest.main(CassandraTest.scala)
Caused by: java.lang.ClassNotFoundException: scala.collection.Seq
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 1 more

I am getting the following warning when trying to invoke maven, but it doesn’t 
seem to be related to the underlying issue:
[INFO] Checking for multiple versions of scala
[WARNING]  Expected all dependencies to require Scala version: 2.10.4
[WARNING]  com.mycompany:test:2.0.5-SNAPSHOT requires scala version: 2.10.4
[WARNING]  com.twitter:chill_2.10:0.5.0 requires scala version: 2.10.4
[WARNING]  org.spark-project.akka:akka-remote_2.10:2.3.4-spark requires scala 
version: 2.10.4
[WARNING]  org.spark-project.akka:akka-actor_2.10:2.3.4-spark requires scala 
version: 2.10.4
[WARNING]  org.spark-project.akka:akka-slf4j_2.10:2.3.4-spark requires scala 
version: 2.10.4
[WARNING]  org.apache.spark:spark-core_2.10:1.4.1 requires scala version: 2.10.4
[WARNING]  org.json4s:json4s-jackson_2.10:3.2.10 requires scala version: 2.10.0
[WARNING] Multiple versions of scala libraries detected!
[INFO] includes = [**/*.scala,**/*.java,]

Here’s the code I’m trying to run:

object CassandraTest {
  def main(args: Array[String]) {
println(Hello, scala!)

val conf = new SparkConf(true).set(spark.cassandra.connection.host, 
127.0.0.1).set(
spark.driver.extraClassPath,

/home/bross/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.4/spark-cassandra-connector_2.10-1.2.4.jar;/home/bross/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.4/spark-cassandra-connector_2.10-1.2.4.jar;/home/bross/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar);

val sc = new SparkContext(local, test, conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext
  .read
  .format(org.apache.spark.sql.cassandra)
  .options(Map( table - kv, keyspace - test))
  .load()
val w = Window.orderBy(value).rowsBetween(-2, 0