RE: Evaluating spark + Cassandra for our use cases
Hi Jorn, Of course we're planning on doing a proof of concept here - the difficulty is that our timeline is short, so we cannot afford too many PoCs before we have to make a decision. We also need to figure out *which* databases to proof of concept. Note that one tricky aspect of our problem is that we need to support window functions partitioned on a per account basis. I've found that support for window functions is very limited in most databases, and they're also generally slow when available. Also, 1 customer certainly does not have 100M transactions per month. There are 100M transactions total for a given customer when we roll everything up to be per-month. We do not care about granularity smaller than a month. There are also many columns that we care about - on the order of many thousands. What makes you suggest that we do not need in-memory technology? Ben From: Jörn Franke [jornfra...@gmail.com] Sent: Tuesday, August 18, 2015 4:14 PM To: Benjamin Ross; user@spark.apache.org Cc: Ron Gonzalez Subject: Re: Evaluating spark + Cassandra for our use cases Hi, First you need to make your SLA clear. It does not sound for me they are defined very well or that your solution is necessary for the scenario. I also find it hard to believe that 1 customer has 100Million transactions per month. Time series data is easy to precalculate - you do not necessarily need in-memory technology here. I recommend your company to do a Proof of Concept and get more details/clarificarion on the requirements before risking million of dollars of investment. Le mar. 18 août 2015 à 21:18, Benjamin Ross mailto:br...@lattice-engines.com>> a écrit : My company is interested in building a real-time time-series querying solution using Spark and Cassandra. Specifically, we’re interested in setting up a Spark system against Cassandra running a hive thrift server. We need to be able to perform real-time queries on time-series data – things like, how many accounts have spent in total more than $300 on product X in the past 3 months, and purchased product Y in the past month. These queries need to be fast – preferably sub-second but we can deal with a few seconds if absolutely necessary. The data sizes are in the millions of records when rolled up to be per-monthly records. Something on the order of 100M per customer. My question is, based on experience, how hard would it be to get Cassandra and Spark working together to give us sub-second response times in this use case? Note that we’ll need to use DataStax enterprise (which is unappealing from a cost standpoint) because it’s the only thing that provides the hive spark thrift server to Cassandra. The two top contenders for our solution are Spark+Cassandra and Druid. Neither of these solutions work perfectly out of the box: - Druid would need to be modified, possibly hacked, to support the queries we require. I’m also not clear how operationally ready it is. - Cassandra and Spark would require paying money for DataStax enterprise. It really feels like it’s going to be tricky to configure Cassandra and Spark to be lightning fast for our use case. Finally, window functions (which we need – see above) are not supported unless we use a pre-release milestone of the datastax spark Cassandra connector. I was wondering if anyone had any thoughts. How easy is it to get Spark and Cassandra down to sub-second speeds in our use case? Thanks, Ben
Evaluating spark + Cassandra for our use cases
My company is interested in building a real-time time-series querying solution using Spark and Cassandra. Specifically, we're interested in setting up a Spark system against Cassandra running a hive thrift server. We need to be able to perform real-time queries on time-series data - things like, how many accounts have spent in total more than $300 on product X in the past 3 months, and purchased product Y in the past month. These queries need to be fast - preferably sub-second but we can deal with a few seconds if absolutely necessary. The data sizes are in the millions of records when rolled up to be per-monthly records. Something on the order of 100M per customer. My question is, based on experience, how hard would it be to get Cassandra and Spark working together to give us sub-second response times in this use case? Note that we'll need to use DataStax enterprise (which is unappealing from a cost standpoint) because it's the only thing that provides the hive spark thrift server to Cassandra. The two top contenders for our solution are Spark+Cassandra and Druid. Neither of these solutions work perfectly out of the box: - Druid would need to be modified, possibly hacked, to support the queries we require. I'm also not clear how operationally ready it is. - Cassandra and Spark would require paying money for DataStax enterprise. It really feels like it's going to be tricky to configure Cassandra and Spark to be lightning fast for our use case. Finally, window functions (which we need - see above) are not supported unless we use a pre-release milestone of the datastax spark Cassandra connector. I was wondering if anyone had any thoughts. How easy is it to get Spark and Cassandra down to sub-second speeds in our use case? Thanks, Ben
RE: Is there any external dependencies for lag() and lead() when using data frames?
I forgot to mention, my setup was: - Spark 1.4.1 running in standalone mode - Datastax spark cassandra connector 1.4.0-M1 - Cassandra DB - Scala version 2.10.4 From: Benjamin Ross Sent: Tuesday, August 11, 2015 10:16 AM To: Jerry; Michael Armbrust Cc: user Subject: RE: Is there any external dependencies for lag() and lead() when using data frames? Jerry, I was able to use window functions without the hive thrift server. HiveContext does not imply that you need the hive thrift server running. Here’s what I used to test this out: var conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) val df = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "kv", "keyspace" -> "test")) .load() val w = Window.orderBy("value").rowsBetween(-2, 0) I then submitted this using spark-submit. From: Jerry [mailto:jerry.c...@gmail.com] Sent: Monday, August 10, 2015 10:55 PM To: Michael Armbrust Cc: user Subject: Re: Is there any external dependencies for lag() and lead() when using data frames? By the way, if Hive is present in the Spark install, does show up in text when you start the spark shell? Any commands I can run to check if it exists? I didn't setup the spark machine that I use, so I don't know what's present or absent. Thanks, Jerry On Mon, Aug 10, 2015 at 2:38 PM, Jerry mailto:jerry.c...@gmail.com>> wrote: Thanks... looks like I now hit that bug about HiveMetaStoreClient as I now get the message about being unable to instantiate it. On a side note, does anyone know where hive-site.xml is typically located? Thanks, Jerry On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust mailto:mich...@databricks.com>> wrote: You will need to use a HiveContext for window functions to work. On Mon, Aug 10, 2015 at 1:26 PM, Jerry mailto:jerry.c...@gmail.com>> wrote: Hello, Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a data frame and I'm trying to figure out if I just have a bad setup or if this is a bug. As for the exceptions I get: when using selectExpr() with a string as an argument, I get "NoSuchElementException: key not found: lag" and when using the select method and ...spark.sql.functions.lag I get an AnalysisException. If I replace lag with abs in the first case, Spark runs without exception, so none of the other syntax is incorrect. As for how I'm running it; the code is written in Java with a static method that takes the SparkContext as an argument which is used to create a JavaSparkContext which then is used to create an SQLContext which loads a json file from the local disk and runs those queries on that data frame object. FYI: the java code is compiled, jared and then pointed to with -cp when starting the spark shell, so all I do is "Test.run(sc)" in shell. Let me know what to look for to debug this problem. I'm not sure where to look to solve this problem. Thanks, Jerry
RE: Is there any external dependencies for lag() and lead() when using data frames?
Jerry, I was able to use window functions without the hive thrift server. HiveContext does not imply that you need the hive thrift server running. Here’s what I used to test this out: var conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) val df = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "kv", "keyspace" -> "test")) .load() val w = Window.orderBy("value").rowsBetween(-2, 0) I then submitted this using spark-submit. From: Jerry [mailto:jerry.c...@gmail.com] Sent: Monday, August 10, 2015 10:55 PM To: Michael Armbrust Cc: user Subject: Re: Is there any external dependencies for lag() and lead() when using data frames? By the way, if Hive is present in the Spark install, does show up in text when you start the spark shell? Any commands I can run to check if it exists? I didn't setup the spark machine that I use, so I don't know what's present or absent. Thanks, Jerry On Mon, Aug 10, 2015 at 2:38 PM, Jerry mailto:jerry.c...@gmail.com>> wrote: Thanks... looks like I now hit that bug about HiveMetaStoreClient as I now get the message about being unable to instantiate it. On a side note, does anyone know where hive-site.xml is typically located? Thanks, Jerry On Mon, Aug 10, 2015 at 2:03 PM, Michael Armbrust mailto:mich...@databricks.com>> wrote: You will need to use a HiveContext for window functions to work. On Mon, Aug 10, 2015 at 1:26 PM, Jerry mailto:jerry.c...@gmail.com>> wrote: Hello, Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a data frame and I'm trying to figure out if I just have a bad setup or if this is a bug. As for the exceptions I get: when using selectExpr() with a string as an argument, I get "NoSuchElementException: key not found: lag" and when using the select method and ...spark.sql.functions.lag I get an AnalysisException. If I replace lag with abs in the first case, Spark runs without exception, so none of the other syntax is incorrect. As for how I'm running it; the code is written in Java with a static method that takes the SparkContext as an argument which is used to create a JavaSparkContext which then is used to create an SQLContext which loads a json file from the local disk and runs those queries on that data frame object. FYI: the java code is compiled, jared and then pointed to with -cp when starting the spark shell, so all I do is "Test.run(sc)" in shell. Let me know what to look for to debug this problem. I'm not sure where to look to solve this problem. Thanks, Jerry
How to run start-thrift-server in debug mode?
Hi, I'm trying to run the hive thrift server in debug mode. I've tried to simply pass -Xdebug -Xrunjdwp:transport=dt_socket,address=127.0.0.1:,server=y,suspend=n to start-thriftserver.sh as a driver option, but it doesn't seem to host a server. I've then tried to edit the various shell scripts to run hive thrift server but couldn't get things to work. It seems that there must be an easier way to do this. I've also tried to run it directly in eclipse, but ran into issues related to Scala that I haven't quite yet figured out. start-thriftserver.sh --driver-java-options "-agentlib:jdwp=transport=dt_socket,address=localhost:8000,server=y,suspend=n -XX:MaxPermSize=512" --master yarn://localhost:9000 --num-executors 2 jdb -attach localhost:8000 java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at com.sun.tools.jdi.SocketTransportService.attach(SocketTransportService.java:222) at com.sun.tools.jdi.GenericAttachingConnector.attach(GenericAttachingConnector.java:116) at com.sun.tools.jdi.SocketAttachingConnector.attach(SocketAttachingConnector.java:90) at com.sun.tools.example.debug.tty.VMConnection.attachTarget(VMConnection.java:519) at com.sun.tools.example.debug.tty.VMConnection.open(VMConnection.java:328) at com.sun.tools.example.debug.tty.Env.init(Env.java:63) at com.sun.tools.example.debug.tty.TTY.main(TTY.java:1066) Let me know if I'm missing something here... Thanks in advance, Ben
RE: Failed to load class for data source: org.apache.spark.sql.cassandra
If anyone's curious, the issue here is that I was using the 1.2.4 connector of the datastax spark Cassandra connector, rather than the 1.4.0-M1 pre-release. 1.2.4 doesn't fully support data frames, and it's presumably still only experimental in 1.4.0-M1. Ben From: Benjamin Ross Sent: Thursday, July 30, 2015 4:14 PM To: user@spark.apache.org Subject: RE: Failed to load class for data source: org.apache.spark.sql.cassandra I'm submitting the application this way: spark-submit test-2.0.5-SNAPSHOT-jar-with-dependencies.jar I've confirmed that org.apache.spark.sql.cassandra and org.apache.cassandra classes are in the jar. Apologies for this relatively newbie question - I'm still new to both spark and scala. Thanks, Ben From: Benjamin Ross Sent: Thursday, July 30, 2015 3:45 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Failed to load class for data source: org.apache.spark.sql.cassandra Hey all, I'm running what should be a very straight-forward application of the Cassandra sql connector, and I'm getting an error: Exception in thread "main" java.lang.RuntimeException: Failed to load class for data source: org.apache.spark.sql.cassandra at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at com.latticeengines.test.CassandraTest$.main(CassandraTest.scala:33) at com.latticeengines.test.CassandraTest.main(CassandraTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/07/30 15:34:47 INFO spark.SparkContext: Invoking stop() from shutdown hook My jar is shaded, so I assume this shouldn't happen? Here's the code I'm trying to run: object CassandraTest { def main(args: Array[String]) { println("Hello, scala!") var conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val df = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "kv", "keyspace" -> "test")) .load() val w = Window.orderBy("value").rowsBetween(-2, 0) df.select(mean("value").over(w)) } }
RE: Failed to load class for data source: org.apache.spark.sql.cassandra
I'm submitting the application this way: spark-submit test-2.0.5-SNAPSHOT-jar-with-dependencies.jar I've confirmed that org.apache.spark.sql.cassandra and org.apache.cassandra classes are in the jar. Apologies for this relatively newbie question - I'm still new to both spark and scala. Thanks, Ben From: Benjamin Ross Sent: Thursday, July 30, 2015 3:45 PM To: user@spark.apache.org Subject: Failed to load class for data source: org.apache.spark.sql.cassandra Hey all, I'm running what should be a very straight-forward application of the Cassandra sql connector, and I'm getting an error: Exception in thread "main" java.lang.RuntimeException: Failed to load class for data source: org.apache.spark.sql.cassandra at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at com.latticeengines.test.CassandraTest$.main(CassandraTest.scala:33) at com.latticeengines.test.CassandraTest.main(CassandraTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/07/30 15:34:47 INFO spark.SparkContext: Invoking stop() from shutdown hook My jar is shaded, so I assume this shouldn't happen? Here's the code I'm trying to run: object CassandraTest { def main(args: Array[String]) { println("Hello, scala!") var conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val df = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "kv", "keyspace" -> "test")) .load() val w = Window.orderBy("value").rowsBetween(-2, 0) df.select(mean("value").over(w)) } }
Failed to load class for data source: org.apache.spark.sql.cassandra
Hey all, I'm running what should be a very straight-forward application of the Cassandra sql connector, and I'm getting an error: Exception in thread "main" java.lang.RuntimeException: Failed to load class for data source: org.apache.spark.sql.cassandra at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at com.latticeengines.test.CassandraTest$.main(CassandraTest.scala:33) at com.latticeengines.test.CassandraTest.main(CassandraTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/07/30 15:34:47 INFO spark.SparkContext: Invoking stop() from shutdown hook My jar is shaded, so I assume this shouldn't happen? Here's the code I'm trying to run: object CassandraTest { def main(args: Array[String]) { println("Hello, scala!") var conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val df = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "kv", "keyspace" -> "test")) .load() val w = Window.orderBy("value").rowsBetween(-2, 0) df.select(mean("value").over(w)) } }
RE: NoClassDefFoundError: scala/collection/GenTraversableOnce$class
Hey Ted, Thanks for the quick response. Sadly, all of those are 2.10.x: ─$ mvn dependency:tree | grep -A 2 -B 2 org.scala-lang 130 ↵ [INFO] | | \- org.tukaani:xz:jar:1.0:compile [INFO] | \- org.slf4j:slf4j-api:jar:1.6.4:compile [INFO] +- org.scala-lang:scala-library:jar:2.10.4:compile [INFO] +- org.apache.spark:spark-core_2.10:jar:1.4.1:compile [INFO] | +- com.twitter:chill_2.10:jar:0.5.0:compile -- [INFO] | | \- org.json4s:json4s-core_2.10:jar:3.2.10:compile [INFO] | | +- org.json4s:json4s-ast_2.10:jar:3.2.10:compile [INFO] | | \- org.scala-lang:scalap:jar:2.10.0:compile [INFO] | +- com.sun.jersey:jersey-server:jar:1.9:compile [INFO] | | \- asm:asm:jar:3.1:compile -- [INFO] +- org.apache.spark:spark-sql_2.10:jar:1.4.1:compile [INFO] | +- org.apache.spark:spark-catalyst_2.10:jar:1.4.1:compile [INFO] | | +- org.scala-lang:scala-compiler:jar:2.10.4:compile [INFO] | | \- org.scalamacros:quasiquotes_2.10:jar:2.0.1:compile [INFO] | \- org.jodd:jodd-core:jar:3.6.3:compile -- [INFO] | +- org.joda:joda-convert:jar:1.2:compile [INFO] | +- com.twitter:jsr166e:jar:1.1.0:compile [INFO] | \- org.scala-lang:scala-reflect:jar:2.10.5:compile [INFO] +- com.datastax.spark:spark-cassandra-connector-java_2.10:jar:1.2.4:compile [INFO] +- commons-codec:commons-codec:jar:1.4:compile Ben From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, July 29, 2015 8:30 PM To: Benjamin Ross Cc: user@spark.apache.org Subject: Re: NoClassDefFoundError: scala/collection/GenTraversableOnce$class You can generate dependency tree using: mvn dependency:tree and grep for 'org.scala-lang' in the output to see if there is any clue. Cheers On Wed, Jul 29, 2015 at 5:14 PM, Benjamin Ross mailto:br...@lattice-engines.com>> wrote: Hello all, I’m new to both spark and scala, and am running into an annoying error attempting to prototype some spark functionality. From forums I’ve read online, this error should only present itself if there’s a version mismatch between the version of scala used to compile spark and the scala version that I’m using. However, that’s not the case for me. I’m using scala 2.10.4, and spark was compiled against scala 2.10.x. Perhaps I’m missing something here. Also, the NoClassDefFoundError presents itself when debugging in eclipse, but running directly via the jar, the following error appears: Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Seq at com.latticeengines.test.CassandraTest.main(CassandraTest.scala) Caused by: java.lang.ClassNotFoundException: scala.collection.Seq at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 1 more I am getting the following warning when trying to invoke maven, but it doesn’t seem to be related to the underlying issue: [INFO] Checking for multiple versions of scala [WARNING] Expected all dependencies to require Scala version: 2.10.4 [WARNING] com.mycompany:test:2.0.5-SNAPSHOT requires scala version: 2.10.4 [WARNING] com.twitter:chill_2.10:0.5.0 requires scala version: 2.10.4 [WARNING] org.spark-project.akka:akka-remote_2.10:2.3.4-spark requires scala version: 2.10.4 [WARNING] org.spark-project.akka:akka-actor_2.10:2.3.4-spark requires scala version: 2.10.4 [WARNING] org.spark-project.akka:akka-slf4j_2.10:2.3.4-spark requires scala version: 2.10.4 [WARNING] org.apache.spark:spark-core_2.10:1.4.1 requires scala version: 2.10.4 [WARNING] org.json4s:json4s-jackson_2.10:3.2.10 requires scala version: 2.10.0 [WARNING] Multiple versions of scala libraries detected! [INFO] includes = [**/*.scala,**/*.java,] Here’s the code I’m trying to run: object CassandraTest { def main(args: Array[String]) { println("Hello, scala!") val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1").set( "spark.driver.extraClassPath", "/home/bross/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.4/spark-cassandra-connector_2.10-1.2.4.jar;/home/bross/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.4/spark-cassandra-connector_2.10-1.2.4.jar;/home/bross/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar"); val sc = new SparkContext("local", "test", conf) val sqlContext = new SQLContext(sc) val df = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( &
NoClassDefFoundError: scala/collection/GenTraversableOnce$class
Hello all, I'm new to both spark and scala, and am running into an annoying error attempting to prototype some spark functionality. From forums I've read online, this error should only present itself if there's a version mismatch between the version of scala used to compile spark and the scala version that I'm using. However, that's not the case for me. I'm using scala 2.10.4, and spark was compiled against scala 2.10.x. Perhaps I'm missing something here. Also, the NoClassDefFoundError presents itself when debugging in eclipse, but running directly via the jar, the following error appears: Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Seq at com.latticeengines.test.CassandraTest.main(CassandraTest.scala) Caused by: java.lang.ClassNotFoundException: scala.collection.Seq at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 1 more I am getting the following warning when trying to invoke maven, but it doesn't seem to be related to the underlying issue: [INFO] Checking for multiple versions of scala [WARNING] Expected all dependencies to require Scala version: 2.10.4 [WARNING] com.mycompany:test:2.0.5-SNAPSHOT requires scala version: 2.10.4 [WARNING] com.twitter:chill_2.10:0.5.0 requires scala version: 2.10.4 [WARNING] org.spark-project.akka:akka-remote_2.10:2.3.4-spark requires scala version: 2.10.4 [WARNING] org.spark-project.akka:akka-actor_2.10:2.3.4-spark requires scala version: 2.10.4 [WARNING] org.spark-project.akka:akka-slf4j_2.10:2.3.4-spark requires scala version: 2.10.4 [WARNING] org.apache.spark:spark-core_2.10:1.4.1 requires scala version: 2.10.4 [WARNING] org.json4s:json4s-jackson_2.10:3.2.10 requires scala version: 2.10.0 [WARNING] Multiple versions of scala libraries detected! [INFO] includes = [**/*.scala,**/*.java,] Here's the code I'm trying to run: object CassandraTest { def main(args: Array[String]) { println("Hello, scala!") val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1").set( "spark.driver.extraClassPath", "/home/bross/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.4/spark-cassandra-connector_2.10-1.2.4.jar;/home/bross/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.4/spark-cassandra-connector_2.10-1.2.4.jar;/home/bross/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar"); val sc = new SparkContext("local", "test", conf) val sqlContext = new SQLContext(sc) val df = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "kv", "keyspace" -> "test")) .load() val w = Window.orderBy("value").rowsBetween(-2, 0) df.select(mean("value").over(w)) } } Here's my maven file: http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd";> 4.0.0 test jar ${component-name} le-sparkdb 2.6.0.2.2.0.0-2041 2.10.4 1.4.1 1.7.7 1.4.3 2.0.5-SNAPSHOT 2.0.5-SNAPSHOT 2.0.5-SNAPSHOT 1.2.4 com.mycompany le-parent 2.0.5-SNAPSHOT le-parent org.scala-tools maven-scala-plugin compile testCompile org.apache.maven.plugins maven-eclipse-plugin ${maven.eclipse.version} true true org.scala-ide.sdt.core.scalanature org.eclipse.jdt.core.javanature org.scala-ide.sdt.core.scalabuilder org.scala-ide.sdt.launching.SCALA_CONTAINER org.eclipse.jdt.launching.JRE_CONTAINER org.scala-lang:scala-library org.scala-lang:scala-compiler **/*.scala **/*.java