Re: SQL on Spark - Shark or SparkSQL
This is a great question. We are in the same position, having not invested in Hive yet and looking at various options for SQL-on-Hadoop. On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.comwrote: Hi, In context of the recent Spark SQL announcement ( http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html ). If there is no existing investment in Hive/Shark, would it be worth starting a new SQL work using SparkSQL rather than Shark ? * It seems Shark SQL core will use more and more of SparkSQL * From the blog, it seems Shark has baggage from Hive, that is not needed in this case On the other hand, there seems to be two shortcomings of SparkSQL (from a quick scan of blog and doc) * SparkSQL will have less features than Shark/Hive QL, at least for now. * The standalone SharkServer feature will not be available in SparkSQL. Can someone from Databricks shed light on what is the long term roadmap? It will help in avoiding investing in older/two technologies for work with no Hive needs. Thanks, PS: Great work on SparkSQL
Re: WikipediaPageRank Data Set
The GraphX team has been using Wikipedia dumps from http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less convenient format than the Freebase dumps. In particular, an article may span multiple lines, so more involved input parsing is required. Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML parser from Mahout: see WikiPipelineBenchmark.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157and WikiArticle.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala . However, we plan to upload a parsed version of this dataset to S3 for easier access from Spark and GraphX. Ankur http://www.ankurdave.com/ On 27 Mar, 2014, at 9:45 pm, Niko Stahl r.niko.st...@gmail.com wrote: I would like to run the WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample, but the Wikipedia dump XML files are no longer available on Freebase. Does anyone know an alternative source for the data?
Re: WikipediaPageRank Data Set
In particular, we are using this dataset: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 Ankur http://www.ankurdave.com/ On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave ankurd...@gmail.com wrote: The GraphX team has been using Wikipedia dumps from http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less convenient format than the Freebase dumps. In particular, an article may span multiple lines, so more involved input parsing is required. Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML parser from Mahout: see WikiPipelineBenchmark.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157and WikiArticle.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala . However, we plan to upload a parsed version of this dataset to S3 for easier access from Spark and GraphX. Ankur http://www.ankurdave.com/ On 27 Mar, 2014, at 9:45 pm, Niko Stahl r.niko.st...@gmail.com wrote: I would like to run the WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample, but the Wikipedia dump XML files are no longer available on Freebase. Does anyone know an alternative source for the data?
Re: Cross validation is missing in machine learning examples
Aureliano, you're correct that this is not validation error, which is computed as the residuals on out-of-training-sample data, and helps minimize overfit variance. However, in this example, the errors are correctly referred to as training error, which is what you might compute on a per-iteration basis in a gradient-descent optimizer, in order to see how you're doing with respect to minimizing the in-sample residuals. There's nothing special about Spark ML algorithms that claims to escape these bias-variance considerations. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Sat, Mar 29, 2014 at 10:25 PM, Aureliano Buendia buendia...@gmail.comwrote: Hi, I notices spark machine learning examples use training data to validate regression models, For instance, in linear regressionhttp://spark.apache.org/docs/0.9.0/mllib-guide.htmlexample: // Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map { point = val prediction = model.predict(point.features) (point.label, prediction)} ... Here training data was used to validated a model which was created from the very same training data. This is just a bias estimation, and cross validationhttp://en.wikipedia.org/wiki/Cross-validation_%28statistics%29is missing here. In order to cross validate, we need to partition the data into in-sample for training, and out-of-sample for validation. Please correct me if this does not apply to ML algorithms implemented in spark.
Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])
Hi, Can we convert directly scala collection to spark RDD data type without using parellize method? Is their any way to create custom converted RDD datatype from scala type using some typecast like that? Please suggest me -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-convert-scala-collection-ArrayBuffer-Int-Double-to-org-spark-RDD-Int-Double-tp3486.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Error in SparkSQL Example
Hi, On http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html, I am trying to run code on Writing Language-Integrated Relational Queries ( I have 1.0.0 Snapshot ). I am running into error on val people: RDD[Person] // An RDD of case class objects, from the first example. scala val people: RDD[Person] console:19: error: not found: type RDD val people: RDD[Person] ^ scala val people: org.apache.spark.rdd.RDD[Person] console:18: error: class $iwC needs to be abstract, since value people is not defined class $iwC extends Serializable { ^ Any idea what the issue is ? Also, its not clear what does the RDD[Person] brings. I can run the DSL without the case class objects RDD ... val people = sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) val teenagers = people.where('age = 13).where('age = 19) Thanks,
Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?
Hi, I am trying SparkSQL based on the example on doc ... val people = sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) val olderThanTeans = people.where('age 19) val youngerThanTeans = people.where('age 13) val nonTeans = youngerThanTeans.union(olderThanTeans) I can do a orderBy('age) on first two (which are SchemaRDD) but not on third. The nonTeans is a UnionRDD that does not supports orderBy. This seems different than the SQL behavior where results of 2 SQL unions is a SQL itself with same functionality ... Not clear why union of 2 SchemaRDDs does not produces a SchemaRDD Thanks,
SparkSQL where with BigDecimal type gives stacktrace
Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to Double works ... scala case class JournalLine(account: String, credit: BigDecimal, debit: BigDecimal, date: String, company: String, currency: String, costcenter: String, region: String) defined class JournalLine ... scala jl.where('credit 0).foreach(println) scala.MatchError: scala.BigDecimal (of class scala.reflect.internal.Types$TypeRef$$anon$3) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:41) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:45) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:38) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:32) at org.apache.spark.sql.execution.ExistingRdd$.fromProductRdd(basicOperators.scala:128) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:79) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:44) at $iwC$$iwC$$iwC$$iwC.init(console:46) at $iwC$$iwC$$iwC.init(console:48) at $iwC$$iwC.init(console:50) at $iwC.init(console:52) at init(console:54) at .init(console:58) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1045) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:795) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:840) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:752) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:600) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:607) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:610) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:935) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:883) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:981) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) Thanks,
Spark webUI - application details page
Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.
Re: SparkSQL where with BigDecimal type gives stacktrace
can I get the whole operation? then i can try to locate the error smallmonkey...@hotmail.com From: Manoj Samel Date: 2014-03-31 01:16 To: user Subject: SparkSQL where with BigDecimal type gives stacktrace Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to Double works ... scala case class JournalLine(account: String, credit: BigDecimal, debit: BigDecimal, date: String, company: String, currency: String, costcenter: String, region: String) defined class JournalLine ... scala jl.where('credit 0).foreach(println) scala.MatchError: scala.BigDecimal (of class scala.reflect.internal.Types$TypeRef$$anon$3) at org.apache.sparksql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:41) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:45) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:38) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:32) at org.apache.spark.sql.execution.ExistingRdd$.fromProductRdd(basicOperators.scala:128) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:79) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:39) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:44) at $iwC$$iwC$$iwC$$iwC.init(console:46) at $iwC$$iwC$$iwC.init(console:48) at $iwC$$iwC.init(console:50) at $iwC.init(console:52) at init(console:54) at .init(console:58) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1045) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:795) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:840) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:752) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:600) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:607) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:610) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:935) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:883) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:981) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) Thanks,
Re: Spark webUI - application details page
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark applications can persist their state so that the UI can be reloaded after they have completed. - Patrick On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote: Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.
Spark-ec2 setup is getting slower and slower
Hi, Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time. Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have 1 rsync which deploys N applications. This should remarkably speed up the setup part, specially for clusters with many nodes.
Re: Spark-ec2 setup is getting slower and slower
That is a good idea, though I am not sure how much it will help as time to rsync is also dependent just on data size being copied. The other problem is that sometime we have dependencies across packages, so the first needs to be running before the second can start etc. However I agree that it takes too long to launch say a 100 node cluster right now. If you want to take a shot at trying out some changes, you can fork the spark-ec2 repo at https://github.com/mesos/spark-ec2/tree/v2 and modify the number of rsync calls (each call to /root/spark-ec2/copy-dir launches an rsync now). Thanks Shivaram On Sun, Mar 30, 2014 at 3:12 PM, Aureliano Buendia buendia...@gmail.comwrote: Hi, Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time. Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have 1 rsync which deploys N applications. This should remarkably speed up the setup part, specially for clusters with many nodes.
Re: Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])
The scala object needs to be sent to workers to be used as a RDD, parallalize is a way to do that. What are you looking to do? You can serialize the scala object to hdfs/disk load it from thr Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sun, Mar 30, 2014 at 6:22 AM, yh18190 yh18...@gmail.com wrote: Hi, Can we convert directly scala collection to spark RDD data type without using parellize method? Is their any way to create custom converted RDD datatype from scala type using some typecast like that? Please suggest me -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-convert-scala-collection-ArrayBuffer-Int-Double-to-org-spark-RDD-Int-Double-tp3486.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SQL on Spark - Shark or SparkSQL
+1 Have done a few installations of Shark with customers using Hive, they love it. Would be good to maintain compatibility with Metastore QL till we have substantial reason to break off (like BlinkDB). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sun, Mar 30, 2014 at 2:46 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: This is a great question. We are in the same position, having not invested in Hive yet and looking at various options for SQL-on-Hadoop. On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.comwrote: Hi, In context of the recent Spark SQL announcement ( http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html ). If there is no existing investment in Hive/Shark, would it be worth starting a new SQL work using SparkSQL rather than Shark ? * It seems Shark SQL core will use more and more of SparkSQL * From the blog, it seems Shark has baggage from Hive, that is not needed in this case On the other hand, there seems to be two shortcomings of SparkSQL (from a quick scan of blog and doc) * SparkSQL will have less features than Shark/Hive QL, at least for now. * The standalone SharkServer feature will not be available in SparkSQL. Can someone from Databricks shed light on what is the long term roadmap? It will help in avoiding investing in older/two technologies for work with no Hive needs. Thanks, PS: Great work on SparkSQL
Re: SparkSQL where with BigDecimal type gives stacktrace
Hi, Would the same issue be present for other Java type like Date ? Converting the person/teenager example on Patricks page reproduces the problem ... Thanks, scala import scala.math import scala.math scala case class Person(name: String, age: BigDecimal) defined class Person scala val people = sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), BigDecimal(p(1).trim.toInt))) 14/03/31 00:23:40 INFO MemoryStore: ensureFreeSpace(32960) called with curMem=0, maxMem=308713881 14/03/31 00:23:40 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.2 KB, free 294.4 MB) people: org.apache.spark.rdd.RDD[Person] = MappedRDD[3] at map at console:20 scala people take 1 ... scala val t = people.where('age 12 ) scala.MatchError: scala.BigDecimal (of class scala.reflect.internal.Types$TypeRef$$anon$3) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:41) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:45) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:38) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:32) at org.apache.spark.sql.execution.ExistingRdd$.fromProductRdd(basicOperators.scala:128) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:79) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:27) at $iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC.init(console:33) at $iwC.init(console:35) at init(console:37) at .init(console:41) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1045) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:795) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:840) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:752) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:600) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:607) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:610) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:935) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:883) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:981) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) On Sun, Mar 30, 2014 at 11:04 AM, Aaron Davidson ilike...@gmail.com wrote: Well, the error is coming from this case statement not matching on the BigDecimal type: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L41 This seems to be a bug because there is a corresponding Catalyst DataType for BigDecimal, just no way to produce a schema for it. A patch should be straightforward enough to match against typeOf[BigDecimal] assuming this was not for some reason intentional. On Sun, Mar 30, 2014 at 10:43 AM, smallmonkey...@hotmail.com smallmonkey...@hotmail.com wrote: can I get the whole operation? then i can try to locate the error -- smallmonkey...@hotmail.com *From:* Manoj Samel manojsamelt...@gmail.com *Date:* 2014-03-31 01:16 *To:* user user@spark.apache.org *Subject:* SparkSQL where with BigDecimal type gives stacktrace Hi, If I do a where on BigDecimal, I get a stack trace. Changing
Re: [shark-users] SQL on Spark - Shark or SparkSQL
Hi Manoj, At the current time, for drop-in replacement of Hive, it will be best to stick with Shark. Over time, Shark will use the Spark SQL backend, but should remain deployable the way it is today (including launching the SharkServer, using the Hive CLI, etc). Spark SQL is better for accessing Hive data within a Spark program though, where its APIs are richer and easier to link to than the SharkContext.sql2rdd we had previously provided in Shark. So in a nutshell, if you have a Shark deployment today, or need the HiveServer, then going with Shark will be fine and we will switch out the backend in a future release (we’ll probably create preview of this even before we’re ready to fully switch). If you just want to run SQL queries or load SQL data within a Spark program, try out Spark SQL. Matei On Mar 30, 2014, at 4:46 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: +1 Have done a few installations of Shark with customers using Hive, they love it. Would be good to maintain compatibility with Metastore QL till we have substantial reason to break off (like BlinkDB). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Sun, Mar 30, 2014 at 2:46 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: This is a great question. We are in the same position, having not invested in Hive yet and looking at various options for SQL-on-Hadoop. On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, In context of the recent Spark SQL announcement (http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html). If there is no existing investment in Hive/Shark, would it be worth starting a new SQL work using SparkSQL rather than Shark ? * It seems Shark SQL core will use more and more of SparkSQL * From the blog, it seems Shark has baggage from Hive, that is not needed in this case On the other hand, there seems to be two shortcomings of SparkSQL (from a quick scan of blog and doc) * SparkSQL will have less features than Shark/Hive QL, at least for now. * The standalone SharkServer feature will not be available in SparkSQL. Can someone from Databricks shed light on what is the long term roadmap? It will help in avoiding investing in older/two technologies for work with no Hive needs. Thanks, PS: Great work on SparkSQL -- You received this message because you are subscribed to the Google Groups shark-users group. To unsubscribe from this group and stop receiving emails from it, send an email to shark-users+unsubscr...@googlegroups.com. To post to this group, send email to shark-us...@googlegroups.com. Visit this group at http://groups.google.com/group/shark-users. For more options, visit https://groups.google.com/d/optout.
groupBy RDD does not have grouping column ?
Hi, If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the resulting RDD should have 'a, 'foo and 'bar. The result RDD just shows 'foo and 'bar and is missing 'a Thoughts? Thanks, Manoj
Re: Using ProtoBuf 2.5 for messages with Spark Streaming
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. any word on this one? On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal...@gmail.com wrote: We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar deployed on each of the spark worker nodes. The message is compiled using 2.5 but then on runtime it is being de-serialized by 2.4.1 as I'm getting the following exception java.lang.VerifyError (java.lang.VerifyError: class com.snc.sinet.messages.XServerMessage$XServer overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;) java.lang.ClassLoader.defineClass1(Native Method) java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) java.lang.ClassLoader.defineClass(ClassLoader.java:615) java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) Suggestions on how I could still use ProtoBuf 2.5. Based on the article - https://spark-project.atlassian.net/browse/SPARK-995 we should be able to use different version of protobuf in the application. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
batching the output
Hi, I need to batch the values in my final RDD before writing out to hdfs. The idea is to batch multiple rows in a protobuf and write those batches out - mostly to save some space as a lot of metadata is the same. e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records instead of 6 What Im doing is that I'm using mapPartitions by using the grouped function of the iterator by giving it a groupSize. val protoRDD:RDD[MyProto] = rdd.mapPartitions[Profiles](_.grouped(groupSize).map(seq ={ val profiles = MyProto(...) seq.foreach(x ={ val row = new Row(x._1.toString) row.setFloatValue(x._2) profiles.addRow(row) }) profiles }) ) I haven't been able to test it out because of a separate issue (protobuf version mismatch - in a different thread) - but i'm hoping it will work. Is there a better/straight-forward way of doing this? Thanks Vipul