Re: SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Nicholas Chammas
This is a great question. We are in the same position, having not invested
in Hive yet and looking at various options for SQL-on-Hadoop.


On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.comwrote:

 Hi,

 In context of the recent Spark SQL announcement (
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 ).

 If there is no existing investment in Hive/Shark, would it be worth
 starting a new SQL work using SparkSQL rather than Shark ?

 * It seems Shark SQL core will use more and more of SparkSQL
 * From the blog, it seems Shark has baggage from Hive, that is not needed
 in this case

 On the other hand, there seems to be two shortcomings of SparkSQL (from a
 quick scan of blog and doc)

 * SparkSQL will have less features than Shark/Hive QL, at least for now.
 * The standalone SharkServer feature will not be available in SparkSQL.

 Can someone from Databricks shed light on what is the long term roadmap?
 It will help in avoiding investing in older/two technologies for work with
 no Hive needs.

 Thanks,

 PS: Great work on SparkSQL




Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
The GraphX team has been using Wikipedia dumps from
http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less
convenient format than the Freebase dumps. In particular, an article may
span multiple lines, so more involved input parsing is required.

Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML
parser from Mahout: see
WikiPipelineBenchmark.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157and
WikiArticle.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala
.

However, we plan to upload a parsed version of this dataset to S3 for
easier access from Spark and GraphX.

Ankur http://www.ankurdave.com/

On 27 Mar, 2014, at 9:45 pm, Niko Stahl r.niko.st...@gmail.com wrote:

I would like to run the
WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample,
but the Wikipedia dump XML files are no longer available on
 Freebase. Does anyone know an alternative source for the data?



Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
In particular, we are using this dataset:
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Ankur http://www.ankurdave.com/


On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave ankurd...@gmail.com wrote:

 The GraphX team has been using Wikipedia dumps from
 http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less
 convenient format than the Freebase dumps. In particular, an article may
 span multiple lines, so more involved input parsing is required.

 Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML
 parser from Mahout: see 
 WikiPipelineBenchmark.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157and
 WikiArticle.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala
 .

 However, we plan to upload a parsed version of this dataset to S3 for
 easier access from Spark and GraphX.

 Ankur http://www.ankurdave.com/

 On 27 Mar, 2014, at 9:45 pm, Niko Stahl r.niko.st...@gmail.com wrote:

 I would like to run the 
 WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample,
  but the Wikipedia dump XML files are no longer available on
 Freebase. Does anyone know an alternative source for the data?





Re: Cross validation is missing in machine learning examples

2014-03-30 Thread Christopher Nguyen
Aureliano, you're correct that this is not validation error, which is
computed as the residuals on out-of-training-sample data, and helps
minimize overfit variance.

However, in this example, the errors are correctly referred to as training
error, which is what you might compute on a per-iteration basis in a
gradient-descent optimizer, in order to see how you're doing with respect
to minimizing the in-sample residuals.

There's nothing special about Spark ML algorithms that claims to escape
these bias-variance considerations.

--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Sat, Mar 29, 2014 at 10:25 PM, Aureliano Buendia buendia...@gmail.comwrote:

 Hi,

 I notices spark machine learning examples use training data to validate
 regression models, For instance, in linear 
 regressionhttp://spark.apache.org/docs/0.9.0/mllib-guide.htmlexample:

 // Evaluate model on training examples and compute training errorval 
 valuesAndPreds = parsedData.map { point =
   val prediction = model.predict(point.features)
   (point.label, prediction)}
 ...


  Here training data was used to validated a model which was created from
 the very same training data. This is just a bias estimation, and cross
 validationhttp://en.wikipedia.org/wiki/Cross-validation_%28statistics%29is 
 missing here. In order to cross validate, we need to partition the data
 into in-sample for training, and out-of-sample for validation.

 Please correct me if this does not apply to ML algorithms implemented in
 spark.



Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])

2014-03-30 Thread yh18190
Hi,

Can we convert directly scala collection to spark RDD data type without
using parellize method?
Is their any way to create custom converted RDD datatype from scala type
using some typecast like that?

Please suggest me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-convert-scala-collection-ArrayBuffer-Int-Double-to-org-spark-RDD-Int-Double-tp3486.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Error in SparkSQL Example

2014-03-30 Thread Manoj Samel
Hi,

On
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html,
I am trying to run code on Writing Language-Integrated Relational Queries
( I have 1.0.0 Snapshot ).

I am running into error on

val people: RDD[Person] // An RDD of case class objects, from the first
example.

scala val people: RDD[Person]
console:19: error: not found: type RDD
   val people: RDD[Person]
   ^

scala val people: org.apache.spark.rdd.RDD[Person]
console:18: error: class $iwC needs to be abstract, since value people is
not defined
class $iwC extends Serializable {
  ^

Any idea what the issue is ?

Also, its not clear what does the RDD[Person] brings. I can run the DSL
without the case class objects RDD ...

val people =
sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt))

val teenagers = people.where('age = 13).where('age = 19)

Thanks,


Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi,

I am trying SparkSQL based on the example on doc ...



val people =
sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt))


val olderThanTeans = people.where('age  19)
val youngerThanTeans = people.where('age  13)
val nonTeans = youngerThanTeans.union(olderThanTeans)

I can do a orderBy('age) on first two (which are SchemaRDD) but not on
third. The nonTeans is a UnionRDD that does not supports orderBy. This
seems different than the SQL behavior where results of 2 SQL unions is a
SQL itself with same functionality ...

Not clear why union of 2 SchemaRDDs does not produces a SchemaRDD 


Thanks,


SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi,

If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to
Double works ...

scala case class JournalLine(account: String, credit: BigDecimal, debit:
BigDecimal, date: String, company: String, currency: String, costcenter:
String, region: String)
defined class JournalLine
...
scala jl.where('credit  0).foreach(println)
scala.MatchError: scala.BigDecimal (of class
scala.reflect.internal.Types$TypeRef$$anon$3)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:41)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:45)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:38)
at
org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:32)
at
org.apache.spark.sql.execution.ExistingRdd$.fromProductRdd(basicOperators.scala:128)
at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:79)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:39)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:44)
at $iwC$$iwC$$iwC$$iwC.init(console:46)
at $iwC$$iwC$$iwC.init(console:48)
at $iwC$$iwC.init(console:50)
at $iwC.init(console:52)
at init(console:54)
at .init(console:58)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1045)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:795)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:840)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:752)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:600)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:607)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:610)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:935)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:883)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:981)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)

Thanks,


Spark webUI - application details page

2014-03-30 Thread David Thomas
Is there a way to see 'Application Detail UI' page (at master:4040) for
completed applications? Currently, I can see that page only for running
applications, I would like to see various numbers for the application after
it has completed.


Re: SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread smallmonkey...@hotmail.com
can I get the whole operation? then i can try to locate  the error




smallmonkey...@hotmail.com

From: Manoj Samel
Date: 2014-03-31 01:16
To: user
Subject: SparkSQL where with BigDecimal type gives stacktrace
Hi,


If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to 
Double works ...

scala case class JournalLine(account: String, credit: BigDecimal, debit: 
BigDecimal, date: String, company: String, currency: String, costcenter: 
String, region: String)
defined class JournalLine
...
scala jl.where('credit  0).foreach(println)
scala.MatchError: scala.BigDecimal (of class 
scala.reflect.internal.Types$TypeRef$$anon$3)
at 
org.apache.sparksql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:41)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:45)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:38)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:32)
at 
org.apache.spark.sql.execution.ExistingRdd$.fromProductRdd(basicOperators.scala:128)
at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:79)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:39)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:44)
at $iwC$$iwC$$iwC$$iwC.init(console:46)
at $iwC$$iwC$$iwC.init(console:48)
at $iwC$$iwC.init(console:50)
at $iwC.init(console:52)
at init(console:54)
at .init(console:58)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1045)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:795)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:840)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:752)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:600)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:607)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:610)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:935)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:883)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:981)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)


Thanks,

Re: Spark webUI - application details page

2014-03-30 Thread Patrick Wendell
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark
applications can persist their state so that the UI can be reloaded after
they have completed.

- Patrick


On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote:

 Is there a way to see 'Application Detail UI' page (at master:4040) for
 completed applications? Currently, I can see that page only for running
 applications, I would like to see various numbers for the application after
 it has completed.



Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Aureliano Buendia
Hi,

Spark-ec2 uses rsync to deploy many applications. It seem over time more
and more applications have been added to the script, which has
significantly slowed down the setup time.

Perhaps the script could be restructured this this way: Instead of rsyncing
N times per application, we could have 1 rsync which deploys N applications.

This should remarkably speed up the setup part, specially for clusters with
many nodes.


Re: Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Shivaram Venkataraman
That is a good idea, though I am not sure how much it will help as time to
rsync is also dependent just on data size being copied. The other problem
is that sometime we have dependencies across packages, so the first needs
to be running before the second can start etc.
However I agree that it takes too long to launch say a 100 node cluster
right now. If you want to take a shot at trying out some changes, you can
fork the spark-ec2 repo at https://github.com/mesos/spark-ec2/tree/v2  and
modify the number of rsync calls (each call to /root/spark-ec2/copy-dir
launches an rsync now).

Thanks
Shivaram


On Sun, Mar 30, 2014 at 3:12 PM, Aureliano Buendia buendia...@gmail.comwrote:

 Hi,

 Spark-ec2 uses rsync to deploy many applications. It seem over time more
 and more applications have been added to the script, which has
 significantly slowed down the setup time.

 Perhaps the script could be restructured this this way: Instead of
 rsyncing N times per application, we could have 1 rsync which deploys N
 applications.

 This should remarkably speed up the setup part, specially for clusters
 with many nodes.



Re: Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])

2014-03-30 Thread Mayur Rustagi
The scala object needs to be sent to workers to be used as a RDD,
parallalize is a way to do that. What are you looking to do?
You can serialize the scala object to hdfs/disk  load it from thr
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Sun, Mar 30, 2014 at 6:22 AM, yh18190 yh18...@gmail.com wrote:

 Hi,

 Can we convert directly scala collection to spark RDD data type without
 using parellize method?
 Is their any way to create custom converted RDD datatype from scala type
 using some typecast like that?

 Please suggest me



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-convert-scala-collection-ArrayBuffer-Int-Double-to-org-spark-RDD-Int-Double-tp3486.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Mayur Rustagi
+1 Have done a few installations of Shark with customers using Hive, they
love it. Would be good to maintain compatibility with Metastore  QL till
we have substantial reason to break off (like BlinkDB).

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Sun, Mar 30, 2014 at 2:46 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 This is a great question. We are in the same position, having not invested
 in Hive yet and looking at various options for SQL-on-Hadoop.


 On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.comwrote:

 Hi,

 In context of the recent Spark SQL announcement (
 http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
 ).

 If there is no existing investment in Hive/Shark, would it be worth
 starting a new SQL work using SparkSQL rather than Shark ?

 * It seems Shark SQL core will use more and more of SparkSQL
 * From the blog, it seems Shark has baggage from Hive, that is not needed
 in this case

 On the other hand, there seems to be two shortcomings of SparkSQL (from a
 quick scan of blog and doc)

 * SparkSQL will have less features than Shark/Hive QL, at least for now.
 * The standalone SharkServer feature will not be available in SparkSQL.

 Can someone from Databricks shed light on what is the long term roadmap?
 It will help in avoiding investing in older/two technologies for work with
 no Hive needs.

 Thanks,

 PS: Great work on SparkSQL





Re: SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi,

Would the same issue be present for other Java type like Date ?

Converting the person/teenager example on Patricks page reproduces the
problem ...

Thanks,


scala import scala.math
import scala.math

scala case class Person(name: String, age: BigDecimal)
defined class Person

scala val people =
sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p
= Person(p(0), BigDecimal(p(1).trim.toInt)))
14/03/31 00:23:40 INFO MemoryStore: ensureFreeSpace(32960) called with
curMem=0, maxMem=308713881
14/03/31 00:23:40 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 32.2 KB, free 294.4 MB)
people: org.apache.spark.rdd.RDD[Person] = MappedRDD[3] at map at
console:20

scala people take 1
...

scala val t = people.where('age  12 )
scala.MatchError: scala.BigDecimal (of class
scala.reflect.internal.Types$TypeRef$$anon$3)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:41)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45)
at
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:45)
at
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:38)
at
org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:32)
at
org.apache.spark.sql.execution.ExistingRdd$.fromProductRdd(basicOperators.scala:128)
at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:79)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:27)
at $iwC$$iwC$$iwC$$iwC.init(console:29)
at $iwC$$iwC$$iwC.init(console:31)
at $iwC$$iwC.init(console:33)
at $iwC.init(console:35)
at init(console:37)
at .init(console:41)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1045)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:795)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:840)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:752)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:600)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:607)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:610)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:935)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:883)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:981)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)



On Sun, Mar 30, 2014 at 11:04 AM, Aaron Davidson ilike...@gmail.com wrote:

 Well, the error is coming from this case statement not matching on the
 BigDecimal type:
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L41

 This seems to be a bug because there is a corresponding Catalyst DataType
 for BigDecimal, just no way to produce a schema for it. A patch should be
 straightforward enough to match against typeOf[BigDecimal] assuming this
 was not for some reason intentional.


 On Sun, Mar 30, 2014 at 10:43 AM, smallmonkey...@hotmail.com 
 smallmonkey...@hotmail.com wrote:

  can I get the whole operation? then i can try to locate  the error

 --
  smallmonkey...@hotmail.com

  *From:* Manoj Samel manojsamelt...@gmail.com
 *Date:* 2014-03-31 01:16
 *To:* user user@spark.apache.org
 *Subject:* SparkSQL where with BigDecimal type gives stacktrace
  Hi,

 If I do a where on BigDecimal, I get a stack trace. Changing 

Re: [shark-users] SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Matei Zaharia
Hi Manoj,

At the current time, for drop-in replacement of Hive, it will be best to stick 
with Shark. Over time, Shark will use the Spark SQL backend, but should remain 
deployable the way it is today (including launching the SharkServer, using the 
Hive CLI, etc). Spark SQL is better for accessing Hive data within a Spark 
program though, where its APIs are richer and easier to link to than the 
SharkContext.sql2rdd we had previously provided in Shark.

So in a nutshell, if you have a Shark deployment today, or need the HiveServer, 
then going with Shark will be fine and we will switch out the backend in a 
future release (we’ll probably create preview of this even before we’re ready 
to fully switch). If you just want to run SQL queries or load SQL data within a 
Spark program, try out Spark SQL.

Matei

On Mar 30, 2014, at 4:46 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:

 +1 Have done a few installations of Shark with customers using Hive, they 
 love it. Would be good to maintain compatibility with Metastore  QL till we 
 have substantial reason to break off (like BlinkDB). 
 
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi
 
 
 
 On Sun, Mar 30, 2014 at 2:46 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 This is a great question. We are in the same position, having not invested in 
 Hive yet and looking at various options for SQL-on-Hadoop.
 
 
 On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.com wrote:
 Hi,
 
 In context of the recent Spark SQL announcement 
 (http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html).
 
 If there is no existing investment in Hive/Shark, would it be worth starting 
 a new SQL work using SparkSQL rather than Shark ?
 
 * It seems Shark SQL core will use more and more of SparkSQL
 * From the blog, it seems Shark has baggage from Hive, that is not needed in 
 this case
 
 On the other hand, there seems to be two shortcomings of SparkSQL (from a 
 quick scan of blog and doc) 
 
 * SparkSQL will have less features than Shark/Hive QL, at least for now.
 * The standalone SharkServer feature will not be available in SparkSQL.
 
 Can someone from Databricks shed light on what is the long term roadmap? It 
 will help in avoiding investing in older/two technologies for work with no 
 Hive needs.
 
 Thanks,
 
 PS: Great work on SparkSQL
 
 
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 shark-users group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to shark-users+unsubscr...@googlegroups.com.
 To post to this group, send email to shark-us...@googlegroups.com.
 Visit this group at http://groups.google.com/group/shark-users.
 For more options, visit https://groups.google.com/d/optout.



groupBy RDD does not have grouping column ?

2014-03-30 Thread Manoj Samel
Hi,

If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the
resulting RDD should have 'a, 'foo and 'bar.

The result RDD just shows 'foo and 'bar and is missing 'a

Thoughts?

Thanks,

Manoj


Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-30 Thread Vipul Pandey
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. 
any word on this one?
On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal...@gmail.com wrote:

 We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with
 Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar deployed
 on each of the spark worker nodes.  
 The message is compiled using 2.5 but then on runtime it is being
 de-serialized by 2.4.1 as I'm getting the following exception
 
 java.lang.VerifyError (java.lang.VerifyError: class
 com.snc.sinet.messages.XServerMessage$XServer overrides final method
 getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
 java.lang.ClassLoader.defineClass1(Native Method)
 java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
 java.lang.ClassLoader.defineClass(ClassLoader.java:615)
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
 
 Suggestions on how I could still use ProtoBuf 2.5. Based on the article -
 https://spark-project.atlassian.net/browse/SPARK-995 we should be able to
 use different version of protobuf in the application.
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



batching the output

2014-03-30 Thread Vipul Pandey
Hi,

I need to batch the values in my final RDD before writing out to hdfs. The idea 
is to batch multiple rows in a protobuf and write those batches out - mostly 
to save some space as a lot of metadata is the same. 
e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records 
instead of 6

What Im doing is that I'm using mapPartitions by using the grouped function of 
the iterator by giving it a groupSize. 

val protoRDD:RDD[MyProto] = 
rdd.mapPartitions[Profiles](_.grouped(groupSize).map(seq ={
val profiles = MyProto(...)
seq.foreach(x ={
  val row = new Row(x._1.toString)
  row.setFloatValue(x._2)
  profiles.addRow(row)
})
profiles
  })
)
I haven't been able to test it out because of a separate issue (protobuf 
version mismatch - in a different thread)  - but i'm hoping it will work.

Is there a better/straight-forward way of doing this?

Thanks
Vipul