Re: SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Nicholas Chammas
This is a great question. We are in the same position, having not invested in Hive yet and looking at various options for SQL-on-Hadoop. On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.comwrote: Hi, In context of the recent Spark SQL announcement (

Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
The GraphX team has been using Wikipedia dumps from http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less convenient format than the Freebase dumps. In particular, an article may span multiple lines, so more involved input parsing is required. Dan Crankshaw (cc'd) wrote a driver

Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
In particular, we are using this dataset: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 Ankur http://www.ankurdave.com/ On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave ankurd...@gmail.com wrote: The GraphX team has been using Wikipedia dumps from

Re: Cross validation is missing in machine learning examples

2014-03-30 Thread Christopher Nguyen
Aureliano, you're correct that this is not validation error, which is computed as the residuals on out-of-training-sample data, and helps minimize overfit variance. However, in this example, the errors are correctly referred to as training error, which is what you might compute on a per-iteration

Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])

2014-03-30 Thread yh18190
Hi, Can we convert directly scala collection to spark RDD data type without using parellize method? Is their any way to create custom converted RDD datatype from scala type using some typecast like that? Please suggest me -- View this message in context:

Error in SparkSQL Example

2014-03-30 Thread Manoj Samel
Hi, On http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html, I am trying to run code on Writing Language-Integrated Relational Queries ( I have 1.0.0 Snapshot ). I am running into error on val people: RDD[Person] // An RDD of case class objects, from the first example.

Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi, I am trying SparkSQL based on the example on doc ... val people = sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) val olderThanTeans = people.where('age 19) val youngerThanTeans = people.where('age 13) val

SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to Double works ... scala case class JournalLine(account: String, credit: BigDecimal, debit: BigDecimal, date: String, company: String, currency: String, costcenter: String, region: String) defined class JournalLine

Spark webUI - application details page

2014-03-30 Thread David Thomas
Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.

Re: SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread smallmonkey...@hotmail.com
can I get the whole operation? then i can try to locate the error smallmonkey...@hotmail.com From: Manoj Samel Date: 2014-03-31 01:16 To: user Subject: SparkSQL where with BigDecimal type gives stacktrace Hi, If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to

Re: Spark webUI - application details page

2014-03-30 Thread Patrick Wendell
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark applications can persist their state so that the UI can be reloaded after they have completed. - Patrick On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote: Is there a way to see 'Application

Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Aureliano Buendia
Hi, Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time. Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have

Re: Spark-ec2 setup is getting slower and slower

2014-03-30 Thread Shivaram Venkataraman
That is a good idea, though I am not sure how much it will help as time to rsync is also dependent just on data size being copied. The other problem is that sometime we have dependencies across packages, so the first needs to be running before the second can start etc. However I agree that it

Re: Can we convert scala.collection.ArrayBuffer[(Int,Double)] to org.spark.RDD[(Int,Double])

2014-03-30 Thread Mayur Rustagi
The scala object needs to be sent to workers to be used as a RDD, parallalize is a way to do that. What are you looking to do? You can serialize the scala object to hdfs/disk load it from thr Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Mayur Rustagi
+1 Have done a few installations of Shark with customers using Hive, they love it. Would be good to maintain compatibility with Metastore QL till we have substantial reason to break off (like BlinkDB). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: SparkSQL where with BigDecimal type gives stacktrace

2014-03-30 Thread Manoj Samel
Hi, Would the same issue be present for other Java type like Date ? Converting the person/teenager example on Patricks page reproduces the problem ... Thanks, scala import scala.math import scala.math scala case class Person(name: String, age: BigDecimal) defined class Person scala val

Re: [shark-users] SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Matei Zaharia
Hi Manoj, At the current time, for drop-in replacement of Hive, it will be best to stick with Shark. Over time, Shark will use the Spark SQL backend, but should remain deployable the way it is today (including launching the SharkServer, using the Hive CLI, etc). Spark SQL is better for

groupBy RDD does not have grouping column ?

2014-03-30 Thread Manoj Samel
Hi, If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the resulting RDD should have 'a, 'foo and 'bar. The result RDD just shows 'foo and 'bar and is missing 'a Thoughts? Thanks, Manoj

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-30 Thread Vipul Pandey
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. any word on this one? On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal...@gmail.com wrote: We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with Kafka stream setup. I have protocol Buffer 2.5

batching the output

2014-03-30 Thread Vipul Pandey
Hi, I need to batch the values in my final RDD before writing out to hdfs. The idea is to batch multiple rows in a protobuf and write those batches out - mostly to save some space as a lot of metadata is the same. e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records