This is a great question. We are in the same position, having not invested
in Hive yet and looking at various options for SQL-on-Hadoop.
On Sat, Mar 29, 2014 at 9:48 PM, Manoj Samel manojsamelt...@gmail.comwrote:
Hi,
In context of the recent Spark SQL announcement (
The GraphX team has been using Wikipedia dumps from
http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less
convenient format than the Freebase dumps. In particular, an article may
span multiple lines, so more involved input parsing is required.
Dan Crankshaw (cc'd) wrote a driver
In particular, we are using this dataset:
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Ankur http://www.ankurdave.com/
On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave ankurd...@gmail.com wrote:
The GraphX team has been using Wikipedia dumps from
Aureliano, you're correct that this is not validation error, which is
computed as the residuals on out-of-training-sample data, and helps
minimize overfit variance.
However, in this example, the errors are correctly referred to as training
error, which is what you might compute on a per-iteration
Hi,
Can we convert directly scala collection to spark RDD data type without
using parellize method?
Is their any way to create custom converted RDD datatype from scala type
using some typecast like that?
Please suggest me
--
View this message in context:
Hi,
On
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html,
I am trying to run code on Writing Language-Integrated Relational Queries
( I have 1.0.0 Snapshot ).
I am running into error on
val people: RDD[Person] // An RDD of case class objects, from the first
example.
Hi,
I am trying SparkSQL based on the example on doc ...
val people =
sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt))
val olderThanTeans = people.where('age 19)
val youngerThanTeans = people.where('age 13)
val
Hi,
If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to
Double works ...
scala case class JournalLine(account: String, credit: BigDecimal, debit:
BigDecimal, date: String, company: String, currency: String, costcenter:
String, region: String)
defined class JournalLine
Is there a way to see 'Application Detail UI' page (at master:4040) for
completed applications? Currently, I can see that page only for running
applications, I would like to see various numbers for the application after
it has completed.
can I get the whole operation? then i can try to locate the error
smallmonkey...@hotmail.com
From: Manoj Samel
Date: 2014-03-31 01:16
To: user
Subject: SparkSQL where with BigDecimal type gives stacktrace
Hi,
If I do a where on BigDecimal, I get a stack trace. Changing BigDecimal to
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark
applications can persist their state so that the UI can be reloaded after
they have completed.
- Patrick
On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote:
Is there a way to see 'Application
Hi,
Spark-ec2 uses rsync to deploy many applications. It seem over time more
and more applications have been added to the script, which has
significantly slowed down the setup time.
Perhaps the script could be restructured this this way: Instead of rsyncing
N times per application, we could have
That is a good idea, though I am not sure how much it will help as time to
rsync is also dependent just on data size being copied. The other problem
is that sometime we have dependencies across packages, so the first needs
to be running before the second can start etc.
However I agree that it
The scala object needs to be sent to workers to be used as a RDD,
parallalize is a way to do that. What are you looking to do?
You can serialize the scala object to hdfs/disk load it from thr
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
+1 Have done a few installations of Shark with customers using Hive, they
love it. Would be good to maintain compatibility with Metastore QL till
we have substantial reason to break off (like BlinkDB).
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Hi,
Would the same issue be present for other Java type like Date ?
Converting the person/teenager example on Patricks page reproduces the
problem ...
Thanks,
scala import scala.math
import scala.math
scala case class Person(name: String, age: BigDecimal)
defined class Person
scala val
Hi Manoj,
At the current time, for drop-in replacement of Hive, it will be best to stick
with Shark. Over time, Shark will use the Spark SQL backend, but should remain
deployable the way it is today (including launching the SharkServer, using the
Hive CLI, etc). Spark SQL is better for
Hi,
If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the
resulting RDD should have 'a, 'foo and 'bar.
The result RDD just shows 'foo and 'bar and is missing 'a
Thoughts?
Thanks,
Manoj
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue.
any word on this one?
On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal...@gmail.com wrote:
We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with
Kafka stream setup. I have protocol Buffer 2.5
Hi,
I need to batch the values in my final RDD before writing out to hdfs. The idea
is to batch multiple rows in a protobuf and write those batches out - mostly
to save some space as a lot of metadata is the same.
e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records
20 matches
Mail list logo