Hi Daniel,
Thanks for your email! We don't have a book (yet?) specifically on SparkR,
but here's a list of helpful tutorials / links you can check out (I am
listing them in roughly basic - advanced order):
- AMPCamp5 SparkR exercises
http://ampcamp.berkeley.edu/5/exercises/sparkr.html. This
Hi Pranay,
If this is data format is to be assumed, then I believe the issue starts at
lines - textFile(sc,/sparkdev/datafiles/covariance.txt)
totals - lapply(lines, function(lines)
After the first line, `lines` becomes an RDD of strings, each of which
is a line of the form 1,1.
I agree that this is definitely useful.
One related project I know of is Sparkling [1] (also see talk at Spark
Summit 2014 [2]), but it'd be great (and I imagine somewhat
challenging) to visualize the *physical execution* graph of a Spark
job.
[1] http://pr01.uml.edu/
[2]
countDistinct is recently added and is in 1.0.2. If you are using that
or the master branch, you could try something like:
r.select('keyword, countDistinct('userId)).groupBy('keyword)
On Thu, Jul 31, 2014 at 12:27 PM, buntu buntu...@gmail.com wrote:
I'm looking to write a select statement
, Buntu Dev buntu...@gmail.com wrote:
Thanks Zongheng for the pointer. Is there a way to achieve the same in 1.0.0
?
On Thu, Jul 31, 2014 at 1:43 PM, Zongheng Yang zonghen...@gmail.com wrote:
countDistinct is recently added and is in 1.0.2. If you are using that
or the master branch, you could
To add to this: for this many (= 20) machines I usually use at least
--wait 600.
On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
William,
The error you are seeing is misleading. There is no need to terminate the
cluster and start over.
Just re-run your
As Hao already mentioned, using 'hive' (the HiveContext) throughout would
work.
On Monday, July 28, 2014, Cheng, Hao hao.ch...@intel.com wrote:
In your code snippet, sample is actually a SchemaRDD, and SchemaRDD
actually binds a certain SQLContext in runtime, I don't think we can
Do you mean that the texts of the SQL queries being hardcoded in the
code? What do you mean by cannot shar the sql to all workers?
On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com hsy...@gmail.com wrote:
Hi guys,
I'm able to run some Spark SQL example but the sql is static in the code. I
,
Siyuan
On Tue, Jul 22, 2014 at 4:15 PM, Zongheng Yang zonghen...@gmail.com wrote:
Do you mean that the texts of the SQL queries being hardcoded in the
code? What do you mean by cannot shar the sql to all workers?
On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com hsy...@gmail.com
wrote:
Hi guys
One way is to set this in your conf/spark-defaults.conf:
spark.executor.extraLibraryPath /path/to/native/lib
The key is documented here:
http://spark.apache.org/docs/latest/configuration.html
On Thu, Jul 17, 2014 at 1:25 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
I used to use
Sounds like a job for Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html !
On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath
nick.pentre...@gmail.com wrote:
You can use .distinct.count on your user RDD.
What are you trying to achieve with the time group by?
—
Sent from
FWIW, I am unable to reproduce this using the example program locally.
On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons keith.simm...@gmail.com wrote:
Nope. All of them are registered from the driver program.
However, I think we've found the culprit. If the join column between two
tables is
- user@incubator
Hi Keith,
I did reproduce this using local-cluster[2,2,1024], and the errors
look almost the same. Just wondering, despite the errors did your
program output any result for the join? On my machine, I could see the
correct output.
Zongheng
On Tue, Jul 15, 2014 at 1:46 PM,
Hi Keith gorenuru,
This patch (https://github.com/apache/spark/pull/1423) solves the
errors for me in my local tests. If possible, can you guys test this
out to see if it solves your test programs?
Thanks,
Zongheng
On Tue, Jul 15, 2014 at 3:08 PM, Zongheng Yang zonghen...@gmail.com wrote
Hey Jerry,
When you ran these queries using different methods, did you see any
discrepancy in the returned results (i.e. the counts)?
On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust
mich...@databricks.com wrote:
Yeah, sorry. I think you are seeing some weirdness with partitioned tables
that
Hi Haoming,
For your spark-submit question: can you try using an assembly jar
(sbt/sbt assembly will build it for you)? Another thing to check is
if there is any package structure that contains your SimpleApp; if so
you should include the hierarchal name.
Zongheng
On Thu, Jul 10, 2014 at 11:33
Hi durin,
I just tried this example (nice data, by the way!), *with each JSON
object on one line*, and it worked fine:
scala rdd.printSchema()
root
|-- entities: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef
||-- friends:
Hi Stuti,
Yes, you do need to install R on all nodes. Furthermore the rJava
library is also required, which can be installed simply using
'install.packages(rJava)' in the R shell. Some more installation
instructions after that step can be found in the README here:
If your input data is JSON, you can also try out the recently merged
in initial JSON support:
https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916
On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
That’s pretty neat! So I guess if you
I may be wrong, but I think RDDs must be created inside a
SparkContext. To somehow preserve the order of the list, perhaps you
could try something like:
sc.parallelize((1 to xs.size).zip(xs))
On Fri, Jun 13, 2014 at 6:08 PM, SK skrishna...@gmail.com wrote:
Hi,
I have a List[ (String, Int,
Sorry I wasn't being clear. The idea off the top of my head was that
you could append an original position index to each element (using the
line above), and modified what ever processing functions you have in
mind to make them aware of these indices. And I think you are right
that RDD collections
Hi,
Just wondering if you can try this:
val obj = sql(select manufacturer, count(*) as examcount from pft
group by manufacturer order by examcount desc)
obj.collect()
obj.queryExecution.executedPlan.executeCollect()
and time the third line alone. It could be that Spark SQL taking some
time to
22 matches
Mail list logo