Re: Book: Data Analysis with SparkR

2014-11-21 Thread Zongheng Yang
Hi Daniel, Thanks for your email! We don't have a book (yet?) specifically on SparkR, but here's a list of helpful tutorials / links you can check out (I am listing them in roughly basic - advanced order): - AMPCamp5 SparkR exercises http://ampcamp.berkeley.edu/5/exercises/sparkr.html. This

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Zongheng Yang
Hi Pranay, If this is data format is to be assumed, then I believe the issue starts at lines - textFile(sc,/sparkdev/datafiles/covariance.txt) totals - lapply(lines, function(lines) After the first line, `lines` becomes an RDD of strings, each of which is a line of the form 1,1.

Re: Visualizing stage task dependency graph

2014-08-04 Thread Zongheng Yang
I agree that this is definitely useful. One related project I know of is Sparkling [1] (also see talk at Spark Summit 2014 [2]), but it'd be great (and I imagine somewhat challenging) to visualize the *physical execution* graph of a Spark job. [1] http://pr01.uml.edu/ [2]

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
countDistinct is recently added and is in 1.0.2. If you are using that or the master branch, you could try something like: r.select('keyword, countDistinct('userId)).groupBy('keyword) On Thu, Jul 31, 2014 at 12:27 PM, buntu buntu...@gmail.com wrote: I'm looking to write a select statement

Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
, Buntu Dev buntu...@gmail.com wrote: Thanks Zongheng for the pointer. Is there a way to achieve the same in 1.0.0 ? On Thu, Jul 31, 2014 at 1:43 PM, Zongheng Yang zonghen...@gmail.com wrote: countDistinct is recently added and is in 1.0.2. If you are using that or the master branch, you could

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread Zongheng Yang
To add to this: for this many (= 20) machines I usually use at least --wait 600. On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: William, The error you are seeing is misleading. There is no need to terminate the cluster and start over. Just re-run your

Re: SparkSQL can not use SchemaRDD from Hive

2014-07-29 Thread Zongheng Yang
As Hao already mentioned, using 'hive' (the HiveContext) throughout would work. On Monday, July 28, 2014, Cheng, Hao hao.ch...@intel.com wrote: In your code snippet, sample is actually a SchemaRDD, and SchemaRDD actually binds a certain SQLContext in runtime, I don't think we can

Re: How to do an interactive Spark SQL

2014-07-22 Thread Zongheng Yang
Do you mean that the texts of the SQL queries being hardcoded in the code? What do you mean by cannot shar the sql to all workers? On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com hsy...@gmail.com wrote: Hi guys, I'm able to run some Spark SQL example but the sql is static in the code. I

Re: How to do an interactive Spark SQL

2014-07-22 Thread Zongheng Yang
, Siyuan On Tue, Jul 22, 2014 at 4:15 PM, Zongheng Yang zonghen...@gmail.com wrote: Do you mean that the texts of the SQL queries being hardcoded in the code? What do you mean by cannot shar the sql to all workers? On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com hsy...@gmail.com wrote: Hi guys

Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Zongheng Yang
One way is to set this in your conf/spark-defaults.conf: spark.executor.extraLibraryPath /path/to/native/lib The key is documented here: http://spark.apache.org/docs/latest/configuration.html On Thu, Jul 17, 2014 at 1:25 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I used to use

Re: Count distinct with groupBy usage

2014-07-15 Thread Zongheng Yang
Sounds like a job for Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html ! On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath nick.pentre...@gmail.com wrote: You can use .distinct.count on your user RDD. What are you trying to achieve with the time group by? — Sent from

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
FWIW, I am unable to reproduce this using the example program locally. On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons keith.simm...@gmail.com wrote: Nope. All of them are registered from the driver program. However, I think we've found the culprit. If the join column between two tables is

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
- user@incubator Hi Keith, I did reproduce this using local-cluster[2,2,1024], and the errors look almost the same. Just wondering, despite the errors did your program output any result for the join? On my machine, I could see the correct output. Zongheng On Tue, Jul 15, 2014 at 1:46 PM,

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
Hi Keith gorenuru, This patch (https://github.com/apache/spark/pull/1423) solves the errors for me in my local tests. If possible, can you guys test this out to see if it solves your test programs? Thanks, Zongheng On Tue, Jul 15, 2014 at 3:08 PM, Zongheng Yang zonghen...@gmail.com wrote

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-11 Thread Zongheng Yang
Hey Jerry, When you ran these queries using different methods, did you see any discrepancy in the returned results (i.e. the counts)? On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sorry. I think you are seeing some weirdness with partitioned tables that

Re: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Zongheng Yang
Hi Haoming, For your spark-submit question: can you try using an assembly jar (sbt/sbt assembly will build it for you)? Another thing to check is if there is any package structure that contains your SimpleApp; if so you should include the hierarchal name. Zongheng On Thu, Jul 10, 2014 at 11:33

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Zongheng Yang
Hi durin, I just tried this example (nice data, by the way!), *with each JSON object on one line*, and it worked fine: scala rdd.printSchema() root |-- entities: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef ||-- friends:

Re: SparkR Installation

2014-06-19 Thread Zongheng Yang
Hi Stuti, Yes, you do need to install R on all nodes. Furthermore the rJava library is also required, which can be installed simply using 'install.packages(rJava)' in the R shell. Some more installation instructions after that step can be found in the README here:

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Zongheng Yang
If your input data is JSON, you can also try out the recently merged in initial JSON support: https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916 On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That’s pretty neat! So I guess if you

Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
I may be wrong, but I think RDDs must be created inside a SparkContext. To somehow preserve the order of the list, perhaps you could try something like: sc.parallelize((1 to xs.size).zip(xs)) On Fri, Jun 13, 2014 at 6:08 PM, SK skrishna...@gmail.com wrote: Hi, I have a List[ (String, Int,

Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
Sorry I wasn't being clear. The idea off the top of my head was that you could append an original position index to each element (using the line above), and modified what ever processing functions you have in mind to make them aware of these indices. And I think you are right that RDD collections

Re: SQLContext and HiveContext Query Performance

2014-06-04 Thread Zongheng Yang
Hi, Just wondering if you can try this: val obj = sql(select manufacturer, count(*) as examcount from pft group by manufacturer order by examcount desc) obj.collect() obj.queryExecution.executedPlan.executeCollect() and time the third line alone. It could be that Spark SQL taking some time to