Hive on Spark Vs Spark SQL

2015-11-15 Thread kiran lonikar
I would like to know if Hive on Spark uses or shares the execution code with Spark SQL or DataFrames? More specifically, does Hive on Spark benefit from the changes made to Spark SQL, project Tungsten? Or is it completely different execution path where it creates its own plan and executes on RDD?

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread kiran lonikar
So does not benefit from Project Tungsten right? On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin <r...@databricks.com> wrote: > It's a completely different path. > > > On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar <loni...@gmail.com> wrote: > >> I would l

Fwd: Code generation for GPU

2015-09-03 Thread kiran lonikar
Hi, I am speaking in Spark Europe summit on exploiting GPUs for columnar DataFrame operations . I was going through various blogs, talks and JIRAs given by all the key spark folks and trying to figure out

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote: Simillar question was asked

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
. On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar loni...@gmail.com wrote: Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. On Tue, Jun 9, 2015 at 1:47 PM

Re: Optimisation advice for Avro-Parquet merge job

2015-06-08 Thread kiran lonikar
James, As I can see, there are three distinct parts to your program: - for loop - synchronized block - final outputFrame.save statement Can you do a separate timing measurement by putting a simple System.currentTimeMillis() around these blocks to know how much they are taking and then

Re: Column operation on Spark RDDs.

2015-06-08 Thread kiran lonikar
Two simple suggestions: 1. No need to call zipWithIndex twice. Use the earlier RDD dt. 2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark job Below your code with the above changes: var dataRDD = sc.textFile(/test.csv).map(_.split(,)) val dt =

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
are also lazily evaluated. However, DataFrame transformations like filter(), select(), agg() return a DataFrame rather than an RDD. Other methods like show() and collect() are actions. Cheng On 6/8/15 1:33 PM, kiran lonikar wrote: Thanks for replying twice :) I think I sent this question

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
DataFrame and RDD are both lazily evaluated. Cheng On 6/8/15 8:11 PM, kiran lonikar wrote: Thanks. Can you point me to a place in the documentation of SQL programming guide or DataFrame scaladoc where this transformation and actions are grouped like in the case of RDD? Also if you can tell me

Re: Optimisation advice for Avro-Parquet merge job

2015-06-08 Thread kiran lonikar
at 12:30 PM, kiran lonikar loni...@gmail.com wrote: James, As I can see, there are three distinct parts to your program: - for loop - synchronized block - final outputFrame.save statement Can you do a separate timing measurement by putting a simple System.currentTimeMillis

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread kiran lonikar
Thanks for replying twice :) I think I sent this question by email and somehow thought I did not sent it, hence created the other one on the web interface. Lets retain this thread since you have provided more details here. Great, it confirms my intuition about DataFrame. It's similar to Shark

columnar structure of RDDs from Parquet or ORC files

2015-06-03 Thread kiran lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames