Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
For DataFrame, there are also transformations and actions. And transformations are also lazily evaluated. However, DataFrame transformations like filter(), select(), agg() return a DataFrame rather than an RDD. Other methods like show() and collect() are actions. Cheng On 6/8/15 1:33 PM,

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread ayan guha
I would think DF=RDD+Schema+some additional methods. In fact, a DF object has a DF.rdd in it so you can (if needed) convert DF=RDD really easily. On Mon, Jun 8, 2015 at 5:41 PM, kiran lonikar loni...@gmail.com wrote: Thanks. Can you point me to a place in the documentation of SQL programming

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
Thanks. Can you point me to a place in the documentation of SQL programming guide or DataFrame scaladoc where this transformation and actions are grouped like in the case of RDD? Also if you can tell me if sqlContext.load and unionAll are transformations or actions... I answered a question on

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
You may refer to DataFrame Scaladoc http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame Methods listed in Language Integrated Queries and RDD Options can be viewed as transformations, and those listed in Actions are, of course, actions. As for

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
Hi Cheng, Ayan, Thanks for the answers. I like the rule of thumb. I cursorily went through the DataFrame, SQLContext and sql.execution.basicOperators.scala code. It is apparent that these functions are lazily evaluated. The SQLContext.load functions are similar to SparkContext.textFile kind of

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread Cheng Lian
Interesting, just posted on another thread asking exactly the same question :) My answer there quoted below: For the following code: val df = sqlContext.parquetFile(path) `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code:

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-07 Thread kiran lonikar
Thanks for replying twice :) I think I sent this question by email and somehow thought I did not sent it, hence created the other one on the web interface. Lets retain this thread since you have provided more details here. Great, it confirms my intuition about DataFrame. It's similar to Shark

columnar structure of RDDs from Parquet or ORC files

2015-06-03 Thread kiran lonikar
When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames