Re: Running spark function on parquet without sql

2015-03-15 Thread Cheng Lian
That's an unfortunate documentation bug in the programming guide... We 
failed to update it after making the change.


Cheng

On 2/28/15 8:13 AM, Deborah Siegel wrote:

Hi Michael,

Would you help me understand  the apparent difference here..

The Spark 1.2.1 programming guide indicates:

Note that if you call |schemaRDD.cache()| rather than 
|sqlContext.cacheTable(...)|, tables will /not/ be cached using the 
in-memory columnar format, and therefore 
|sqlContext.cacheTable(...)| is strongly recommended for this use case.


Yet the API doc shows that :


def cache(): SchemaRDD

https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html.this.type


Overridden cache function will always use the in-memory
columnar caching.



links
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD

Thanks
Sincerely
Deb

On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust 
mich...@databricks.com mailto:mich...@databricks.com wrote:


From Zhan Zhang's reply, yes I still get the parquet's advantage.

You will need to at least use SQL or the DataFrame API (coming in
Spark 1.3) to specify the columns that you want in order to get
the parquet benefits.   The rest of your operations can be
standard Spark.

My next question is, if I operate on SchemaRdd will I get the
advantage of
Spark SQL's in memory columnar store when cached the table using
cacheTable()?


Yes, SchemaRDDs always use the in-memory columnar cache for
cacheTable and .cache() since Spark 1.2+






Re: Running spark function on parquet without sql

2015-02-27 Thread tridib
Somehow my posts are not getting excepted, and replies are not visible here.
But I got following reply from Zhan.

From Zhan Zhang's reply, yes I still get the parquet's advantage. 

My next question is, if I operate on SchemaRdd will I get the advantage of
Spark SQL's in memory columnar store when cached the table using
cacheTable()?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833p21850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running spark function on parquet without sql

2015-02-27 Thread Deborah Siegel
Hi Michael,

Would you help me understand  the apparent difference here..

The Spark 1.2.1 programming guide indicates:

Note that if you call schemaRDD.cache() rather than
sqlContext.cacheTable(...), tables will *not* be cached using the in-memory
columnar format, and therefore sqlContext.cacheTable(...) is strongly
recommended for this use case.

Yet the API doc shows that :
def cache(): SchemaRDD
https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html
.this.typeOverridden cache function will always use the in-memory columnar
caching.


links
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD

Thanks
Sincerely
Deb

On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust mich...@databricks.com
wrote:

 From Zhan Zhang's reply, yes I still get the parquet's advantage.


 You will need to at least use SQL or the DataFrame API (coming in Spark
 1.3) to specify the columns that you want in order to get the parquet
 benefits.   The rest of your operations can be standard Spark.

 My next question is, if I operate on SchemaRdd will I get the advantage of
 Spark SQL's in memory columnar store when cached the table using
 cacheTable()?


 Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and
 .cache() since Spark 1.2+



Re: Running spark function on parquet without sql

2015-02-27 Thread Michael Armbrust

 From Zhan Zhang's reply, yes I still get the parquet's advantage.


You will need to at least use SQL or the DataFrame API (coming in Spark
1.3) to specify the columns that you want in order to get the parquet
benefits.   The rest of your operations can be standard Spark.

My next question is, if I operate on SchemaRdd will I get the advantage of
 Spark SQL's in memory columnar store when cached the table using
 cacheTable()?


Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and
.cache() since Spark 1.2+


Running spark function on parquet without sql

2015-02-26 Thread tridib
Hello Experts,
In one of my projects we are having parquet files and we are using spark SQL
to get our analytics. I am encountering situation where simple SQL is not
getting me what I need or the complex SQL is not supported by Spark Sql. In
scenarios like this I am able to get things done using low level spark
constructs like MapFunction and reducers.

My question is if I create a JavaSchemaRdd on Parquet and use basic spark
constructs, will I still get the benefit of parquets columnar format? Will
my aggregation be as fast as it would have been if I have used SQL?

Please advice.

Thanks  Regards
Tridib



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running spark function on parquet without sql

2015-02-26 Thread Zhan Zhang
When you use sql (or API from SchemaRDD/DataFrame) to read data form parquet, 
the optimizer will do column pruning, predictor pushdown, etc. Thus you can  
the benefit of parquet column benefits. After that, you can operate the 
SchemaRDD (DF) like regular RDD.

Thanks.

Zhan Zhang
 
On Feb 26, 2015, at 1:50 PM, tridib tridib.sama...@live.com wrote:

 Hello Experts,
 In one of my projects we are having parquet files and we are using spark SQL
 to get our analytics. I am encountering situation where simple SQL is not
 getting me what I need or the complex SQL is not supported by Spark Sql. In
 scenarios like this I am able to get things done using low level spark
 constructs like MapFunction and reducers.
 
 My question is if I create a JavaSchemaRdd on Parquet and use basic spark
 constructs, will I still get the benefit of parquets columnar format? Will
 my aggregation be as fast as it would have been if I have used SQL?
 
 Please advice.
 
 Thanks  Regards
 Tridib
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org