Re: Running spark function on parquet without sql
That's an unfortunate documentation bug in the programming guide... We failed to update it after making the change. Cheng On 2/28/15 8:13 AM, Deborah Siegel wrote: Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: Note that if you call |schemaRDD.cache()| rather than |sqlContext.cacheTable(...)|, tables will /not/ be cached using the in-memory columnar format, and therefore |sqlContext.cacheTable(...)| is strongly recommended for this use case. Yet the API doc shows that : def cache(): SchemaRDD https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html.this.type Overridden cache function will always use the in-memory columnar caching. links https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD Thanks Sincerely Deb On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust mich...@databricks.com mailto:mich...@databricks.com wrote: From Zhan Zhang's reply, yes I still get the parquet's advantage. You will need to at least use SQL or the DataFrame API (coming in Spark 1.3) to specify the columns that you want in order to get the parquet benefits. The rest of your operations can be standard Spark. My next question is, if I operate on SchemaRdd will I get the advantage of Spark SQL's in memory columnar store when cached the table using cacheTable()? Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and .cache() since Spark 1.2+
Re: Running spark function on parquet without sql
Somehow my posts are not getting excepted, and replies are not visible here. But I got following reply from Zhan. From Zhan Zhang's reply, yes I still get the parquet's advantage. My next question is, if I operate on SchemaRdd will I get the advantage of Spark SQL's in memory columnar store when cached the table using cacheTable()? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833p21850.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running spark function on parquet without sql
Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will *not* be cached using the in-memory columnar format, and therefore sqlContext.cacheTable(...) is strongly recommended for this use case. Yet the API doc shows that : def cache(): SchemaRDD https://spark.apache.org/docs/1.2.0/api/scala/org/apache/spark/sql/SchemaRDD.html .this.typeOverridden cache function will always use the in-memory columnar caching. links https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD Thanks Sincerely Deb On Fri, Feb 27, 2015 at 2:13 PM, Michael Armbrust mich...@databricks.com wrote: From Zhan Zhang's reply, yes I still get the parquet's advantage. You will need to at least use SQL or the DataFrame API (coming in Spark 1.3) to specify the columns that you want in order to get the parquet benefits. The rest of your operations can be standard Spark. My next question is, if I operate on SchemaRdd will I get the advantage of Spark SQL's in memory columnar store when cached the table using cacheTable()? Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and .cache() since Spark 1.2+
Re: Running spark function on parquet without sql
From Zhan Zhang's reply, yes I still get the parquet's advantage. You will need to at least use SQL or the DataFrame API (coming in Spark 1.3) to specify the columns that you want in order to get the parquet benefits. The rest of your operations can be standard Spark. My next question is, if I operate on SchemaRdd will I get the advantage of Spark SQL's in memory columnar store when cached the table using cacheTable()? Yes, SchemaRDDs always use the in-memory columnar cache for cacheTable and .cache() since Spark 1.2+
Running spark function on parquet without sql
Hello Experts, In one of my projects we are having parquet files and we are using spark SQL to get our analytics. I am encountering situation where simple SQL is not getting me what I need or the complex SQL is not supported by Spark Sql. In scenarios like this I am able to get things done using low level spark constructs like MapFunction and reducers. My question is if I create a JavaSchemaRdd on Parquet and use basic spark constructs, will I still get the benefit of parquets columnar format? Will my aggregation be as fast as it would have been if I have used SQL? Please advice. Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running spark function on parquet without sql
When you use sql (or API from SchemaRDD/DataFrame) to read data form parquet, the optimizer will do column pruning, predictor pushdown, etc. Thus you can the benefit of parquet column benefits. After that, you can operate the SchemaRDD (DF) like regular RDD. Thanks. Zhan Zhang On Feb 26, 2015, at 1:50 PM, tridib tridib.sama...@live.com wrote: Hello Experts, In one of my projects we are having parquet files and we are using spark SQL to get our analytics. I am encountering situation where simple SQL is not getting me what I need or the complex SQL is not supported by Spark Sql. In scenarios like this I am able to get things done using low level spark constructs like MapFunction and reducers. My question is if I create a JavaSchemaRdd on Parquet and use basic spark constructs, will I still get the benefit of parquets columnar format? Will my aggregation be as fast as it would have been if I have used SQL? Please advice. Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org