Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Tobias Pfeiffer
Hi,

On Wed, Mar 11, 2015 at 11:05 PM, Cesar Flores ces...@gmail.com wrote:

 Thanks for both answers. One final question. *This registerTempTable is
 not an extra process that the SQL queries need to do that may decrease
 performance over the language integrated method calls? *


As far as I know, registerTempTable is just a Map[String, SchemaRDD]
insertion, nothing that would be measurable. But there are no
distributed/RDD operations involved, I think.

Tobias


Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Cesar Flores
Hi:

Thanks for both answers. One final question. *This registerTempTable is not
an extra process that the SQL queries need to do that may decrease
performance over the language integrated method calls? *The thing is that I
am planning to use them in the current version of the ML Pipeline
transformers classes for feature extraction, and If I need to save the
input and maybe output SchemaRDD of the transform function in every
transformer, this may not very efficient.


Thanks

On Tue, Mar 10, 2015 at 8:20 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote:

 I am new to the SchemaRDD class, and I am trying to decide in using SQL
 queries or Language Integrated Queries (
 https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
 ).

 Can someone tell me what is the main difference between the two
 approaches, besides using different syntax? Are they interchangeable? Which
 one has better performance?


 One difference is that the language integrated queries are method calls on
 the SchemaRDD you want to work on, which requires you have access to the
 object at hand. The SQL queries are passed to a method of the SQLContext
 and you have to call registerTempTable() on the SchemaRDD you want to use
 beforehand, which can basically happen at an arbitrary location of your
 program. (I don't know if I could express what I wanted to say.) That may
 have an influence on how you design your program and how the different
 parts work together.

 Tobias




-- 
Cesar Flores


Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Tobias Pfeiffer
Hi,

On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote:

 I am new to the SchemaRDD class, and I am trying to decide in using SQL
 queries or Language Integrated Queries (
 https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
 ).

 Can someone tell me what is the main difference between the two
 approaches, besides using different syntax? Are they interchangeable? Which
 one has better performance?


One difference is that the language integrated queries are method calls on
the SchemaRDD you want to work on, which requires you have access to the
object at hand. The SQL queries are passed to a method of the SQLContext
and you have to call registerTempTable() on the SchemaRDD you want to use
beforehand, which can basically happen at an arbitrary location of your
program. (I don't know if I could express what I wanted to say.) That may
have an influence on how you design your program and how the different
parts work together.

Tobias


Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Reynold Xin
They should have the same performance, as they are compiled down to the
same execution plan.

Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame:

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html



On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote:


 I am new to the SchemaRDD class, and I am trying to decide in using SQL
 queries or Language Integrated Queries (
 https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
 ).

 Can someone tell me what is the main difference between the two
 approaches, besides using different syntax? Are they interchangeable? Which
 one has better performance?


 Thanks a lot
 --
 Cesar Flores