Re: SchemaRDD: SQL Queries vs Language Integrated Queries
Hi, On Wed, Mar 11, 2015 at 11:05 PM, Cesar Flores ces...@gmail.com wrote: Thanks for both answers. One final question. *This registerTempTable is not an extra process that the SQL queries need to do that may decrease performance over the language integrated method calls? * As far as I know, registerTempTable is just a Map[String, SchemaRDD] insertion, nothing that would be measurable. But there are no distributed/RDD operations involved, I think. Tobias
Re: SchemaRDD: SQL Queries vs Language Integrated Queries
Hi: Thanks for both answers. One final question. *This registerTempTable is not an extra process that the SQL queries need to do that may decrease performance over the language integrated method calls? *The thing is that I am planning to use them in the current version of the ML Pipeline transformers classes for feature extraction, and If I need to save the input and maybe output SchemaRDD of the transform function in every transformer, this may not very efficient. Thanks On Tue, Mar 10, 2015 at 8:20 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote: I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone tell me what is the main difference between the two approaches, besides using different syntax? Are they interchangeable? Which one has better performance? One difference is that the language integrated queries are method calls on the SchemaRDD you want to work on, which requires you have access to the object at hand. The SQL queries are passed to a method of the SQLContext and you have to call registerTempTable() on the SchemaRDD you want to use beforehand, which can basically happen at an arbitrary location of your program. (I don't know if I could express what I wanted to say.) That may have an influence on how you design your program and how the different parts work together. Tobias -- Cesar Flores
Re: SchemaRDD: SQL Queries vs Language Integrated Queries
Hi, On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote: I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone tell me what is the main difference between the two approaches, besides using different syntax? Are they interchangeable? Which one has better performance? One difference is that the language integrated queries are method calls on the SchemaRDD you want to work on, which requires you have access to the object at hand. The SQL queries are passed to a method of the SQLContext and you have to call registerTempTable() on the SchemaRDD you want to use beforehand, which can basically happen at an arbitrary location of your program. (I don't know if I could express what I wanted to say.) That may have an influence on how you design your program and how the different parts work together. Tobias
Re: SchemaRDD: SQL Queries vs Language Integrated Queries
They should have the same performance, as they are compiled down to the same execution plan. Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html On Tue, Mar 10, 2015 at 2:13 PM, Cesar Flores ces...@gmail.com wrote: I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone tell me what is the main difference between the two approaches, besides using different syntax? Are they interchangeable? Which one has better performance? Thanks a lot -- Cesar Flores