Is cacheTable similar to asTempTable before? 

Sent from my iPhone

> On 19 Jan, 2016, at 4:18 am, George Sigletos <sigle...@textkernel.nl> wrote:
> 
> Thanks Kevin for your reply.
> 
> I was suspecting the same thing as well, although it still does not make much 
> sense to me why would you need to do both:
> myData.cache()
> sqlContext.cacheTable("myData")
> 
> in case you are using both sqlContext and dataframes to execute queries
> 
> dataframe.select(...) and sqlContext.sql("select ...") are equivalent, as far 
> as I understand
> 
> Kind regards,
> George
> 
>> On Fri, Jan 15, 2016 at 6:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com> 
>> wrote:
>> Hi George,
>> 
>> I believe that sqlContext.cacheTable("tableName") is to be used when you 
>> want to cache the data that is being used within a Spark SQL query. For 
>> example, take a look at the code below.
>>  
>>> val myData = sqlContext.load("com.databricks.spark.csv", Map("path" -> 
>>> "hdfs://somepath/file", "header" -> "false").toDF("col1", "col2")
>>> myData.registerTempTable("myData")  
>> 
>> Here, the usage of cache() will affect ONLY the myData.select query. 
>>> myData.cache() 
>>> myData.select("col1", "col2").show() 
>>  
>> Here, the usage of cacheTable will affect ONLY the sqlContext.sql query.
>>> sqlContext.cacheTable("myData")
>>> sqlContext.sql("SELECT col1, col2 FROM myData").show()
>> 
>> Thanks,
>> Kevin
>> 
>>> On Fri, Jan 15, 2016 at 7:00 AM, George Sigletos <sigle...@textkernel.nl> 
>>> wrote:
>>> According to the documentation they are exactly the same, but in my queries 
>>> 
>>> dataFrame.cache() 
>>> 
>>> results in much faster execution times vs doing 
>>> 
>>> sqlContext.cacheTable("tableName")
>>> 
>>> Is there any explanation about this? I am not caching the RDD prior to 
>>> creating the dataframe. Using Pyspark on Spark 1.5.2
>>> 
>>> Kind regards,
>>> George
> 

Reply via email to