Is cacheTable similar to asTempTable before? Sent from my iPhone
> On 19 Jan, 2016, at 4:18 am, George Sigletos <sigle...@textkernel.nl> wrote: > > Thanks Kevin for your reply. > > I was suspecting the same thing as well, although it still does not make much > sense to me why would you need to do both: > myData.cache() > sqlContext.cacheTable("myData") > > in case you are using both sqlContext and dataframes to execute queries > > dataframe.select(...) and sqlContext.sql("select ...") are equivalent, as far > as I understand > > Kind regards, > George > >> On Fri, Jan 15, 2016 at 6:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com> >> wrote: >> Hi George, >> >> I believe that sqlContext.cacheTable("tableName") is to be used when you >> want to cache the data that is being used within a Spark SQL query. For >> example, take a look at the code below. >> >>> val myData = sqlContext.load("com.databricks.spark.csv", Map("path" -> >>> "hdfs://somepath/file", "header" -> "false").toDF("col1", "col2") >>> myData.registerTempTable("myData") >> >> Here, the usage of cache() will affect ONLY the myData.select query. >>> myData.cache() >>> myData.select("col1", "col2").show() >> >> Here, the usage of cacheTable will affect ONLY the sqlContext.sql query. >>> sqlContext.cacheTable("myData") >>> sqlContext.sql("SELECT col1, col2 FROM myData").show() >> >> Thanks, >> Kevin >> >>> On Fri, Jan 15, 2016 at 7:00 AM, George Sigletos <sigle...@textkernel.nl> >>> wrote: >>> According to the documentation they are exactly the same, but in my queries >>> >>> dataFrame.cache() >>> >>> results in much faster execution times vs doing >>> >>> sqlContext.cacheTable("tableName") >>> >>> Is there any explanation about this? I am not caching the RDD prior to >>> creating the dataframe. Using Pyspark on Spark 1.5.2 >>> >>> Kind regards, >>> George >