+1.
Caching is way too slow.
On Wed, Apr 1, 2015 at 12:33 PM, SamyaMaiti samya.maiti2...@gmail.com wrote:
Hi Experts,
I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL
queries repetitively.
Few questions :
1. When I do the below (persist to memory after reading from disk), it takes
lot of time to persist to memory, any suggestions of how to tune this?
val inputP = sqlContext.parquetFile(some HDFS path)
inputP.registerTempTable(sample_table)
inputP.persist(MEMORY_ONLY)
val result = sqlContext.sql(some sql query)
result.count
Note : Once the data is persisted to memory, it takes fraction of seconds to
return query result from the second query onwards. So my concern is how to
reduce the time when the data is first loaded to cache.
2. I have observed that if I omit the below line,
inputP.persist(MEMORY_ONLY)
the first time Query execution is comparatively quick (say it take
1min), as the load to Memory time is saved, but to my surprise the second
time I run the same query it takes 30 sec as the inputP is not constructed
from disk (checked from UI).
So my question is, Does spark use some kind of internal caching for inputP
in this scenario?
Thanks in advance
Regards,
Sam
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org