Re: persist(MEMORY_ONLY) takes lot of time

2015-04-02 Thread Christian Perez
+1.

Caching is way too slow.

On Wed, Apr 1, 2015 at 12:33 PM, SamyaMaiti samya.maiti2...@gmail.com wrote:
 Hi Experts,

 I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL
 queries repetitively.

 Few questions :

 1. When I do the below (persist to memory after reading from disk), it takes
 lot of time to persist to memory, any suggestions of how to tune this?

  val inputP  = sqlContext.parquetFile(some HDFS path)
  inputP.registerTempTable(sample_table)
  inputP.persist(MEMORY_ONLY)
  val result = sqlContext.sql(some sql query)
  result.count

 Note : Once the data is persisted to memory, it takes fraction of seconds to
 return query result from the second query onwards. So my concern is how to
 reduce the time when the data is first loaded to cache.


 2. I have observed that if I omit the below line,
  inputP.persist(MEMORY_ONLY)
   the first time Query execution is comparatively quick (say it take
 1min), as the load to Memory time is saved, but to my surprise the second
 time I run the same query it takes 30 sec as the inputP is not constructed
 from disk (checked from UI).

  So my question is, Does spark use some kind of internal caching for inputP
 in this scenario?

 Thanks in advance

 Regards,
 Sam



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



persist(MEMORY_ONLY) takes lot of time

2015-04-01 Thread SamyaMaiti
Hi Experts,

I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL
queries repetitively.

Few questions : 

1. When I do the below (persist to memory after reading from disk), it takes
lot of time to persist to memory, any suggestions of how to tune this?
 
 val inputP  = sqlContext.parquetFile(some HDFS path)
 inputP.registerTempTable(sample_table)
 inputP.persist(MEMORY_ONLY)
 val result = sqlContext.sql(some sql query)
 result.count

Note : Once the data is persisted to memory, it takes fraction of seconds to
return query result from the second query onwards. So my concern is how to
reduce the time when the data is first loaded to cache.


2. I have observed that if I omit the below line, 
 inputP.persist(MEMORY_ONLY)
  the first time Query execution is comparatively quick (say it take
1min), as the load to Memory time is saved, but to my surprise the second
time I run the same query it takes 30 sec as the inputP is not constructed
from disk (checked from UI).

 So my question is, Does spark use some kind of internal caching for inputP
in this scenario?

Thanks in advance

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org