Re: Spark SQL - Long running job
I meant using |saveAsParquetFile|. As for partition number, you can always control it with |spark.sql.shuffle.partitions| property. Cheng On 2/23/15 1:38 PM, nitin wrote: I believe calling processedSchemaRdd.persist(DISK) and processedSchemaRdd.checkpoint() only persists data and I will lose all the RDD metadata and when I re-start my driver, that data is kind of useless for me (correct me if I am wrong). I thought of doing processedSchemaRdd.saveAsParquetFile (hdfs file system) but I fear that in case my HDFS block size partition file size, I will get more partitions when reading instead of original schemaRdd. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717p10727.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark SQL - Long running job
How about persisting the computed result table first before caching it? So that you only need to cache the result table after restarting your service without recomputing it. Somewhat like checkpointing. Cheng On 2/22/15 12:55 AM, nitin wrote: Hi All, I intend to build a long running spark application which fetches data/tuples from parquet, does some processing(time consuming) and then cache the processed table (InMemoryColumnarTableScan). My use case is good retrieval time for SQL query(benefits of Spark SQL optimizer) and data compression(in-built in in-memory caching). Now the problem is that if my driver goes down, I will have to fetch the data again for all the tables and compute it and cache which is time consuming. Is it possible to persist processed/cached RDDs on disk such that my system up time is less when restarted after failure/going down? On a side note, the data processing contains a shuffle step which creates huge temporary shuffle files on local disk in temp folder and as per current logic, shuffle files don't get deleted for running executors. This is leading to my local disk getting filled up quickly and going out of space as its a long running spark job. (running spark in yarn-client mode btw). Thanks -Nitin -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark SQL - Long running job
Hi All, I intend to build a long running spark application which fetches data/tuples from parquet, does some processing(time consuming) and then cache the processed table (InMemoryColumnarTableScan). My use case is good retrieval time for SQL query(benefits of Spark SQL optimizer) and data compression(in-built in in-memory caching). Now the problem is that if my driver goes down, I will have to fetch the data again for all the tables and compute it and cache which is time consuming. Is it possible to persist processed/cached RDDs on disk such that my system up time is less when restarted after failure/going down? On a side note, the data processing contains a shuffle step which creates huge temporary shuffle files on local disk in temp folder and as per current logic, shuffle files don't get deleted for running executors. This is leading to my local disk getting filled up quickly and going out of space as its a long running spark job. (running spark in yarn-client mode btw). Thanks -Nitin -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org