On Thursday 03 March 2016 11:03 AM,
Angel Angel wrote:
For above case one approach that can help a lot is to covert the lines[0] to a table and then do a join on it instead of individual searches. Something like: val linesRDD = sc.parallelize(lines, 1) // since number of lines is small, so 1 partition should be fine If you do need to scan the DataFrame multiple times, then this will end up scanning the csv file, formatting etc in every loop. I would suggest caching in memory or saving to parquet/orc format for faster access. If there is enough memory then the SQLContext.cacheTable API can be used, else can save to parquet file: dfCustomers1.write.parquet("database.parquet") Normally parquet file scanning should be much faster than CSV scan+format so use the dfCustomers2 everywhere. You can also try various values of "spark.sql.parquet.compression.codec" (lzo, snappy, uncompressed) instead of default gzip. Try if this reduces the runtime. Fastest will be if there is enough memory for sqlContext.cacheTable but I doubt that will be possible since you say it is a big table.
thanks -- Sumedh Wale SnappyData (http://www.snappydata.io) --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org |
- Spark sql query taking long time Angel Angel
- Re: Spark sql query taking long time Ted Yu
- Re: Spark sql query taking long time Sumedh Wale
- Re: Spark sql query taking long time Gourav Sengupta
- Re: Spark sql query taking long time Sumedh Wale
- Re: Spark sql query taking long time Gourav Sengupta