On Thursday 03 March 2016 09:15 PM,
Gourav Sengupta wrote:
Yes, that will simplify and avoid the explicit split/map a bit (though the code below is simple enough as is). However, the basic problem with performance is not due to that. Note that a DataFrame whether using the spark-csv package or otherwise is just an access point into the underlying database.txt file, so multiple scans of the DataFrame as in the code below will lead to multiple tokenization/parse of the database.txt file which is quite expensive. The join approach will reduce to a single scan for case below which should definitely be done if possible, but if more queries are required to be executed on the DataFrame then saving it into parquet/orc (or cacheTable if possible) is faster in my experience.
thanks -- Sumedh Wale SnappyData (http://www.snappydata.io)
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org |
- Spark sql query taking long time Angel Angel
- Re: Spark sql query taking long time Ted Yu
- Re: Spark sql query taking long time Sumedh Wale
- Re: Spark sql query taking long time Gourav Sengupta
- Re: Spark sql query taking long time Sumedh Wale
- Re: Spark sql query taking long time Gourav Sengupta