I am using Spark 1.4.1 , in stand-alone mode, on a cluster of 3 nodes. Using Spark sql and Hive Context, I am trying to run a simple scan query on an existing Hive table (which is an external table consisting of rows in text files stored in HDFS - it is NOT parquet, ORC or any other richer format).
DataFrame res = hiveCtx.sql("SELECT * FROM lineitem WHERE L_LINENUMBER < 0"); What I observe is the performance of this full scan in Spark is not comparable with Hive (it is almost 4 times slower). Checking the resource usage, what I see is workers/executors do not do parallel scans but they scan on a per-node basis (first executors from the worker(s) on node 1 do reading from disk, while other two nodes are not doing I/O and just receive data from the first node and through network, then 2nd node does the scan and then the third one). I also realized that if I load this data file directly from my spark context (using textFile() ) and run count() on that (not using spark sql) then I can get a better performance by increasing number of partitions. I am just trying to do the same thing (increasing number of partitions in the beginning) in Spark sql as: var tab = sqlContext.read.table("lineitem"); tab.repartition(1000); OR tab.coalesce(1000); but none of repartition() or coalesce() methods actually work - they do not return an error, but if I check var p = tab.rdd.partitions.size; before and after calling any of them, it returns the same number of partitions. I am just wondering how I can change the number of partitions for a Hive external table, in Spark Sql. Any help/suggestion would be appreciated.