Re: Spark sql query taking long time

Sumedh Wale Thu, 03 Mar 2016 04:34:01 -0800

On Thursday 03 March 2016 11:03 AM, Angel Angel wrote:

Hello Sir/Madam,

I am writing one application using spark sql.

i made the vary big table using the following command

val dfCustomers1 = sc.textFile("/root/Desktop/database.txt").map(_.split(",")).map(p => Customer1(p(0),p(1).trim.toInt, p(2).trim.toInt, p(3)))toDF

Now i want to search the address(many address) fields in the table and then extends the new table as per the searching.

var k = dfCustomers1.filter(dfCustomers1("Address").equalTo(lines(0)))

for( a <-1 until 1500) {

| var temp= dfCustomers1.filter(dfCustomers1("Address").equalTo(lines(a)))

| k = temp.unionAll(k)

}

k.show

For above case one approach that can help a lot is to covert the lines[0] to a table and then do a join on it instead of individual searches. Something like:

val linesRDD = sc.parallelize(lines, 1) // since number of lines is small, so 1 partition should be fine
val schema = StructType(Array(StructField("Address", StringType)))
val linesDF = sqlContext.createDataFrame(linesRDD.map(Row(_)), schema)
val result = dfCustomers1.join(linesDF, "Address")

If you do need to scan the DataFrame multiple times, then this will end up scanning the csv file, formatting etc in every loop. I would suggest caching in memory or saving to parquet/orc format for faster access. If there is enough memory then the SQLContext.cacheTable API can be used, else can save to parquet file:

dfCustomers1.write.parquet("database.parquet")
val dfCustomers2 = sqlContext.read.parquet("database.parquet")

Normally parquet file scanning should be much faster than CSV scan+format so use the dfCustomers2 everywhere. You can also try various values of "spark.sql.parquet.compression.codec" (lzo, snappy, uncompressed) instead of default gzip. Try if this reduces the runtime. Fastest will be if there is enough memory for sqlContext.cacheTable but I doubt that will be possible since you say it is a big table.

But this is taking so long time. So can you suggest me the any optimized way, so i can reduce the execution time.

My cluster has 3 slaves and 1 master.

Thanks.

thanks

-- 
Sumedh Wale
SnappyData (http://www.snappydata.io)



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark sql query taking long time

Reply via email to