Hello,

I am using fpgrowth to generate frequent item sets and the model is working
fine. If I select n rows I was able to see the data.

When I try to save the data using any of the methods like write.orc or
saveAsTable or saveAsParquet it is taking unusual amount of time to save
the data.

If I save data before running the model all the methods work perfectly.
Only after running the model I see this issue. Below is the sample code I
am using. Can you let me know if there is anything wrong that I am doing or
any configuration changes need to be made?

rdd_buckets.write.orc('/data/buckets') -- *before running the model this
works and writes data in less than 2 minutes.*

transactions = rdd_buckets.rdd.map(lambda line: line.buckets.split('::'))
model = FPGrowth.train(transactions, minSupport=0.000001,numPartitions=200)
result = model.freqItemsets().toDF()
size_1_buckets = result.filter(F.size(result.items) == 1)
size_2_buckets = result.filter(F.size(result.items) == 2)
size_1_buckets.registerTempTable('size_1_buckets')
hive_context.sql("use buckets")
hive_context.sql("create table size_1_buckets as select * from
size_1_buckets" ) -- *this step takes long time (10 hours) to complete the
writing process.*
size_2_buckets.registerTempTable('size_2_buckets')
hive_context.sql("use buckets")
hive_context.sql("create table size_2_buckets as select * from
size_2_buckets" ) -- *this step takes **takes long time (10 hours) ** to
complete the writing process.*

Below is the command that we are using to submit the job.

/usr/hdp/current/spark-client/bin/spark-submit --master yarn-client
--num-executors 10 --conf spark.executor.memory=10g --conf
spark.yarn.queue=batch --conf spark.rpc.askTimeout=100s --conf
spark.driver.memory=2g --conf spark.kryoserializer.buffer.max=256m --conf
spark.executor.cores=5 --conf spark.driver.cores=4 python/buckets.py

Thanks,
Goutham.

Reply via email to