Unsubscribe
Not able to save data after running fpgrowth in pyspark
Hello, I am using fpgrowth to generate frequent item sets and the model is working fine. If I select n rows I was able to see the data. When I try to save the data using any of the methods like write.orc or saveAsTable or saveAsParquet it is taking unusual amount of time to save the data. If I save data before running the model all the methods work perfectly. Only after running the model I see this issue. Below is the sample code I am using. Can you let me know if there is anything wrong that I am doing or any configuration changes need to be made? rdd_buckets.write.orc('/data/buckets') -- *before running the model this works and writes data in less than 2 minutes.* transactions = rdd_buckets.rdd.map(lambda line: line.buckets.split('::')) model = FPGrowth.train(transactions, minSupport=0.01,numPartitions=200) result = model.freqItemsets().toDF() size_1_buckets = result.filter(F.size(result.items) == 1) size_2_buckets = result.filter(F.size(result.items) == 2) size_1_buckets.registerTempTable('size_1_buckets') hive_context.sql("use buckets") hive_context.sql("create table size_1_buckets as select * from size_1_buckets" ) -- *this step takes long time (10 hours) to complete the writing process.* size_2_buckets.registerTempTable('size_2_buckets') hive_context.sql("use buckets") hive_context.sql("create table size_2_buckets as select * from size_2_buckets" ) -- *this step takes **takes long time (10 hours) ** to complete the writing process.* Below is the command that we are using to submit the job. /usr/hdp/current/spark-client/bin/spark-submit --master yarn-client --num-executors 10 --conf spark.executor.memory=10g --conf spark.yarn.queue=batch --conf spark.rpc.askTimeout=100s --conf spark.driver.memory=2g --conf spark.kryoserializer.buffer.max=256m --conf spark.executor.cores=5 --conf spark.driver.cores=4 python/buckets.py Thanks, Goutham.
Not able to write data after running fpgrowth in pyspark
Hello, I am using fpgrowth to generate frequent item sets and the model is working fine. If I select n rows I was able to see the data. When I try to save the data using any of the methods like write.orc or saveAsTable or saveAsParquet it is taking unusual amount of time to save the data. If I save data before running the model all the methods work perfectly. Only after running the model I see this issue. Below is the sample code I am using. Can you let me know if there is anything wrong that I am doing or any configuration changes need to be made? rdd_buckets.write.orc('/data/buckets') -- *before running the model this works and writes data in less than 2 minutes.* transactions = rdd_buckets.rdd.map(lambda line: line.buckets.split('::')) model = FPGrowth.train(transactions, minSupport=0.01,numPartitions=200) result = model.freqItemsets().toDF() size_1_buckets = result.filter(F.size(result.items) == 1) size_2_buckets = result.filter(F.size(result.items) == 2) size_1_buckets.registerTempTable('size_1_buckets') hive_context.sql("use buckets") hive_context.sql("create table size_1_buckets as select * from size_1_buckets" ) -- *this step takes long time (10 hours) to complete the writing process.* size_2_buckets.registerTempTable('size_2_buckets') hive_context.sql("use buckets") hive_context.sql("create table size_2_buckets as select * from size_2_buckets" ) -- *this step takes **takes long time (10 hours) ** to complete the writing process.* Below is the command that we are using to submit the job. /usr/hdp/current/spark-client/bin/spark-submit --master yarn-client --num-executors 10 --conf spark.executor.memory=10g --conf spark.yarn.queue=batch --conf spark.rpc.askTimeout=100s --conf spark.driver.memory=2g --conf spark.kryoserializer.buffer.max=256m --conf spark.executor.cores=5 --conf spark.driver.cores=4 python/buckets.py Thanks, Goutham.