ORC file writing hangs in pyspark

James Barney Tue, 23 Feb 2016 06:06:05 -0800

I'm trying to write an ORC file after running the FPGrowth algorithm on a
dataset of around just 2GB in size. The algorithm performs well and can
display results if I take(n) the freqItemSets() of the result after
converting that to a DF.


I'm using Spark 1.5.2 on HDP 2.3.4 and Python 3.4.2 on Yarn.

I get the results from querying a Hive table, also ORC format, running a
number of maps, joins, and filters on the data.

When the program attempts to write the files:
    result.write.orc('/data/staged/raw_result')
  size_1_buckets.write.orc('/data/staged/size_1_results')
  filter_size_2_buckets.write.orc('/data/staged/size_2_results')

The first path, /data/staged/raw_result, is created with a _temporary
folder, but the data is never written. The job hangs at this point,
apparently indefinitely.

Additionally, no logs are recorded or available for the jobs on the history
server.

What could be the problem?

ORC file writing hangs in pyspark

Reply via email to