Hi All, My google/SO searching is somehow failing on this I simply want to compute histograms for a column in a Spark dataframe.
There are two SO hits on this question: - https://stackoverflow.com/questions/39154325/pyspark-show-histogram-of-a-data-frame-column - https://stackoverflow.com/questions/36043256/making-histogram-with-spark-dataframe-column I've actually submitted an answer to the second one but I realize that my answer is wrong because even though this code looks fine/simple... in my testing the flatmap is 'bad' because it really slows down things down because it's pulling each value into python. # Show histogram of the 'C1' column bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20) # This is a bit awkward but I believe this is the correct way to do it plt.hist(bins[:-1], bins=bins, weights=counts) I've looked at QuantileDiscretizer.. from pyspark.ml.feature import QuantileDiscretizer discretizer = QuantileDiscretizer(numBuckets=20, inputCol='query_length', outputCol='query_hist') result = discretizer.fit(spark_df).transform(spark_df) but I feel like this might be the wrong path.... so... the general question is what's the best way to compute histograms in pyspark on columns that have a large number of rows?