If you want to avoid pulling values into python you can use hive function "histogram_numeric", you need set `SparkSession.enableHiveSupport()`, but note that, calling hive function in spark will also slow down performance. Spark-sql haven't implemented "histogram_numeric" yet. But I think it will be added in future.
On Wed, Sep 27, 2017 at 11:50 PM, Brian Wylie <briford.wy...@gmail.com> wrote: > Hi All, > > My google/SO searching is somehow failing on this I simply want to compute > histograms for a column in a Spark dataframe. > > There are two SO hits on this question: > - https://stackoverflow.com/questions/39154325/pyspark- > show-histogram-of-a-data-frame-column > - https://stackoverflow.com/questions/36043256/making- > histogram-with-spark-dataframe-column > > I've actually submitted an answer to the second one but I realize that my > answer is wrong because even though this code looks fine/simple... in my > testing the flatmap is 'bad' because it really slows down things down > because it's pulling each value into python. > > # Show histogram of the 'C1' column > bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20) > # This is a bit awkward but I believe this is the correct way to do it > plt.hist(bins[:-1], bins=bins, weights=counts) > > I've looked at QuantileDiscretizer.. > > from pyspark.ml.feature import QuantileDiscretizer > discretizer = QuantileDiscretizer(numBuckets=20, inputCol='query_length', > outputCol='query_hist') > result = discretizer.fit(spark_df).transform(spark_df) > > but I feel like this might be the wrong path.... so... the general > question is what's the best way to compute histograms in pyspark on columns > that have a large number of rows? > > > > >