Re: pyspark histogram

2017-09-27 Thread Weichen Xu
If you want to avoid pulling values into python you can use hive function
"histogram_numeric", you need set `SparkSession.enableHiveSupport()`, but
note that, calling hive function in spark will also slow down performance.
Spark-sql haven't implemented "histogram_numeric" yet. But I think it will
be added in future.

On Wed, Sep 27, 2017 at 11:50 PM, Brian Wylie 
wrote:

> Hi All,
>
> My google/SO searching is somehow failing on this I simply want to compute
> histograms for a column in a Spark dataframe.
>
> There are two SO hits on this question:
> - https://stackoverflow.com/questions/39154325/pyspark-
> show-histogram-of-a-data-frame-column
> - https://stackoverflow.com/questions/36043256/making-
> histogram-with-spark-dataframe-column
>
> I've actually submitted an answer to the second one but I realize that my
> answer is wrong because even though this code looks fine/simple... in my
> testing the flatmap is 'bad' because it really slows down things down
> because it's pulling each value into python.
>
> # Show histogram of the 'C1' column
> bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)
> # This is a bit awkward but I believe this is the correct way to do it
> plt.hist(bins[:-1], bins=bins, weights=counts)
>
> I've looked at QuantileDiscretizer..
>
> from pyspark.ml.feature import QuantileDiscretizer
> discretizer = QuantileDiscretizer(numBuckets=20, inputCol='query_length',
> outputCol='query_hist')
> result = discretizer.fit(spark_df).transform(spark_df)
>
> but I feel like this might be the wrong path so... the general
> question is what's the best way to compute histograms in pyspark on columns
> that have a large number of rows?
>
>
>
>
>


pyspark histogram

2017-09-27 Thread Brian Wylie
Hi All,

My google/SO searching is somehow failing on this I simply want to compute
histograms for a column in a Spark dataframe.

There are two SO hits on this question:
-
https://stackoverflow.com/questions/39154325/pyspark-show-histogram-of-a-data-frame-column
-
https://stackoverflow.com/questions/36043256/making-histogram-with-spark-dataframe-column

I've actually submitted an answer to the second one but I realize that my
answer is wrong because even though this code looks fine/simple... in my
testing the flatmap is 'bad' because it really slows down things down
because it's pulling each value into python.

# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)
# This is a bit awkward but I believe this is the correct way to do it
plt.hist(bins[:-1], bins=bins, weights=counts)

I've looked at QuantileDiscretizer..

from pyspark.ml.feature import QuantileDiscretizer
discretizer = QuantileDiscretizer(numBuckets=20, inputCol='query_length',
outputCol='query_hist')
result = discretizer.fit(spark_df).transform(spark_df)

but I feel like this might be the wrong path so... the general question
is what's the best way to compute histograms in pyspark on columns that
have a large number of rows?