Re: [pyspark 2.3+] CountDistinct
I can't exactly reproduce this. Here is what I tried quickly: import uuid import findspark findspark.init() # noqa import pyspark from pyspark.sql import functions as F # noqa: N812 spark = pyspark.sql.SparkSession.builder.getOrCreate() df = spark.createDataFrame([ [str(uuid.uuid4()) for i in range(45)], ], ['col1']) print(' Spark version:', spark.sparkContext.version) print(' Null count:', df.filter(F.col('col1').isNull()).count()) print(' Value count:', df.filter(F.col('col1').isNotNull()).count()) print(' Distinct Count 1:', df.agg(F.countDistinct(F.col('col1'))).collect()[0][0]) print(' Distinct Count 2:', df.agg(F.countDistinct(F.col('col1'))).collect()[0][0]) This always returns: Spark version: 2.4.0 Null count: 0 Value count: 45 Distinct Count 1: 45 Distinct Count 2: 45 On Sat, Jun 29, 2019 at 6:51 PM Rishi Shah wrote: > Thanks Abdeali! Please find details below: > > df.agg(countDistinct(col('col1'))).show() --> 450089 > df.agg(countDistinct(col('col1'))).show() --> 450076 > df.filter(col('col1').isNull()).count() --> 0 > df.filter(col('col1').isNotNull()).count() --> 450063 > > col1 is a string > Spark version 2.4.0 > datasize: ~ 500GB > > > On Sat, Jun 29, 2019 at 5:33 AM Abdeali Kothari > wrote: > >> How large is the data frame and what data type are you counting distinct >> for? >> I use count distinct quite a bit and haven't noticed any thing peculiar. >> >> Also, which exact version in 2.3.x? >> And, are performing any operations on the DF before the countDistinct? >> >> I recall there was a bug when I did countDistinct(PythonUDF(x)) in the >> same query which was resolved in one of the minor versions in 2.3.x >> >> On Sat, Jun 29, 2019, 10:32 Rishi Shah wrote: >> >>> Hi All, >>> >>> Just wanted to check in to see if anyone has any insight about this >>> behavior. Any pointers would help. >>> >>> Thanks, >>> Rishi >>> >>> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah >>> wrote: >>> Hi All, Recently we noticed that countDistinct on a larger dataframe doesn't always return the same value. Any idea? If this is the case then what is the difference between countDistinct & approx_count_distinct? -- Regards, Rishi Shah >>> >>> >>> -- >>> Regards, >>> >>> Rishi Shah >>> >> > > -- > Regards, > > Rishi Shah >
Re: [pyspark 2.3+] CountDistinct
Thanks Abdeali! Please find details below: df.agg(countDistinct(col('col1'))).show() --> 450089 df.agg(countDistinct(col('col1'))).show() --> 450076 df.filter(col('col1').isNull()).count() --> 0 df.filter(col('col1').isNotNull()).count() --> 450063 col1 is a string Spark version 2.4.0 datasize: ~ 500GB On Sat, Jun 29, 2019 at 5:33 AM Abdeali Kothari wrote: > How large is the data frame and what data type are you counting distinct > for? > I use count distinct quite a bit and haven't noticed any thing peculiar. > > Also, which exact version in 2.3.x? > And, are performing any operations on the DF before the countDistinct? > > I recall there was a bug when I did countDistinct(PythonUDF(x)) in the > same query which was resolved in one of the minor versions in 2.3.x > > On Sat, Jun 29, 2019, 10:32 Rishi Shah wrote: > >> Hi All, >> >> Just wanted to check in to see if anyone has any insight about this >> behavior. Any pointers would help. >> >> Thanks, >> Rishi >> >> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah >> wrote: >> >>> Hi All, >>> >>> Recently we noticed that countDistinct on a larger dataframe doesn't >>> always return the same value. Any idea? If this is the case then what is >>> the difference between countDistinct & approx_count_distinct? >>> >>> -- >>> Regards, >>> >>> Rishi Shah >>> >> >> >> -- >> Regards, >> >> Rishi Shah >> > -- Regards, Rishi Shah
Re: [pyspark 2.3+] CountDistinct
How large is the data frame and what data type are you counting distinct for? I use count distinct quite a bit and haven't noticed any thing peculiar. Also, which exact version in 2.3.x? And, are performing any operations on the DF before the countDistinct? I recall there was a bug when I did countDistinct(PythonUDF(x)) in the same query which was resolved in one of the minor versions in 2.3.x On Sat, Jun 29, 2019, 10:32 Rishi Shah wrote: > Hi All, > > Just wanted to check in to see if anyone has any insight about this > behavior. Any pointers would help. > > Thanks, > Rishi > > On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah > wrote: > >> Hi All, >> >> Recently we noticed that countDistinct on a larger dataframe doesn't >> always return the same value. Any idea? If this is the case then what is >> the difference between countDistinct & approx_count_distinct? >> >> -- >> Regards, >> >> Rishi Shah >> > > > -- > Regards, > > Rishi Shah >
Re: [pyspark 2.3+] CountDistinct
Hi All, Just wanted to check in to see if anyone has any insight about this behavior. Any pointers would help. Thanks, Rishi On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah wrote: > Hi All, > > Recently we noticed that countDistinct on a larger dataframe doesn't > always return the same value. Any idea? If this is the case then what is > the difference between countDistinct & approx_count_distinct? > > -- > Regards, > > Rishi Shah > -- Regards, Rishi Shah