[ https://issues.apache.org/jira/browse/SPARK-35504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nikolay Sokolov resolved SPARK-35504. ------------------------------------- Resolution: Fixed I could not fully comprehend what was written in the documentation. Helped to figure it out. > count distinct asterisk > ------------------------ > > Key: SPARK-35504 > URL: https://issues.apache.org/jira/browse/SPARK-35504 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Environment: {code:java} > uname -a > Linux 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > {code} > > {code:java} > lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 18.04.4 LTS > Release: 18.04 > Codename: bionic > {code} > > {code:java} > /opt/spark/bin/spark-submit --version > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292 > Branch HEAD > Compiled by user ubuntu on 2020-06-06T13:05:28Z > Revision 3fdfce3120f307147244e5eaf46d61419a723d50 > Url https://gitbox.apache.org/repos/asf/spark.git > Type --help for more information. > {code} > {code:java} > lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz > Stepping: 7 > CPU MHz: 3602.011 > BogoMIPS: 6000.01 > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache: 1024K > L3 cache: 36608K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm > constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf > tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe > popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm > 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms > invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd > avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke > {code} > > Reporter: Nikolay Sokolov > Priority: Minor > Attachments: SPARK-35504_first_query_plan.log, > SPARK-35504_second_query_plan.log > > > Hi everyone, > I hope you're well! > > Today I came across a very interesting case when the result of the execution > of the algorithm for counting unique rows differs depending on the form > (count(distinct *) vs count( * ) from derived table) of the Spark SQL queries. > I still can't figure out on my own if this is a bug or a feature and I would > like to share what I found. > > I run Spark SQL queries through the Thrift (and not only) connecting to the > Spark cluster. I use the DBeaver app to execute Spark SQL queries. > > So, I have two identical Spark SQL queries from an algorithmic point of view > that return different results. > > The first query: > {code:sql} > select count(distinct *) unique_amt from storage_datamart.olympiads > ; -- Rows: 13437678 > {code} > > The second query: > {code:sql} > select count(*) from (select distinct * from storage_datamart.olympiads) > ; -- Rows: 36901430 > {code} > > The result of the two queries is different. (But it must be the same, right!?) > {code:sql} > select 'The first query' description, count(distinct *) unique_amt from > storage_datamart.olympiads > union all > select 'The second query', count(*) from (select distinct * from > storage_datamart.olympiads) > ; > {code} > > The result of the above query is the following: > {code:java} > The first query 13437678 > The second query 36901430 > {code} > > I can easily calculate the unique number of rows in the table: > {code:sql} > select count(*) from ( > select student_id, olympiad_id, tour, grade > from storage_datamart.olympiads > group by student_id, olympiad_id, tour, grade > having count(*) = 1 > ) > ; -- Rows: 36901365 > {code} > > The table DDL is the following: > {code:sql} > CREATE TABLE `storage_datamart`.`olympiads` ( > `ptn_date` DATE, > `student_id` BIGINT, > `olympiad_id` STRING, > `grade` BIGINT, > `grade_type` STRING, > `tour` STRING, > `created_at` TIMESTAMP, > `created_at_local` TIMESTAMP, > `olympiad_num` BIGINT, > `olympiad_name` STRING, > `subject` STRING, > `started_at` TIMESTAMP, > `ended_at` TIMESTAMP, > `region_id` BIGINT, > `region_name` STRING, > `municipality_name` STRING, > `school_id` BIGINT, > `school_name` STRING, > `school_status` BOOLEAN, > `oly_n_common` INT, > `num_day` INT, > `award_type` STRING, > `new_student_legacy` INT, > `segment` STRING, > `total_start` TIMESTAMP, > `total_end` TIMESTAMP, > `year_learn` STRING, > `parent_id` BIGINT, > `teacher_id` BIGINT, > `parallel` BIGINT, > `olympiad_type` STRING) > USING parquet > LOCATION 's3a://uchiru-bi-dwh/storage/datamart/olympiads.parquet' > ; > {code} > > Could you please tell me why in the first Spark SQL query counting the unique > number of rows using the construction `count(distinct *)` does not count > correctly and why the result of the two Spark SQL queries is different?? > Thanks in advance. > > p.s. I could not find a description of such behaviour of the function > `count(distinct *)` in the [official Spark > documentation|https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions]: > {quote}count(DISTINCT expr[, expr...]) -> Returns the number of rows for > which the supplied expression(s) are unique and non-null. > {quote} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org