[ https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171682#comment-17171682 ]
Jayce Jiang edited comment on SPARK-32515 at 8/5/20, 7:10 PM: -------------------------------------------------------------- Okay, I am trying to get all distinct value for the "username" columns of this df. The expect results is what is I show in the filter_df.toPandas()["username"].unique. The result was all usernames are all in the correct format, the username columns only contain characters [a-z][A-Z][0-9] and the underscore. !unknown.png|width=631,height=251! For example, "danielrainge", "dgreen_14". But this is what I am actually getting. Actual results: The problem is when I use spark function instead of converting to a pandas dataframe first. As you see in the image. In [134], when I do the collect() method, I am getting results like [''#classic'], ['#foodforthought'], etc. These results with bracket [] come from another column, they shouldn't be there, all the string in the username column does not contain bracket or hashtags #. !unknown1.png|width=576,height=272! I am trying it in Google Collab right now, and see if it is a Jupyter notebook problem. Will keep you updated. was (Author: tigaiii123): Okay, I am trying to get all distinct value for the "username" columns of this df. The expect results is what is I show in the filter_df.toPandas()["username"].unique. The result was all usernames are all in the correct format, the username columns only contain characters [a-z][A-Z][0-9] and the underscore. !unknown.png|width=631,height=251! For example, "danielrainge", "dgreen_14". What I am actually getting. The problem is when I use spark function instead of converting to a pandas dataframe first. As you see in the image. In [134], when I do the collect() method, I am getting result like [["#classic" |#classic" ]] , and random result with bracket [], those results shouldn't be there, all the string in the username column does not contain bracket or hashtags #. !unknown1.png|width=576,height=272! I am trying it in Google Collab right now, and see if it is a Jupyter notebook problem. Will keep you updated. > Distinct Function Weird Bug > --------------------------- > > Key: SPARK-32515 > URL: https://issues.apache.org/jira/browse/SPARK-32515 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.6 > Environment: Window 10 and Mac, both have the same issues. > Using Scala version 2.11.12 > Python 3.6.10 > java version "1.8.0_261" > Reporter: Jayce Jiang > Priority: Major > Attachments: Capture.PNG, Capture1.png, Capture2.PNG, > Screen_Shot_2020-08-05_at_2.46.42_PM.png, image-2020-08-03-07-03-55-716.png, > unknown.png, unknown1.png, unknown2.png > > > A weird spark display and counting error. When I was loading in my CSV file > into spark and trying to do check all distinct value from a column inside of > a dataframe. Everything I try in spark resulted in a wrong answer. But if I > convert my spark dataframe into pandas dataframe, it works. Please help. This > bug only happens in this one CSV file, all my other CSV files work properly. > Here are the pictures. > > !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org