[jira] [Commented] (SPARK-8380) SparkR mis-counts
[ https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588196#comment-14588196 ] Shivaram Venkataraman commented on SPARK-8380: -- Thanks for the update. I'm going to mark this issue as resolved. BTW if there are documentation changes that you think will be helpful feel free to create JIRAs / PRs for them > SparkR mis-counts > - > > Key: SPARK-8380 > URL: https://issues.apache.org/jira/browse/SPARK-8380 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Rick Moritz > > On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can > perform count operations on the entirety of the dataset and get the correct > value, as double checked against the same code in scala. > When I start to add conditions or even do a simple partial ascending > histogram, I get discrepancies. > In particular, there are missing values in SparkR, and massively so: > A top 6 count of a certain feature in my dataset results in an order of > magnitude smaller numbers, than I get via scala. > The following logic, which I consider equivalent is the basis for this report: > counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name)) > head(arrange(counts, desc(counts$count))) > versus: > val table = sql("SELECT col_name, count(col_name) as value from df group by > col_name order by value desc") > The first, in particular, is taken directly from the SparkR programming > guide. Since summarize isn't documented from what I can see, I'd hope it does > what the programming guide indicates. In that case this would be a pretty > serious logic bug (no errors are thrown). Otherwise, there's the possibility > of a lack of documentation and badly worded example in the guide being behind > my misperception of SparkRs functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8380) SparkR mis-counts
[ https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586277#comment-14586277 ] Shivaram Venkataraman commented on SPARK-8380: -- [~RPCMoritz] Couple of things that would be interesting to see 1. Does the `sql` command in SparkR work correctly ? 2. Can you try the dataframe statements in Scala and see what results you get ? cc [~rxin] > SparkR mis-counts > - > > Key: SPARK-8380 > URL: https://issues.apache.org/jira/browse/SPARK-8380 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Rick Moritz > > On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can > perform count operations on the entirety of the dataset and get the correct > value, as double checked against the same code in scala. > When I start to add conditions or even do a simple partial ascending > histogram, I get discrepancies. > In particular, there are missing values in SparkR, and massively so: > A top 6 count of a certain feature in my dataset results in an order of > magnitude smaller numbers, than I get via scala. > The following logic, which I consider equivalent is the basis for this report: > counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name)) > head(arrange(counts, desc(counts$count))) > versus: > val table = sql("SELECT col_name, count(col_name) as value from df group by > col_name order by value desc") > The first, in particular, is taken directly from the SparkR programming > guide. Since summarize isn't documented from what I can see, I'd hope it does > what the programming guide indicates. In that case this would be a pretty > serious logic bug (no errors are thrown). Otherwise, there's the possibility > of a lack of documentation and badly worded example in the guide being behind > my misperception of SparkRs functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8380) SparkR mis-counts
[ https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586262#comment-14586262 ] Rick Moritz commented on SPARK-8380: I will attempt to reproduce this with an alternate dataset asap, but getting large volume datasets into this cluster is difficult. > SparkR mis-counts > - > > Key: SPARK-8380 > URL: https://issues.apache.org/jira/browse/SPARK-8380 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Rick Moritz > > On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can > perform count operations on the entirety of the dataset and get the correct > value, as double checked against the same code in scala. > When I start to add conditions or even do a simple partial ascending > histogram, I get discrepancies. > In particular, there are missing values in SparkR, and massively so: > A top 6 count of a certain feature in my dataset results in an order of > magnitude smaller numbers, than I get via scala. > The following logic, which I consider equivalent is the basis for this report: > counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name)) > head(arrange(counts, desc(counts$count))) > versus: > val table = sql("SELECT col_name, count(col_name) as value from df group by > col_name order by value desc") > The first, in particular, is taken directly from the SparkR programming > guide. Since summarize isn't documented from what I can see, I'd hope it does > what the programming guide indicates. In that case this would be a pretty > serious logic bug (no errors are thrown). Otherwise, there's the possibility > of a lack of documentation and badly worded example in the guide being behind > my misperception of SparkRs functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org