[jira] [Commented] (SPARK-8380) SparkR mis-counts

2015-06-16 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588196#comment-14588196
 ] 

Shivaram Venkataraman commented on SPARK-8380:
--

Thanks for the update. I'm going to mark this issue as resolved. BTW if there 
are documentation changes that you think will be helpful feel free to create 
JIRAs / PRs for them

> SparkR mis-counts
> -
>
> Key: SPARK-8380
> URL: https://issues.apache.org/jira/browse/SPARK-8380
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Rick Moritz
>
> On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can 
> perform count operations on the entirety of the dataset and get the correct 
> value, as double checked against the same code in scala.
> When I start to add conditions or even do a simple partial ascending 
> histogram, I get discrepancies.
> In particular, there are missing values in SparkR, and massively so:
> A top 6 count of a certain feature in my dataset results in an order of 
> magnitude smaller numbers, than I get via scala.
> The following logic, which I consider equivalent is the basis for this report:
> counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
> head(arrange(counts, desc(counts$count)))
> versus:
> val table = sql("SELECT col_name, count(col_name) as value from df  group by 
> col_name order by value desc")
> The first, in particular, is taken directly from the SparkR programming 
> guide. Since summarize isn't documented from what I can see, I'd hope it does 
> what the programming guide indicates. In that case this would be a pretty 
> serious logic bug (no errors are thrown). Otherwise, there's the possibility 
> of a lack of documentation and badly worded example in the guide being behind 
> my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8380) SparkR mis-counts

2015-06-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586277#comment-14586277
 ] 

Shivaram Venkataraman commented on SPARK-8380:
--

[~RPCMoritz] Couple of things that would be interesting to see 

1. Does the `sql` command in SparkR work correctly ?
2. Can you try the dataframe statements in Scala and see what results you get ?

cc [~rxin]

> SparkR mis-counts
> -
>
> Key: SPARK-8380
> URL: https://issues.apache.org/jira/browse/SPARK-8380
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Rick Moritz
>
> On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can 
> perform count operations on the entirety of the dataset and get the correct 
> value, as double checked against the same code in scala.
> When I start to add conditions or even do a simple partial ascending 
> histogram, I get discrepancies.
> In particular, there are missing values in SparkR, and massively so:
> A top 6 count of a certain feature in my dataset results in an order of 
> magnitude smaller numbers, than I get via scala.
> The following logic, which I consider equivalent is the basis for this report:
> counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
> head(arrange(counts, desc(counts$count)))
> versus:
> val table = sql("SELECT col_name, count(col_name) as value from df  group by 
> col_name order by value desc")
> The first, in particular, is taken directly from the SparkR programming 
> guide. Since summarize isn't documented from what I can see, I'd hope it does 
> what the programming guide indicates. In that case this would be a pretty 
> serious logic bug (no errors are thrown). Otherwise, there's the possibility 
> of a lack of documentation and badly worded example in the guide being behind 
> my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8380) SparkR mis-counts

2015-06-15 Thread Rick Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586262#comment-14586262
 ] 

Rick Moritz commented on SPARK-8380:


I will attempt to reproduce this with an alternate dataset asap, but getting 
large volume datasets into this cluster is difficult.

> SparkR mis-counts
> -
>
> Key: SPARK-8380
> URL: https://issues.apache.org/jira/browse/SPARK-8380
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Rick Moritz
>
> On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can 
> perform count operations on the entirety of the dataset and get the correct 
> value, as double checked against the same code in scala.
> When I start to add conditions or even do a simple partial ascending 
> histogram, I get discrepancies.
> In particular, there are missing values in SparkR, and massively so:
> A top 6 count of a certain feature in my dataset results in an order of 
> magnitude smaller numbers, than I get via scala.
> The following logic, which I consider equivalent is the basis for this report:
> counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
> head(arrange(counts, desc(counts$count)))
> versus:
> val table = sql("SELECT col_name, count(col_name) as value from df  group by 
> col_name order by value desc")
> The first, in particular, is taken directly from the SparkR programming 
> guide. Since summarize isn't documented from what I can see, I'd hope it does 
> what the programming guide indicates. In that case this would be a pretty 
> serious logic bug (no errors are thrown). Otherwise, there's the possibility 
> of a lack of documentation and badly worded example in the guide being behind 
> my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org