[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172226#comment-17172226 ] Yu Gan commented on SPARK-12741: Aha, I came across the similar issue. My sql is select p_brand, p_size, count(ps_suppkey) as supplier_cnt from tpch.partsupp inner join tpch.part on p_partkey = ps_partkey group by P_BRAND, p_size the total row count are different: dataSet.count()=1179, dataSet.rdd().count()=1178 Finally i found the root cause: In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws BadRecordException, when in PermissiveMode and corrupted record exists the result row would be None record. In this case, the none record will be filtered. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi >Priority: Major > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242596#comment-15242596 ] Sean Owen commented on SPARK-12741: --- master = the latest code on the master branch. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242132#comment-15242132 ] Stephane Maarek commented on SPARK-12741: - Hi Sean, What do you mean by the behavior on master? Do you want me to run the query on something different than spark-shell or spark-shell --master yarn --deploy-mode client ? Sorry I'm just starting with these kind of bugs reports and I don't have the expertise to dive down in the Spark code. Thanks for working with me through that Regards, Stephane > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240811#comment-15240811 ] Sean Owen commented on SPARK-12741: --- Yes, and I couldn't reproduce your test case. The question is a) what is the behavior on master, and b) what happens in Stephane's case? > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240808#comment-15240808 ] Sasi commented on SPARK-12741: -- I test it on Spark 1.5.2 latest version and got the same results. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240804#comment-15240804 ] Sean Owen commented on SPARK-12741: --- I mean it's a little better to write your test case here than link to an external post. Also, try running the latest build of Spark, since sometimes something has already been fixed, and that much would be good to know. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238769#comment-15238769 ] Sean Owen commented on SPARK-12741: --- Please try vs master, and please inline your test case here. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238723#comment-15238723 ] Stephane Maarek commented on SPARK-12741: - can we please re-open the issue? > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227885#comment-15227885 ] Sasi commented on SPARK-12741: -- Thanks Stephane, its the same issue. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227688#comment-15227688 ] Stephane Maarek commented on SPARK-12741: - Hi, May be related to: http://stackoverflow.com/questions/36438898/spark-dataframe-count-doesnt-return-the-same-results-when-run-twice I don't have code to generate the input file, it's just a simple hive table though. Cheers, Stephane > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110523#comment-15110523 ] Sasi commented on SPARK-12741: -- I changed the way I used the DataFrame from my last ticket. Now, I have dataframe without any cache or persist operation, so each time I add/remove row I see it on the dataframe. The only problem I'm having right now is the count operation. I'll try to create new dataframe for each count calls, maybe it will resolve the problem. By the way, I see the count result change, e.g. from 3 to 5, but the count size not equals to the real dataframe size. Thanks again! Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110553#comment-15110553 ] Sasi commented on SPARK-12741: -- Create new DataFame didn't resolve the issue. I still think its bug. Thanks, Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110584#comment-15110584 ] Sasi commented on SPARK-12741: -- I checked my DB which is Aerospike, and I got the same results of my collect. I'm creating my DataFrame with Aerospark which is a connector written by Sasha, https://github.com/sasha-polev/aerospark/ I'm using the DataFrame actions as describe in sql-programming-guide, https://spark.apache.org/docs/1.3.0/sql-programming-guide.html I know there're two ways to do actions on DataFrame: 1) SQL way. {code} dataFrame.sqlContext().sql("select count(*) from tbl").collect()[0] {code} 2) DataFrame way. {code} dataFrame.where("...").count() {code} I'm using the DataFrame way which is more simple to understand and to read as a JAVA code. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110563#comment-15110563 ] Sasi commented on SPARK-12741: -- If I'm running the following code: {code} dataFrame.where("...").count() {code} The result is the same as the collect. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110572#comment-15110572 ] Sean Owen commented on SPARK-12741: --- I can't reproduce this. I always get the same count and collect result on an example data set. Above I think you mean sqlContext not dataFrame right? this makes me wonder what you're really executing. Also, couldn't it be an issue with your DB? > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110503#comment-15110503 ] Sasi commented on SPARK-12741: -- I updated the report, can you verify it again. Thanks! Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > subscribersDataFrame.collect() > {code} > method doQueryCount looks like: > {code} > subscribersDataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > subscribersDataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110613#comment-15110613 ] Sasi commented on SPARK-12741: -- Addtional update: If I use the following code, then I get the same length of the collect. {code} subscribersDataFrame.rdd().count(); {code} > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110511#comment-15110511 ] Sean Owen commented on SPARK-12741: --- I recall from other JIRAs that you're not updating the DataFrame / RDD but updating the table? that's not going to cause the result to change -- or it may, it's undefined. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110701#comment-15110701 ] Sasi commented on SPARK-12741: -- That's not what I meant. I just set an example for each case, SQL way and DataFrame way. I know that count() on select count(*) return 1 row, that's why I wrote collect()[0] which give back the value. As I said before: dataFrame.count() and dataFrame.where("...").count() results wrong size when running dataFrame.where("...").collect().length or dataFrame.collect().length. I'm pretty sure that count() doesn't work as expected. Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110762#comment-15110762 ] Sean Owen commented on SPARK-12741: --- OK, that's what you wrote at the outset though. Then I can't reproduce it. I always get the correct count both ways. {{where("...")}} isn't what you're really executing; what are you writing? are you sure that's not the problem? because you have no predicate in the query you're comparing to. It's important to be clear what you're comparing. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110676#comment-15110676 ] Sean Owen commented on SPARK-12741: --- Wait, is this what you mean? "select count(*) ..." returns 1 row, which contains a number, which is the number of rows matching the predicate. count() returns 1 because there is 1 row in the result set. collect() collects (an Array of) that number of rows. Those are different; they're different things. That seems to be your problem. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org