[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2020-08-06 Thread Yu Gan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172226#comment-17172226
 ] 

Yu Gan commented on SPARK-12741:


Aha, I came across the similar issue. My sql is 

select
p_brand,
p_size,
count(ps_suppkey) as supplier_cnt 
from
tpch.partsupp 
inner join
tpch.part 
on p_partkey = ps_partkey 
group by
P_BRAND,
p_size


the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws 
BadRecordException, when  in PermissiveMode and corrupted record exists the 
result row would be None record. In this case, the none record will be 
filtered. 

 

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>Priority: Major
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242596#comment-15242596
 ] 

Sean Owen commented on SPARK-12741:
---

master = the latest code on the master branch.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-14 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242132#comment-15242132
 ] 

Stephane Maarek commented on SPARK-12741:
-

Hi Sean,
What do you mean by the behavior on master? Do you want me to run the query on 
something different than spark-shell or spark-shell --master yarn --deploy-mode 
client ? Sorry I'm just starting with these kind of bugs reports and I don't 
have the expertise to dive down in the Spark code. 
Thanks for working with me through that
Regards,
Stephane

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240811#comment-15240811
 ] 

Sean Owen commented on SPARK-12741:
---

Yes, and I couldn't reproduce your test case. The question is a) what is the 
behavior on master, and b) what happens in Stephane's case?

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-14 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240808#comment-15240808
 ] 

Sasi commented on SPARK-12741:
--

I test it on Spark 1.5.2 latest version and got the same results.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240804#comment-15240804
 ] 

Sean Owen commented on SPARK-12741:
---

I mean it's a little better to write your test case here than link to an 
external post.
Also, try running the latest build of Spark, since sometimes something has 
already been fixed, and that much would be good to know.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238769#comment-15238769
 ] 

Sean Owen commented on SPARK-12741:
---

Please try vs master, and please inline your test case here. 

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-13 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238723#comment-15238723
 ] 

Stephane Maarek commented on SPARK-12741:
-

can we please re-open the issue?

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-06 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227885#comment-15227885
 ] 

Sasi commented on SPARK-12741:
--

Thanks Stephane, its the same issue.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-05 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227688#comment-15227688
 ] 

Stephane Maarek commented on SPARK-12741:
-

Hi,

May be related to:
http://stackoverflow.com/questions/36438898/spark-dataframe-count-doesnt-return-the-same-results-when-run-twice
 

I don't have code to generate the input file, it's just a simple hive table 
though.

Cheers,
Stephane

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110523#comment-15110523
 ] 

Sasi commented on SPARK-12741:
--

I changed the way I used the DataFrame from my last ticket.
Now, I have dataframe without any cache or persist operation, so each time I 
add/remove row I see it on the dataframe.
The only problem I'm having right now is the count operation.
I'll try to create new dataframe for each count calls, maybe it will resolve 
the problem.

By the way, I see the count result change, e.g. from 3 to 5, but the count size 
not equals to the real dataframe size.

Thanks again!
Sasi


> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110553#comment-15110553
 ] 

Sasi commented on SPARK-12741:
--

Create new DataFame didn't resolve the issue.
I still think its bug.

Thanks,
Sasi

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110584#comment-15110584
 ] 

Sasi commented on SPARK-12741:
--

I checked my DB which is Aerospike, and I got the same results of my collect.
I'm creating my DataFrame with Aerospark which is a connector written by Sasha, 
https://github.com/sasha-polev/aerospark/
I'm using the DataFrame actions as describe in sql-programming-guide, 
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html

I know there're two ways to do actions on DataFrame:
1) SQL way.
{code}
dataFrame.sqlContext().sql("select count(*) from tbl").collect()[0]
{code}
2) DataFrame way.
{code}
dataFrame.where("...").count()
{code}

I'm using the DataFrame way which is more simple to understand and to read as a 
JAVA code.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110563#comment-15110563
 ] 

Sasi commented on SPARK-12741:
--

If I'm running the following code:
{code}
dataFrame.where("...").count()
{code}

The result is the same as the collect.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110572#comment-15110572
 ] 

Sean Owen commented on SPARK-12741:
---

I can't reproduce this. I always get the same count and collect result on an 
example data set. Above I think you mean sqlContext not dataFrame right? this 
makes me wonder what you're really executing. Also, couldn't it be an issue 
with your DB?

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110503#comment-15110503
 ] 

Sasi commented on SPARK-12741:
--

I updated the report, can you verify it again.

Thanks!
Sasi

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> subscribersDataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> subscribersDataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> subscribersDataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110613#comment-15110613
 ] 

Sasi commented on SPARK-12741:
--

Addtional update:

If I use the following code, then I get the same length of the collect.
{code}
subscribersDataFrame.rdd().count();
{code}

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110511#comment-15110511
 ] 

Sean Owen commented on SPARK-12741:
---

I recall from other JIRAs that you're not updating the DataFrame / RDD but 
updating the table? that's not going to cause the result to change -- or it 
may, it's undefined.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110701#comment-15110701
 ] 

Sasi commented on SPARK-12741:
--

That's not what I meant.
I just set an example for each case, SQL way and DataFrame way.
I know that count() on select count(*) return 1 row, that's why I wrote 
collect()[0] which give back the value.

As I said before: dataFrame.count() and dataFrame.where("...").count() results 
wrong size when running dataFrame.where("...").collect().length or 
dataFrame.collect().length.

I'm pretty sure that count() doesn't work as expected.

Sasi


> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110762#comment-15110762
 ] 

Sean Owen commented on SPARK-12741:
---

OK, that's what you wrote at the outset though. Then I can't reproduce it.  I 
always get the correct count both ways. {{where("...")}} isn't what you're 
really executing; what are you writing? are you sure that's not the problem? 
because you have no predicate in the query you're comparing to. It's important 
to be clear what you're comparing.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110676#comment-15110676
 ] 

Sean Owen commented on SPARK-12741:
---

Wait, is this what you mean? "select count(*) ..." returns 1 row, which 
contains a number, which is the number of rows matching the predicate. count() 
returns 1 because there is 1 row in the result set. collect() collects (an 
Array of) that number of rows. Those are different; they're different things. 
That seems to be your problem.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org