[jira] [Updated] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-08 Thread Prasad Chalasani (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Chalasani updated SPARK-11008:
-
Summary: Spark window function returns inconsistent/wrong results  (was: 
Spark window function return inconsistent/wrong results)

> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Blocker
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the above, then the result gets "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 250,000). So this is a serious 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-08 Thread Prasad Chalasani (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Chalasani updated SPARK-11008:
-
Description: 
Summary: applying a windowing function on a data-frame, followed by count() 
gives widely varying results in repeated runs: none exceed the correct value, 
but of course all but one are wrong. On large data-sets I sometimes get as 
small as HALF of the correct value.

A minimal reproducible example is here: 

(1) start spark-shell
(2) run these:
val data = 1.to(100).map(x => (x,1))
import sqlContext.implicits._
val tbl = sc.parallelize(data).toDF("id", "time")
tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")

(3) exit the shell (this is important)
(4) start spark-shell again
(5) run these:
import org.apache.spark.sql.expressions.Window
val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
val win = Window.partitionBy("id").orderBy("time")

df.select($"id", 
(rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()

I get 98, but the correct result is 100. 
If I re-run the code in step 5 in the same shell, then the result gets "fixed" 
and I always get 100.

Note this is only a minimal reproducible example to reproduce the error. In my 
real application the size of the data is much larger and the window function is 
not trivial as above (i.e. there are multiple "time" values per "id", etc), and 
I see results sometimes as small as HALF of the correct value (e.g. 120,000 
while the correct value is 250,000). So this is a serious problem.



  was:
Summary: applying a windowing function on a data-frame, followed by count() 
gives widely varying results in repeated runs: none exceed the correct value, 
but of course all but one are wrong. On large data-sets I sometimes get as 
small as HALF of the correct value.

A minimal reproducible example is here: 

(1) start spark-shell
(2) run these:
val data = 1.to(100).map(x => (x,1))
import sqlContext.implicits._
val tbl = sc.parallelize(data).toDF("id", "time")
tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")

(3) exit the shell (this is important)
(4) start spark-shell again
(5) run these:
import org.apache.spark.sql.expressions.Window
val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
val win = Window.partitionBy("id").orderBy("time")

df.select($"id", 
(rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()

I get 98, but the correct result is 100. 
If I re-run the above, then the result gets "fixed" and I always get 100.

Note this is only a minimal reproducible example to reproduce the error. In my 
real application the size of the data is much larger and the window function is 
not trivial as above (i.e. there are multiple "time" values per "id", etc), and 
I see results sometimes as small as HALF of the correct value (e.g. 120,000 
while the correct value is 250,000). So this is a serious problem.




> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Blocker
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the code in step 5 in the same shell, then the result gets 
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 250,000

[jira] [Updated] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11008:
--
Priority: Minor  (was: Blocker)

[~pchalasani] don't set "Blocker". I agree that most bets are off when using 
S3, so I doubt this is a Spark problem.

> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Minor
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the code in step 5 in the same shell, then the result gets 
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 250,000). So this is a serious 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org