[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948511#comment-14948511
 ] 

Sean Owen commented on SPARK-10914:
---

I don't think having it on one machine necessarily matters. You still have two 
JVMs in play; whereas you can't reproduce when just one JVM is involved. What 
if the driver has oops on, but the executor does not? and the results from the 
executor are parsed somewhere as if oops are on? Normally this would be wholly 
transparent to JVM bytecode but tungsten / SizeEstimator are depending in part 
on the actual representation of the object in memory.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948492#comment-14948492
 ] 

Ben Moran commented on SPARK-10914:
---

I just tried moving the master to the worker box, so it's entirely on one 
machine.  (Ubuntu 14.04 + now Oracle JDK 1.8).

It still reproduces the bug.  So, entirely on spark-worker:

{code}
spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-master.sh
spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-slave.sh --master 
spark://spark-worker:7077
spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master 
spark://spark-worker:7077  --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Using Spark's repl log4j profile: 
org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/08 12:15:12 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
Spark context available as sc.
15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/08 12:15:19 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
15/10/08 12:15:20 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
15/10/08 12:15:21 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
SQL context available as sqlContext.

scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> 

scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ 
res0: Long = 0

{code}

does give me the incorrect count.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948486#comment-14948486
 ] 

Sean Owen commented on SPARK-10914:
---

Still kind of guessing here... but what if the problem is that the computation 
spans machines that have a different oops configuration (driver vs executor) 
and that breaks some assumption in the low-level byte munging?

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948460#comment-14948460
 ] 

Ben Moran commented on SPARK-10914:
---

On latest master for me .count() also always seems to return 5 for everything! 
I think that is a separate bug - I think I saw it filed already but I can't 
find it now.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948513#comment-14948513
 ] 

Ben Moran commented on SPARK-10914:
---

I think you've got it - if I also turn off UseCompressedOops for the driver as 
well as the executor, it gives correct results:

 bin/spark-shell --master spark://spark-worker:7077  --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops" --driver-java-options 
"-XX:-UseCompressedOops"


Does this leave me with a viable workaround?  I'm not sure of the impact of 
UseCompressedOops

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948438#comment-14948438
 ] 

Sean Owen commented on SPARK-10914:
---

I ran the latest master in standalone mode, with {{/bin/spark-shell --master 
spark://localhost:7077 --executor-memory 31g --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"}}

{code}
scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> x.join(y, $"xx" === $"yy").count()
res0: Long = 5  
{code}

Without the {{-XX:-UseCompressedOops}} the answer is 2.

Could be unrelated to {{SizeEstimator}}, yes. I get 5 when setting 
{{spark.test.useCompressedOops=false}} too, which seems to indicate it's not 
{{SizeEstimator}}.

Could it be something in the Tungsten machinery? I see it's part of the plan 
above. I wasn't sure how to test with that disabled.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948494#comment-14948494
 ] 

Ben Moran commented on SPARK-10914:
---

Either using the large heap, or -XX:-UseCompressedOops triggers the bug.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-08 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948464#comment-14948464
 ] 

Ben Moran commented on SPARK-10914:
---

I also don't see it if I run spark-shell without setting --master.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947711#comment-14947711
 ] 

Reynold Xin commented on SPARK-10914:
-

I don't think size estimator would impact the result. 

If I understand this correctly, this fails with a small heap and compressed 
oops turned off? I can't reproduce it locally. I tried launching spark-shell 
using
{code}
bin/spark-shell --driver-java-options "-XX:-UseCompressedOops"
{code}


> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-07 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946623#comment-14946623
 ] 

Ben Moran commented on SPARK-10914:
---

I added a properties file to set spark.test.useCompressedOops=false.  This 
didn't seem to make any effect - it still gives wrong results for >=32gb, and 
correct results for <=31gb.

I'm not sufficiently familiar with the code to investigate what SizeEstimator 
thinks.  Have you any suggestions?

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947307#comment-14947307
 ] 

Sean Owen commented on SPARK-10914:
---

It fails with a small heap, right -- if you set -XX:-UseCompressedOops too? I 
mean do that, but also set spark.test.useCompressedOops=false to make sure 
SizeEstimator definitely thinks oops are disabled. Or: use a big heap, and 
spark.test.useCompressedOops=false

I don't have a great system for investigating SizeEstimator other than to debug 
if you can, or just copy/paste that method that figures out if oops are on and 
see what it prints in your program.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-06 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945443#comment-14945443
 ] 

Ben Moran commented on SPARK-10914:
---

Ah, I did it the wrong way around!  With a *small* heap and the option disabled 
it does indeed fail:

--executor-memory 31g --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"

gives the wrong result.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945638#comment-14945638
 ] 

Sean Owen commented on SPARK-10914:
---

Hm, it could be a valid lead after all. The size estimator code is aware of 
32-bit vs 64-bit pointers but a next line of inquiry might be to determine if 
somehow in your case it's detected incorrectly and that causes an error.  It 
sounds like compressed oops are off, but it thinks it's on. You could try 
adding "-Dspark.test.useCompressedOops=false" to see if it works then. That 
would pretty much confirm it. Then the question is what, for example 
SizeEstimator thinks for these values; if you can dig in to that and see what 
it comes up with that would help.

I'm assuming it is related to the part of the code that looks for compressed 
oops, but I wonder if somehow this affects Tungsten? CC [~rxin] for what may be 
a dumb question.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945436#comment-14945436
 ] 

Sean Owen commented on SPARK-10914:
---

Sure, pass -XX:-UseCompressedOops. So if it fails with a small heap and this 
setting, maybe we have something. Worth checking, at least because it seems 
weird that it magically works with a certain heap size or below, and 32g is the 
magic boundary above which compressed oops can't be enabled, and references go 
back to being 64-bit instead of 32-bit.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets

2015-10-06 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945358#comment-14945358
 ] 

Ben Moran commented on SPARK-10914:
---

Thanks for looking into it.  I have narrowed it down a lot now.  It depends on 
the --executor-memory setting!

For me using "bin/spark-shell" locally I don't see the problem, but I do see it 
when I use a standalone cluster. It reliably reproduces whenever I specify 
"--executor-memory 32g" or greater, but if I leave executor-memory unset or 
specify a value of 31g or less I get the correct result.

Here's a correct run:
spark@spark-master:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master 
spark://spark-master:7077   --executor-memory 31g 

scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
res0: Long = 2 


Here's an incorrect run, with an explain plan (the explain is the same in any 
case):

spark@spark-master:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master 
spark://spark-master:7077   --executor-memory 32g

scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> x.join(y, $"xx" === $"yy").explain()
== Physical Plan ==
BroadcastHashJoin [xx#0], [yy#2], BuildRight
 Union
  TungstenProject [1 AS xx#0]
   Scan OneRowRelation[]
  TungstenProject [2 AS _c0#1]
   Scan OneRowRelation[]
 Union
  TungstenProject [1 AS yy#2]
   Scan OneRowRelation[]
  TungstenProject [2 AS _c0#3]
   Scan OneRowRelation[]

scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
res1: Long = 0   


I have two machines in my cluster:
- one is Ubuntu 12.04, running the spark-master node
- one is Ubuntu 14.04, running the spark-slave node.

Both have 256Gb RAM.

JVM on both machines is: oracle-java7-installer from PPA:
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)



> Incorrect empty join sets
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-06 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945441#comment-14945441
 ] 

Ben Moran commented on SPARK-10914:
---

I just ran with 
--executor-memory 100g --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"

but the problem persists.  In the worker log it shows:


15/10/06 18:36:36 INFO ExecutorRunner: Launch command: 
"/usr/lib/jvm/java-7-oracle/jre/bin/java" "-cp" 
"/home/spark/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar"
 "-Xms102400M" "-Xmx102400M" "-Dspark.driver.port=53169" 
"-XX:-UseCompressedOops" "-XX:MaxPermSize=256m" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"akka.tcp://sparkDriver@10.122.82.99:53169/user/CoarseGrainedScheduler" 
"--executor-id" "0" "--hostname" "10.122.82.99" "--cores" "20" "--app-id" 
"app-20151006183636-0019" "--worker-url" 
"akka.tcp://sparkWorker@10.122.82.99:51402/user/Worker"


> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945417#comment-14945417
 ] 

Sean Owen commented on SPARK-10914:
---

One long-shot question -- can you figure out whether your JVM is using 
-XX:+UseCompressedOops by default? using -XX:+PrintFlagsFinal or something? it 
should switch off at 32gb but is often on by default in JVMs. I hope it's not 
the case that compressed OOPS throws off Tungsten.

It could also be that bigger heap sizes let tasks all execute on the same 
machine and that happens to matter to reproducing this.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-06 Thread Ben Moran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945426#comment-14945426
 ] 

Ben Moran commented on SPARK-10914:
---

java  -XX:+PrintFlagsFinal says:
 bool UseCompressedOops:= true
{lp64_product}  

Is there a way to turn that off and retest?

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945418#comment-14945418
 ] 

Sean Owen commented on SPARK-10914:
---

One long-shot question -- can you figure out whether your JVM is using 
-XX:+UseCompressedOops by default? using -XX:+PrintFlagsFinal or something? it 
should switch off at 32gb but is often on by default in JVMs. I hope it's not 
the case that compressed OOPS throws off Tungsten.

It could also be that bigger heap sizes let tasks all execute on the same 
machine and that happens to matter to reproducing this.

> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10914) Incorrect empty join sets

2015-10-05 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943855#comment-14943855
 ] 

Wenchen Fan commented on SPARK-10914:
-

hi [~benm], I can't reproduce this bug on spark 1.5.1(downloaded binary 
version) on my mac locally, can you provide more details(local or cluster? jvm 
options?)? you can use `df.explain(true)` to print the plan tree so that we can 
see what's going on there.

> Incorrect empty join sets
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org