[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948513#comment-14948513 ] Ben Moran commented on SPARK-10914: --- I think you've got it - if I also turn off UseCompressedOops for the driver as well as the executor, it gives correct results: bin/spark-shell --master spark://spark-worker:7077 --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" --driver-java-options "-XX:-UseCompressedOops" Does this leave me with a viable workaround? I'm not sure of the impact of UseCompressedOops > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948511#comment-14948511 ] Sean Owen commented on SPARK-10914: --- I don't think having it on one machine necessarily matters. You still have two JVMs in play; whereas you can't reproduce when just one JVM is involved. What if the driver has oops on, but the executor does not? and the results from the executor are parsed somewhere as if oops are on? Normally this would be wholly transparent to JVM bytecode but tungsten / SizeEstimator are depending in part on the actual representation of the object in memory. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948494#comment-14948494 ] Ben Moran commented on SPARK-10914: --- Either using the large heap, or -XX:-UseCompressedOops triggers the bug. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948492#comment-14948492 ] Ben Moran commented on SPARK-10914: --- I just tried moving the master to the worker box, so it's entirely on one machine. (Ubuntu 14.04 + now Oracle JDK 1.8). It still reproduces the bug. So, entirely on spark-worker: {code} spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-master.sh spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-slave.sh --master spark://spark-worker:7077 spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master spark://spark-worker:7077 --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. 15/10/08 12:15:12 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. Spark context available as sc. 15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/08 12:15:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 15/10/08 12:15:20 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/10/08 12:15:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) SQL context available as sqlContext. scala> val x = sql("select 1 xx union all select 2") x: org.apache.spark.sql.DataFrame = [xx: int] scala> val y = sql("select 1 yy union all select 2") y: org.apache.spark.sql.DataFrame = [yy: int] scala> scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ res0: Long = 0 {code} does give me the incorrect count. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948486#comment-14948486 ] Sean Owen commented on SPARK-10914: --- Still kind of guessing here... but what if the problem is that the computation spans machines that have a different oops configuration (driver vs executor) and that breaks some assumption in the low-level byte munging? > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948464#comment-14948464 ] Ben Moran commented on SPARK-10914: --- I also don't see it if I run spark-shell without setting --master. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948460#comment-14948460 ] Ben Moran commented on SPARK-10914: --- On latest master for me .count() also always seems to return 5 for everything! I think that is a separate bug - I think I saw it filed already but I can't find it now. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948438#comment-14948438 ] Sean Owen commented on SPARK-10914: --- I ran the latest master in standalone mode, with {{/bin/spark-shell --master spark://localhost:7077 --executor-memory 31g --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"}} {code} scala> val x = sql("select 1 xx union all select 2") x: org.apache.spark.sql.DataFrame = [xx: int] scala> val y = sql("select 1 yy union all select 2") y: org.apache.spark.sql.DataFrame = [yy: int] scala> x.join(y, $"xx" === $"yy").count() res0: Long = 5 {code} Without the {{-XX:-UseCompressedOops}} the answer is 2. Could be unrelated to {{SizeEstimator}}, yes. I get 5 when setting {{spark.test.useCompressedOops=false}} too, which seems to indicate it's not {{SizeEstimator}}. Could it be something in the Tungsten machinery? I see it's part of the plan above. I wasn't sure how to test with that disabled. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947711#comment-14947711 ] Reynold Xin commented on SPARK-10914: - I don't think size estimator would impact the result. If I understand this correctly, this fails with a small heap and compressed oops turned off? I can't reproduce it locally. I tried launching spark-shell using {code} bin/spark-shell --driver-java-options "-XX:-UseCompressedOops" {code} > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947307#comment-14947307 ] Sean Owen commented on SPARK-10914: --- It fails with a small heap, right -- if you set -XX:-UseCompressedOops too? I mean do that, but also set spark.test.useCompressedOops=false to make sure SizeEstimator definitely thinks oops are disabled. Or: use a big heap, and spark.test.useCompressedOops=false I don't have a great system for investigating SizeEstimator other than to debug if you can, or just copy/paste that method that figures out if oops are on and see what it prints in your program. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946623#comment-14946623 ] Ben Moran commented on SPARK-10914: --- I added a properties file to set spark.test.useCompressedOops=false. This didn't seem to make any effect - it still gives wrong results for >=32gb, and correct results for <=31gb. I'm not sufficiently familiar with the code to investigate what SizeEstimator thinks. Have you any suggestions? > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945638#comment-14945638 ] Sean Owen commented on SPARK-10914: --- Hm, it could be a valid lead after all. The size estimator code is aware of 32-bit vs 64-bit pointers but a next line of inquiry might be to determine if somehow in your case it's detected incorrectly and that causes an error. It sounds like compressed oops are off, but it thinks it's on. You could try adding "-Dspark.test.useCompressedOops=false" to see if it works then. That would pretty much confirm it. Then the question is what, for example SizeEstimator thinks for these values; if you can dig in to that and see what it comes up with that would help. I'm assuming it is related to the part of the code that looks for compressed oops, but I wonder if somehow this affects Tungsten? CC [~rxin] for what may be a dumb question. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945443#comment-14945443 ] Ben Moran commented on SPARK-10914: --- Ah, I did it the wrong way around! With a *small* heap and the option disabled it does indeed fail: --executor-memory 31g --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" gives the wrong result. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945441#comment-14945441 ] Ben Moran commented on SPARK-10914: --- I just ran with --executor-memory 100g --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" but the problem persists. In the worker log it shows: 15/10/06 18:36:36 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-7-oracle/jre/bin/java" "-cp" "/home/spark/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar" "-Xms102400M" "-Xmx102400M" "-Dspark.driver.port=53169" "-XX:-UseCompressedOops" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver@10.122.82.99:53169/user/CoarseGrainedScheduler" "--executor-id" "0" "--hostname" "10.122.82.99" "--cores" "20" "--app-id" "app-20151006183636-0019" "--worker-url" "akka.tcp://sparkWorker@10.122.82.99:51402/user/Worker" > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945436#comment-14945436 ] Sean Owen commented on SPARK-10914: --- Sure, pass -XX:-UseCompressedOops. So if it fails with a small heap and this setting, maybe we have something. Worth checking, at least because it seems weird that it magically works with a certain heap size or below, and 32g is the magic boundary above which compressed oops can't be enabled, and references go back to being 64-bit instead of 32-bit. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945426#comment-14945426 ] Ben Moran commented on SPARK-10914: --- java -XX:+PrintFlagsFinal says: bool UseCompressedOops:= true {lp64_product} Is there a way to turn that off and retest? > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945418#comment-14945418 ] Sean Owen commented on SPARK-10914: --- One long-shot question -- can you figure out whether your JVM is using -XX:+UseCompressedOops by default? using -XX:+PrintFlagsFinal or something? it should switch off at 32gb but is often on by default in JVMs. I hope it's not the case that compressed OOPS throws off Tungsten. It could also be that bigger heap sizes let tasks all execute on the same machine and that happens to matter to reproducing this. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945417#comment-14945417 ] Sean Owen commented on SPARK-10914: --- One long-shot question -- can you figure out whether your JVM is using -XX:+UseCompressedOops by default? using -XX:+PrintFlagsFinal or something? it should switch off at 32gb but is often on by default in JVMs. I hope it's not the case that compressed OOPS throws off Tungsten. It could also be that bigger heap sizes let tasks all execute on the same machine and that happens to matter to reproducing this. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org