[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948492#comment-14948492 ]
Ben Moran commented on SPARK-10914: ----------------------------------- I just tried moving the master to the worker box, so it's entirely on one machine. (Ubuntu 14.04 + now Oracle JDK 1.8). It still reproduces the bug. So, entirely on spark-worker: {code} spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-master.sh spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-slave.sh --master spark://spark-worker:7077 spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master spark://spark-worker:7077 --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. 15/10/08 12:15:12 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. Spark context available as sc. 15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/08 12:15:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 15/10/08 12:15:20 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/10/08 12:15:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) SQL context available as sqlContext. scala> val x = sql("select 1 xx union all select 2") x: org.apache.spark.sql.DataFrame = [xx: int] scala> val y = sql("select 1 yy union all select 2") y: org.apache.spark.sql.DataFrame = [yy: int] scala> scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ res0: Long = 0 {code} does give me the incorrect count. > Incorrect empty join sets when executor-memory >= 32g > ----------------------------------------------------- > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) > Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org