[jira] [Assigned] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions
[ https://issues.apache.org/jira/browse/SPARK-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12480: Assignee: Apache Spark > add Hash expression that can calculate hash value for a group of expressions > > > Key: SPARK-12480 > URL: https://issues.apache.org/jira/browse/SPARK-12480 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions
[ https://issues.apache.org/jira/browse/SPARK-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12480: Assignee: (was: Apache Spark) > add Hash expression that can calculate hash value for a group of expressions > > > Key: SPARK-12480 > URL: https://issues.apache.org/jira/browse/SPARK-12480 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions
[ https://issues.apache.org/jira/browse/SPARK-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068268#comment-15068268 ] Apache Spark commented on SPARK-12480: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/10435 > add Hash expression that can calculate hash value for a group of expressions > > > Key: SPARK-12480 > URL: https://issues.apache.org/jira/browse/SPARK-12480 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
Paulo Magalhaes created SPARK-12479: --- Summary: sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed" Key: SPARK-12479 URL: https://issues.apache.org/jira/browse/SPARK-12479 Project: Spark Issue Type: Bug Components: R, SparkR Affects Versions: 1.5.1 Reporter: Paulo Magalhaes sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" Spark Version: 1.5.1 R Version: 3.2.2 I tracked down the root cause of this exception to an specific key for which the hashCode could not be calculated. The following code recreates the problem when ran in sparkR: hashCode <- getFromNamespace("hashCode","SparkR") hashCode("bc53d3605e8a5b7de1e8e271c2317645") Error in if (value > .Machine$integer.max) { : missing value where TRUE/FALSE needed I went one step further and relaised the the problem happens because of the bit wise shift below returning NA. bitwShiftL(-1073741824,1) where bitwShiftL is an R function. I believe the bitwShiftL function is working as it is supposed to. Therefore, my PR will fix it in the SparkR package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions
Wenchen Fan created SPARK-12480: --- Summary: add Hash expression that can calculate hash value for a group of expressions Key: SPARK-12480 URL: https://issues.apache.org/jira/browse/SPARK-12480 Project: Spark Issue Type: New Feature Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12480) add Hash expression that can calculate hash value for a group of expressions
[ https://issues.apache.org/jira/browse/SPARK-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12480: Assignee: Apache Spark > add Hash expression that can calculate hash value for a group of expressions > > > Key: SPARK-12480 > URL: https://issues.apache.org/jira/browse/SPARK-12480 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Description: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} * A connection is made from the driver host to the _Spark_ service ** {{spark.driver.host}} is set to the IP of the driver host on the LAN {{172.30.0.2}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the driver host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the driver host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect {code:title=Stacktrace|borderStyle=solid} 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.0.3): java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. \\ \\ Original report follows: I initially inquired about this here:
[jira] [Resolved] (SPARK-12456) Add ExpressionDescription to misc functions
[ https://issues.apache.org/jira/browse/SPARK-12456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12456. - Resolution: Fixed Assignee: Xiu(Joe) Guo Fix Version/s: 2.0.0 > Add ExpressionDescription to misc functions > --- > > Key: SPARK-12456 > URL: https://issues.apache.org/jira/browse/SPARK-12456 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Xiu(Joe) Guo > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068562#comment-15068562 ] Kyle Sutton commented on SPARK-12482: - Wasn't able to reopen the original one, so tried cloning to preserve as much info from the original as possible. Should I still create a new one? > Spark fileserver not started on same IP as configured in spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.5.2 >Reporter: Kyle Sutton > > The issue of the file server using the default IP instead of the IP address > configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the _Spark_ service attempts to call back to the default > port of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} > * A connection is made from the driver host to the _Spark_ service > ** {{spark.driver.host}} is set to the IP of the driver host on the LAN > {{172.30.0.2}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the driver host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} > * The _Spark_ service, being on a different network than the driver host, > cannot see the {{192.168.1.0/24}} address space, and fails to connect to the > file server > ** A {{netstat}} on the _Spark_ service host will show the connection to the > file server host as being in {{SYN_SENT}} state until the process gives up > trying to connect > {code:title=Driver|borderStyle=solid} > SparkConf conf = new SparkConf() > .setMaster("spark://172.30.0.3:7077") > .setAppName("TestApp") > .set("spark.driver.host", "172.30.0.2") > .set("spark.driver.port", "50003") > .set("spark.fileserver.port", "50005"); > JavaSparkContext sc = new JavaSparkContext(conf); > sc.addJar("target/code.jar"); > {code} > {code:title=Stacktrace|borderStyle=solid} > 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 172.30.0.3): java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at >
[jira] [Comment Edited] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068572#comment-15068572 ] Kyle Sutton edited comment on SPARK-6476 at 12/22/15 7:14 PM: -- The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default IP of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} * A connection is made from the driver host to the _Spark_ service ** {{spark.driver.host}} is set to the IP of the driver host on the LAN {{172.30.0.2}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the driver host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the driver host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect {code:title=Driver|borderStyle=solid} SparkConf conf = new SparkConf() .setMaster("spark://172.30.0.3:7077") .setAppName("TestApp") .set("spark.driver.host", "172.30.0.2") .set("spark.driver.port", "50003") .set("spark.fileserver.port", "50005"); JavaSparkContext sc = new JavaSparkContext(conf); sc.addJar("target/code.jar"); {code} {code:title=Stacktrace|borderStyle=solid} 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.0.3): java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Updated] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-12488: - Description: When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: The below example generates 10 topics on a data set with a vocabulary of 685. {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted.reverse res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4... {code} was: When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: The below example generated 10 topics on a data set with a vocabulary of 685. {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted.reverse res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4... {code} > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2
[jira] [Resolved] (SPARK-12471) Spark daemons should log their pid in the log file
[ https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12471. - Resolution: Fixed Assignee: Nong Li (was: Apache Spark) Fix Version/s: 2.0.0 > Spark daemons should log their pid in the log file > -- > > Key: SPARK-12471 > URL: https://issues.apache.org/jira/browse/SPARK-12471 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nong Li >Assignee: Nong Li > Fix For: 2.0.0 > > > This is useful when debugging from the log files without the processes > running. This information makes it possible to combine the log files with > other system information (e.g. dmesg output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12475) Upgrade Zinc from 0.3.5.3 to 0.3.9
[ https://issues.apache.org/jira/browse/SPARK-12475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-12475. Resolution: Fixed Assignee: Josh Rosen (was: Apache Spark) Fix Version/s: 2.0.0 Fixed by my PR for Spark 2.0.0 > Upgrade Zinc from 0.3.5.3 to 0.3.9 > -- > > Key: SPARK-12475 > URL: https://issues.apache.org/jira/browse/SPARK-12475 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > We should update to the latest version of Zinc in order to match our SBT > version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Description: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} * A connection is made from the driver host to the _Spark_ service ** {{spark.driver.host}} is set to the IP of the driver host on the LAN {{172.30.0.2}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the driver host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the driver host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect {code:title=Driver|borderStyle=solid} SparkConf conf = new SparkConf() .setMaster("spark://172.30.0.3:7077") .setAppName("TestApp") .set("spark.driver.host", "172.30.0.2") .set("spark.driver.port", "50003") .set("spark.fileserver.port", "50005"); JavaSparkContext sc = new JavaSparkContext(conf); sc.addJar("target/code.jar"); {code} {code:title=Stacktrace|borderStyle=solid} 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.0.3): java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
[jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068627#comment-15068627 ] Maciej Bryński commented on SPARK-11437: [~davies] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} from pyspark.sql.types import * schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} from pyspark.sql.types import * schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Jason White > Fix For: 1.6.0 > > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12483) Data Frame as() does not work in Java
Andrew Davidson created SPARK-12483: --- Summary: Data Frame as() does not work in Java Key: SPARK-12483 URL: https://issues.apache.org/jira/browse/SPARK-12483 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Environment: Mac El Cap 10.11.2 Java 8 Reporter: Andrew Davidson Following unit test demonstrates a bug in as(). The column name for aliasDF was not changed @Test public void bugDataFrameAsTest() { DataFrame df = createData(); df.printSchema(); df.show(); DataFrame aliasDF = df.select("id").as("UUID"); aliasDF.printSchema(); aliasDF.show(); } DataFrame createData() { Features f1 = new Features(1, category1); Features f2 = new Features(2, category2); ArrayList data = new ArrayList(2); data.add(f1); data.add(f2); //JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList(f1, f2)); JavaRDD rdd = javaSparkContext.parallelize(data); DataFrame df = sqlContext.createDataFrame(rdd, Features.class); return df; } This is the output I got (without the spark log msgs) root |-- id: integer (nullable = false) |-- labelStr: string (nullable = true) +---++ | id|labelStr| +---++ | 1| noise| | 2|questionable| +---++ root |-- id: integer (nullable = false) +---+ | id| +---+ | 1| | 2| +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068627#comment-15068627 ] Maciej Bryński edited comment on SPARK-11437 at 12/22/15 8:24 PM: -- [~davies], [~jason.white] Are you sure that this patch is OK ? In 1.6.0 if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} was (Author: maver1ck): [~davies], [~jason.white] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Jason White > Fix For: 1.6.0 > > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12477) [SQL] Tungsten projection fails for null values in array fields
[ https://issues.apache.org/jira/browse/SPARK-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12477: Assignee: Apache Spark > [SQL] Tungsten projection fails for null values in array fields > --- > > Key: SPARK-12477 > URL: https://issues.apache.org/jira/browse/SPARK-12477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Pierre Borckmans >Assignee: Apache Spark > > Accessing null elements in an array field fails when tungsten is enabled. > It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled. > Example: > {code} > // Array of String > case class AS( as: Seq[String] ) > val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF > dfAS.registerTempTable("T_AS") > for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from > T_AS").collect.mkString(","))} > {code} > With Tungsten disabled: > {code} > 0 = [a] > 1 = [null] > 2 = [b] > {code} > With Tungsten enabled: > {code} > 0 = [a] > 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > {code} > -- > More examples below. > The following code works in Spark 1.3.1, and in Spark > 1.5 with Tungsten > disabled: > {code} > // Array of String > case class AS( as: Seq[String] ) > val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF > dfAS.registerTempTable("T_AS") > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select as[$i] from > T_AS").collect.mkString(","))} > // Array of Int > case class AI( ai: Seq[Option[Int]] ) > val dfAI = sc.parallelize( Seq( AI ( Seq(Some(1),None,Some(2) ) ) ) ).toDF > dfAI.registerTempTable("T_AI") > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select ai[$i] from > T_AI").collect.mkString(","))} > // Array of struct[Int,String] > case class B(x: Option[Int], y: String) > case class A( b: Seq[B] ) > val df1 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), > B(None, "c"), B(Some(4),null), B(None,null), null ) ) ) ).toDF > df1.registerTempTable("T1") > val df2 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), > B(None, "c"), B(Some(4),null), B(None,null), null ) ), A(null) ) ).toDF > df2.registerTempTable("T2") > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, > b[$i].y from T1").collect.mkString(","))} > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, > b[$i].y from T2").collect.mkString(","))} > // Struct[Int,String] > case class C(b: B) > val df3 = sc.parallelize( Seq( C ( B(Some(1),"test") ), C(null) ) ).toDF > df3.registerTempTable("T3") > sqlContext.sql("select b.x, b.y from T3").collect > {code} > With Tungsten enabled, it reaches NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12482) Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Summary: Spark fileserver not started on same IP as using spark.driver.host (was: CLONE - Spark fileserver not started on same IP as using spark.driver.host) > Spark fileserver not started on same IP as using spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.5.2 >Reporter: Kyle Sutton > > The issue of the file server using the default IP instead of the IP address > configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the _Spark_ service attempts to call back to the default > port of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} > * A connection is made from the driver host to the _Spark_ service > ** {{spark.driver.host}} is set to the IP of the driver host on the LAN > {{172.30.0.2}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the driver host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} > * The _Spark_ service, being on a different network than the driver host, > cannot see the {{192.168.1.0/24}} address space, and fails to connect to the > file server > ** A {{netstat}} on the _Spark_ service host will show the connection to the > file server host as being in {{SYN_SENT}} state until the process gives up > trying to connect > {code:title=Driver|borderStyle=solid} > SparkConf conf = new SparkConf() > .setMaster("spark://172.30.0.3:7077") > .setAppName("TestApp") > .set("spark.driver.host", "172.30.0.2") > .set("spark.driver.port", "50003") > .set("spark.fileserver.port", "50005"); > JavaSparkContext sc = new JavaSparkContext(conf); > sc.addJar("target/code.jar"); > {code} > {code:title=Stacktrace|borderStyle=solid} > 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 172.30.0.3): java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at
[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068627#comment-15068627 ] Maciej Bryński edited comment on SPARK-11437 at 12/22/15 7:45 PM: -- [~davies], [~jason.white] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} was (Author: maver1ck): [~davies] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Jason White > Fix For: 1.6.0 > > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068670#comment-15068670 ] Apache Spark commented on SPARK-12462: -- User 'xguo27' has created a pull request for this issue: https://github.com/apache/spark/pull/10437 > Add ExpressionDescription to misc non-aggregate functions > - > > Key: SPARK-12462 > URL: https://issues.apache.org/jira/browse/SPARK-12462 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068696#comment-15068696 ] Andrew Davidson commented on SPARK-12484: - What I am really trying to do is rewrite the following python code in Java. Ideally I would implement this code as a MLib.transformation how ever that does not seem possible at this point in time using the Java API Kind regards Andy def convertMultinomialLabelToBinary(dataFrame): newColName = "binomialLabel" binomial = udf(lambda labelStr: labelStr if (labelStr == "noise") else "signal", StringType()) ret = dataFrame.withColumn(newColName, binomial(dataFrame["label"])) return ret > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"
[ https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12485: -- Target Version/s: 2.0.0 > Rename "dynamic allocation" to "elastic scaling" > > > Key: SPARK-12485 > URL: https://issues.apache.org/jira/browse/SPARK-12485 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > > Fewer syllables, sounds more natural. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12488) LDA Describe Topics Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-12488: - Description: When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: The below example generated 10 topics on a data set with a vocabulary of 685. {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted.reverse res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4... {code} was: When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} > LDA Describe Topics Generates Invalid Term IDs > -- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generated 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169,
[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Description: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} * A connection is made from the launch host to the _Spark_ service ** {{spark.driver.host}} is set to the IP of the launch host on the LAN {{172.30.0.2}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect {code:title=Stacktrace|borderStyle=solid} 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.0.3): java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. \\ \\ Original report follows: I initially inquired about this here:
[jira] [Commented] (SPARK-1865) Improve behavior of cleanup of disk state
[ https://issues.apache.org/jira/browse/SPARK-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068542#comment-15068542 ] Charles Allen commented on SPARK-1865: -- This is compounded by the fact that some of the shutdown processes will stop() during the normal course of the main thread, then will fail to wait for completion if stop() is ALSO called via the shutdown hook. > Improve behavior of cleanup of disk state > - > > Key: SPARK-1865 > URL: https://issues.apache.org/jira/browse/SPARK-1865 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Reporter: Aaron Davidson > > Right now the behavior of disk cleanup is centered around the exit hook of > the executor, which attempts to cleanup shuffle files and disk manager > blocks, but may fail. We should make this behavior more predictable, perhaps > by letting the Standalone Worker cleanup the disk state, and adding a flag to > disable having the executor cleanup its own state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Attachment: SPARK_12483_unitTest.java added a unit test file > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068761#comment-15068761 ] Ilya Ganelin commented on SPARK-12488: -- @jkbradley Would love your feedback here. Thanks! > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generates 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068568#comment-15068568 ] Kyle Sutton commented on SPARK-12482: - Thanks! I did. I think he's saying that fileserver is listening on all ports, but if the Spark service can't see the IP given it by the Spark driver, the ports are immaterial. > Spark fileserver not started on same IP as configured in spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.5.2 >Reporter: Kyle Sutton > > The issue of the file server using the default IP instead of the IP address > configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the _Spark_ service attempts to call back to the default > port of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} > * A connection is made from the driver host to the _Spark_ service > ** {{spark.driver.host}} is set to the IP of the driver host on the LAN > {{172.30.0.2}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the driver host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} > * The _Spark_ service, being on a different network than the driver host, > cannot see the {{192.168.1.0/24}} address space, and fails to connect to the > file server > ** A {{netstat}} on the _Spark_ service host will show the connection to the > file server host as being in {{SYN_SENT}} state until the process gives up > trying to connect > {code:title=Driver|borderStyle=solid} > SparkConf conf = new SparkConf() > .setMaster("spark://172.30.0.3:7077") > .setAppName("TestApp") > .set("spark.driver.host", "172.30.0.2") > .set("spark.driver.port", "50003") > .set("spark.fileserver.port", "50005"); > JavaSparkContext sc = new JavaSparkContext(conf); > sc.addJar("target/code.jar"); > {code} > {code:title=Stacktrace|borderStyle=solid} > 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 172.30.0.3): java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at >
[jira] [Updated] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12484: Attachment: UDFTest.java Add a unit test file > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12486) Executors are not always terminated successfully by the worker.
Nong Li created SPARK-12486: --- Summary: Executors are not always terminated successfully by the worker. Key: SPARK-12486 URL: https://issues.apache.org/jira/browse/SPARK-12486 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Nong Li There are cases when the executor is not killed successfully by the worker. One way this can happen is if the executor is in a bad state, fails to heartbeat and the master tells the worker to kill the executor. The executor is in such a bad state that the kill request is ignored. This seems to be able to happen if the executor is in heavy GC. The cause of this is that the Process.destroy() API is not forceful enough. In Java8, a new API, destroyForcibly() was added. We should use that if available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-12488: - Summary: LDA describeTopics() Generates Invalid Term IDs (was: LDA Describe Topics Generates Invalid Term IDs) > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generated 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Description: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * A _Spark_ connection is made from the launch host to a service running on that LAN, {{172.30.0.0/16}} ** {{spark.driver.host}} is set to the IP of the launch host on that network {{172.30.0.2}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. \\ \\ Original report follows: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead: {code:title=HttpServer.scala#L75} val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0)) {code} was: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * A _Spark_ connection is made from the launch host to a service running on that LAN, {{172.30.0.0/16}} ** {{spark.driver.host}} is set to the IP of that _Spark_ service host {{172.30.0.3}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. \\ \\ Original report follows: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068698#comment-15068698 ] Andrew Davidson commented on SPARK-12484: - releated issue https://issues.apache.org/jira/browse/SPARK-12483 > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12488) LDA Describe Topics Generates Invalid Term IDs
Ilya Ganelin created SPARK-12488: Summary: LDA Describe Topics Generates Invalid Term IDs Key: SPARK-12488 URL: https://issues.apache.org/jira/browse/SPARK-12488 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.5.2 Reporter: Ilya Ganelin When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12487) Add docs for Kafka message handler
[ https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068748#comment-15068748 ] Apache Spark commented on SPARK-12487: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/10439 > Add docs for Kafka message handler > -- > > Key: SPARK-12487 > URL: https://issues.apache.org/jira/browse/SPARK-12487 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12489) Fix minor issues found by Findbugs
Shixiong Zhu created SPARK-12489: Summary: Fix minor issues found by Findbugs Key: SPARK-12489 URL: https://issues.apache.org/jira/browse/SPARK-12489 Project: Spark Issue Type: Bug Components: MLlib, Spark Core, SQL Reporter: Shixiong Zhu Priority: Minor Just used FindBugs to scan the codes and fixed some real issues: 1. Close `java.sql.Statement` 2. Fix incorrect `asInstanceOf`. 3. Remove unnecessary `synchronized` and `ReentrantLock`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls
[ https://issues.apache.org/jira/browse/SPARK-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12490: Assignee: Apache Spark (was: Josh Rosen) > Don't use Javascript for web UI's paginated table navigation controls > - > > Key: SPARK-12490 > URL: https://issues.apache.org/jira/browse/SPARK-12490 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Josh Rosen >Assignee: Apache Spark > > The web UI's paginated table uses Javascript to implement certain navigation > controls, such as table sorting and the "go to page" form. This is > unnecessary and should be simplified to use plain HTML form controls and > links. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12102: - Target Version/s: (was: 1.6.0) > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Dilip Biswal > Fix For: 2.0.0 > > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069020#comment-15069020 ] Xiao Li commented on SPARK-12483: - That means, it is not a bug. Thanks! > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069027#comment-15069027 ] Andrew Davidson commented on SPARK-12483: - Hi Xiao thanks for looking at the issue is there a way to change a column name? If you do a select() using a data frame, the column name is really strange see attachement for https://issues.apache.org/jira/browse/SPARK-12484 // get column from data frame call df.withColumnName Column newCol = udfDF.col("_c0"); renaming data frame columns is very common in R Kind regards Andy > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12478) Dataset fields of product types can't be null
[ https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069044#comment-15069044 ] Cheng Lian commented on SPARK-12478: I'm leaving this ticket open since we also need to backport this to branch-1.6 after the release. > Dataset fields of product types can't be null > - > > Key: SPARK-12478 > URL: https://issues.apache.org/jira/browse/SPARK-12478 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > Labels: backport-needed > > Spark shell snippet for reproduction: > {code} > import sqlContext.implicits._ > case class Inner(f: Int) > case class Outer(i: Inner) > Seq(Outer(null)).toDS().toDF().show() > Seq(Outer(null)).toDS().show() > {code} > Expected output should be: > {noformat} > ++ > | i| > ++ > |null| > ++ > ++ > | i| > ++ > |null| > ++ > {noformat} > Actual output: > {noformat} > +--+ > | i| > +--+ > |[null]| > +--+ > java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: > Null value appeared in non-nullable field Inner.f of type scala.Int. If the > schema is inferred from a Scala tuple/case class, or a Java bean, please try > to use scala.Option[_] or other nullable types (e.g. java.lang.Integer > instead of int/scala.Int). > newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, > StructType(StructField(f,IntegerType,false))])) null else newinstance(class > $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class > $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3)) > +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null > else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)) >:- isnull(input[0, StructType(StructField(f,IntegerType,false))]) >: +- input[0, StructType(StructField(f,IntegerType,false))] >:- null >+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)) > +- assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int) > +- input[0, StructType(StructField(f,IntegerType,false))].f > +- input[0, StructType(StructField(f,IntegerType,false))] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224) > at > org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704) > at > org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:704) > at org.apache.spark.sql.Dataset.take(Dataset.scala:725) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:240) > at org.apache.spark.sql.Dataset.show(Dataset.scala:230) > at org.apache.spark.sql.Dataset.show(Dataset.scala:193) > at org.apache.spark.sql.Dataset.show(Dataset.scala:201) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:46) > at $iwC$$iwC$$iwC$$iwC.(:48) > at $iwC$$iwC$$iwC.(:50) > at $iwC$$iwC.(:52) > at $iwC.(:54) > at (:56) > at .(:60) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at >
[jira] [Commented] (SPARK-10486) Spark intermittently fails to recover from a worker failure (in standalone mode)
[ https://issues.apache.org/jira/browse/SPARK-10486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069054#comment-15069054 ] Liang Chen commented on SPARK-10486: I meet the same problem > Spark intermittently fails to recover from a worker failure (in standalone > mode) > > > Key: SPARK-10486 > URL: https://issues.apache.org/jira/browse/SPARK-10486 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Cheuk Lam >Priority: Critical > > We have run into a problem where some Spark job is aborted after one worker > is killed in a 2-worker standalone cluster. The problem is intermittent, but > we can consistently reproduce it. The problem only appears to happen when we > kill a worker. It doesn't seem to happen when we kill an executor directly. > The program we use to reproduce the problem is some iterative program based > on GraphX, although the nature of the issue doesn't seem to be GraphX > related. This is how we reproduce the problem: > * Set up a standalone cluster of 2 workers; > * Run a Spark application of some iterative program (ours is some based on > GraphX); > * Kill a worker process (and thus the associated executor); > * Intermittently some job will be aborted. > The driver and the executor logs are available, as well as the application > history (event log file). But they are quite large and can't be attached > here. > ~ > After looking into the log files, we think the failure is caused by the > following two things combined: > * The BlockManagerMasterEndpoint in the driver has some stale block info > corresponding to the dead executor after the worker has been killed. The > driver does appear to handle the "RemoveExecutor" message and cleans up all > related block info. But subsequently, and intermittently, it receives some > Akka messages to re-register the dead BlockManager and re-add some of its > blocks. As a result, upon GetLocations requests from the remaining executor, > the driver responds with some stale block info, instructing the remaining > executor to fetch blocks from the dead executor. Please see the driver log > excerption below that shows the sequence of events described above. In the > log, there are two executors: 1.2.3.4 was the one which got shut down, while > 5.6.7.8 is the remaining executor. The driver also ran on 5.6.7.8. > * When the remaining executor's BlockManager issues a doGetRemote() call to > fetch the block of data, it fails because the targeted BlockManager which > resided in the dead executor is gone. This failure results in an exception > forwarded to the caller, bypassing the mechanism in the doGetRemote() > function to trigger a re-computation of the block. I don't know whether that > is intentional or not. > Driver log excerption that shows that the driver received messages to > re-register the dead executor after handling the RemoveExecutor message: > 11690 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message > (172.236378 ms) > AkkaMessage(RegisterExecutor(0,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/Executor#670388190]),1.2.3.4:36140,8,Map(stdout > -> > http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stdout, > stderr -> > http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stderr)),true) > from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$f] > 11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message > AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, > 52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true) > from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g] > 11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: > AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, > 52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true) > 11718 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] INFO > BlockManagerMasterEndpoint: Registering block manager 1.2.3.4:52615 with 6.2 > GB RAM, BlockManagerId(0, 1.2.3.4, 52615) > 11719 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message > (1.498313 ms) AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, >
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: Apache Spark > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Assignee: Apache Spark >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: (was: Apache Spark) > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12061) Persist for Map/filter with Lambda Functions don't always read from Cache
[ https://issues.apache.org/jira/browse/SPARK-12061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058419#comment-15058419 ] Xiao Li edited comment on SPARK-12061 at 12/23/15 12:22 AM: The logical plan's cleanArgs do not match when we calling sameResult. was (Author: smilegator): Start working on it. Thanks! > Persist for Map/filter with Lambda Functions don't always read from Cache > - > > Key: SPARK-12061 > URL: https://issues.apache.org/jira/browse/SPARK-12061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > So far, the existing caching mechanisms do not work on dataset operations > when using map/filter with lambda functions. For example, > {code} > test("persist and then map/filter with lambda functions") { > val f = (i: Int) => i + 1 > val ds = Seq(1, 2, 3).toDS() > val mapped = ds.map(f) > mapped.cache() > val mapped2 = ds.map(f) > assertCached(mapped2) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12386) Setting "spark.executor.port" leads to NPE in SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12386: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > Setting "spark.executor.port" leads to NPE in SparkEnv > -- > > Key: SPARK-12386 > URL: https://issues.apache.org/jira/browse/SPARK-12386 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > Fix For: 1.6.0 > > > From the list: > {quote} > when we set spark.executor.port in 1.6, we get thrown a NPE in > SparkEnv$.create(SparkEnv.scala:259). > {quote} > Fix is simple; probably should make it to 1.6.0 since it will affect anyone > using that config options, but I'll leave that to the release manager's > discretion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12410) "." and "|" used for String.split directly
[ https://issues.apache.org/jira/browse/SPARK-12410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12410: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > "." and "|" used for String.split directly > -- > > Key: SPARK-12410 > URL: https://issues.apache.org/jira/browse/SPARK-12410 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.2, 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 1.4.2, 1.5.3, 1.6.0 > > > String.split accepts a regular expression, so we should escape "." and "|". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069025#comment-15069025 ] Xiao Li commented on SPARK-12484: - Please check the email answer by [~zjffdu] Thanks! > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-12429: -- Assignee: Shixiong Zhu (was: Apache Spark) > Update documentation to show how to use accumulators and broadcasts with > Spark Streaming > > > Key: SPARK-12429 > URL: https://issues.apache.org/jira/browse/SPARK-12429 > Project: Spark > Issue Type: Documentation > Components: Documentation, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > > Accumulators and Broadcasts with Spark Streaming cannot work perfectly when > restarting on driver failures. We need to add some example to guide the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-12429. --- Resolution: Fixed Fix Version/s: 1.6.0 > Update documentation to show how to use accumulators and broadcasts with > Spark Streaming > > > Key: SPARK-12429 > URL: https://issues.apache.org/jira/browse/SPARK-12429 > Project: Spark > Issue Type: Documentation > Components: Documentation, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > > Accumulators and Broadcasts with Spark Streaming cannot work perfectly when > restarting on driver failures. We need to add some example to guide the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069055#comment-15069055 ] Xiao Li commented on SPARK-12483: - Hi, Andy, See this example: {code} DataFrame df = hc.range(0, 100).unionAll(hc.range(0, 100)).select(col("id").as("value")); {code} Good luck, Xiao > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069061#comment-15069061 ] Andrew Davidson commented on SPARK-12484: - Hi Xiao thanks for looking into this Andy > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: (was: Apache Spark) > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: Apache Spark > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Assignee: Apache Spark >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12376) Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method
[ https://issues.apache.org/jira/browse/SPARK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12376: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method > > > Key: SPARK-12376 > URL: https://issues.apache.org/jira/browse/SPARK-12376 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 > Environment: Oracle Java 64-bit (build 1.8.0_66-b17) >Reporter: Evan Chen >Assignee: Evan Chen >Priority: Minor > Fix For: 1.6.0 > > > org.apache.spark.streaming.Java8APISuite.java is failing due to trying to > sort immutable list in assertOrderInvariantEquals method. > Here are the errors: > Tests run: 27, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 5.948 sec > <<< FAILURE! - in org.apache.spark.streaming.Java8APISuite > testMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.217 sec > <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testFlatMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.203 > sec <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testFilter(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.209 sec > <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testTransform(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.215 > sec <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > Results : > Tests in error: > > Java8APISuite.testFilter:81->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testFlatMap:360->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testMap:63->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testTransform:168->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11749) Duplicate creating the RDD in file stream when recovering from checkpoint data
[ https://issues.apache.org/jira/browse/SPARK-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-11749: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > Duplicate creating the RDD in file stream when recovering from checkpoint data > -- > > Key: SPARK-11749 > URL: https://issues.apache.org/jira/browse/SPARK-11749 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2, 1.6.0 >Reporter: Jack Hu >Assignee: Jack Hu > Fix For: 1.6.0 > > > I have a case to monitor a HDFS folder, then enrich the incoming data from > the HDFS folder via different table (about 15 reference tables) and send to > different hive table after some operations. > The code is as this: > {code} > val txt = > ssc.textFileStream(folder).map(toKeyValuePair).reduceByKey(removeDuplicates) > val refTable1 = > ssc.textFileStream(refSource1).map(parse(_)).updateStateByKey(...) > txt.join(refTable1).map(..).reduceByKey(...).foreachRDD( > rdd => { > // insert into hive table > } > ) > val refTable2 = > ssc.textFileStream(refSource2).map(parse(_)).updateStateByKey(...) > txt.join(refTable2).map(..).reduceByKey(...).foreachRDD( > rdd => { > // insert into hive table > } > ) > /// more refTables in following code > {code} > > The {{batchInterval}} of this application is set to *30 seconds*, the > checkpoint interval is set to *10 minutes*, every batch in {{txt}} has *60 > files* > After recovered from checkpoint data, I can see lots of log to create the RDD > in file stream: rdd in each batch of file stream was been recreated *15 > times*, and it takes about *5 minutes* to create so much file RDD. During > this period, *10K+ broadcast* had been created and almost used all the block > manager space. > After some investigation, we found that the {{DStream.restoreCheckpointData}} > would be invoked at each output ({{DStream.foreachRDD}} in this case), and no > flag to indicate that this {{DStream}} had been restored, so the RDD in file > stream was been recreated. > Suggest to add on flag to control the restore process to avoid the duplicated > work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12386) Setting "spark.executor.port" leads to NPE in SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12386: - Target Version/s: 1.6.0 (was: 1.6.1, 2.0.0) > Setting "spark.executor.port" leads to NPE in SparkEnv > -- > > Key: SPARK-12386 > URL: https://issues.apache.org/jira/browse/SPARK-12386 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > Fix For: 1.6.0 > > > From the list: > {quote} > when we set spark.executor.port in 1.6, we get thrown a NPE in > SparkEnv$.create(SparkEnv.scala:259). > {quote} > Fix is simple; probably should make it to 1.6.0 since it will affect anyone > using that config options, but I'll leave that to the release manager's > discretion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11749) Duplicate creating the RDD in file stream when recovering from checkpoint data
[ https://issues.apache.org/jira/browse/SPARK-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-11749: - Affects Version/s: (was: 1.5.0) 1.6.0 > Duplicate creating the RDD in file stream when recovering from checkpoint data > -- > > Key: SPARK-11749 > URL: https://issues.apache.org/jira/browse/SPARK-11749 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2, 1.6.0 >Reporter: Jack Hu >Assignee: Jack Hu > Fix For: 1.6.0 > > > I have a case to monitor a HDFS folder, then enrich the incoming data from > the HDFS folder via different table (about 15 reference tables) and send to > different hive table after some operations. > The code is as this: > {code} > val txt = > ssc.textFileStream(folder).map(toKeyValuePair).reduceByKey(removeDuplicates) > val refTable1 = > ssc.textFileStream(refSource1).map(parse(_)).updateStateByKey(...) > txt.join(refTable1).map(..).reduceByKey(...).foreachRDD( > rdd => { > // insert into hive table > } > ) > val refTable2 = > ssc.textFileStream(refSource2).map(parse(_)).updateStateByKey(...) > txt.join(refTable2).map(..).reduceByKey(...).foreachRDD( > rdd => { > // insert into hive table > } > ) > /// more refTables in following code > {code} > > The {{batchInterval}} of this application is set to *30 seconds*, the > checkpoint interval is set to *10 minutes*, every batch in {{txt}} has *60 > files* > After recovered from checkpoint data, I can see lots of log to create the RDD > in file stream: rdd in each batch of file stream was been recreated *15 > times*, and it takes about *5 minutes* to create so much file RDD. During > this period, *10K+ broadcast* had been created and almost used all the block > manager space. > After some investigation, we found that the {{DStream.restoreCheckpointData}} > would be invoked at each output ({{DStream.foreachRDD}} in this case), and no > flag to indicate that this {{DStream}} had been restored, so the RDD in file > stream was been recreated. > Suggest to add on flag to control the restore process to avoid the duplicated > work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12220) Make Utils.fetchFile support files that contain special characters
[ https://issues.apache.org/jira/browse/SPARK-12220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12220: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > Make Utils.fetchFile support files that contain special characters > -- > > Key: SPARK-12220 > URL: https://issues.apache.org/jira/browse/SPARK-12220 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > > Now if a file name contains some illegal characters, such as " ", > Utils.fetchFile will fail because it doesn't handle this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069019#comment-15069019 ] Xiao Li commented on SPARK-12483: - {code}as{code} is used to return a new DataFrame with an alias set. For example, {code} val x = testData2.as("x") val y = testData2.as("y") val join = x.join(y, $"x.a" === $"y.a", "inner").queryExecution.optimizedPlan {code} > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12478) Dataset fields of product types can't be null
[ https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-12478: --- Labels: backport-needed (was: ) > Dataset fields of product types can't be null > - > > Key: SPARK-12478 > URL: https://issues.apache.org/jira/browse/SPARK-12478 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > Labels: backport-needed > > Spark shell snippet for reproduction: > {code} > import sqlContext.implicits._ > case class Inner(f: Int) > case class Outer(i: Inner) > Seq(Outer(null)).toDS().toDF().show() > Seq(Outer(null)).toDS().show() > {code} > Expected output should be: > {noformat} > ++ > | i| > ++ > |null| > ++ > ++ > | i| > ++ > |null| > ++ > {noformat} > Actual output: > {noformat} > +--+ > | i| > +--+ > |[null]| > +--+ > java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: > Null value appeared in non-nullable field Inner.f of type scala.Int. If the > schema is inferred from a Scala tuple/case class, or a Java bean, please try > to use scala.Option[_] or other nullable types (e.g. java.lang.Integer > instead of int/scala.Int). > newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, > StructType(StructField(f,IntegerType,false))])) null else newinstance(class > $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class > $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3)) > +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null > else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)) >:- isnull(input[0, StructType(StructField(f,IntegerType,false))]) >: +- input[0, StructType(StructField(f,IntegerType,false))] >:- null >+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)) > +- assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int) > +- input[0, StructType(StructField(f,IntegerType,false))].f > +- input[0, StructType(StructField(f,IntegerType,false))] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224) > at > org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704) > at > org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:704) > at org.apache.spark.sql.Dataset.take(Dataset.scala:725) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:240) > at org.apache.spark.sql.Dataset.show(Dataset.scala:230) > at org.apache.spark.sql.Dataset.show(Dataset.scala:193) > at org.apache.spark.sql.Dataset.show(Dataset.scala:201) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:46) > at $iwC$$iwC$$iwC$$iwC.(:48) > at $iwC$$iwC$$iwC.(:50) > at $iwC$$iwC.(:52) > at $iwC.(:54) > at (:56) > at .(:60) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045) > at >
[jira] [Updated] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Summary: Spark fileserver not started on same IP as configured in spark.driver.host (was: Spark fileserver not started on same IP as using spark.driver.host) > Spark fileserver not started on same IP as configured in spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.5.2 >Reporter: Kyle Sutton > > The issue of the file server using the default IP instead of the IP address > configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the _Spark_ service attempts to call back to the default > port of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} > * A connection is made from the driver host to the _Spark_ service > ** {{spark.driver.host}} is set to the IP of the driver host on the LAN > {{172.30.0.2}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the driver host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} > * The _Spark_ service, being on a different network than the driver host, > cannot see the {{192.168.1.0/24}} address space, and fails to connect to the > file server > ** A {{netstat}} on the _Spark_ service host will show the connection to the > file server host as being in {{SYN_SENT}} state until the process gives up > trying to connect > {code:title=Driver|borderStyle=solid} > SparkConf conf = new SparkConf() > .setMaster("spark://172.30.0.3:7077") > .setAppName("TestApp") > .set("spark.driver.host", "172.30.0.2") > .set("spark.driver.port", "50003") > .set("spark.fileserver.port", "50005"); > JavaSparkContext sc = new JavaSparkContext(conf); > sc.addJar("target/code.jar"); > {code} > {code:title=Stacktrace|borderStyle=solid} > 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 172.30.0.3): java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at >
[jira] [Reopened] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton reopened SPARK-12482: - Reopening SPARK-6476, since it's still an issue > Spark fileserver not started on same IP as configured in spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.5.2 >Reporter: Kyle Sutton > > The issue of the file server using the default IP instead of the IP address > configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the _Spark_ service attempts to call back to the default > port of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} > * A connection is made from the driver host to the _Spark_ service > ** {{spark.driver.host}} is set to the IP of the driver host on the LAN > {{172.30.0.2}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the driver host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} > * The _Spark_ service, being on a different network than the driver host, > cannot see the {{192.168.1.0/24}} address space, and fails to connect to the > file server > ** A {{netstat}} on the _Spark_ service host will show the connection to the > file server host as being in {{SYN_SENT}} state until the process gives up > trying to connect > {code:title=Driver|borderStyle=solid} > SparkConf conf = new SparkConf() > .setMaster("spark://172.30.0.3:7077") > .setAppName("TestApp") > .set("spark.driver.host", "172.30.0.2") > .set("spark.driver.port", "50003") > .set("spark.fileserver.port", "50005"); > JavaSparkContext sc = new JavaSparkContext(conf); > sc.addJar("target/code.jar"); > {code} > {code:title=Stacktrace|borderStyle=solid} > 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 172.30.0.3): java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at
[jira] [Commented] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068572#comment-15068572 ] Kyle Sutton commented on SPARK-6476: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} * A connection is made from the driver host to the _Spark_ service ** {{spark.driver.host}} is set to the IP of the driver host on the LAN {{172.30.0.2}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the driver host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the driver host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect {code:title=Driver|borderStyle=solid} SparkConf conf = new SparkConf() .setMaster("spark://172.30.0.3:7077") .setAppName("TestApp") .set("spark.driver.host", "172.30.0.2") .set("spark.driver.port", "50003") .set("spark.fileserver.port", "50005"); JavaSparkContext sc = new JavaSparkContext(conf); sc.addJar("target/code.jar"); {code} {code:title=Stacktrace|borderStyle=solid} 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.0.3): java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068627#comment-15068627 ] Maciej Bryński edited comment on SPARK-11437 at 12/22/15 7:44 PM: -- [~davies] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} was (Author: maver1ck): [~davies] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} from pyspark.sql.types import * schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} from pyspark.sql.types import * schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Jason White > Fix For: 1.6.0 > > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Attachment: SPARK_12483_unitTest.java add a unit test file > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12484) DataFrame withColumn() does not work in Java
Andrew Davidson created SPARK-12484: --- Summary: DataFrame withColumn() does not work in Java Key: SPARK-12484 URL: https://issues.apache.org/jira/browse/SPARK-12484 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Environment: mac El Cap. 10.11.2 Java 8 Reporter: Andrew Davidson DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS transformedByUDF#3]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068684#comment-15068684 ] Andrew Davidson commented on SPARK-12484: - you can find some more back ground on the email thread 'should I file a bug? Re: trouble implementing Transformer and calling DataFrame.withColumn()' > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068693#comment-15068693 ] Pete Robbins commented on SPARK-12470: -- I'm fairly sure the code in my PR is correct but it is causing an ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is failing with the change I made. The failure is: determining the number of reducers: aggregate operator *** FAILED *** 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316) putting some debug into the test I see that before my change the pre-shuffle partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 800. 800. 720 but I have no idea why. I'd really appreciate anyone with knowledge of this area a) checking my PR and b) helping explain the failing test. > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Priority: Minor > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12489) Fix minor issues found by Findbugs
[ https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12489: Assignee: (was: Apache Spark) > Fix minor issues found by Findbugs > -- > > Key: SPARK-12489 > URL: https://issues.apache.org/jira/browse/SPARK-12489 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core, SQL >Reporter: Shixiong Zhu >Priority: Minor > > Just used FindBugs to scan the codes and fixed some real issues: > 1. Close `java.sql.Statement` > 2. Fix incorrect `asInstanceOf`. > 3. Remove unnecessary `synchronized` and `ReentrantLock`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12489) Fix minor issues found by Findbugs
[ https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12489: Assignee: Apache Spark > Fix minor issues found by Findbugs > -- > > Key: SPARK-12489 > URL: https://issues.apache.org/jira/browse/SPARK-12489 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core, SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > > Just used FindBugs to scan the codes and fixed some real issues: > 1. Close `java.sql.Statement` > 2. Fix incorrect `asInstanceOf`. > 3. Remove unnecessary `synchronized` and `ReentrantLock`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12491) UDAF result differs in SQL if alias is used
Tristan created SPARK-12491: --- Summary: UDAF result differs in SQL if alias is used Key: SPARK-12491 URL: https://issues.apache.org/jira/browse/SPARK-12491 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Reporter: Tristan Using the GeometricMean UDAF example (https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html), I found the following discrepancy in results: scala> sqlContext.sql("select group_id, gm(id) from simple group by group_id").show() ++---+ |group_id|_c1| ++---+ | 0|0.0| | 1|0.0| | 2|0.0| ++---+ scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple group by group_id").show() ++-+ |group_id|GeometricMean| ++-+ | 0|8.981385496571725| | 1|7.301716979342118| | 2|7.706253151292568| ++-+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12102. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10156 [https://github.com/apache/spark/pull/10156] > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > Fix For: 2.0.0 > > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12102: - Assignee: Dilip Biswal > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Dilip Biswal > Fix For: 2.0.0 > > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068693#comment-15068693 ] Pete Robbins edited comment on SPARK-12470 at 12/22/15 9:47 PM: I'm fairly sure the code in my PR is correct but it is causing an ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is failing with the change I made. The failure is: determining the number of reducers: aggregate operator *** FAILED *** 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316) putting some debug into the test I see that before my change the pre-shuffle partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 800. 800. 720 but I have no idea why. I'd really appreciate anyone with knowledge of this area a) checking my PR and b) helping explain the failing test. EDIT Please ignore. Merged with latest head including changes for SPARK-12388 now passes all tests was (Author: robbinspg): I'm fairly sure the code in my PR is correct but it is causing an ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is failing with the change I made. The failure is: determining the number of reducers: aggregate operator *** FAILED *** 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316) putting some debug into the test I see that before my change the pre-shuffle partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 800. 800. 720 but I have no idea why. I'd really appreciate anyone with knowledge of this area a) checking my PR and b) helping explain the failing test. > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Priority: Minor > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068908#comment-15068908 ] Timothy Hunter commented on SPARK-12247: It seems to me that the calculation of false positives is more relevant for the movie ratings, and that the RMSE right above in the example is already a good example to but. What do you think? > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12487) Add docs for Kafka message handler
[ https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-12487. --- Resolution: Fixed Fix Version/s: 1.6.0 > Add docs for Kafka message handler > -- > > Key: SPARK-12487 > URL: https://issues.apache.org/jira/browse/SPARK-12487 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls
Josh Rosen created SPARK-12490: -- Summary: Don't use Javascript for web UI's paginated table navigation controls Key: SPARK-12490 URL: https://issues.apache.org/jira/browse/SPARK-12490 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Josh Rosen Assignee: Josh Rosen The web UI's paginated table uses Javascript to implement certain navigation controls, such as table sorting and the "go to page" form. This is unnecessary and should be simplified to use plain HTML form controls and links. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls
[ https://issues.apache.org/jira/browse/SPARK-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068802#comment-15068802 ] Apache Spark commented on SPARK-12490: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/10441 > Don't use Javascript for web UI's paginated table navigation controls > - > > Key: SPARK-12490 > URL: https://issues.apache.org/jira/browse/SPARK-12490 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Josh Rosen >Assignee: Josh Rosen > > The web UI's paginated table uses Javascript to implement certain navigation > controls, such as table sorting and the "go to page" form. This is > unnecessary and should be simplified to use plain HTML form controls and > links. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068827#comment-15068827 ] Ilya Ganelin commented on SPARK-12488: -- Further investigation identifies the issue as stemming from the docTermVector containing zero-vectors (as in no words from the vocabulary present in the document). > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generates 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12441) Fixing missingInput in all Logical/Physical operators
[ https://issues.apache.org/jira/browse/SPARK-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12441: Summary: Fixing missingInput in all Logical/Physical operators (was: Fixing missingInput in Generate/MapPartitions/AppendColumns/MapGroups/CoGroup) > Fixing missingInput in all Logical/Physical operators > - > > Key: SPARK-12441 > URL: https://issues.apache.org/jira/browse/SPARK-12441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Xiao Li > > The value of missingInput in > Generate/MapPartitions/AppendColumns/MapGroups/CoGroup is incorrect. > {code} > val df = Seq((1, "a b c"), (2, "a b"), (3, "a")).toDF("number", "letters") > val df2 = > df.explode('letters) { > case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq > } > df2.explain(true) > {code} > {code} > == Parsed Logical Plan == > 'Generate UserDefinedGenerator('letters), true, false, None > +- Project [_1#0 AS number#2,_2#1 AS letters#3] >+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]] > == Analyzed Logical Plan == > number: int, letters: string, _1: string > Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8] > +- Project [_1#0 AS number#2,_2#1 AS letters#3] >+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]] > == Optimized Logical Plan == > Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8] > +- LocalRelation [number#2,letters#3], [[1,a b c],[2,a b],[3,a]] > == Physical Plan == > !Generate UserDefinedGenerator(letters#3), true, false, > [number#2,letters#3,_1#8] > +- LocalTableScan [number#2,letters#3], [[1,a b c],[2,a b],[3,a]] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12482. --- Resolution: Duplicate I reopened the other one and reclosed this. Please see Rares's comments there though. > Spark fileserver not started on same IP as configured in spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.5.2 >Reporter: Kyle Sutton > > The issue of the file server using the default IP instead of the IP address > configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the _Spark_ service attempts to call back to the default > port of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} > * A connection is made from the driver host to the _Spark_ service > ** {{spark.driver.host}} is set to the IP of the driver host on the LAN > {{172.30.0.2}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the driver host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} > * The _Spark_ service, being on a different network than the driver host, > cannot see the {{192.168.1.0/24}} address space, and fails to connect to the > file server > ** A {{netstat}} on the _Spark_ service host will show the connection to the > file server host as being in {{SYN_SENT}} state until the process gives up > trying to connect > {code:title=Driver|borderStyle=solid} > SparkConf conf = new SparkConf() > .setMaster("spark://172.30.0.3:7077") > .setAppName("TestApp") > .set("spark.driver.host", "172.30.0.2") > .set("spark.driver.port", "50003") > .set("spark.fileserver.port", "50005"); > JavaSparkContext sc = new JavaSparkContext(conf); > sc.addJar("target/code.jar"); > {code} > {code:title=Stacktrace|borderStyle=solid} > 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 172.30.0.3): java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at
[jira] [Reopened] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-6476: -- > Spark fileserver not started on same IP as using spark.driver.host > -- > > Key: SPARK-6476 > URL: https://issues.apache.org/jira/browse/SPARK-6476 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 >Reporter: Rares Vernica > > I initially inquired about this here: > http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E > If the Spark driver host has multiple IPs and spark.driver.host is set to one > of them, I would expect the fileserver to start on the same IP. I checked > HttpServer and the jetty Server is started the default IP of the machine: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 > Something like this might work instead: > {code:title=HttpServer.scala#L75} > val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), > 0)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068565#comment-15068565 ] Kyle Sutton commented on SPARK-12482: - Actually, how do I reopen a ticket I didn't write? > Spark fileserver not started on same IP as configured in spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.5.2 >Reporter: Kyle Sutton > > The issue of the file server using the default IP instead of the IP address > configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the _Spark_ service attempts to call back to the default > port of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} > * A connection is made from the driver host to the _Spark_ service > ** {{spark.driver.host}} is set to the IP of the driver host on the LAN > {{172.30.0.2}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the driver host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} > * The _Spark_ service, being on a different network than the driver host, > cannot see the {{192.168.1.0/24}} address space, and fails to connect to the > file server > ** A {{netstat}} on the _Spark_ service host will show the connection to the > file server host as being in {{SYN_SENT}} state until the process gives up > trying to connect > {code:title=Driver|borderStyle=solid} > SparkConf conf = new SparkConf() > .setMaster("spark://172.30.0.3:7077") > .setAppName("TestApp") > .set("spark.driver.host", "172.30.0.2") > .set("spark.driver.port", "50003") > .set("spark.fileserver.port", "50005"); > JavaSparkContext sc = new JavaSparkContext(conf); > sc.addJar("target/code.jar"); > {code} > {code:title=Stacktrace|borderStyle=solid} > 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 172.30.0.3): java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at
[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12462: Assignee: (was: Apache Spark) > Add ExpressionDescription to misc non-aggregate functions > - > > Key: SPARK-12462 > URL: https://issues.apache.org/jira/browse/SPARK-12462 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12462: Assignee: Apache Spark > Add ExpressionDescription to misc non-aggregate functions > - > > Key: SPARK-12462 > URL: https://issues.apache.org/jira/browse/SPARK-12462 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068761#comment-15068761 ] Ilya Ganelin edited comment on SPARK-12488 at 12/22/15 9:32 PM: [~josephkb] Would love your feedback here. Thanks! was (Author: ilganeli): @jkbradley Would love your feedback here. Thanks! > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generates 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Description: The issue of the file server starting up on the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the Spark service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * A _Spark_ connection is made from the launch host to a service running on that LAN, {{172.30.0.0/16}} ** {{spark.driver.host}} is set to the IP of that _Spark_ service host {{172.30.0.3}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. Original report follows: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead: {code:title=HttpServer.scala#L75} val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0)) {code} was: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead: {code:title=HttpServer.scala#L75} val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0)) {code} > CLONE - Spark fileserver not started on same IP as using spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 >Reporter: Kyle Sutton > > The issue of the file server starting up on the default IP instead of the IP > address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the Spark service attempts to call back to the default port > of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * A _Spark_ connection is made from the launch host to a service running on > that LAN, {{172.30.0.0/16}} > ** {{spark.driver.host}} is set to the IP of that _Spark_ service host > {{172.30.0.3}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the launch host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}}
[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Description: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the Spark service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * A _Spark_ connection is made from the launch host to a service running on that LAN, {{172.30.0.0/16}} ** {{spark.driver.host}} is set to the IP of that _Spark_ service host {{172.30.0.3}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. Original report follows: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead: {code:title=HttpServer.scala#L75} val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0)) {code} was: The issue of the file server starting up on the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the Spark service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * A _Spark_ connection is made from the launch host to a service running on that LAN, {{172.30.0.0/16}} ** {{spark.driver.host}} is set to the IP of that _Spark_ service host {{172.30.0.3}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. Original report follows: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead:
[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Affects Version/s: 1.5.2 > CLONE - Spark fileserver not started on same IP as using spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.5.2 >Reporter: Kyle Sutton > > The issue of the file server using the default IP instead of the IP address > configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the Spark service attempts to call back to the default port > of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * A _Spark_ connection is made from the launch host to a service running on > that LAN, {{172.30.0.0/16}} > ** {{spark.driver.host}} is set to the IP of that _Spark_ service host > {{172.30.0.3}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the launch host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} > * The _Spark_ service, being on a different network than the launch host, > cannot see the {{192.168.1.0/24}} address space, and fails to connect to the > file server > ** A {{netstat}} on the _Spark_ service host will show the connection to the > file server host as being in {{SYN_SENT}} state until the process gives up > trying to connect > Given that the configured {{spark.driver.host}} must necessarily be > accessible by the _Spark_ service, it makes sense to reuse it for fileserver > connections and any other _Spark_ service -> driver host connections. > > Original report follows: > I initially inquired about this here: > http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E > If the Spark driver host has multiple IPs and spark.driver.host is set to one > of them, I would expect the fileserver to start on the same IP. I checked > HttpServer and the jetty Server is started the default IP of the machine: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 > Something like this might work instead: > {code:title=HttpServer.scala#L75} > val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), > 0)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Description: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * A _Spark_ connection is made from the launch host to a service running on that LAN, {{172.30.0.0/16}} ** {{spark.driver.host}} is set to the IP of that _Spark_ service host {{172.30.0.3}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. \\ \\ Original report follows: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead: {code:title=HttpServer.scala#L75} val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0)) {code} was: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * A _Spark_ connection is made from the launch host to a service running on that LAN, {{172.30.0.0/16}} ** {{spark.driver.host}} is set to the IP of that _Spark_ service host {{172.30.0.3}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. Original report follows: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead:
[jira] [Updated] (SPARK-12482) CLONE - Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Sutton updated SPARK-12482: Description: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * A _Spark_ connection is made from the launch host to a service running on that LAN, {{172.30.0.0/16}} ** {{spark.driver.host}} is set to the IP of that _Spark_ service host {{172.30.0.3}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. Original report follows: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead: {code:title=HttpServer.scala#L75} val server = new Server(new InetSocketAddress(conf.get("spark.driver.host"), 0)) {code} was: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the Spark service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Launch host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * A _Spark_ connection is made from the launch host to a service running on that LAN, {{172.30.0.0/16}} ** {{spark.driver.host}} is set to the IP of that _Spark_ service host {{172.30.0.3}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the launch host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the file server host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the launch host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect Given that the configured {{spark.driver.host}} must necessarily be accessible by the _Spark_ service, it makes sense to reuse it for fileserver connections and any other _Spark_ service -> driver host connections. Original report follows: I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead:
[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12479: Assignee: (was: Apache Spark) > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069092#comment-15069092 ] Apache Spark commented on SPARK-12479: -- User 'paulomagalhaes' has created a pull request for this issue: https://github.com/apache/spark/pull/10436 > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12385) Push projection into Join
[ https://issues.apache.org/jira/browse/SPARK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069133#comment-15069133 ] Xiao Li commented on SPARK-12385: - I will work on it, if nobody starts it. Thanks! > Push projection into Join > - > > Key: SPARK-12385 > URL: https://issues.apache.org/jira/browse/SPARK-12385 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > We usually have Join followed by a projection to pruning some columns, but > Join already have a result projection to produce UnsafeRow, we should combine > them together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11164) Add InSet pushdown filter back for Parquet
[ https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11164. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10278 [https://github.com/apache/spark/pull/10278] > Add InSet pushdown filter back for Parquet > -- > > Key: SPARK-11164 > URL: https://issues.apache.org/jira/browse/SPARK-11164 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Xiao Li > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11164) Add InSet pushdown filter back for Parquet
[ https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11164: --- Assignee: Xiao Li > Add InSet pushdown filter back for Parquet > -- > > Key: SPARK-11164 > URL: https://issues.apache.org/jira/browse/SPARK-11164 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12492) SQL page of Spark-sql is always blank
meiyoula created SPARK-12492: Summary: SQL page of Spark-sql is always blank Key: SPARK-12492 URL: https://issues.apache.org/jira/browse/SPARK-12492 Project: Spark Issue Type: Bug Components: SQL Reporter: meiyoula When I run a sql query in spark-sql, the Execution page of SQL tab is always blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12479: Assignee: (was: Apache Spark) > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org