[jira] [Commented] (SPARK-6006) Optimize count distinct in case of high cardinality columns
[ https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336430#comment-14336430 ] Apache Spark commented on SPARK-6006: - User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/4764 Optimize count distinct in case of high cardinality columns --- Key: SPARK-6006 URL: https://issues.apache.org/jira/browse/SPARK-6006 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.2.1 Reporter: Yash Datta Priority: Minor Fix For: 1.3.0 In case there are a lot of distinct values, count distinct becomes too slow since it tries to hash partial results to one map. It can be improved by creating buckets/partial maps in an intermediate stage where same key from multiple partial maps of first stage hash to the same bucket. Later we can sum the size of these buckets to get total distinct count. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5983) Don't respond to HTTP TRACE in HTTP-based UIs
[ https://issues.apache.org/jira/browse/SPARK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336440#comment-14336440 ] Apache Spark commented on SPARK-5983: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4765 Don't respond to HTTP TRACE in HTTP-based UIs - Key: SPARK-5983 URL: https://issues.apache.org/jira/browse/SPARK-5983 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sean Owen Priority: Minor This was flagged a while ago during a routine security scan: the HTTP-based Spark services respond to an HTTP TRACE command. This is basically an HTTP verb that has no practical use, and has a pretty theoretical chance of being an exploit vector. It is flagged as a security issue by one common tool, however. Spark's HTTP services are based on Jetty, which by default does not enable TRACE (like Tomcat). However, the services do reply to TRACE requests. I think it is because the use of Jetty is pretty 'raw' and does not enable much of the default additional configuration you might get by using Jetty as a standalone server. I know that it is at least possible to stop the reply to TRACE with a few extra lines of code, so I think it is worth shutting off TRACE requests. Although the security risk is quite theoretical, it should be easy to fix and bring the Spark services into line with the common default of HTTP servers today. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6007) Add numRows param in DataFrame.show
Jacky Li created SPARK-6007: --- Summary: Add numRows param in DataFrame.show Key: SPARK-6007 URL: https://issues.apache.org/jira/browse/SPARK-6007 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Jacky Li Priority: Minor Fix For: 1.3.0 Currently, DataFrame.show only takes 20 rows to show, it will be useful if the user can decide how many rows to show by passing it as a parameter in show() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5666) Improvements in Mqtt Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5666. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4178 [https://github.com/apache/spark/pull/4178] Improvements in Mqtt Spark Streaming - Key: SPARK-5666 URL: https://issues.apache.org/jira/browse/SPARK-5666 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Prabeesh K Priority: Minor Fix For: 1.4.0 Cleanup the source code related to the Mqtt Spark Streaming to adhere to accept coding standards. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6005) Flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery
[ https://issues.apache.org/jira/browse/SPARK-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336542#comment-14336542 ] Iulian Dragos commented on SPARK-6005: -- Looks similar, but unless I miss something, that fix didn't fix this one. The failure I report is from a recent build that includes the fixes in 3912d332464dcd124c60b734724c34d9742466a4 Flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery Key: SPARK-6005 URL: https://issues.apache.org/jira/browse/SPARK-6005 Project: Spark Issue Type: Bug Components: Streaming Reporter: Iulian Dragos Labels: flaky-test, kafka, streaming [Link to failing test on Jenkins|https://ci.typesafe.com/view/Spark/job/spark-nightly-build/lastCompletedBuild/testReport/org.apache.spark.streaming.kafka/DirectKafkaStreamSuite/offset_recovery/] {code} The code passed to eventually never returned normally. Attempted 208 times over 10.00622791 seconds. Last failure message: strings.forall({ ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false. {code} {code:title=Stack trace} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 208 times over 10.00622791 seconds. Last failure message: strings.forall({ ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.org$apache$spark$streaming$kafka$DirectKafkaStreamSuite$$anonfun$$sendDataAndWaitForReceive$1(DirectKafkaStreamSuite.scala:225) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply$mcV$sp(DirectKafkaStreamSuite.scala:287) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply(DirectKafkaStreamSuite.scala:211) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply(DirectKafkaStreamSuite.scala:211) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:39) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:39) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at
[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336475#comment-14336475 ] Marco Capuccini commented on SPARK-5837: setting yarn.resourcemanager.hostname solved the problem. Thanks a lot. HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode -- Key: SPARK-5837 URL: https://issues.apache.org/jira/browse/SPARK-5837 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0, 1.2.1 Reporter: Marco Capuccini Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the Spark UI if I run over yarn (version 2.4.0): HTTP ERROR 500 Problem accessing /proxy/application_1423564210894_0017/. Reason: Connection refused Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.init(Socket.java:425) at java.net.Socket.init(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at
[jira] [Updated] (SPARK-5666) Improvements in Mqtt Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5666: - Assignee: Prabeesh K Improvements in Mqtt Spark Streaming - Key: SPARK-5666 URL: https://issues.apache.org/jira/browse/SPARK-5666 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Prabeesh K Assignee: Prabeesh K Priority: Minor Fix For: 1.4.0 Cleanup the source code related to the Mqtt Spark Streaming to adhere to accept coding standards. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5947) First class partitioning support in data sources API
[ https://issues.apache.org/jira/browse/SPARK-5947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336556#comment-14336556 ] Philippe Girolami commented on SPARK-5947: -- For some workloads, it can make more sense to use SKEWED ON rather than PARTITION in order to prevent creating thousands of tiny partitions just to handle a few large partitions. As far as I can tell, these two cases can't be inferred from a directory layout so maybe it would make sense to make PARTITION SKEW part of Spark too, and rely on meta-data defined by the application rather than directory discovery ? First class partitioning support in data sources API Key: SPARK-5947 URL: https://issues.apache.org/jira/browse/SPARK-5947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Lian For file system based data sources, implementing Hive style partitioning support can be complex and error prone. To be specific, partitioning support include: # Partition discovery: Given a directory organized similar to Hive partitions, discover the directory structure and partitioning information automatically, including partition column names, data types, and values. # Reading from partitioned tables # Writing to partitioned tables It would be good to have first class partitioning support in the data sources API. For example, add a {{FileBasedScan}} trait with callbacks and default implementations for these features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6006) Optimize count distinct in case high cardinality columns
Yash Datta created SPARK-6006: - Summary: Optimize count distinct in case high cardinality columns Key: SPARK-6006 URL: https://issues.apache.org/jira/browse/SPARK-6006 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1, 1.1.1 Reporter: Yash Datta Priority: Minor Fix For: 1.3.0 In case there are a lot of distinct values, count distinct becomes too slow since it tries to hash partial results to one map. It can be improved by creating buckets/partial maps in an intermediate stage where same key from multiple partial maps of first stage hash to the same bucket. Later we can sum the size of these buckets to get total distinct count. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6006) Optimize count distinct in case of high cardinality columns
[ https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yash Datta updated SPARK-6006: -- Summary: Optimize count distinct in case of high cardinality columns (was: Optimize count distinct in case high cardinality columns) Optimize count distinct in case of high cardinality columns --- Key: SPARK-6006 URL: https://issues.apache.org/jira/browse/SPARK-6006 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.2.1 Reporter: Yash Datta Priority: Minor Fix For: 1.3.0 In case there are a lot of distinct values, count distinct becomes too slow since it tries to hash partial results to one map. It can be improved by creating buckets/partial maps in an intermediate stage where same key from multiple partial maps of first stage hash to the same bucket. Later we can sum the size of these buckets to get total distinct count. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336443#comment-14336443 ] Sean Owen commented on SPARK-5978: -- Hm, not sure I can reproduce this. If you build for Hadoop 2 and look at the dependency tree (e.g. {{mvn -Phadoop-2.4 -Dhadoop.version=2.4.0 dependency:tree -pl examples}}) then it looks like you get Hadoop 2, HBase's Hadoop 2 flavor, etc.: {code} [INFO] | +- org.apache.hadoop:hadoop-client:jar:2.4.0:compile ... [INFO] +- org.apache.hbase:hbase-testing-util:jar:0.98.7-hadoop2:compile {code} Is that how you built it? Spark examples cannot compile with Hadoop 2 --- Key: SPARK-5978 URL: https://issues.apache.org/jira/browse/SPARK-5978 Project: Spark Issue Type: Bug Components: Examples, PySpark Affects Versions: 1.2.0, 1.2.1 Reporter: Michael Nazario Labels: hadoop-version This is a regression from Spark 1.1.1. The Spark Examples includes an example for an avro converter for PySpark. When I was trying to debug a problem, I discovered that even though you can build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 somewhere else in the examples code. An easy fix would be to separate the examples into hadoop specific versions. Another way would be to fix the hbase dependencies so that they don't rely on hadoop 1 specific code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6007) Add numRows param in DataFrame.show
[ https://issues.apache.org/jira/browse/SPARK-6007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336470#comment-14336470 ] Apache Spark commented on SPARK-6007: - User 'jackylk' has created a pull request for this issue: https://github.com/apache/spark/pull/4767 Add numRows param in DataFrame.show --- Key: SPARK-6007 URL: https://issues.apache.org/jira/browse/SPARK-6007 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Jacky Li Priority: Minor Fix For: 1.3.0 Currently, DataFrame.show only takes 20 rows to show, it will be useful if the user can decide how many rows to show by passing it as a parameter in show() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
[ https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5771. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4567 [https://github.com/apache/spark/pull/4567] Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called -- Key: SPARK-5771 URL: https://issues.apache.org/jira/browse/SPARK-5771 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.1 Reporter: Liangliang Gu Priority: Minor Fix For: 1.4.0 In Standalone mode, the number of cores in Completed Applications of the Master Web Page will always be zero, if sc.stop() is called. But the number will always be right, if sc.stop() is not called. The reason maybe: after sc.stop() is called, the function removeExecutor of class ApplicationInfo will be called, thus reduce the variable coresGranted to zero. The variable coresGranted is used to display the number of Cores on the Web Page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4010) Spark UI returns 500 in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336654#comment-14336654 ] Sean Owen commented on SPARK-4010: -- [~Hanchen] see SPARK-5837 for a possible explanation. Spark UI returns 500 in yarn-client mode - Key: SPARK-4010 URL: https://issues.apache.org/jira/browse/SPARK-4010 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Blocker Fix For: 1.1.1, 1.2.0 http://host/proxy/application_id/stages/ returns this result: {noformat} HTTP ERROR 500 Problem accessing /proxy/application_1411648907638_0281/stages/. Reason: Connection refused Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.init(Socket.java:425) at java.net.Socket.init(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:185) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:336) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1183) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at
[jira] [Commented] (SPARK-5940) Graph Loader: refactor + add more formats
[ https://issues.apache.org/jira/browse/SPARK-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336677#comment-14336677 ] Takeshi Yamamuro commented on SPARK-5940: - I made a quick fix as follows; https://github.com/maropu/spark/commit/3c55be83f48e4be5baf3e58ca49850c6cb48df12 Loaders for multiple formats (e.g., RDF and others) are implemented in graphx/impl. Graph Loader: refactor + add more formats - Key: SPARK-5940 URL: https://issues.apache.org/jira/browse/SPARK-5940 Project: Spark Issue Type: New Feature Components: GraphX Reporter: lukovnikov Priority: Minor Currently, the only graph loader is GraphLoader.edgeListFile. [SPARK-5280] adds a RDF graph loader. However, as Takeshi Yamamuro suggested on github [SPARK-5280], https://github.com/apache/spark/pull/4650, it might be interesting to make GraphLoader an interface with several implementations for different formats. And maybe it's good to make a façade graph loader that provides a unified interface to all loaders. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4010) Spark UI returns 500 in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336644#comment-14336644 ] Hanchen Su commented on SPARK-4010: --- I still have the problem in Spark 1.2.1 Spark UI returns 500 in yarn-client mode - Key: SPARK-4010 URL: https://issues.apache.org/jira/browse/SPARK-4010 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Blocker Fix For: 1.1.1, 1.2.0 http://host/proxy/application_id/stages/ returns this result: {noformat} HTTP ERROR 500 Problem accessing /proxy/application_1411648907638_0281/stages/. Reason: Connection refused Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.init(Socket.java:425) at java.net.Socket.init(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:185) at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:336) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1183) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at
[jira] [Updated] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
[ https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5771: - Assignee: Liangliang Gu Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called -- Key: SPARK-5771 URL: https://issues.apache.org/jira/browse/SPARK-5771 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.1 Reporter: Liangliang Gu Assignee: Liangliang Gu Priority: Minor Fix For: 1.4.0 In Standalone mode, the number of cores in Completed Applications of the Master Web Page will always be zero, if sc.stop() is called. But the number will always be right, if sc.stop() is not called. The reason maybe: after sc.stop() is called, the function removeExecutor of class ApplicationInfo will be called, thus reduce the variable coresGranted to zero. The variable coresGranted is used to display the number of Cores on the Web Page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6009) IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND ()
Paul Barber created SPARK-6009: -- Summary: IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND () Key: SPARK-6009 URL: https://issues.apache.org/jira/browse/SPARK-6009 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.2.0 Environment: Centos 7, Hadoop 2.6.0, Hive 0.15.0 java version 1.7.0_75 OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode) Reporter: Paul Barber Running the following SparkSQL query over JDBC: {noformat} SELECT * FROM FAA WHERE Year = 1998 AND Year = 1999 ORDER BY RAND () LIMIT 10 {noformat} This results in one or more workers throwing the following exception, with variations for {{mergeLo}} and {{mergeHi}}. {noformat} :java.lang.IllegalArgumentException: Comparison method violates its general contract! - at java.util.TimSort.mergeHi(TimSort.java:868) - at java.util.TimSort.mergeAt(TimSort.java:485) - at java.util.TimSort.mergeCollapse(TimSort.java:410) - at java.util.TimSort.sort(TimSort.java:214) - at java.util.Arrays.sort(Arrays.java:727) - at org.spark-project.guava.common.collect.Ordering.leastOf(Ordering.java:708) - at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1138) - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1135) - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) - at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) - at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) - at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) - at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) - at org.apache.spark.scheduler.Task.run(Task.scala:56) - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) - at java.lang.Thread.run(Thread.java:745) {noformat} We have tested with both Spark 1.2.0 and Spark 1.2.1 and have seen the same error in both. The query sometimes succeeds, but fails more often than not. Whilst this sounds similar to bugs 3032 and 3656, we believe it it is not the same. The {{ORDER BY RAND ()}} is using TimSort to produce the random ordering by sorting a list of random values. Having spent some time looking at the issue with jdb, it appears that the problem is triggered by the random values being changed during the sort - the code which triggers this is in {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala}} - class RowOrdering, function compare, line 250 - where a new random number is taken for the same row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs
[ https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336795#comment-14336795 ] Ilya Ganelin commented on SPARK-5750: - Did you have a particular doc in mind to update? I feel like this sort of comment should go in the programming guide but there's not really a good spot for it. One glaring omission in the guide is a general writeup of the shuffle operation and the role that it plays internally. Understanding shuffles is key to writing stable Spark applications yet there isn't really any mention of it outside of the tech talks and presentations from the Spark folks. My suggestion would be to create a section providing an overview of shuffle, what parameters influence its behavior and stability, and then add this comment to that section. Document that ordering of elements in shuffled partitions is not deterministic across runs -- Key: SPARK-5750 URL: https://issues.apache.org/jira/browse/SPARK-5750 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen The ordering of elements in shuffled partitions is not deterministic across runs. For instance, consider the following example: {code} val largeFiles = sc.textFile(...) val airlines = largeFiles.repartition(2000).cache() println(airlines.first) {code} If this code is run twice, then each run will output a different result. There is non-determinism in the shuffle read code that accounts for this: Spark's shuffle read path processes blocks as soon as they are fetched Spark uses [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala] to fetch shuffle data from mappers. In this code, requests for multiple blocks from the same host are batched together, so nondeterminism in where tasks are run means that the set of requests can vary across runs. In addition, there's an [explicit call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256] to randomize the order of the batched fetch requests. As a result, shuffle operations cannot be guaranteed to produce the same ordering of the elements in their partitions. Therefore, Spark should update its docs to clarify that the ordering of elements in shuffle RDDs' partitions is non-deterministic. Note, however, that the _set_ of elements in each partition will be deterministic: if we used {{mapPartitions}} to sort each partition, then the {{first()}} call above would produce a deterministic result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6010) Exception thrown when reading Spark SQL generated Parquet files with different but compatible schemas
[ https://issues.apache.org/jira/browse/SPARK-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336796#comment-14336796 ] Apache Spark commented on SPARK-6010: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/4768 Exception thrown when reading Spark SQL generated Parquet files with different but compatible schemas - Key: SPARK-6010 URL: https://issues.apache.org/jira/browse/SPARK-6010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker The following test case added in {{ParquetPartitionDiscoverySuite}} can be used to reproduce this issue: {code} test(read partitioned table - merging compatible schemas) { withTempDir { base = makeParquetFile( (1 to 10).map(i = Tuple1(i)).toDF(intField), makePartitionDir(base, defaultPartitionName, pi - 1)) makeParquetFile( (1 to 10).map(i = (i, i.toString)).toDF(intField, stringField), makePartitionDir(base, defaultPartitionName, pi - 2)) load(base.getCanonicalPath, org.apache.spark.sql.parquet).registerTempTable(t) withTempTable(t) { checkAnswer( sql(SELECT * FROM t), (1 to 10).map(i = Row(i, null, 1)) ++ (1 to 10).map(i = Row(i, i.toString, 2))) } } } {code} Exception thrown: {code} [info] java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: [{type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}},{name:stringField,type:string,nullable:true,metadata:{}}]}, {type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}}]}] [info] at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) [info] at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) [info] at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:484) [info] at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) [info] at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.getPartitions(NewHadoopRDD.scala:239) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1518) [info] at org.apache.spark.rdd.RDD.collect(RDD.scala:813) [info] at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) [info] at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:790) [info] at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18$$anonfun$apply$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:337) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempTable(ParquetTest.scala:112) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempTable(ParquetPartitionDiscoverySuite.scala:35) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:336) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:325)
[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs
[ https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336864#comment-14336864 ] Sean Owen commented on SPARK-5750: -- Personally I think that would be a fine way forward, to create a basic overview of the shuffle, what operations shuffle, and the essential things to know. This would be a good place to document things like this topic. Here are a few other sort-of related JIRAs: https://issues.apache.org/jira/browse/SPARK-5836 https://issues.apache.org/jira/browse/SPARK-4227 https://issues.apache.org/jira/browse/SPARK-3441 Document that ordering of elements in shuffled partitions is not deterministic across runs -- Key: SPARK-5750 URL: https://issues.apache.org/jira/browse/SPARK-5750 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen The ordering of elements in shuffled partitions is not deterministic across runs. For instance, consider the following example: {code} val largeFiles = sc.textFile(...) val airlines = largeFiles.repartition(2000).cache() println(airlines.first) {code} If this code is run twice, then each run will output a different result. There is non-determinism in the shuffle read code that accounts for this: Spark's shuffle read path processes blocks as soon as they are fetched Spark uses [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala] to fetch shuffle data from mappers. In this code, requests for multiple blocks from the same host are batched together, so nondeterminism in where tasks are run means that the set of requests can vary across runs. In addition, there's an [explicit call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256] to randomize the order of the batched fetch requests. As a result, shuffle operations cannot be guaranteed to produce the same ordering of the elements in their partitions. Therefore, Spark should update its docs to clarify that the ordering of elements in shuffle RDDs' partitions is non-deterministic. Note, however, that the _set_ of elements in each partition will be deterministic: if we used {{mapPartitions}} to sort each partition, then the {{first()}} call above would produce a deterministic result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6008) zip two rdds derived from pickleFile fails
[ https://issues.apache.org/jira/browse/SPARK-6008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-6008. --- Resolution: Duplicate Fix Version/s: 1.2.2 1.3.0 Assignee: Davies Liu zip two rdds derived from pickleFile fails -- Key: SPARK-6008 URL: https://issues.apache.org/jira/browse/SPARK-6008 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Reporter: Charles Hayden Assignee: Davies Liu Fix For: 1.3.0, 1.2.2 Read an rdd from a pickle file. Then create another from the first, and zip them together. from pyspark import SparkContext sc = SparkContext() print sc.version r = sc.parallelize(range(1, 1000)) r.saveAsPickleFile('file') rdd = sc.pickleFile('file') res = rdd.map(lambda row: row, preservesPartitioning=True) z = rdd.zip(res) print z.take(1) Gives the following error: File bug.py, line 30, in module print z.take(1) File /home/ubuntu/spark/python/pyspark/rdd.py, line 1225, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /home/ubuntu/spark/python/pyspark/context.py, line 843, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 8, localhost): org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:746) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.foreach(RDD.scala:742) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:406) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:244) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6008) zip two rdds derived from pickleFile fails
Charles Hayden created SPARK-6008: - Summary: zip two rdds derived from pickleFile fails Key: SPARK-6008 URL: https://issues.apache.org/jira/browse/SPARK-6008 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Reporter: Charles Hayden Read an rdd from a pickle file. Then create another from the first, and zip them together. from pyspark import SparkContext sc = SparkContext() print sc.version r = sc.parallelize(range(1, 1000)) r.saveAsPickleFile('file') rdd = sc.pickleFile('file') res = rdd.map(lambda row: row, preservesPartitioning=True) z = rdd.zip(res) print z.take(1) Gives the following error: File bug.py, line 30, in module print z.take(1) File /home/ubuntu/spark/python/pyspark/rdd.py, line 1225, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /home/ubuntu/spark/python/pyspark/context.py, line 843, in runJob it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) File /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 8, localhost): org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:746) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.foreach(RDD.scala:742) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:406) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:244) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6010) Exception thrown when reading Spark SQL generated Parquet files with different but compatible schemas
Cheng Lian created SPARK-6010: - Summary: Exception thrown when reading Spark SQL generated Parquet files with different but compatible schemas Key: SPARK-6010 URL: https://issues.apache.org/jira/browse/SPARK-6010 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker The following test case added in {{ParquetPartitionDiscoverySuite}} can be used to reproduce this issue: {code} test(read partitioned table - merging compatible schemas) { withTempDir { base = makeParquetFile( (1 to 10).map(i = Tuple1(i)).toDF(intField), makePartitionDir(base, defaultPartitionName, pi - 1)) makeParquetFile( (1 to 10).map(i = (i, i.toString)).toDF(intField, stringField), makePartitionDir(base, defaultPartitionName, pi - 2)) load(base.getCanonicalPath, org.apache.spark.sql.parquet).registerTempTable(t) withTempTable(t) { checkAnswer( sql(SELECT * FROM t), (1 to 10).map(i = Row(i, null, 1)) ++ (1 to 10).map(i = Row(i, i.toString, 2))) } } } {code} Exception thrown: {code} [info] java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: [{type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}},{name:stringField,type:string,nullable:true,metadata:{}}]}, {type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}}]}] [info] at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) [info] at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) [info] at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:484) [info] at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) [info] at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.getPartitions(NewHadoopRDD.scala:239) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) [info] at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1518) [info] at org.apache.spark.rdd.RDD.collect(RDD.scala:813) [info] at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) [info] at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:790) [info] at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18$$anonfun$apply$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:337) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempTable(ParquetTest.scala:112) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempTable(ParquetPartitionDiscoverySuite.scala:35) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:336) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:325) [info] at org.apache.spark.sql.parquet.ParquetTest$class.withTempDir(ParquetTest.scala:82) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempDir(ParquetPartitionDiscoverySuite.scala:35) [info] at org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:325) [info] at
[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator
[ https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Seiden updated SPARK-6012: -- Summary: Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator (was: Deadlock when asking for SchemaRDD partitions with TakeOrdered operator) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator -- Key: SPARK-6012 URL: https://issues.apache.org/jira/browse/SPARK-6012 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Max Seiden Priority: Critical h3. Summary I've found that a deadlock occurs when asking for the partitions from a SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs when a child RDDs asks the DAGScheduler for preferred partition locations (which locks the scheduler) and eventually hits the #execute() of the TakeOrdered operator, which submits tasks but is blocked when it also tries to get preferred locations (in a separate thread). It seems like the TakeOrdered op's #execute() method should not actually submit a task (it is calling #executeCollect() and creating a new RDD) and should instead stay more true to the comment a logically apply a Limit on top of a Sort. In my particular case, I am forcing a repartition of a SchemaRDD with a terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into play. h3. Stack Traces h4. Task Submission {noformat} main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() [0x00010ed5e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at java.lang.Object.wait(Object.java:503) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390) at org.apache.spark.rdd.RDD.reduce(RDD.scala:884) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183) at org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) - locked 0x0007c36ce038 (a org.apache.spark.sql.hive.HiveContext$$anon$7) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278) at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304) - locked 0x0007f55c2238 (a org.apache.spark.scheduler.DAGScheduler) at
[jira] [Created] (SPARK-6015) Python docs' source code links are all broken
Joseph K. Bradley created SPARK-6015: Summary: Python docs' source code links are all broken Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. I believe these are generated because we have viewcode set in the doc conf file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33] I'm not sure why the source code is not being generated/posted. Should the links be removed, or do we want to add the source code? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6012) Deadlock when asking for SchemaRDD partitions with TakeOrdered operator
[ https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Seiden updated SPARK-6012: -- Description: h3. Summary I've found that a deadlock occurs when asking for the partitions from a SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs when a child RDDs asks the DAGScheduler for preferred partition locations (which locks the scheduler) and eventually hits the #execute() of the TakeOrdered operator, which submits tasks but is blocked when it also tries to get preferred locations (in a separate thread). It seems like the TakeOrdered op's #execute() method should not actually submit a task (it is calling #executeCollect() and creating a new RDD) and should instead stay more true to the comment a logically apply a Limit on top of a Sort. In my particular case, I am forcing a repartition of a SchemaRDD with a terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into play. h3. Stack Traces h4. Task Submission {noformat} main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() [0x00010ed5e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at java.lang.Object.wait(Object.java:503) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390) at org.apache.spark.rdd.RDD.reduce(RDD.scala:884) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183) at org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) - locked 0x0007c36ce038 (a org.apache.spark.sql.hive.HiveContext$$anon$7) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278) at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304) - locked 0x0007f55c2238 (a org.apache.spark.scheduler.DAGScheduler) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$5$$anonfun$apply$2.apply(CoalescedRDD.scala:192) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$5$$anonfun$apply$2.apply(CoalescedRDD.scala:191) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at
[jira] [Commented] (SPARK-6004) Pick the best model when training GradientBoostedTrees with validation
[ https://issues.apache.org/jira/browse/SPARK-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337020#comment-14337020 ] Joseph K. Bradley commented on SPARK-6004: -- Can you please add a short JIRA description stating the current issue and the proposed solution? Pick the best model when training GradientBoostedTrees with validation -- Key: SPARK-6004 URL: https://issues.apache.org/jira/browse/SPARK-6004 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Liang-Chi Hsieh Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337028#comment-14337028 ] Reynold Xin commented on SPARK-5124: I went and looked at the various use cases of Akka in the code base. Quite a few of them actually also use askWithReply. There are really two categories of message deliveries: one that the sender expects a reply, and one that doesn't. As a result, I think the following interface would make more sense: {code} // receive side: def receive(msg: RpcMessage) def receiveAndReply(msg: RpcMessage) = sys.err(...) // send side: def send(message: Object): Unit def sendWithReply[T](message: Object): Future[T] {code} There is an obvious pairing here. Messages sent via send goes to receive, without requiring an ack (although if we use the transport layer, an implicit ack can be sent). Messages sent via sendWithReply goes to receiveAndReply, and the caller needs to explicitly handle the reply. In most cases, an rpc endpoint only needs to receive messages without reply, and thus can override just receive. Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6014) java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on YARN
Cheolsoo Park created SPARK-6014: Summary: java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on YARN Key: SPARK-6014 URL: https://issues.apache.org/jira/browse/SPARK-6014 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Environment: Hadoop 2.4, YARN Reporter: Cheolsoo Park Priority: Minor This is a regression of SPARK-2261. In branch-1.3 and master, {{EventLoggingListener}} throws {{java.io.IOException: Filesystem closed}} when ctrl+c or ctrl+d the spark-sql shell. The root cause is that DFSClient is already shut down before EventLoggingListener invokes the following HDFS methods, and thus, DFSClient.isClientRunning() check fails- {code} Line #135: hadoopDataStream.foreach(hadoopFlushMethod.invoke(_)) Line #187: if (fileSystem.exists(target)) { {code} The followings are full stack trace- {code} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:135) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:170) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127) ... 19 more {code} {code} Exception in thread Thread-3 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:187) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1379) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6014) java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on YARN
[ https://issues.apache.org/jira/browse/SPARK-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337055#comment-14337055 ] Apache Spark commented on SPARK-6014: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/4771 java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on YARN -- Key: SPARK-6014 URL: https://issues.apache.org/jira/browse/SPARK-6014 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Environment: Hadoop 2.4, YARN Reporter: Cheolsoo Park Priority: Minor Labels: yarn This is a regression of SPARK-2261. In branch-1.3 and master, {{EventLoggingListener}} throws {{java.io.IOException: Filesystem closed}} when ctrl+c or ctrl+d the spark-sql shell. The root cause is that DFSClient is already shut down before EventLoggingListener invokes the following HDFS methods, and thus, DFSClient.isClientRunning() check fails- {code} Line #135: hadoopDataStream.foreach(hadoopFlushMethod.invoke(_)) Line #187: if (fileSystem.exists(target)) { {code} The followings are full stack trace- {code} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:135) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:170) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127) ... 19 more {code} {code} Exception in thread Thread-3 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:187) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1379) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107) {code} -- This message was
[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs
[ https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337104#comment-14337104 ] Ilya Ganelin commented on SPARK-5750: - I'd be happy to pull those in. Is it fine to submit the PR against this issue? Document that ordering of elements in shuffled partitions is not deterministic across runs -- Key: SPARK-5750 URL: https://issues.apache.org/jira/browse/SPARK-5750 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen The ordering of elements in shuffled partitions is not deterministic across runs. For instance, consider the following example: {code} val largeFiles = sc.textFile(...) val airlines = largeFiles.repartition(2000).cache() println(airlines.first) {code} If this code is run twice, then each run will output a different result. There is non-determinism in the shuffle read code that accounts for this: Spark's shuffle read path processes blocks as soon as they are fetched Spark uses [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala] to fetch shuffle data from mappers. In this code, requests for multiple blocks from the same host are batched together, so nondeterminism in where tasks are run means that the set of requests can vary across runs. In addition, there's an [explicit call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256] to randomize the order of the batched fetch requests. As a result, shuffle operations cannot be guaranteed to produce the same ordering of the elements in their partitions. Therefore, Spark should update its docs to clarify that the ordering of elements in shuffle RDDs' partitions is non-deterministic. Note, however, that the _set_ of elements in each partition will be deterministic: if we used {{mapPartitions}} to sort each partition, then the {{first()}} call above would produce a deterministic result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336914#comment-14336914 ] Michael Nazario commented on SPARK-5978: I was building this with 2.0.0-cdh4.7.0. The original version I was using which had this problem also was 2.0.0-cdh4.2.0 which is one of the pre-built distributions of spark. mvn -DskipTests -Phadoop-2.4 -Dhadoop.version=2.0.0-cdh4.7.0 dependency:tree -pl examples will show [INFO] +- org.apache.hbase:hbase-server:jar:0.98.7-hadoop1:compile Spark examples cannot compile with Hadoop 2 --- Key: SPARK-5978 URL: https://issues.apache.org/jira/browse/SPARK-5978 Project: Spark Issue Type: Bug Components: Examples, PySpark Affects Versions: 1.2.0, 1.2.1 Reporter: Michael Nazario Labels: hadoop-version This is a regression from Spark 1.1.1. The Spark Examples includes an example for an avro converter for PySpark. When I was trying to debug a problem, I discovered that even though you can build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 somewhere else in the examples code. An easy fix would be to separate the examples into hadoop specific versions. Another way would be to fix the hbase dependencies so that they don't rely on hadoop 1 specific code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is confusing
[ https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336926#comment-14336926 ] Apache Spark commented on SPARK-5930: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4769 Documented default of spark.shuffle.io.retryWait is confusing - Key: SPARK-5930 URL: https://issues.apache.org/jira/browse/SPARK-5930 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: Andrew Or Priority: Trivial The description makes it sound like the retryWait itself defaults to 15 seconds, when it's actually 5. We should clarify this by changing the wording a little... {code} tr tdcodespark.shuffle.io.retryWait/code/td td5/td td (Netty only) Seconds to wait between retries of fetches. The maximum delay caused by retrying is simply codemaxRetries * retryWait/code, by default 15 seconds. /td /tr {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336948#comment-14336948 ] Sean Owen commented on SPARK-3441: -- Since another shuffle-related doc ticket came up for action (http://issues.apache.org/jira/browse/SPARK-5750) I wonder is this still live? is it just a matter of documenting something or needs a code change? Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5978) Spark, Examples have Hadoop1/2 compat issues with Hadoop 2.0.x (e.g. CDH4)
[ https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5978: - Component/s: (was: PySpark) Build Priority: Critical (was: Major) Labels: (was: hadoop-version) Raising priority since I would like to get more discussion on whether to actually address this or reconsider Hadoop version support. Spark, Examples have Hadoop1/2 compat issues with Hadoop 2.0.x (e.g. CDH4) -- Key: SPARK-5978 URL: https://issues.apache.org/jira/browse/SPARK-5978 Project: Spark Issue Type: Bug Components: Build, Examples Affects Versions: 1.2.0, 1.2.1 Reporter: Michael Nazario Priority: Critical This is a regression from Spark 1.1.1. The Spark Examples includes an example for an avro converter for PySpark. When I was trying to debug a problem, I discovered that even though you can build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 somewhere else in the examples code. An easy fix would be to separate the examples into hadoop specific versions. Another way would be to fix the hbase dependencies so that they don't rely on hadoop 1 specific code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336974#comment-14336974 ] Aaron Davidson commented on SPARK-5124: --- I tend to prefer having explicit message.reply() semantics over Akka's weird ask. This is how the Transport layer implements RPCs, for instance: https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/server/RpcHandler.java#L26 I might even supplement message with a reply with failure method, so exceptions can be relayed. I have not found this mechanism to be a bottleneck at many thousands of requests per second. Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6004) Pick the best model when training GradientBoostedTrees with validation
[ https://issues.apache.org/jira/browse/SPARK-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337025#comment-14337025 ] Joseph K. Bradley commented on SPARK-6004: -- The point of the previous PR introducing validation was to allow early stopping. We should keep early stopping as an option, but I do think your JIRA/PR bring up one good point: Even if we never stop early, it may make more sense to return the best model, rather than the last model. (However, some users may want the full model since they went to all of that trouble to train it.) Ping [~MechCoder] --- what do you think? I'd vote for: * allow early stopping based on validationTol * return the best model instead of the full model (if we do not stop early while doing validation) Pick the best model when training GradientBoostedTrees with validation -- Key: SPARK-6004 URL: https://issues.apache.org/jira/browse/SPARK-6004 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Liang-Chi Hsieh Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6013) Add more Python ML examples for spark.ml
Joseph K. Bradley created SPARK-6013: Summary: Add more Python ML examples for spark.ml Key: SPARK-6013 URL: https://issues.apache.org/jira/browse/SPARK-6013 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Now that the spark.ml Pipelines API is supported within Python, we should duplicate the remaining Scala/Java spark.ml examples within Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337028#comment-14337028 ] Reynold Xin edited comment on SPARK-5124 at 2/25/15 7:29 PM: - I went and looked at the various use cases of Akka in the code base. Quite a few of them actually also use askWithReply. There are really two categories of message deliveries: one that the sender expects a reply, and one that doesn't. As a result, I think the following interface would make more sense: {code} // receive side: def receive(msg: RpcMessage) def receiveAndReply(msg: RpcMessage) = sys.err(...) // send side: def send(message: Object): Unit def sendWithReply[T](message: Object): Future[T] {code} There is an obvious pairing here. Messages sent via send goes to receive, without requiring an ack (although if we use the transport layer, an implicit ack can be sent). Messages sent via sendWithReply goes to receiveAndReply, and the caller needs to explicitly handle the reply. In most cases, an rpc endpoint only needs to receive messages without reply, and thus can override just receive. By making obvious distinctions between the two, I think we mitigate the risk of an end point not properly responding to a message, leading to memory leaks. was (Author: rxin): I went and looked at the various use cases of Akka in the code base. Quite a few of them actually also use askWithReply. There are really two categories of message deliveries: one that the sender expects a reply, and one that doesn't. As a result, I think the following interface would make more sense: {code} // receive side: def receive(msg: RpcMessage) def receiveAndReply(msg: RpcMessage) = sys.err(...) // send side: def send(message: Object): Unit def sendWithReply[T](message: Object): Future[T] {code} There is an obvious pairing here. Messages sent via send goes to receive, without requiring an ack (although if we use the transport layer, an implicit ack can be sent). Messages sent via sendWithReply goes to receiveAndReply, and the caller needs to explicitly handle the reply. In most cases, an rpc endpoint only needs to receive messages without reply, and thus can override just receive. Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6011) Out of disk space due to Spark not deleting shuffle files of lost executors
pankaj arora created SPARK-6011: --- Summary: Out of disk space due to Spark not deleting shuffle files of lost executors Key: SPARK-6011 URL: https://issues.apache.org/jira/browse/SPARK-6011 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Running Spark in Yarn-Client mode Reporter: pankaj arora Fix For: 1.3.1 If Executors gets lost abruptly spark does not delete its shuffle files till application ends. Ours is long running application which is serving requests received through REST APIs and if any of the executor gets lost shuffle files are not deleted and that leads to local disk going out of space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6011) Out of disk space due to Spark not deleting shuffle files of lost executors
[ https://issues.apache.org/jira/browse/SPARK-6011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336950#comment-14336950 ] Apache Spark commented on SPARK-6011: - User 'pankajarora12' has created a pull request for this issue: https://github.com/apache/spark/pull/4770 Out of disk space due to Spark not deleting shuffle files of lost executors --- Key: SPARK-6011 URL: https://issues.apache.org/jira/browse/SPARK-6011 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Running Spark in Yarn-Client mode Reporter: pankaj arora Fix For: 1.3.1 If Executors gets lost abruptly spark does not delete its shuffle files till application ends. Ours is long running application which is serving requests received through REST APIs and if any of the executor gets lost shuffle files are not deleted and that leads to local disk going out of space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336961#comment-14336961 ] Sean Owen commented on SPARK-5978: -- Ah right, the key is 2.0.x. This is a subset of the discussion at http://apache-spark-developers-list.1001551.n3.nabble.com/The-default-CDH4-build-uses-avro-mapred-hadoop1-td10699.html really. If I may, I'm going to update the title to broaden it, since I think there is more that doesn't work with Hadoop 2.0.x versions. Although as you can see my own opinion was that it maybe not worth supporting at this point, I think it's still open for discussion. Spark examples cannot compile with Hadoop 2 --- Key: SPARK-5978 URL: https://issues.apache.org/jira/browse/SPARK-5978 Project: Spark Issue Type: Bug Components: Examples, PySpark Affects Versions: 1.2.0, 1.2.1 Reporter: Michael Nazario Labels: hadoop-version This is a regression from Spark 1.1.1. The Spark Examples includes an example for an avro converter for PySpark. When I was trying to debug a problem, I discovered that even though you can build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 somewhere else in the examples code. An easy fix would be to separate the examples into hadoop specific versions. Another way would be to fix the hbase dependencies so that they don't rely on hadoop 1 specific code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5124: --- Target Version/s: 1.4.0 (was: 1.3.0) Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5996) DataFrame.collect() doesn't recognize UDTs
[ https://issues.apache.org/jira/browse/SPARK-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5996. -- Resolution: Fixed Fix Version/s: 1.3.0 DataFrame.collect() doesn't recognize UDTs -- Key: SPARK-5996 URL: https://issues.apache.org/jira/browse/SPARK-5996 Project: Spark Issue Type: Bug Components: MLlib, SQL Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Michael Armbrust Priority: Blocker Fix For: 1.3.0 {code} import org.apache.spark.mllib.linalg._ case class Test(data: Vector) val df = sqlContext.createDataFrame(Seq(Test(Vectors.dense(1.0, 2.0 df.collect()[0].getAs[Vector](0) {code} throws an exception. `collect()` returns `Row` instead of `Vector`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6012) Deadlock when asking for SchemaRDD partitions with TakeOrdered operator
Max Seiden created SPARK-6012: - Summary: Deadlock when asking for SchemaRDD partitions with TakeOrdered operator Key: SPARK-6012 URL: https://issues.apache.org/jira/browse/SPARK-6012 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Max Seiden Priority: Critical h3. Summary I've found that a deadlock occurs when asking for the partitions from a SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs when a child RDDs asks the DAGScheduler for preferred partition locations (which locks the scheduler) and eventually hits the #execute() of the TakeOrdered operator, which submits tasks but is blocked when it also tries to get preferred locations (in a separate thread). It seems like the TakeOrdered op's #execute() method should not actually submit a task (it is calling #executeCollect() and creating a new RDD) and should instead stay more true to the comment a logically apply a Limit on top of a Sort. h3. Stack Traces h4. Task Submission {noformat} main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() [0x00010ed5e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at java.lang.Object.wait(Object.java:503) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390) at org.apache.spark.rdd.RDD.reduce(RDD.scala:884) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183) at org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) - locked 0x0007c36ce038 (a org.apache.spark.sql.hive.HiveContext$$anon$7) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278) at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304) - locked 0x0007f55c2238 (a org.apache.spark.scheduler.DAGScheduler) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$5$$anonfun$apply$2.apply(CoalescedRDD.scala:192) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$5$$anonfun$apply$2.apply(CoalescedRDD.scala:191) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at
[jira] [Updated] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true
[ https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6016: Description: saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. Note: the newly created table needs to have more than one file to trigger the bug (if there is only a single file, we will not need to merge metadata). was: saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true Key: SPARK-6016 URL: https://issues.apache.org/jira/browse/SPARK-6016 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. Note: the newly created table needs to have more than one file to trigger the bug (if there is only a single file, we will not need to merge metadata). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs
[ https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337116#comment-14337116 ] Sean Owen commented on SPARK-5750: -- Yes, although on some of those I'm not as clear what the resolution is. If it's going to extend the scope of this beyond a paragraph in the programming guide, they're best dealt with separately. Document that ordering of elements in shuffled partitions is not deterministic across runs -- Key: SPARK-5750 URL: https://issues.apache.org/jira/browse/SPARK-5750 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Josh Rosen The ordering of elements in shuffled partitions is not deterministic across runs. For instance, consider the following example: {code} val largeFiles = sc.textFile(...) val airlines = largeFiles.repartition(2000).cache() println(airlines.first) {code} If this code is run twice, then each run will output a different result. There is non-determinism in the shuffle read code that accounts for this: Spark's shuffle read path processes blocks as soon as they are fetched Spark uses [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala] to fetch shuffle data from mappers. In this code, requests for multiple blocks from the same host are batched together, so nondeterminism in where tasks are run means that the set of requests can vary across runs. In addition, there's an [explicit call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256] to randomize the order of the batched fetch requests. As a result, shuffle operations cannot be guaranteed to produce the same ordering of the elements in their partitions. Therefore, Spark should update its docs to clarify that the ordering of elements in shuffle RDDs' partitions is non-deterministic. Note, however, that the _set_ of elements in each partition will be deterministic: if we used {{mapPartitions}} to sort each partition, then the {{first()}} call above would produce a deterministic result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6011) Out of disk space due to Spark not deleting shuffle files of lost executors
[ https://issues.apache.org/jira/browse/SPARK-6011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6011: - Component/s: (was: Spark Core) YARN Target Version/s: (was: 1.3.1) Fix Version/s: (was: 1.3.1) So from the PR discussion, I don't believe the proposed change can proceed. I am not sure it gets at the underlying issue here either, which is a concern that has been raised, for example, in https://issues.apache.org/jira/browse/SPARK-5836 . I do think it's worth tracking a) at least documenting this, and Marcelo's suggestion that maybe the block manager can later do more proactive cleanup. Have you used {{spark.cleaner.ttl}} by the way? I'm not sure if it helps in this case. I'd like to focus the discussion in one place so would prefer to resolve this as a duplicate and merge into SPARK-5836 Out of disk space due to Spark not deleting shuffle files of lost executors --- Key: SPARK-6011 URL: https://issues.apache.org/jira/browse/SPARK-6011 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.1 Environment: Running Spark in Yarn-Client mode Reporter: pankaj arora If Executors gets lost abruptly spark does not delete its shuffle files till application ends. Ours is long running application which is serving requests received through REST APIs and if any of the executor gets lost shuffle files are not deleted and that leads to local disk going out of space. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files
[ https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337128#comment-14337128 ] Sean Owen commented on SPARK-5836: -- I'd like to take this up, since I've heard versions of this come up frequently lately. A first step is indeed improving documentation. I want to confirm or deny things I only sort of know about how temp files are treated. - Temp files/dirs created by executors may live as long as the executors, but should be deleted with executors? - Shuffle files however may live longer? - {{spark.cleaner.ttl}} is relevant to this or no? If we believe that temp files die when they should (er, well, [~vanzin] is fixing a few things around temp dirs right now), then is the surprising thing here the life of shuffle files? In which case maybe [~ilganeli] can cover this when writing up some basics about how the shuffle works? But I want to figure out definitively what the right thing is to say about behavior right now, even if the behavior should or could be different in the future. CC [~sandyr] Highlight in Spark documentation that by default Spark does not delete its temporary files -- Key: SPARK-5836 URL: https://issues.apache.org/jira/browse/SPARK-5836 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Tomasz Dudziak We recently learnt the hard way (in a prod system) that Spark by default does not delete its temporary files until it is stopped. WIthin a relatively short time span of heavy Spark use the disk of our prod machine filled up completely because of multiple shuffle files written to it. We think there should be better documentation around the fact that after a job is finished it leaves a lot of rubbish behind so that this does not come as a surprise. Probably a good place to highlight that fact would be the documentation of {{spark.local.dir}} property, which controls where Spark temporary files are written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true
Yin Huai created SPARK-6016: --- Summary: Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true Key: SPARK-6016 URL: https://issues.apache.org/jira/browse/SPARK-6016 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true
[ https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337189#comment-14337189 ] Yin Huai commented on SPARK-6016: - cc [~lian cheng] Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true Key: SPARK-6016 URL: https://issues.apache.org/jira/browse/SPARK-6016 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6015) Python docs' source code links are all broken
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337247#comment-14337247 ] Joseph K. Bradley commented on SPARK-6015: -- That would be great; would you mind making a PR for that? I'll modify this JIRA to be for that backport. Python docs' source code links are all broken - Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor Fix For: 1.3.0 The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. I believe these are generated because we have viewcode set in the doc conf file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33] I'm not sure why the source code is not being generated/posted. Should the links be removed, or do we want to add the source code? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6015: - Description: The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. The PR for [SPARK-5994] included a fix of this for 1.3 which should be backported to 1.2. was: The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. I believe these are generated because we have viewcode set in the doc conf file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33] I'm not sure why the source code is not being generated/posted. Should the links be removed, or do we want to add the source code? Backport Python doc source code link fix to 1.2 --- Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor Fix For: 1.3.0 The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. The PR for [SPARK-5994] included a fix of this for 1.3 which should be backported to 1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6015: - Summary: Backport Python doc source code link fix to 1.2 (was: Python docs' source code links are all broken) Backport Python doc source code link fix to 1.2 --- Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor Fix For: 1.3.0 The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. I believe these are generated because we have viewcode set in the doc conf file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33] I'm not sure why the source code is not being generated/posted. Should the links be removed, or do we want to add the source code? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6015: - Affects Version/s: (was: 1.3.0) 1.2.1 Backport Python doc source code link fix to 1.2 --- Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.2.1 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. The PR for [SPARK-5994] included a fix of this for 1.3 which should be backported to 1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6015) Backport Python doc source code link fix to 1.2
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reopened SPARK-6015: -- Backport Python doc source code link fix to 1.2 --- Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.2.1 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. The PR for [SPARK-5994] included a fix of this for 1.3 which should be backported to 1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5975) SparkSubmit --jars not present on driver in python
[ https://issues.apache.org/jira/browse/SPARK-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5975: - Component/s: (was: Spark Core) PySpark SparkSubmit --jars not present on driver in python -- Key: SPARK-5975 URL: https://issues.apache.org/jira/browse/SPARK-5975 Project: Spark Issue Type: Bug Components: PySpark, Spark Submit Affects Versions: 1.3.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Reported by [~tdas]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD
[ https://issues.apache.org/jira/browse/SPARK-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4845. -- Resolution: Won't Fix Fix Version/s: (was: 1.3.0) Target Version/s: (was: 1.3.0) See PR discussion. Adding a parallelismRatio to control the partitions num of shuffledRDD -- Key: SPARK-4845 URL: https://issues.apache.org/jira/browse/SPARK-4845 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Fei Wang Adding parallelismRatio to control the partitions num of shuffledRDD, the rule is: Math.max(1, parallelismRatio * number of partitions of the largest upstream RDD) The ratio is 1.0 by default to make it compatible with the old version. When we have a good experience on it, we can change this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6014) java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on YARN
[ https://issues.apache.org/jira/browse/SPARK-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6014: - Component/s: YARN java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on YARN -- Key: SPARK-6014 URL: https://issues.apache.org/jira/browse/SPARK-6014 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Environment: Hadoop 2.4, YARN Reporter: Cheolsoo Park Priority: Minor Labels: yarn This is a regression of SPARK-2261. In branch-1.3 and master, {{EventLoggingListener}} throws {{java.io.IOException: Filesystem closed}} when ctrl+c or ctrl+d the spark-sql shell. The root cause is that DFSClient is already shut down before EventLoggingListener invokes the following HDFS methods, and thus, DFSClient.isClientRunning() check fails- {code} Line #135: hadoopDataStream.foreach(hadoopFlushMethod.invoke(_)) Line #187: if (fileSystem.exists(target)) { {code} The followings are full stack trace- {code} java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:135) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:170) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127) ... 19 more {code} {code} Exception in thread Thread-3 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:187) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1379) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Updated] (SPARK-5982) Remove Local Read Time
[ https://issues.apache.org/jira/browse/SPARK-5982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5982: - Component/s: Spark Core Remove Local Read Time -- Key: SPARK-5982 URL: https://issues.apache.org/jira/browse/SPARK-5982 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Blocker LocalReadTime was added to TaskMetrics for the 1.3.0 release. However, this time is actually only a small subset of the local read time, because local shuffle files are memory mapped, so most of the read time occurs later, as data is read from the memory mapped files and the data actually gets read from disk. We should remove this before the 1.3.0 release, so we never expose this incomplete info. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000
[ https://issues.apache.org/jira/browse/SPARK-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5949: - Component/s: Spark Core Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000 Key: SPARK-5949 URL: https://issues.apache.org/jira/browse/SPARK-5949 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Peter Torok Labels: kryo, partitioning, serialization When more than 2000 partitions are being used with Kryo, the following classes need to be registered by driver program: - org.apache.spark.scheduler.HighlyCompressedMapStatus - org.roaringbitmap.RoaringBitmap - org.roaringbitmap.RoaringArray - org.roaringbitmap.ArrayContainer - org.roaringbitmap.RoaringArray$Element - org.roaringbitmap.RoaringArray$Element[] - short[] Our project doesn't have dependency on roaring bitmap and HighlyCompressedMapStatus is intended for internal spark usage. Spark should take care of this registration when Kryo is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6015) Python docs' source code links are all broken
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-6015: - Assignee: Davies Liu Python docs' source code links are all broken - Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. I believe these are generated because we have viewcode set in the doc conf file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33] I'm not sure why the source code is not being generated/posted. Should the links be removed, or do we want to add the source code? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6015) Python docs' source code links are all broken
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337240#comment-14337240 ] Davies Liu commented on SPARK-6015: --- Should we backport that fix into 1.2? Python docs' source code links are all broken - Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor Fix For: 1.3.0 The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. I believe these are generated because we have viewcode set in the doc conf file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33] I'm not sure why the source code is not being generated/posted. Should the links be removed, or do we want to add the source code? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6015) Python docs' source code links are all broken
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-6015. --- Resolution: Fixed Fix Version/s: 1.3.0 Python docs' source code links are all broken - Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor Fix For: 1.3.0 The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. I believe these are generated because we have viewcode set in the doc conf file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33] I'm not sure why the source code is not being generated/posted. Should the links be removed, or do we want to add the source code? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6015: - Target Version/s: 1.2.2 (was: 1.3.0) Backport Python doc source code link fix to 1.2 --- Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.2.1 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. The PR for [SPARK-5994] included a fix of this for 1.3 which should be backported to 1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2
[ https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6015: - Fix Version/s: (was: 1.3.0) Backport Python doc source code link fix to 1.2 --- Key: SPARK-6015 URL: https://issues.apache.org/jira/browse/SPARK-6015 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 1.2.1 Reporter: Joseph K. Bradley Assignee: Davies Liu Priority: Minor The Python docs display {code}[source]{code} links which should link to source code, but none work in the documentation provided on the Apache Spark website. E.g., go here [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression] and click a source code link on the right. The PR for [SPARK-5994] included a fix of this for 1.3 which should be backported to 1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5970) Temporary directories are not removed (but their content is)
[ https://issues.apache.org/jira/browse/SPARK-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5970: - Priority: Minor (was: Major) Assignee: Milan Straka Temporary directories are not removed (but their content is) Key: SPARK-5970 URL: https://issues.apache.org/jira/browse/SPARK-5970 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Linux, 64bit spark-1.2.1-bin-hadoop2.4.tgz Reporter: Milan Straka Assignee: Milan Straka Priority: Minor Fix For: 1.4.0 How to reproduce: - extract spark-1.2.1-bin-hadoop2.4.tgz - without any further configuration, run bin/pyspark - run sc.stop() and close python shell Expected results: - no temporary directories are left in /tmp Actual results: - four empty temporary directories are created in /tmp, for example after {{ls -d /tmp/spark*}}:{code} /tmp/spark-1577b13d-4b9a-4e35-bac2-6e84e5605f53 /tmp/spark-96084e69-77fd-42fb-ab10-e1fc74296fe3 /tmp/spark-ab2ea237-d875-485e-b16c-5b0ac31bd753 /tmp/spark-ddeb0363-4760-48a4-a189-81321898b146 {code} The issue is caused by changes in {{util/Utils.scala}}. Consider the {{createDirectory}}: {code} /** * Create a directory inside the given parent directory. The directory is guaranteed to be * newly created, and is not marked for automatic deletion. */ def createDirectory(root: String, namePrefix: String = spark): File = ... {code} The {{createDirectory}} is used in two places. The first is in {{createTempDir}}, where it is marked for automatic deletion: {code} def createTempDir( root: String = System.getProperty(java.io.tmpdir), namePrefix: String = spark): File = { val dir = createDirectory(root, namePrefix) registerShutdownDeleteDir(dir) dir } {code} Nevertheless, it is also used in {{getOrCreateLocalDirs}} where it is _not_ marked for automatic deletion: {code} private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] = { if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. getYarnLocalDirs(conf).split(,) } else { // In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user // configuration to point to a secure directory. So create a subdirectory with restricted // permissions under each listed directory. Option(conf.getenv(SPARK_LOCAL_DIRS)) .getOrElse(conf.get(spark.local.dir, System.getProperty(java.io.tmpdir))) .split(,) .flatMap { root = try { val rootDir = new File(root) if (rootDir.exists || rootDir.mkdirs()) { Some(createDirectory(root).getAbsolutePath()) } else { logError(sFailed to create dir in $root. Ignoring this directory.) None } } catch { case e: IOException = logError(sFailed to create local root dir in $root. Ignoring this directory.) None } } .toArray } } {code} Therefore I think the {code} Some(createDirectory(root).getAbsolutePath()) {code} should be replaced by something like (I am not an experienced Scala programmer): {code} val dir = createDirectory(root) registerShutdownDeleteDir(dir) Some(dir.getAbsolutePath()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5970) Temporary directories are not removed (but their content is)
[ https://issues.apache.org/jira/browse/SPARK-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5970. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4759 [https://github.com/apache/spark/pull/4759] Temporary directories are not removed (but their content is) Key: SPARK-5970 URL: https://issues.apache.org/jira/browse/SPARK-5970 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Linux, 64bit spark-1.2.1-bin-hadoop2.4.tgz Reporter: Milan Straka Fix For: 1.4.0 How to reproduce: - extract spark-1.2.1-bin-hadoop2.4.tgz - without any further configuration, run bin/pyspark - run sc.stop() and close python shell Expected results: - no temporary directories are left in /tmp Actual results: - four empty temporary directories are created in /tmp, for example after {{ls -d /tmp/spark*}}:{code} /tmp/spark-1577b13d-4b9a-4e35-bac2-6e84e5605f53 /tmp/spark-96084e69-77fd-42fb-ab10-e1fc74296fe3 /tmp/spark-ab2ea237-d875-485e-b16c-5b0ac31bd753 /tmp/spark-ddeb0363-4760-48a4-a189-81321898b146 {code} The issue is caused by changes in {{util/Utils.scala}}. Consider the {{createDirectory}}: {code} /** * Create a directory inside the given parent directory. The directory is guaranteed to be * newly created, and is not marked for automatic deletion. */ def createDirectory(root: String, namePrefix: String = spark): File = ... {code} The {{createDirectory}} is used in two places. The first is in {{createTempDir}}, where it is marked for automatic deletion: {code} def createTempDir( root: String = System.getProperty(java.io.tmpdir), namePrefix: String = spark): File = { val dir = createDirectory(root, namePrefix) registerShutdownDeleteDir(dir) dir } {code} Nevertheless, it is also used in {{getOrCreateLocalDirs}} where it is _not_ marked for automatic deletion: {code} private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] = { if (isRunningInYarnContainer(conf)) { // If we are in yarn mode, systems can have different disk layouts so we must set it // to what Yarn on this system said was available. Note this assumes that Yarn has // created the directories already, and that they are secured so that only the // user has access to them. getYarnLocalDirs(conf).split(,) } else { // In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user // configuration to point to a secure directory. So create a subdirectory with restricted // permissions under each listed directory. Option(conf.getenv(SPARK_LOCAL_DIRS)) .getOrElse(conf.get(spark.local.dir, System.getProperty(java.io.tmpdir))) .split(,) .flatMap { root = try { val rootDir = new File(root) if (rootDir.exists || rootDir.mkdirs()) { Some(createDirectory(root).getAbsolutePath()) } else { logError(sFailed to create dir in $root. Ignoring this directory.) None } } catch { case e: IOException = logError(sFailed to create local root dir in $root. Ignoring this directory.) None } } .toArray } } {code} Therefore I think the {code} Some(createDirectory(root).getAbsolutePath()) {code} should be replaced by something like (I am not an experienced Scala programmer): {code} val dir = createDirectory(root) registerShutdownDeleteDir(dir) Some(dir.getAbsolutePath()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5975) SparkSubmit --jars not present on driver in python
[ https://issues.apache.org/jira/browse/SPARK-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5975: - Summary: SparkSubmit --jars not present on driver in python (was: SparkSubmit --jars not present on driver) SparkSubmit --jars not present on driver in python -- Key: SPARK-5975 URL: https://issues.apache.org/jira/browse/SPARK-5975 Project: Spark Issue Type: Bug Components: PySpark, Spark Submit Affects Versions: 1.3.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Reported by [~tdas]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337308#comment-14337308 ] Brennon York commented on SPARK-1182: - [~rxin] I've incorporated all the changes you mentioned (save for #6) and updated all merge conflicts thus far (not fun haha). If we want to move forward with this I'd ask that we move with a bit of brevity on this one before I need to fix a ton of merge conflicts again :/ Thanks! Sort the configuration parameters in configuration.md - Key: SPARK-1182 URL: https://issues.apache.org/jira/browse/SPARK-1182 Project: Spark Issue Type: Task Components: Documentation Reporter: Reynold Xin Assignee: prashant Priority: Minor It is a little bit confusing right now since the config options are all over the place in some arbitrarily sorted order. https://github.com/apache/spark/blob/master/docs/configuration.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-4924: -- Target Version/s: 1.4.0 (was: 1.3.0) Factor out code to launch Spark applications into a separate library Key: SPARK-4924 URL: https://issues.apache.org/jira/browse/SPARK-4924 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Attachments: spark-launcher.txt One of the questions we run into rather commonly is how to start a Spark application from my Java/Scala program?. There currently isn't a good answer to that: - Instantiating SparkContext has limitations (e.g., you can only have one active context at the moment, plus you lose the ability to submit apps in cluster mode) - Calling SparkSubmit directly is doable but you lose a lot of the logic handled by the shell scripts - Calling the shell script directly is doable, but sort of ugly from an API point of view. I think it would be nice to have a small library that handles that for users. On top of that, this library could be used by Spark itself to replace a lot of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1955) VertexRDD can incorrectly assume index sharing
[ https://issues.apache.org/jira/browse/SPARK-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-1955. --- Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 Assignee: Brennon York (was: Ankur Dave) Issue resolved by pull request 4705 https://github.com/apache/spark/pull/4705 VertexRDD can incorrectly assume index sharing -- Key: SPARK-1955 URL: https://issues.apache.org/jira/browse/SPARK-1955 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 0.9.0, 0.9.1, 1.0.0 Reporter: Ankur Dave Assignee: Brennon York Priority: Minor Fix For: 1.3.0, 1.2.2 Many VertexRDD operations (diff, leftJoin, innerJoin) can use a fast zip join if both operands are VertexRDDs sharing the same index (i.e., one operand is derived from the other). This check is implemented by matching on the operand type and using the fast join strategy if both are VertexRDDs. This is clearly fine when both do in fact share the same index. It is also fine when the two VertexRDDs have the same partitioner but different indexes, because each VertexPartition will detect the index mismatch and fall back to the slow but correct local join strategy. However, when they have different numbers of partitions or different partition functions, an exception or even silently incorrect results can occur. For example: {code} import org.apache.spark._ import org.apache.spark.graphx._ // Construct VertexRDDs with different numbers of partitions val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2)), 1)) val b = VertexRDD(sc.parallelize(List((0L, 5)), 8)) // Try to join them. Appears to work... val c = a.innerJoin(b) { (vid, x, y) = x + y } // ... but then fails with java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions c.collect // Construct VertexRDDs with different partition functions val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2))).partitionBy(new HashPartitioner(2))) val bVerts = sc.parallelize(List((1L, 5))) val b = VertexRDD(bVerts.partitionBy(new RangePartitioner(2, bVerts))) // Try to join them. We expect (1L, 7). val c = a.innerJoin(b) { (vid, x, y) = x + y } // Silent failure: we get an empty set! c.collect {code} VertexRDD should check equality of partitioners before using the fast zip join. If the partitioners are different, the two datasets should be automatically co-partitioned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6017) Provide transparent secure communication channel on Yarn
Marcelo Vanzin created SPARK-6017: - Summary: Provide transparent secure communication channel on Yarn Key: SPARK-6017 URL: https://issues.apache.org/jira/browse/SPARK-6017 Project: Spark Issue Type: Umbrella Components: YARN Reporter: Marcelo Vanzin A quick description: Currently driver and executors communicate through an insecure channel, so anyone can listen on the network and see what's going on. That prevents Spark from adding some features securely (e.g. SPARK-5342, SPARK-5682) without resorting to using internal Hadoop APIs. Spark 1.3.0 will add SSL support, but properly configuring SSL is not a trivial task for operators, let alone users. In light of those, we should add a more transparent secure transport layer. I've written a short spec to identify the areas in Spark that need work to achieve this, and I'll attach the document to this issue shortly. Note I'm restricting things to Yarn currently, because as far as I know it's the only cluster manager that provides the needed security features to bootstrap the secure Spark transport. The design itself doesn't really rely on Yarn per se, just on a secure way to distribute the initial secret (which the Yarn/HDFS combo provides). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true
[ https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6016: Description: saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. Note: the newly created table needs to have more than one file to trigger the bug (if there is only a single file, we will not need to merge metadata). To reproduce it, try... {code} import org.apache.spark.sql.SaveMode import sqlContext._ sql(drop table if exists test) val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{a:$i}), 2)) // we will save to 2 parquet files. df1.saveAsTable(test, parquet, SaveMode.Overwrite) sql(select * from test).collect.foreach(println) // Warm the FilteringParquetRowInputFormat.footerCache val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{b:$i}), 4)) // we will save to 4 parquet files. df2.saveAsTable(test, parquet, SaveMode.Overwrite) sql(select * from test).collect.foreach(println) {code} was: saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. Note: the newly created table needs to have more than one file to trigger the bug (if there is only a single file, we will not need to merge metadata). Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true Key: SPARK-6016 URL: https://issues.apache.org/jira/browse/SPARK-6016 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. Note: the newly created table needs to have more than one file to trigger the bug (if there is only a single file, we will not need to merge metadata). To reproduce it, try... {code} import org.apache.spark.sql.SaveMode import sqlContext._ sql(drop table if exists test) val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{a:$i}), 2)) // we will save to 2 parquet files. df1.saveAsTable(test, parquet, SaveMode.Overwrite) sql(select * from test).collect.foreach(println) // Warm the FilteringParquetRowInputFormat.footerCache val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{b:$i}), 4)) // we will save
[jira] [Updated] (SPARK-6017) Provide transparent secure communication channel on Yarn
[ https://issues.apache.org/jira/browse/SPARK-6017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-6017: -- Attachment: secure_spark_on_yarn.pdf First draft of problem statement and proposed solutions. Once there's agreement on these I'll create sub-tasks to do the individual work pieces. Provide transparent secure communication channel on Yarn Key: SPARK-6017 URL: https://issues.apache.org/jira/browse/SPARK-6017 Project: Spark Issue Type: Umbrella Components: YARN Reporter: Marcelo Vanzin Attachments: secure_spark_on_yarn.pdf A quick description: Currently driver and executors communicate through an insecure channel, so anyone can listen on the network and see what's going on. That prevents Spark from adding some features securely (e.g. SPARK-5342, SPARK-5682) without resorting to using internal Hadoop APIs. Spark 1.3.0 will add SSL support, but properly configuring SSL is not a trivial task for operators, let alone users. In light of those, we should add a more transparent secure transport layer. I've written a short spec to identify the areas in Spark that need work to achieve this, and I'll attach the document to this issue shortly. Note I'm restricting things to Yarn currently, because as far as I know it's the only cluster manager that provides the needed security features to bootstrap the secure Spark transport. The design itself doesn't really rely on Yarn per se, just on a secure way to distribute the initial secret (which the Yarn/HDFS combo provides). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true
[ https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6016: Description: saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. Note: the newly created table needs to have more than one file to trigger the bug (if there is only a single file, we will not need to merge metadata). To reproduce it, try... {code} import org.apache.spark.sql.SaveMode import sqlContext._ sql(drop table if exists test) val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{a:$i}), 2)) // we will save to 2 parquet files. df1.saveAsTable(test, parquet, SaveMode.Overwrite) sql(select * from test).collect.foreach(println) // Warm the FilteringParquetRowInputFormat.footerCache val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{b:$i}), 4)) // we will save to 4 parquet files. df2.saveAsTable(test, parquet, SaveMode.Overwrite) sql(select * from test).collect.foreach(println) {code} For this example, we have two outdated footers for df1 in footerCache and since we have four parquet files for the new test table, we picked up 2 new footers for df2. Then, we hit the bug. was: saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461) ... {code} If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. Note: the newly created table needs to have more than one file to trigger the bug (if there is only a single file, we will not need to merge metadata). To reproduce it, try... {code} import org.apache.spark.sql.SaveMode import sqlContext._ sql(drop table if exists test) val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{a:$i}), 2)) // we will save to 2 parquet files. df1.saveAsTable(test, parquet, SaveMode.Overwrite) sql(select * from test).collect.foreach(println) // Warm the FilteringParquetRowInputFormat.footerCache val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = s{b:$i}), 4)) // we will save to 4 parquet files. df2.saveAsTable(test, parquet, SaveMode.Overwrite) sql(select * from test).collect.foreach(println) {code} Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true Key: SPARK-6016 URL: https://issues.apache.org/jira/browse/SPARK-6016 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker saveAsTable is fine and seems we have successfully deleted the old data and written the new data. However, when reading the newly created table, an error will be thrown. {code} Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67) at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84) at org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245) at
[jira] [Updated] (SPARK-6018) NoSuchMethodError in Spark app is swallowed by YARN AM
[ https://issues.apache.org/jira/browse/SPARK-6018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated SPARK-6018: - Description: I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. Due to changes in DF and SchemaRDD, my app failed with {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the error was silently swallowed. The problem is that pattern matching in Spark AM fails to catch NoSuchMethodError- {code} 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done Exception in thread Driver scala.MatchError: java.lang.NoSuchMethodError: org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD; (of class java.lang.NoSuchMethodError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) {code} was: I discovered while testing 1.3 RC with old 1.2 Spark job that I had. Due to changes in DF and SchemaRDD, my app failed with {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the error was silently swallowed. The problem is that pattern matching in Spark AM fails to catch NoSuchMethodError- {code} 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done Exception in thread Driver scala.MatchError: java.lang.NoSuchMethodError: org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD; (of class java.lang.NoSuchMethodError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) {code} NoSuchMethodError in Spark app is swallowed by YARN AM -- Key: SPARK-6018 URL: https://issues.apache.org/jira/browse/SPARK-6018 Project: Spark Issue Type: Bug Components: YARN Reporter: Cheolsoo Park Priority: Minor Labels: yarn I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. Due to changes in DF and SchemaRDD, my app failed with {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the error was silently swallowed. The problem is that pattern matching in Spark AM fails to catch NoSuchMethodError- {code} 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done Exception in thread Driver scala.MatchError: java.lang.NoSuchMethodError: org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD; (of class java.lang.NoSuchMethodError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2770) Rename spark-ganglia-lgpl to ganglia-lgpl
[ https://issues.apache.org/jira/browse/SPARK-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2770. -- Resolution: Won't Fix Rename spark-ganglia-lgpl to ganglia-lgpl - Key: SPARK-2770 URL: https://issues.apache.org/jira/browse/SPARK-2770 Project: Spark Issue Type: Improvement Components: Build Reporter: Chris Fregly Assignee: Chris Fregly Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6022) GraphX `diff` test incorrectly operating on values (not VertexId's)
[ https://issues.apache.org/jira/browse/SPARK-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337513#comment-14337513 ] Brennon York commented on SPARK-6022: - FWIW I have this fix put in place under my branch for [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600], but wanted to point it out here as I'm going to need to remove the original test. This is all assuming I'm not completely insane and missing something here. cc [~ankurd] GraphX `diff` test incorrectly operating on values (not VertexId's) --- Key: SPARK-6022 URL: https://issues.apache.org/jira/browse/SPARK-6022 Project: Spark Issue Type: Bug Components: GraphX Reporter: Brennon York The current GraphX {{diff}} test operates on values rather than the VertexId's and, if {{diff}} were working properly (per [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600]), it should fail this test. The code to test {{diff}} should look like the below as it correctly generates {{VertexRDD}}'s with different {{VertexId}}'s to {{diff}} against. {code} test(diff functionality with small concrete values) { withSpark { sc = val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id = (id, id.toInt))) // setA := Set((0L, 0), (1L, 1)) val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id = (id, id.toInt+2))) // setB := Set((1L, 3), (2L, 4)) val diff = setA.diff(setB) assert(diff.collect.toSet == Set((2L, 4))) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3901) Add SocketSink capability for Spark metrics
[ https://issues.apache.org/jira/browse/SPARK-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337512#comment-14337512 ] Sean Owen commented on SPARK-3901: -- It sounds like this can't proceed, blocked on features from Metrics upstream? is it a question of a newer version? Add SocketSink capability for Spark metrics --- Key: SPARK-3901 URL: https://issues.apache.org/jira/browse/SPARK-3901 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Sreepathi Prasanna Priority: Minor Original Estimate: 48h Remaining Estimate: 48h Spark depends on Coda hale metrics library to collect metrics. Today we can send metrics to console, csv and jmx. We use chukwa as a monitoring framework to monitor the hadoop services. To extend the the framework to collect spark metrics, we need additional socketsink capability which is not there at the moment in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-725) Ran out of disk space on EC2 master due to Ganglia logs
[ https://issues.apache.org/jira/browse/SPARK-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-725. - Resolution: Not a Problem Sounds like the best guess was that this wasn't a Spark issue. Ran out of disk space on EC2 master due to Ganglia logs --- Key: SPARK-725 URL: https://issues.apache.org/jira/browse/SPARK-725 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 0.7.0 Reporter: Josh Rosen This morning, I started a Spark Standalone cluster on EC2 using 50 m1.medium instances. When I tried to rebuild Spark ~5.5 hours later, the build failed because the master ran out of disk space. It looks like {{/var/lib/ganglia/rrds/spark}} grew to 4.2 gigabytes, using over half of the AMI's EBS disk space. Is there a default setting that we can change to place a harder limit on the total amount of space used by Ganglia to prevent this from happening? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5974) Add save/load to examples in ML guide
[ https://issues.apache.org/jira/browse/SPARK-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5974. -- Resolution: Fixed Fix Version/s: 1.3.0 Add save/load to examples in ML guide - Key: SPARK-5974 URL: https://issues.apache.org/jira/browse/SPARK-5974 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Fix For: 1.3.0 We should add save/load (model import/export) to the Scala and Java code examples in the ML guide. This is not yet supported in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md
[ https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337555#comment-14337555 ] Reynold Xin commented on SPARK-1182: Merged. Thanks for doing this, [~boyork]. Couple minor things that would be great to be addressed as a followup PR: 1. Link to the Mesos configuration table rather than just the Mesos page. 2. If possible, make Mesos/YARN/Standalone appear in the table of contents on top. Sort the configuration parameters in configuration.md - Key: SPARK-1182 URL: https://issues.apache.org/jira/browse/SPARK-1182 Project: Spark Issue Type: Task Components: Documentation Reporter: Reynold Xin Assignee: prashant Priority: Minor It is a little bit confusing right now since the config options are all over the place in some arbitrarily sorted order. https://github.com/apache/spark/blob/master/docs/configuration.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-786) Clean up old work directories in standalone worker
[ https://issues.apache.org/jira/browse/SPARK-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-786. - Resolution: Duplicate Clean up old work directories in standalone worker -- Key: SPARK-786 URL: https://issues.apache.org/jira/browse/SPARK-786 Project: Spark Issue Type: New Feature Components: Deploy Affects Versions: 0.7.2 Reporter: Matei Zaharia We should add a setting to clean old work directories after X days. Otherwise, the directory gets filled forever with shuffle files and logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337733#comment-14337733 ] Shixiong Zhu commented on SPARK-5124: - [~rxin] could you clarify how to reply the message in `receiveAndReply`? Use the return value, or `message.reply()`? Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6025) Helper method for GradientBoostedTrees to compute validation error
Joseph K. Bradley created SPARK-6025: Summary: Helper method for GradientBoostedTrees to compute validation error Key: SPARK-6025 URL: https://issues.apache.org/jira/browse/SPARK-6025 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Create a helper method for computing the error at each iteration of boosting. This should be used post-hoc to compute the error efficiently on a new dataset. E.g.: {code} def evaluateEachIteration(data: RDD[LabeledPoint], evaluator): Array[Double] {code} Notes: * It should run in the same big-O time as predict() by keeping a running total (residual). * A different method name could be good. * It could take an evaluator and/or could evaluate using the training metric by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6004) Pick the best model when training GradientBoostedTrees with validation
[ https://issues.apache.org/jira/browse/SPARK-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337793#comment-14337793 ] Joseph K. Bradley commented on SPARK-6004: -- If validation is not being done, we should return the full model. It would be strange if users were unable to get the full model. Pick the best model when training GradientBoostedTrees with validation -- Key: SPARK-6004 URL: https://issues.apache.org/jira/browse/SPARK-6004 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Liang-Chi Hsieh Priority: Minor Since the validation error does not change monotonically, in practice, it should be proper to pick the best model when training GradientBoostedTrees with validation instead of stopping it early. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6027) Make KafkaUtils work in Python with kafka-assembly provided as --jar
Tathagata Das created SPARK-6027: Summary: Make KafkaUtils work in Python with kafka-assembly provided as --jar Key: SPARK-6027 URL: https://issues.apache.org/jira/browse/SPARK-6027 Project: Spark Issue Type: Bug Components: PySpark, Streaming Reporter: Tathagata Das Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337724#comment-14337724 ] Shixiong Zhu commented on SPARK-5124: - [~vanzin], thanks for the suggestions. I agree most of them. Just a little comment for the following point: {quote} The default Endpoint has no thread-safety guarantees. You can wrap an Endpoint in an EventLoop if you want messages to be handled using a queue, or synchronize your receive() method (although that can block the dispatcher thread, which could be bad). But this would easily allow actors to process multiple messages concurrently if desired. {quote} Every Endpoint with EventLoop needs to have an exclusive thread. So it will increase the thread number significantly and pay some cost for the extra thread context switch. However, I think we can have a global Dispatcher for the Endpoints that need thread-safety guarantees. Endpoint needs to register itself to the Dispatcher. Dispatcher will dispatch the messages to these Endpoints and guarantee the thread-safety. This Dispatcher can only have a few threads and queues for dispatching messages. Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5124) Standardize internal RPC interface
[ https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337745#comment-14337745 ] Aaron Davidson commented on SPARK-5124: --- [~zsxwing] For receiveAndReply, I think the intention is message.reply(), so it can be performed in a separate thread. My comment on failure would be so that exceptions can be relayed during sendWithReply() (this is natural anyway if sendWithReply returns a Future, it can just be completed with the exception). Standardize internal RPC interface -- Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6004) Pick the best model when training GradientBoostedTrees with validation
[ https://issues.apache.org/jira/browse/SPARK-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337762#comment-14337762 ] Joseph K. Bradley commented on SPARK-6004: -- I'm not too worried about stopping early when users call runWithValidation(). If they know enough to use a validation set, then I think it's reasonable to expect them to set validationTol according to what they need. This was discussed a little in [SPARK-5972], where the decision was to provide a helper method for users to do validation post-hoc: [SPARK-6025] I feel like the current default behavior is good since it chooses efficiency, while still leaving the option to do more expensive training with potentially higher accuracy. Pick the best model when training GradientBoostedTrees with validation -- Key: SPARK-6004 URL: https://issues.apache.org/jira/browse/SPARK-6004 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Liang-Chi Hsieh Priority: Minor Since the validation error does not change monotonically, in practice, it should be proper to pick the best model when training GradientBoostedTrees with validation instead of stopping it early. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6020) Flaky test: o.a.s.sql.columnar.PartitionBatchPruningSuite
[ https://issues.apache.org/jira/browse/SPARK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6020: - Assignee: Cheng Lian Flaky test: o.a.s.sql.columnar.PartitionBatchPruningSuite - Key: SPARK-6020 URL: https://issues.apache.org/jira/browse/SPARK-6020 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.3.0 Reporter: Andrew Or Assignee: Cheng Lian Priority: Critical Observed in the following builds, only one of which has something to do with SQL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27931/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27930/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27929/ org.apache.spark.sql.columnar.PartitionBatchPruningSuite.SELECT key FROM pruningData WHERE NOT (key IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)) {code} Error Message 8 did not equal 10 Wrong number of read batches: == Parsed Logical Plan == 'Project ['key] 'Filter NOT 'key IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) 'UnresolvedRelation [pruningData], None == Analyzed Logical Plan == Project [key#5245] Filter NOT key#5245 IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) LogicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35 == Optimized Logical Plan == Project [key#5245] Filter NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15) InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35), Some(pruningData) == Physical Plan == Filter NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15) InMemoryColumnarTableScan [key#5245], [NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)], (InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35), Some(pruningData)) Code Generation: false == RDD == Stacktrace sbt.ForkMain$ForkError: 8 did not equal 10 Wrong number of read batches: == Parsed Logical Plan == 'Project ['key] 'Filter NOT 'key IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) 'UnresolvedRelation [pruningData], None == Analyzed Logical Plan == Project [key#5245] Filter NOT key#5245 IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) LogicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35 == Optimized Logical Plan == Project [key#5245] Filter NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15) InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35), Some(pruningData) == Physical Plan == Filter NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15) InMemoryColumnarTableScan [key#5245], [NOT key#5245 INSET (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)], (InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions at ExistingRDD.scala:35), Some(pruningData)) Code Generation: false == RDD == at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466) at org.apache.spark.sql.columnar.PartitionBatchPruningSuite$$anonfun$checkBatchPruning$1.apply$mcV$sp(PartitionBatchPruningSuite.scala:119) at org.apache.spark.sql.columnar.PartitionBatchPruningSuite$$anonfun$checkBatchPruning$1.apply(PartitionBatchPruningSuite.scala:107) at org.apache.spark.sql.columnar.PartitionBatchPruningSuite$$anonfun$checkBatchPruning$1.apply(PartitionBatchPruningSuite.scala:107) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at
[jira] [Commented] (SPARK-5981) pyspark ML models should support predict/transform on vector within map
[ https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337805#comment-14337805 ] Manoj Kumar commented on SPARK-5981: [~josephkb] Hi, Can I work on this? pyspark ML models should support predict/transform on vector within map --- Key: SPARK-5981 URL: https://issues.apache.org/jira/browse/SPARK-5981 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Currently, most Python models only have limited support for single-vector prediction. E.g., one can call {code}model.predict(myFeatureVector){code} for a single instance, but that fails within a map for Python ML models and transformers which use JavaModelWrapper: {code} data.map(lambda features: model.predict(features)) {code} This fails because JavaModelWrapper.call uses the SparkContext (within the transformation). (It works for linear models, which do prediction within Python.) Supporting prediction within a map would require storing the model and doing prediction/transformation within Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5185) pyspark --jars does not add classes to driver class path
[ https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337840#comment-14337840 ] Tathagata Das commented on SPARK-5185: -- I also encountered this for KafkaUtils in Python. I am doing the said workaround. But we should fix this for the general case. pyspark --jars does not add classes to driver class path Key: SPARK-5185 URL: https://issues.apache.org/jira/browse/SPARK-5185 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Uri Laserson Assignee: Andrew Or I have some random class I want access to from an Spark shell, say {{com.cloudera.science.throwaway.ThrowAway}}. You can find the specific example I used here: https://gist.github.com/laserson/e9e3bd265e1c7a896652 I packaged it as {{throwaway.jar}}. If I then run {{bin/spark-shell}} like so: {code} bin/spark-shell --master local[1] --jars throwaway.jar {code} I can execute {code} val a = new com.cloudera.science.throwaway.ThrowAway() {code} Successfully. I now run PySpark like so: {code} PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars throwaway.jar {code} which gives me an error when I try to instantiate the class through Py4J: {code} In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway() --- Py4JError Traceback (most recent call last) ipython-input-1-4eedbe023c29 in module() 1 sc._jvm.com.cloudera.science.throwaway.ThrowAway() /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __getattr__(self, name) 724 def __getattr__(self, name): 725 if name == '__call__': -- 726 raise Py4JError('Trying to call a package.') 727 new_fqn = self._fqn + '.' + name 728 command = REFLECTION_COMMAND_NAME +\ Py4JError: Trying to call a package. {code} However, if I explicitly add the {{--driver-class-path}} to add the same jar {code} PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars throwaway.jar --driver-class-path throwaway.jar {code} it works {code} In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway() Out[1]: JavaObject id=o18 {code} However, the docs state that {{--jars}} should also set the driver class path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5185) pyspark --jars does not add classes to driver class path
[ https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5185: - Assignee: Andrew Or (was: Burak Yavuz) pyspark --jars does not add classes to driver class path Key: SPARK-5185 URL: https://issues.apache.org/jira/browse/SPARK-5185 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Uri Laserson Assignee: Andrew Or I have some random class I want access to from an Spark shell, say {{com.cloudera.science.throwaway.ThrowAway}}. You can find the specific example I used here: https://gist.github.com/laserson/e9e3bd265e1c7a896652 I packaged it as {{throwaway.jar}}. If I then run {{bin/spark-shell}} like so: {code} bin/spark-shell --master local[1] --jars throwaway.jar {code} I can execute {code} val a = new com.cloudera.science.throwaway.ThrowAway() {code} Successfully. I now run PySpark like so: {code} PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars throwaway.jar {code} which gives me an error when I try to instantiate the class through Py4J: {code} In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway() --- Py4JError Traceback (most recent call last) ipython-input-1-4eedbe023c29 in module() 1 sc._jvm.com.cloudera.science.throwaway.ThrowAway() /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __getattr__(self, name) 724 def __getattr__(self, name): 725 if name == '__call__': -- 726 raise Py4JError('Trying to call a package.') 727 new_fqn = self._fqn + '.' + name 728 command = REFLECTION_COMMAND_NAME +\ Py4JError: Trying to call a package. {code} However, if I explicitly add the {{--driver-class-path}} to add the same jar {code} PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars throwaway.jar --driver-class-path throwaway.jar {code} it works {code} In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway() Out[1]: JavaObject id=o18 {code} However, the docs state that {{--jars}} should also set the driver class path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5185) pyspark --jars does not add classes to driver class path
[ https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5185: - Assignee: Burak Yavuz pyspark --jars does not add classes to driver class path Key: SPARK-5185 URL: https://issues.apache.org/jira/browse/SPARK-5185 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Uri Laserson Assignee: Burak Yavuz I have some random class I want access to from an Spark shell, say {{com.cloudera.science.throwaway.ThrowAway}}. You can find the specific example I used here: https://gist.github.com/laserson/e9e3bd265e1c7a896652 I packaged it as {{throwaway.jar}}. If I then run {{bin/spark-shell}} like so: {code} bin/spark-shell --master local[1] --jars throwaway.jar {code} I can execute {code} val a = new com.cloudera.science.throwaway.ThrowAway() {code} Successfully. I now run PySpark like so: {code} PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars throwaway.jar {code} which gives me an error when I try to instantiate the class through Py4J: {code} In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway() --- Py4JError Traceback (most recent call last) ipython-input-1-4eedbe023c29 in module() 1 sc._jvm.com.cloudera.science.throwaway.ThrowAway() /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __getattr__(self, name) 724 def __getattr__(self, name): 725 if name == '__call__': -- 726 raise Py4JError('Trying to call a package.') 727 new_fqn = self._fqn + '.' + name 728 command = REFLECTION_COMMAND_NAME +\ Py4JError: Trying to call a package. {code} However, if I explicitly add the {{--driver-class-path}} to add the same jar {code} PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars throwaway.jar --driver-class-path throwaway.jar {code} it works {code} In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway() Out[1]: JavaObject id=o18 {code} However, the docs state that {{--jars}} should also set the driver class path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2168) History Server renered page not suitable for load balancing
[ https://issues.apache.org/jira/browse/SPARK-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337845#comment-14337845 ] Apache Spark commented on SPARK-2168: - User 'elyast' has created a pull request for this issue: https://github.com/apache/spark/pull/4778 History Server renered page not suitable for load balancing --- Key: SPARK-2168 URL: https://issues.apache.org/jira/browse/SPARK-2168 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Lukasz Jastrzebski Priority: Minor Small issue but still. I run history server through Marathon and balance it through haproxy. The problem is that links generated by HistoryPage (links to completed applications) are absolute, e.g. a href=http://some-server:port/history/...;completedApplicationName/a , but instead they should be relative, e.g. a hfref=/history/...completedApplicationName/a, so they can be load balanced. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6024) When a data source table has too many columns, cannot persist it in metastore.
Yin Huai created SPARK-6024: --- Summary: When a data source table has too many columns, cannot persist it in metastore. Key: SPARK-6024 URL: https://issues.apache.org/jira/browse/SPARK-6024 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker Because we are using table properties of a Hive metastore table to store the schema, when a schema is too wide, we cannot persist it in metastore. {code} 15/02/25 18:13:50 ERROR metastore.RetryingHMSHandler: Retrying HMSHandler after 1000 ms (attempt 1 of 1) with error: javax.jdo.JDODataStoreException: Put request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) VALUES (?,?,?) at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451) at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732) at org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752) at org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:719) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108) at com.sun.proxy.$Proxy15.createTable(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1261) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) at com.sun.proxy.$Proxy16.create_table_with_environment_context(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:558) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:547) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy17.createTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:613) at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:136) at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:243) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1013) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:963) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:929) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:907) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:30) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:32) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:34) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:36) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:38) at $line39.$read$$iwC$$iwC$$iwC$$iwC.init(console:40) at $line39.$read$$iwC$$iwC$$iwC.init(console:42) at $line39.$read$$iwC$$iwC.init(console:44) at $line39.$read$$iwC.init(console:46) at $line39.$read.init(console:48) at
[jira] [Commented] (SPARK-6024) When a data source table has too many columns, it's schema cannot be stored in metastore.
[ https://issues.apache.org/jira/browse/SPARK-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337722#comment-14337722 ] Yin Huai commented on SPARK-6024: - Seems we need to split the schema's string representation... When a data source table has too many columns, it's schema cannot be stored in metastore. - Key: SPARK-6024 URL: https://issues.apache.org/jira/browse/SPARK-6024 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker Because we are using table properties of a Hive metastore table to store the schema, when a schema is too wide, we cannot persist it in metastore. {code} 15/02/25 18:13:50 ERROR metastore.RetryingHMSHandler: Retrying HMSHandler after 1000 ms (attempt 1 of 1) with error: javax.jdo.JDODataStoreException: Put request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) VALUES (?,?,?) at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451) at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732) at org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752) at org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:719) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108) at com.sun.proxy.$Proxy15.createTable(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1261) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) at com.sun.proxy.$Proxy16.create_table_with_environment_context(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:558) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:547) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy17.createTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:613) at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:136) at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:243) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55) at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1013) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:963) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:929) at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:907) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:30) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:32) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:34) at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:36) at