date:20150225


[ 
https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336430#comment-14336430
 ] 

Apache Spark commented on SPARK-6006:
-

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/4764

 Optimize count distinct in case of high cardinality columns
 ---

 Key: SPARK-6006
 URL: https://issues.apache.org/jira/browse/SPARK-6006
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.1
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.3.0


 In case there are a lot of distinct values, count distinct becomes too slow 
 since it tries to hash partial results to one map. It can be improved by 
 creating buckets/partial maps in an intermediate stage where same key from 
 multiple partial maps of first stage hash to the same bucket. Later we can 
 sum the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5983) Don't respond to HTTP TRACE in HTTP-based UIs


[ 
https://issues.apache.org/jira/browse/SPARK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336440#comment-14336440
 ] 

Apache Spark commented on SPARK-5983:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4765

 Don't respond to HTTP TRACE in HTTP-based UIs
 -

 Key: SPARK-5983
 URL: https://issues.apache.org/jira/browse/SPARK-5983
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sean Owen
Priority: Minor

 This was flagged a while ago during a routine security scan: the HTTP-based 
 Spark services respond to an HTTP TRACE command. This is basically an HTTP 
 verb that has no practical use, and has a pretty theoretical chance of being 
 an exploit vector. It is flagged as a security issue by one common tool, 
 however.
 Spark's HTTP services are based on Jetty, which by default does not enable 
 TRACE (like Tomcat). However, the services do reply to TRACE requests. I 
 think it is because the use of Jetty is pretty 'raw' and does not enable much 
 of the default additional configuration you might get by using Jetty as a 
 standalone server.
 I know that it is at least possible to stop the reply to TRACE with a few 
 extra lines of code, so I think it is worth shutting off TRACE requests. 
 Although the security risk is quite theoretical, it should be easy to fix and 
 bring the Spark services into line with the common default of HTTP servers 
 today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6007) Add numRows param in DataFrame.show

2015-02-25 Thread Jacky Li (JIRA)

Jacky Li created SPARK-6007:
---

 Summary: Add numRows param in DataFrame.show
 Key: SPARK-6007
 URL: https://issues.apache.org/jira/browse/SPARK-6007
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Jacky Li
Priority: Minor
 Fix For: 1.3.0


Currently, DataFrame.show only takes 20 rows to show, it will be useful if the 
user can decide how many rows to show by passing it as a parameter in show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5666) Improvements in Mqtt Spark Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5666.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4178
[https://github.com/apache/spark/pull/4178]

 Improvements in Mqtt Spark Streaming 
 -

 Key: SPARK-5666
 URL: https://issues.apache.org/jira/browse/SPARK-5666
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Prabeesh K
Priority: Minor
 Fix For: 1.4.0


 Cleanup the source code related to the Mqtt Spark Streaming to adhere to 
 accept coding standards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6005) Flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery

2015-02-25 Thread Iulian Dragos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336542#comment-14336542
 ] 

Iulian Dragos commented on SPARK-6005:
--

Looks similar, but unless I miss something, that fix didn't fix this one. The 
failure I report is from a recent build that includes the fixes in 
3912d332464dcd124c60b734724c34d9742466a4


 Flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery
 

 Key: SPARK-6005
 URL: https://issues.apache.org/jira/browse/SPARK-6005
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Iulian Dragos
  Labels: flaky-test, kafka, streaming

 [Link to failing test on 
 Jenkins|https://ci.typesafe.com/view/Spark/job/spark-nightly-build/lastCompletedBuild/testReport/org.apache.spark.streaming.kafka/DirectKafkaStreamSuite/offset_recovery/]
 {code}
 The code passed to eventually never returned normally. Attempted 208 times 
 over 10.00622791 seconds. Last failure message: strings.forall({   ((elem: 
 Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false.
 {code}
 {code:title=Stack trace}
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 208 times over 10.00622791 seconds. Last failure message: 
 strings.forall({
   ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem))
 }) was false.
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at 
 org.apache.spark.streaming.kafka.KafkaStreamSuiteBase.eventually(KafkaStreamSuite.scala:49)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.org$apache$spark$streaming$kafka$DirectKafkaStreamSuite$$anonfun$$sendDataAndWaitForReceive$1(DirectKafkaStreamSuite.scala:225)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply$mcV$sp(DirectKafkaStreamSuite.scala:287)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply(DirectKafkaStreamSuite.scala:211)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite$$anonfun$5.apply(DirectKafkaStreamSuite.scala:211)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(DirectKafkaStreamSuite.scala:39)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.kafka.DirectKafkaStreamSuite.runTest(DirectKafkaStreamSuite.scala:39)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at

[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode

2015-02-25 Thread Marco Capuccini (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336475#comment-14336475
 ] 

Marco Capuccini commented on SPARK-5837:


setting yarn.resourcemanager.hostname solved the problem. Thanks a lot.

 HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
 --

 Key: SPARK-5837
 URL: https://issues.apache.org/jira/browse/SPARK-5837
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
Reporter: Marco Capuccini

 Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the 
 Spark UI if I run over yarn (version 2.4.0):
 HTTP ERROR 500
 Problem accessing /proxy/application_1423564210894_0017/. Reason:
 Connection refused
 Caused by:
 java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at java.net.Socket.init(Socket.java:425)
   at java.net.Socket.init(Socket.java:280)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
   at 
 org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
   at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
   at

[jira] [Updated] (SPARK-5666) Improvements in Mqtt Spark Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5666:
-
Assignee: Prabeesh K

 Improvements in Mqtt Spark Streaming 
 -

 Key: SPARK-5666
 URL: https://issues.apache.org/jira/browse/SPARK-5666
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Prabeesh K
Assignee: Prabeesh K
Priority: Minor
 Fix For: 1.4.0


 Cleanup the source code related to the Mqtt Spark Streaming to adhere to 
 accept coding standards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5947) First class partitioning support in data sources API

2015-02-25 Thread Philippe Girolami (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336556#comment-14336556
 ] 

Philippe Girolami commented on SPARK-5947:
--

For some workloads, it can make more sense to use SKEWED ON rather than 
PARTITION in order to prevent creating thousands of tiny partitions just to 
handle a few large partitions.
As far as I can tell, these two cases can't be inferred from a directory layout 
so maybe it would make sense to make PARTITION  SKEW part of Spark too, and 
rely on meta-data defined by the application rather than directory discovery ?

 First class partitioning support in data sources API
 

 Key: SPARK-5947
 URL: https://issues.apache.org/jira/browse/SPARK-5947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Lian

 For file system based data sources, implementing Hive style partitioning 
 support can be complex and error prone. To be specific, partitioning support 
 include:
 # Partition discovery:  Given a directory organized similar to Hive 
 partitions, discover the directory structure and partitioning information 
 automatically, including partition column names, data types, and values.
 # Reading from partitioned tables
 # Writing to partitioned tables
 It would be good to have first class partitioning support in the data sources 
 API. For example, add a {{FileBasedScan}} trait with callbacks and default 
 implementations for these features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6006) Optimize count distinct in case high cardinality columns

2015-02-25 Thread Yash Datta (JIRA)

Yash Datta created SPARK-6006:
-

 Summary: Optimize count distinct in case high cardinality columns
 Key: SPARK-6006
 URL: https://issues.apache.org/jira/browse/SPARK-6006
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1, 1.1.1
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.3.0


In case there are a lot of distinct values, count distinct becomes too slow 
since it tries to hash partial results to one map. It can be improved by 
creating buckets/partial maps in an intermediate stage where same key from 
multiple partial maps of first stage hash to the same bucket. Later we can sum 
the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6006) Optimize count distinct in case of high cardinality columns

2015-02-25 Thread Yash Datta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-6006:
--
Summary: Optimize count distinct in case of high cardinality columns  (was: 
Optimize count distinct in case high cardinality columns)

 Optimize count distinct in case of high cardinality columns
 ---

 Key: SPARK-6006
 URL: https://issues.apache.org/jira/browse/SPARK-6006
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.1
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.3.0


 In case there are a lot of distinct values, count distinct becomes too slow 
 since it tries to hash partial results to one map. It can be improved by 
 creating buckets/partial maps in an intermediate stage where same key from 
 multiple partial maps of first stage hash to the same bucket. Later we can 
 sum the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2

[
https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336443#comment-14336443
]

Sean Owen commented on SPARK-5978:
--

Hm, not sure I can reproduce this. If you build for Hadoop 2 and look at the
dependency tree (e.g. {{mvn -Phadoop-2.4 -Dhadoop.version=2.4.0 dependency:tree
-pl examples}}) then it looks like you get Hadoop 2, HBase's Hadoop 2 flavor,
etc.:

{code}
[INFO] | +- org.apache.hadoop:hadoop-client:jar:2.4.0:compile
...
[INFO] +- org.apache.hbase:hbase-testing-util:jar:0.98.7-hadoop2:compile
{code}

Is that how you built it?

Spark examples cannot compile with Hadoop 2
---

Key: SPARK-5978
URL: https://issues.apache.org/jira/browse/SPARK-5978
Project: Spark
Issue Type: Bug
Components: Examples, PySpark
Affects Versions: 1.2.0, 1.2.1
Reporter: Michael Nazario
Labels: hadoop-version

This is a regression from Spark 1.1.1.
The Spark Examples includes an example for an avro converter for PySpark.
When I was trying to debug a problem, I discovered that even though you can
build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1
somewhere else in the examples code.
An easy fix would be to separate the examples into hadoop specific versions.
Another way would be to fix the hbase dependencies so that they don't rely on
hadoop 1 specific code.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6007) Add numRows param in DataFrame.show


[ 
https://issues.apache.org/jira/browse/SPARK-6007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336470#comment-14336470
 ] 

Apache Spark commented on SPARK-6007:
-

User 'jackylk' has created a pull request for this issue:
https://github.com/apache/spark/pull/4767

 Add numRows param in DataFrame.show
 ---

 Key: SPARK-6007
 URL: https://issues.apache.org/jira/browse/SPARK-6007
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Jacky Li
Priority: Minor
 Fix For: 1.3.0


 Currently, DataFrame.show only takes 20 rows to show, it will be useful if 
 the user can decide how many rows to show by passing it as a parameter in 
 show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called


 [ 
https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5771.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4567
[https://github.com/apache/spark/pull/4567]

 Number of Cores in Completed Applications of Standalone Master Web Page 
 always be 0 if sc.stop() is called
 --

 Key: SPARK-5771
 URL: https://issues.apache.org/jira/browse/SPARK-5771
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.1
Reporter: Liangliang Gu
Priority: Minor
 Fix For: 1.4.0


 In Standalone mode, the number of cores in Completed Applications of the 
 Master Web Page will always be zero, if sc.stop() is called.
 But the number will always be right, if sc.stop() is not called.
 The reason maybe: 
 after sc.stop() is called, the function removeExecutor of class 
 ApplicationInfo will be called, thus reduce the variable coresGranted to 
 zero.  The variable coresGranted is used to display the number of Cores on 
 the Web Page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4010) Spark UI returns 500 in yarn-client mode


[ 
https://issues.apache.org/jira/browse/SPARK-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336654#comment-14336654
 ] 

Sean Owen commented on SPARK-4010:
--

[~Hanchen] see SPARK-5837 for a possible explanation.

 Spark UI returns 500 in yarn-client mode 
 -

 Key: SPARK-4010
 URL: https://issues.apache.org/jira/browse/SPARK-4010
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Guoqiang Li
Assignee: Guoqiang Li
Priority: Blocker
 Fix For: 1.1.1, 1.2.0


 http://host/proxy/application_id/stages/   returns this result:
 {noformat}
 HTTP ERROR 500
 Problem accessing /proxy/application_1411648907638_0281/stages/. Reason:
 Connection refused
 Caused by:
 java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at java.net.Socket.init(Socket.java:425)
   at java.net.Socket.init(Socket.java:280)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
   at 
 org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:185)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:336)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
   at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1183)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
   at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
   at

[jira] [Commented] (SPARK-5940) Graph Loader: refactor + add more formats

2015-02-25 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336677#comment-14336677
 ] 

Takeshi Yamamuro commented on SPARK-5940:
-

I made a quick fix as follows;
https://github.com/maropu/spark/commit/3c55be83f48e4be5baf3e58ca49850c6cb48df12

Loaders for multiple formats (e.g., RDF and others) are implemented in 
graphx/impl.

 Graph Loader: refactor + add more formats
 -

 Key: SPARK-5940
 URL: https://issues.apache.org/jira/browse/SPARK-5940
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: lukovnikov
Priority: Minor

 Currently, the only graph loader is GraphLoader.edgeListFile. [SPARK-5280] 
 adds a RDF graph loader.
 However, as Takeshi Yamamuro suggested on github [SPARK-5280], 
 https://github.com/apache/spark/pull/4650, it might be interesting to make 
 GraphLoader an interface with several implementations for different formats. 
 And maybe it's good to make a façade graph loader that provides a unified 
 interface to all loaders.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4010) Spark UI returns 500 in yarn-client mode

2015-02-25 Thread Hanchen Su (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336644#comment-14336644
 ] 

Hanchen Su commented on SPARK-4010:
---

I still have the problem in Spark 1.2.1

 Spark UI returns 500 in yarn-client mode 
 -

 Key: SPARK-4010
 URL: https://issues.apache.org/jira/browse/SPARK-4010
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Guoqiang Li
Assignee: Guoqiang Li
Priority: Blocker
 Fix For: 1.1.1, 1.2.0


 http://host/proxy/application_id/stages/   returns this result:
 {noformat}
 HTTP ERROR 500
 Problem accessing /proxy/application_1411648907638_0281/stages/. Reason:
 Connection refused
 Caused by:
 java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at java.net.Socket.init(Socket.java:425)
   at java.net.Socket.init(Socket.java:280)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
   at 
 org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
   at 
 org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
   at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
   at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:185)
   at 
 org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:336)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
   at 
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
   at 
 com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
   at 
 com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
   at 
 com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
   at 
 com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
   at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1183)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
   at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
   at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
   at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
   at

[jira] [Updated] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called


 [ 
https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5771:
-
Assignee: Liangliang Gu

 Number of Cores in Completed Applications of Standalone Master Web Page 
 always be 0 if sc.stop() is called
 --

 Key: SPARK-5771
 URL: https://issues.apache.org/jira/browse/SPARK-5771
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.1
Reporter: Liangliang Gu
Assignee: Liangliang Gu
Priority: Minor
 Fix For: 1.4.0


 In Standalone mode, the number of cores in Completed Applications of the 
 Master Web Page will always be zero, if sc.stop() is called.
 But the number will always be right, if sc.stop() is not called.
 The reason maybe: 
 after sc.stop() is called, the function removeExecutor of class 
 ApplicationInfo will be called, thus reduce the variable coresGranted to 
 zero.  The variable coresGranted is used to display the number of Cores on 
 the Web Page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6009) IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND ()

2015-02-25 Thread Paul Barber (JIRA)

Paul Barber created SPARK-6009:
--

 Summary: IllegalArgumentException thrown by TimSort when SQL ORDER 
BY RAND ()
 Key: SPARK-6009
 URL: https://issues.apache.org/jira/browse/SPARK-6009
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.2.0
 Environment: Centos 7, Hadoop 2.6.0, Hive 0.15.0
java version 1.7.0_75
OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

Reporter: Paul Barber


Running the following SparkSQL query over JDBC:

{noformat}
   SELECT *
FROM FAA
  WHERE Year = 1998 AND Year = 1999
ORDER BY RAND () LIMIT 10
{noformat}

This results in one or more workers throwing the following exception, with 
variations for {{mergeLo}} and {{mergeHi}}. 

{noformat}
:java.lang.IllegalArgumentException: Comparison method violates its general 
contract!
- at java.util.TimSort.mergeHi(TimSort.java:868)
- at java.util.TimSort.mergeAt(TimSort.java:485)
- at java.util.TimSort.mergeCollapse(TimSort.java:410)
- at java.util.TimSort.sort(TimSort.java:214)
- at java.util.Arrays.sort(Arrays.java:727)
- at 
org.spark-project.guava.common.collect.Ordering.leastOf(Ordering.java:708)
- at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
- at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1138)
- at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1135)
- at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
- at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
- at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
- at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
- at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
- at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
- at org.apache.spark.scheduler.Task.run(Task.scala:56)
- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
- at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
- at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
- at java.lang.Thread.run(Thread.java:745)
{noformat}

We have tested with both Spark 1.2.0 and Spark 1.2.1 and have seen the same 
error in both. The query sometimes succeeds, but fails more often than not. 
Whilst this sounds similar to bugs 3032 and 3656, we believe it it is not the 
same.

The {{ORDER BY RAND ()}} is using TimSort to produce the random ordering by 
sorting a list of random values. Having spent some time looking at the issue 
with jdb, it appears that the problem is triggered by the random values being 
changed during the sort - the code which triggers this is in 
{{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala}}
 - class RowOrdering, function compare, line 250 - where a new random number is 
taken for the same row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs

2015-02-25 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336795#comment-14336795
 ] 

Ilya Ganelin commented on SPARK-5750:
-

Did you have a particular doc in mind to update? I feel like this sort of 
comment should go in the programming guide but there's not really a good spot 
for it. One glaring omission in the guide is a general writeup of the shuffle 
operation and the role that it plays internally. Understanding shuffles is key 
to writing stable Spark applications yet there isn't really any mention of it 
outside of the tech talks and presentations from the Spark folks. My suggestion 
would be to create a section providing an overview of shuffle, what parameters 
influence its behavior and stability, and then add this comment to that 
section. 

 Document that ordering of elements in shuffled partitions is not 
 deterministic across runs
 --

 Key: SPARK-5750
 URL: https://issues.apache.org/jira/browse/SPARK-5750
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen

 The ordering of elements in shuffled partitions is not deterministic across 
 runs.  For instance, consider the following example:
 {code}
 val largeFiles = sc.textFile(...)
 val airlines = largeFiles.repartition(2000).cache()
 println(airlines.first)
 {code}
 If this code is run twice, then each run will output a different result.  
 There is non-determinism in the shuffle read code that accounts for this:
 Spark's shuffle read path processes blocks as soon as they are fetched  Spark 
 uses 
 [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
  to fetch shuffle data from mappers.  In this code, requests for multiple 
 blocks from the same host are batched together, so nondeterminism in where 
 tasks are run means that the set of requests can vary across runs.  In 
 addition, there's an [explicit 
 call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
  to randomize the order of the batched fetch requests.  As a result, shuffle 
 operations cannot be guaranteed to produce the same ordering of the elements 
 in their partitions.
 Therefore, Spark should update its docs to clarify that the ordering of 
 elements in shuffle RDDs' partitions is non-deterministic.  Note, however, 
 that the _set_ of elements in each partition will be deterministic: if we 
 used {{mapPartitions}} to sort each partition, then the {{first()}} call 
 above would produce a deterministic result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6010) Exception thrown when reading Spark SQL generated Parquet files with different but compatible schemas


[ 
https://issues.apache.org/jira/browse/SPARK-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336796#comment-14336796
 ] 

Apache Spark commented on SPARK-6010:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4768

 Exception thrown when reading Spark SQL generated Parquet files with 
 different but compatible schemas
 -

 Key: SPARK-6010
 URL: https://issues.apache.org/jira/browse/SPARK-6010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 The following test case added in {{ParquetPartitionDiscoverySuite}} can be 
 used to reproduce this issue:
 {code}
   test(read partitioned table - merging compatible schemas) {
 withTempDir { base =
   makeParquetFile(
 (1 to 10).map(i = Tuple1(i)).toDF(intField),
 makePartitionDir(base, defaultPartitionName, pi - 1))
   makeParquetFile(
 (1 to 10).map(i = (i, i.toString)).toDF(intField, stringField),
 makePartitionDir(base, defaultPartitionName, pi - 2))
   load(base.getCanonicalPath, 
 org.apache.spark.sql.parquet).registerTempTable(t)
   withTempTable(t) {
 checkAnswer(
   sql(SELECT * FROM t),
   (1 to 10).map(i = Row(i, null, 1)) ++ (1 to 10).map(i = Row(i, 
 i.toString, 2)))
   }
 }
   }
 {code}
 Exception thrown:
 {code}
 [info]   java.lang.RuntimeException: could not merge metadata: key 
 org.apache.spark.sql.parquet.row.metadata has conflicting values: 
 [{type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}},{name:stringField,type:string,nullable:true,metadata:{}}]},
  
 {type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}}]}]
 [info]  at 
 parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
 [info]  at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
 [info]  at 
 org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:484)
 [info]  at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
 [info]  at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
 [info]  at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 [info]  at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 [info]  at scala.Option.getOrElse(Option.scala:120)
 [info]  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 [info]  at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.getPartitions(NewHadoopRDD.scala:239)
 [info]  at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 [info]  at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 [info]  at scala.Option.getOrElse(Option.scala:120)
 [info]  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 [info]  at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 [info]  at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 [info]  at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 [info]  at scala.Option.getOrElse(Option.scala:120)
 [info]  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 [info]  at 
 org.apache.spark.SparkContext.runJob(SparkContext.scala:1518)
 [info]  at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
 [info]  at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
 [info]  at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:790)
 [info]  at 
 org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115)
 [info]  at 
 org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60)
 [info]  at 
 org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18$$anonfun$apply$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:337)
 [info]  at 
 org.apache.spark.sql.parquet.ParquetTest$class.withTempTable(ParquetTest.scala:112)
 [info]  at 
 org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempTable(ParquetPartitionDiscoverySuite.scala:35)
 [info]  at 
 org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:336)
 [info]  at 
 org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:325)

[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs


[ 
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336864#comment-14336864
 ] 

Sean Owen commented on SPARK-5750:
--

Personally I think that would be a fine way forward, to create a basic overview 
of the shuffle, what operations shuffle, and the essential things to know. This 
would be a good place to document things like this topic.

Here are a few other sort-of related JIRAs:
https://issues.apache.org/jira/browse/SPARK-5836
https://issues.apache.org/jira/browse/SPARK-4227
https://issues.apache.org/jira/browse/SPARK-3441

 Document that ordering of elements in shuffled partitions is not 
 deterministic across runs
 --

 Key: SPARK-5750
 URL: https://issues.apache.org/jira/browse/SPARK-5750
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen

 The ordering of elements in shuffled partitions is not deterministic across 
 runs.  For instance, consider the following example:
 {code}
 val largeFiles = sc.textFile(...)
 val airlines = largeFiles.repartition(2000).cache()
 println(airlines.first)
 {code}
 If this code is run twice, then each run will output a different result.  
 There is non-determinism in the shuffle read code that accounts for this:
 Spark's shuffle read path processes blocks as soon as they are fetched  Spark 
 uses 
 [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
  to fetch shuffle data from mappers.  In this code, requests for multiple 
 blocks from the same host are batched together, so nondeterminism in where 
 tasks are run means that the set of requests can vary across runs.  In 
 addition, there's an [explicit 
 call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
  to randomize the order of the batched fetch requests.  As a result, shuffle 
 operations cannot be guaranteed to produce the same ordering of the elements 
 in their partitions.
 Therefore, Spark should update its docs to clarify that the ordering of 
 elements in shuffle RDDs' partitions is non-deterministic.  Note, however, 
 that the _set_ of elements in each partition will be deterministic: if we 
 used {{mapPartitions}} to sort each partition, then the {{first()}} call 
 above would produce a deterministic result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6008) zip two rdds derived from pickleFile fails


 [ 
https://issues.apache.org/jira/browse/SPARK-6008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-6008.
---
   Resolution: Duplicate
Fix Version/s: 1.2.2
   1.3.0
 Assignee: Davies Liu

 zip two rdds derived from pickleFile fails
 --

 Key: SPARK-6008
 URL: https://issues.apache.org/jira/browse/SPARK-6008
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Charles Hayden
Assignee: Davies Liu
 Fix For: 1.3.0, 1.2.2


 Read an rdd from a pickle file.
 Then create another from the first, and zip them together.
 from pyspark import SparkContext
 sc = SparkContext()
 print sc.version
 r = sc.parallelize(range(1, 1000))
 r.saveAsPickleFile('file')
 rdd = sc.pickleFile('file')
 res = rdd.map(lambda row: row, preservesPartitioning=True)
 z = rdd.zip(res)
 print z.take(1)
 Gives the following error:
   File bug.py, line 30, in module
 print z.take(1)
   File /home/ubuntu/spark/python/pyspark/rdd.py, line 1225, in take
 res = self.context.runJob(self, takeUpToNumLeft, p, True)
   File /home/ubuntu/spark/python/pyspark/context.py, line 843, in runJob
 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
 javaPartitions, allowLocal)
   File 
 /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, 
 line 538, in __call__
   File /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.runJob.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 
 (TID 8, localhost): org.apache.spark.SparkException: Can only zip RDDs with 
 same number of elements in each partition
   at 
 org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:746)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at 
 org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.foreach(RDD.scala:742)
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:406)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:244)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613)
   at 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6008) zip two rdds derived from pickleFile fails

2015-02-25 Thread Charles Hayden (JIRA)

Charles Hayden created SPARK-6008:
-

 Summary: zip two rdds derived from pickleFile fails
 Key: SPARK-6008
 URL: https://issues.apache.org/jira/browse/SPARK-6008
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Charles Hayden


Read an rdd from a pickle file.
Then create another from the first, and zip them together.

from pyspark import SparkContext
sc = SparkContext()
print sc.version

r = sc.parallelize(range(1, 1000))
r.saveAsPickleFile('file')

rdd = sc.pickleFile('file')
res = rdd.map(lambda row: row, preservesPartitioning=True)
z = rdd.zip(res)
print z.take(1)

Gives the following error:

  File bug.py, line 30, in module
print z.take(1)
  File /home/ubuntu/spark/python/pyspark/rdd.py, line 1225, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File /home/ubuntu/spark/python/pyspark/context.py, line 843, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
javaPartitions, allowLocal)
  File 
/home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
538, in __call__
  File /home/ubuntu/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, 
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
8, localhost): org.apache.spark.SparkException: Can only zip RDDs with same 
number of elements in each partition
at 
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.hasNext(RDD.scala:746)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anon$1.foreach(RDD.scala:742)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:406)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:244)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613)
at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)









--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6010) Exception thrown when reading Spark SQL generated Parquet files with different but compatible schemas

2015-02-25 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-6010:
-

 Summary: Exception thrown when reading Spark SQL generated Parquet 
files with different but compatible schemas
 Key: SPARK-6010
 URL: https://issues.apache.org/jira/browse/SPARK-6010
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker


The following test case added in {{ParquetPartitionDiscoverySuite}} can be used 
to reproduce this issue:
{code}
  test(read partitioned table - merging compatible schemas) {
withTempDir { base =
  makeParquetFile(
(1 to 10).map(i = Tuple1(i)).toDF(intField),
makePartitionDir(base, defaultPartitionName, pi - 1))

  makeParquetFile(
(1 to 10).map(i = (i, i.toString)).toDF(intField, stringField),
makePartitionDir(base, defaultPartitionName, pi - 2))

  load(base.getCanonicalPath, 
org.apache.spark.sql.parquet).registerTempTable(t)

  withTempTable(t) {
checkAnswer(
  sql(SELECT * FROM t),
  (1 to 10).map(i = Row(i, null, 1)) ++ (1 to 10).map(i = Row(i, 
i.toString, 2)))
  }
}
  }
{code}
Exception thrown:
{code}
[info]   java.lang.RuntimeException: could not merge metadata: key 
org.apache.spark.sql.parquet.row.metadata has conflicting values: 
[{type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}},{name:stringField,type:string,nullable:true,metadata:{}}]},
 
{type:struct,fields:[{name:intField,type:integer,nullable:false,metadata:{}}]}]
[info]  at 
parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
[info]  at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
[info]  at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:484)
[info]  at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
[info]  at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]  at scala.Option.getOrElse(Option.scala:120)
[info]  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]  at 
org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.getPartitions(NewHadoopRDD.scala:239)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]  at scala.Option.getOrElse(Option.scala:120)
[info]  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
[info]  at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
[info]  at scala.Option.getOrElse(Option.scala:120)
[info]  at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
[info]  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1518)
[info]  at org.apache.spark.rdd.RDD.collect(RDD.scala:813)
[info]  at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
[info]  at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:790)
[info]  at 
org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:115)
[info]  at 
org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:60)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18$$anonfun$apply$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:337)
[info]  at 
org.apache.spark.sql.parquet.ParquetTest$class.withTempTable(ParquetTest.scala:112)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempTable(ParquetPartitionDiscoverySuite.scala:35)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:336)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8$$anonfun$apply$mcV$sp$18.apply(ParquetPartitionDiscoverySuite.scala:325)
[info]  at 
org.apache.spark.sql.parquet.ParquetTest$class.withTempDir(ParquetTest.scala:82)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite.withTempDir(ParquetPartitionDiscoverySuite.scala:35)
[info]  at 
org.apache.spark.sql.parquet.ParquetPartitionDiscoverySuite$$anonfun$8.apply$mcV$sp(ParquetPartitionDiscoverySuite.scala:325)
[info]  at

[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator

2015-02-25 Thread Max Seiden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Seiden updated SPARK-6012:
--
Summary: Deadlock when asking for partitions from CoalescedRDD on top of a 
TakeOrdered operator  (was: Deadlock when asking for SchemaRDD partitions with 
TakeOrdered operator)

 Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered 
 operator
 --

 Key: SPARK-6012
 URL: https://issues.apache.org/jira/browse/SPARK-6012
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Max Seiden
Priority: Critical

 h3. Summary
 I've found that a deadlock occurs when asking for the partitions from a 
 SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs 
 when a child RDDs asks the DAGScheduler for preferred partition locations 
 (which locks the scheduler) and eventually hits the #execute() of the 
 TakeOrdered operator, which submits tasks but is blocked when it also tries 
 to get preferred locations (in a separate thread). It seems like the 
 TakeOrdered op's #execute() method should not actually submit a task (it is 
 calling #executeCollect() and creating a new RDD) and should instead stay 
 more true to the comment a logically apply a Limit on top of a Sort. 
 In my particular case, I am forcing a repartition of a SchemaRDD with a 
 terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into 
 play.
 h3. Stack Traces
 h4. Task Submission
 {noformat}
 main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() 
 [0x00010ed5e000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x0007c4c239b8 (a 
 org.apache.spark.scheduler.JobWaiter)
 at java.lang.Object.wait(Object.java:503)
 at 
 org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
 - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter)
 at 
 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:884)
 at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161)
 at 
 org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183)
 at 
 org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 - locked 0x0007c36ce038 (a 
 org.apache.spark.sql.hive.HiveContext$$anon$7)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278)
 at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304)
 - locked 0x0007f55c2238 (a 
 org.apache.spark.scheduler.DAGScheduler)
 at

[jira] [Created] (SPARK-6015) Python docs' source code links are all broken

Joseph K. Bradley created SPARK-6015:


 Summary: Python docs' source code links are all broken
 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


The Python docs display {code}[source]{code} links which should link to source 
code, but none work in the documentation provided on the Apache Spark website.  
E.g., go here 
[https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
 and click a source code link on the right.

I believe these are generated because we have viewcode set in the doc conf 
file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33]

I'm not sure why the source code is not being generated/posted.  Should the 
links be removed, or do we want to add the source code?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6012) Deadlock when asking for SchemaRDD partitions with TakeOrdered operator

2015-02-25 Thread Max Seiden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Seiden updated SPARK-6012:
--
Description: 
h3. Summary
I've found that a deadlock occurs when asking for the partitions from a 
SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs 
when a child RDDs asks the DAGScheduler for preferred partition locations 
(which locks the scheduler) and eventually hits the #execute() of the 
TakeOrdered operator, which submits tasks but is blocked when it also tries to 
get preferred locations (in a separate thread). It seems like the TakeOrdered 
op's #execute() method should not actually submit a task (it is calling 
#executeCollect() and creating a new RDD) and should instead stay more true to 
the comment a logically apply a Limit on top of a Sort. 

In my particular case, I am forcing a repartition of a SchemaRDD with a 
terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into play.

h3. Stack Traces
h4. Task Submission
{noformat}
main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() 
[0x00010ed5e000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x0007c4c239b8 (a 
org.apache.spark.scheduler.JobWaiter)
at java.lang.Object.wait(Object.java:503)
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
- locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:884)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161)
at 
org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183)
at 
org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
- locked 0x0007c36ce038 (a 
org.apache.spark.sql.hive.HiveContext$$anon$7)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278)
at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333)
at 
org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304)
- locked 0x0007f55c2238 (a 
org.apache.spark.scheduler.DAGScheduler)
at 
org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148)
at 
org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$5$$anonfun$apply$2.apply(CoalescedRDD.scala:192)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$5$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at

[jira] [Commented] (SPARK-6004) Pick the best model when training GradientBoostedTrees with validation


[ 
https://issues.apache.org/jira/browse/SPARK-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337020#comment-14337020
 ] 

Joseph K. Bradley commented on SPARK-6004:
--

Can you please add a short JIRA description stating the current issue and the 
proposed solution?

 Pick the best model when training GradientBoostedTrees with validation
 --

 Key: SPARK-6004
 URL: https://issues.apache.org/jira/browse/SPARK-6004
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Liang-Chi Hsieh
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5124) Standardize internal RPC interface


[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337028#comment-14337028
 ] 

Reynold Xin commented on SPARK-5124:


I went and looked at the various use cases of Akka in the code base. Quite a 
few of them actually also use askWithReply. There are really two categories of 
message deliveries: one that the sender expects a reply, and one that doesn't. 
As a result, I think the following interface would make more sense:

{code}
// receive side:
def receive(msg: RpcMessage)
def receiveAndReply(msg: RpcMessage) = sys.err(...)

// send side:
def send(message: Object): Unit
def sendWithReply[T](message: Object): Future[T]
{code}

There is an obvious pairing here. Messages sent via send goes to receive, 
without requiring an ack (although if we use the transport layer, an implicit 
ack can be sent). Messages sent via sendWithReply goes to receiveAndReply, 
and the caller needs to explicitly handle the reply. 

In most cases, an rpc endpoint only needs to receive messages without reply, 
and thus can override just receive. 




 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6014) java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on YARN

2015-02-25 Thread Cheolsoo Park (JIRA)

Cheolsoo Park created SPARK-6014:


 Summary: java.io.IOException: Filesystem is thrown when ctrl+c or 
ctrl+d spark-sql on YARN 
 Key: SPARK-6014
 URL: https://issues.apache.org/jira/browse/SPARK-6014
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
 Environment: Hadoop 2.4, YARN
Reporter: Cheolsoo Park
Priority: Minor


This is a regression of SPARK-2261. In branch-1.3 and master, 
{{EventLoggingListener}} throws {{java.io.IOException: Filesystem closed}} 
when ctrl+c or ctrl+d the spark-sql shell.

The root cause is that DFSClient is already shut down before 
EventLoggingListener invokes the following HDFS methods, and thus, 
DFSClient.isClientRunning() check fails-
{code}
Line #135: hadoopDataStream.foreach(hadoopFlushMethod.invoke(_))
Line #187: if (fileSystem.exists(target)) {
{code}
The followings are full stack trace-
{code}
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135)
at 
org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:135)
at 
org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:170)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
at 
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843)
at 
org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804)
at 
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127)
... 19 more
{code}
{code}
Exception in thread Thread-3 java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:187)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1379)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6014) java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on YARN


[ 
https://issues.apache.org/jira/browse/SPARK-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337055#comment-14337055
 ] 

Apache Spark commented on SPARK-6014:
-

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/4771

 java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on 
 YARN 
 --

 Key: SPARK-6014
 URL: https://issues.apache.org/jira/browse/SPARK-6014
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
 Environment: Hadoop 2.4, YARN
Reporter: Cheolsoo Park
Priority: Minor
  Labels: yarn

 This is a regression of SPARK-2261. In branch-1.3 and master, 
 {{EventLoggingListener}} throws {{java.io.IOException: Filesystem closed}} 
 when ctrl+c or ctrl+d the spark-sql shell.
 The root cause is that DFSClient is already shut down before 
 EventLoggingListener invokes the following HDFS methods, and thus, 
 DFSClient.isClientRunning() check fails-
 {code}
 Line #135: hadoopDataStream.foreach(hadoopFlushMethod.invoke(_))
 Line #187: if (fileSystem.exists(target)) {
 {code}
 The followings are full stack trace-
 {code}
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:135)
   at 
 org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:170)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
   at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
   at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
   at 
 org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53)
   at 
 org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60)
 Caused by: java.io.IOException: Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843)
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804)
   at 
 org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127)
   ... 19 more
 {code}
 {code}
 Exception in thread Thread-3 java.io.IOException: Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
   at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:187)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379)
   at scala.Option.foreach(Option.scala:236)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1379)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)
 {code}



--
This message was

[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs

2015-02-25 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337104#comment-14337104
 ] 

Ilya Ganelin commented on SPARK-5750:
-

I'd be happy to pull those in. Is it fine to submit the PR against this issue?

 Document that ordering of elements in shuffled partitions is not 
 deterministic across runs
 --

 Key: SPARK-5750
 URL: https://issues.apache.org/jira/browse/SPARK-5750
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen

 The ordering of elements in shuffled partitions is not deterministic across 
 runs.  For instance, consider the following example:
 {code}
 val largeFiles = sc.textFile(...)
 val airlines = largeFiles.repartition(2000).cache()
 println(airlines.first)
 {code}
 If this code is run twice, then each run will output a different result.  
 There is non-determinism in the shuffle read code that accounts for this:
 Spark's shuffle read path processes blocks as soon as they are fetched  Spark 
 uses 
 [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
  to fetch shuffle data from mappers.  In this code, requests for multiple 
 blocks from the same host are batched together, so nondeterminism in where 
 tasks are run means that the set of requests can vary across runs.  In 
 addition, there's an [explicit 
 call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
  to randomize the order of the batched fetch requests.  As a result, shuffle 
 operations cannot be guaranteed to produce the same ordering of the elements 
 in their partitions.
 Therefore, Spark should update its docs to clarify that the ordering of 
 elements in shuffle RDDs' partitions is non-deterministic.  Note, however, 
 that the _set_ of elements in each partition will be deterministic: if we 
 used {{mapPartitions}} to sort each partition, then the {{first()}} call 
 above would produce a deterministic result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2

2015-02-25 Thread Michael Nazario (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336914#comment-14336914
 ] 

Michael Nazario commented on SPARK-5978:


I was building this with 2.0.0-cdh4.7.0. The original version I was using which 
had this problem also was 2.0.0-cdh4.2.0 which is one of the pre-built 
distributions of spark.

mvn -DskipTests -Phadoop-2.4 -Dhadoop.version=2.0.0-cdh4.7.0 dependency:tree 
-pl examples will show

[INFO] +- org.apache.hbase:hbase-server:jar:0.98.7-hadoop1:compile

 Spark examples cannot compile with Hadoop 2
 ---

 Key: SPARK-5978
 URL: https://issues.apache.org/jira/browse/SPARK-5978
 Project: Spark
  Issue Type: Bug
  Components: Examples, PySpark
Affects Versions: 1.2.0, 1.2.1
Reporter: Michael Nazario
  Labels: hadoop-version

 This is a regression from Spark 1.1.1.
 The Spark Examples includes an example for an avro converter for PySpark. 
 When I was trying to debug a problem, I discovered that even though you can 
 build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 
 somewhere else in the examples code.
 An easy fix would be to separate the examples into hadoop specific versions. 
 Another way would be to fix the hbase dependencies so that they don't rely on 
 hadoop 1 specific code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5930) Documented default of spark.shuffle.io.retryWait is confusing


[ 
https://issues.apache.org/jira/browse/SPARK-5930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336926#comment-14336926
 ] 

Apache Spark commented on SPARK-5930:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4769

 Documented default of spark.shuffle.io.retryWait is confusing
 -

 Key: SPARK-5930
 URL: https://issues.apache.org/jira/browse/SPARK-5930
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Andrew Or
Priority: Trivial

 The description makes it sound like the retryWait itself defaults to 15 
 seconds, when it's actually 5. We should clarify this by changing the wording 
 a little...
 {code}
 tr
   tdcodespark.shuffle.io.retryWait/code/td
   td5/td
   td
 (Netty only) Seconds to wait between retries of fetches. The maximum 
 delay caused by retrying
 is simply codemaxRetries * retryWait/code, by default 15 seconds.
   /td
 /tr
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle


[ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336948#comment-14336948
 ] 

Sean Owen commented on SPARK-3441:
--

Since another shuffle-related doc ticket came up for action 
(http://issues.apache.org/jira/browse/SPARK-5750) I wonder is this still live? 
is it just a matter of documenting something or needs a code change?

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5978) Spark, Examples have Hadoop1/2 compat issues with Hadoop 2.0.x (e.g. CDH4)


 [ 
https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5978:
-
Component/s: (was: PySpark)
 Build
   Priority: Critical  (was: Major)
 Labels:   (was: hadoop-version)

Raising priority since I would like to get more discussion on whether to 
actually address this or reconsider Hadoop version support.

 Spark, Examples have Hadoop1/2 compat issues with Hadoop 2.0.x (e.g. CDH4)
 --

 Key: SPARK-5978
 URL: https://issues.apache.org/jira/browse/SPARK-5978
 Project: Spark
  Issue Type: Bug
  Components: Build, Examples
Affects Versions: 1.2.0, 1.2.1
Reporter: Michael Nazario
Priority: Critical

 This is a regression from Spark 1.1.1.
 The Spark Examples includes an example for an avro converter for PySpark. 
 When I was trying to debug a problem, I discovered that even though you can 
 build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 
 somewhere else in the examples code.
 An easy fix would be to separate the examples into hadoop specific versions. 
 Another way would be to fix the hbase dependencies so that they don't rely on 
 hadoop 1 specific code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-02-25 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336974#comment-14336974
 ] 

Aaron Davidson commented on SPARK-5124:
---

I tend to prefer having explicit message.reply() semantics over Akka's weird 
ask. This is how the Transport layer implements RPCs, for instance: 
https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/server/RpcHandler.java#L26

I might even supplement message with a reply with failure method, so exceptions 
can be relayed.

I have not found this mechanism to be a bottleneck at many thousands of 
requests per second.

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6004) Pick the best model when training GradientBoostedTrees with validation


[ 
https://issues.apache.org/jira/browse/SPARK-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337025#comment-14337025
 ] 

Joseph K. Bradley commented on SPARK-6004:
--

The point of the previous PR introducing validation was to allow early 
stopping.  We should keep early stopping as an option, but I do think your 
JIRA/PR bring up one good point: Even if we never stop early, it may make more 
sense to return the best model, rather than the last model.  (However, some 
users may want the full model since they went to all of that trouble to train 
it.)

Ping [~MechCoder] --- what do you think?

I'd vote for:
* allow early stopping based on validationTol
* return the best model instead of the full model (if we do not stop early 
while doing validation)


 Pick the best model when training GradientBoostedTrees with validation
 --

 Key: SPARK-6004
 URL: https://issues.apache.org/jira/browse/SPARK-6004
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Liang-Chi Hsieh
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6013) Add more Python ML examples for spark.ml

Joseph K. Bradley created SPARK-6013:


 Summary: Add more Python ML examples for spark.ml
 Key: SPARK-6013
 URL: https://issues.apache.org/jira/browse/SPARK-6013
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


Now that the spark.ml Pipelines API is supported within Python, we should 
duplicate the remaining Scala/Java spark.ml examples within Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5124) Standardize internal RPC interface


[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337028#comment-14337028
 ] 

Reynold Xin edited comment on SPARK-5124 at 2/25/15 7:29 PM:
-

I went and looked at the various use cases of Akka in the code base. Quite a 
few of them actually also use askWithReply. There are really two categories of 
message deliveries: one that the sender expects a reply, and one that doesn't. 
As a result, I think the following interface would make more sense:

{code}
// receive side:
def receive(msg: RpcMessage)
def receiveAndReply(msg: RpcMessage) = sys.err(...)

// send side:
def send(message: Object): Unit
def sendWithReply[T](message: Object): Future[T]
{code}

There is an obvious pairing here. Messages sent via send goes to receive, 
without requiring an ack (although if we use the transport layer, an implicit 
ack can be sent). Messages sent via sendWithReply goes to receiveAndReply, 
and the caller needs to explicitly handle the reply. 

In most cases, an rpc endpoint only needs to receive messages without reply, 
and thus can override just receive. By making obvious distinctions between the 
two, I think we mitigate the risk of an end point not properly responding to a 
message, leading to memory leaks.





was (Author: rxin):
I went and looked at the various use cases of Akka in the code base. Quite a 
few of them actually also use askWithReply. There are really two categories of 
message deliveries: one that the sender expects a reply, and one that doesn't. 
As a result, I think the following interface would make more sense:

{code}
// receive side:
def receive(msg: RpcMessage)
def receiveAndReply(msg: RpcMessage) = sys.err(...)

// send side:
def send(message: Object): Unit
def sendWithReply[T](message: Object): Future[T]
{code}

There is an obvious pairing here. Messages sent via send goes to receive, 
without requiring an ack (although if we use the transport layer, an implicit 
ack can be sent). Messages sent via sendWithReply goes to receiveAndReply, 
and the caller needs to explicitly handle the reply. 

In most cases, an rpc endpoint only needs to receive messages without reply, 
and thus can override just receive. 




 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6011) Out of disk space due to Spark not deleting shuffle files of lost executors

2015-02-25 Thread pankaj arora (JIRA)

pankaj arora created SPARK-6011:
---

 Summary: Out of disk space due to Spark not deleting shuffle files 
of lost executors
 Key: SPARK-6011
 URL: https://issues.apache.org/jira/browse/SPARK-6011
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Running Spark in Yarn-Client mode
Reporter: pankaj arora
 Fix For: 1.3.1


If Executors gets lost abruptly spark does not delete its shuffle files till 
application ends.
Ours is long running application which is serving requests received through 
REST APIs and if any of the executor gets lost shuffle files are not deleted 
and that leads to local disk going out of space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6011) Out of disk space due to Spark not deleting shuffle files of lost executors


[ 
https://issues.apache.org/jira/browse/SPARK-6011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336950#comment-14336950
 ] 

Apache Spark commented on SPARK-6011:
-

User 'pankajarora12' has created a pull request for this issue:
https://github.com/apache/spark/pull/4770

 Out of disk space due to Spark not deleting shuffle files of lost executors
 ---

 Key: SPARK-6011
 URL: https://issues.apache.org/jira/browse/SPARK-6011
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Running Spark in Yarn-Client mode
Reporter: pankaj arora
 Fix For: 1.3.1


 If Executors gets lost abruptly spark does not delete its shuffle files till 
 application ends.
 Ours is long running application which is serving requests received through 
 REST APIs and if any of the executor gets lost shuffle files are not deleted 
 and that leads to local disk going out of space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2

[
https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336961#comment-14336961
]

Sean Owen commented on SPARK-5978:
--

Ah right, the key is 2.0.x. This is a subset of the discussion at
http://apache-spark-developers-list.1001551.n3.nabble.com/The-default-CDH4-build-uses-avro-mapred-hadoop1-td10699.html
really. If I may, I'm going to update the title to broaden it, since I think
there is more that doesn't work with Hadoop 2.0.x versions. Although as you can
see my own opinion was that it maybe not worth supporting at this point, I
think it's still open for discussion.

Spark examples cannot compile with Hadoop 2
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5124) Standardize internal RPC interface


 [ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5124:
---
Target Version/s: 1.4.0  (was: 1.3.0)

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5996) DataFrame.collect() doesn't recognize UDTs

2015-02-25 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5996.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 DataFrame.collect() doesn't recognize UDTs
 --

 Key: SPARK-5996
 URL: https://issues.apache.org/jira/browse/SPARK-5996
 Project: Spark
  Issue Type: Bug
  Components: MLlib, SQL
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.3.0


 {code}
 import org.apache.spark.mllib.linalg._
 case class Test(data: Vector)
 val df = sqlContext.createDataFrame(Seq(Test(Vectors.dense(1.0, 2.0
 df.collect()[0].getAs[Vector](0)
 {code}
 throws an exception. `collect()` returns `Row` instead of `Vector`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6012) Deadlock when asking for SchemaRDD partitions with TakeOrdered operator

2015-02-25 Thread Max Seiden (JIRA)

Max Seiden created SPARK-6012:
-

 Summary: Deadlock when asking for SchemaRDD partitions with 
TakeOrdered operator
 Key: SPARK-6012
 URL: https://issues.apache.org/jira/browse/SPARK-6012
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Max Seiden
Priority: Critical


h3. Summary
I've found that a deadlock occurs when asking for the partitions from a 
SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs 
when a child RDDs asks the DAGScheduler for preferred partition locations 
(which locks the scheduler) and eventually hits the #execute() of the 
TakeOrdered operator, which submits tasks but is blocked when it also tries to 
get preferred locations (in a separate thread). It seems like the TakeOrdered 
op's #execute() method should not actually submit a task (it is calling 
#executeCollect() and creating a new RDD) and should instead stay more true to 
the comment a logically apply a Limit on top of a Sort. 

h3. Stack Traces
h4. Task Submission
{noformat}
main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() 
[0x00010ed5e000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x0007c4c239b8 (a 
org.apache.spark.scheduler.JobWaiter)
at java.lang.Object.wait(Object.java:503)
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
- locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:884)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161)
at 
org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183)
at 
org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
- locked 0x0007c36ce038 (a 
org.apache.spark.sql.hive.HiveContext$$anon$7)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278)
at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333)
at 
org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304)
- locked 0x0007f55c2238 (a 
org.apache.spark.scheduler.DAGScheduler)
at 
org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148)
at 
org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$5$$anonfun$apply$2.apply(CoalescedRDD.scala:192)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$5$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at

[jira] [Updated] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true


 [ 
https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6016:

Description: 
saveAsTable is fine and seems we have successfully deleted the old data and 
written the new data. However, when reading the newly created table, an error 
will be thrown.
{code}
Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: 
could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has 
conflicting values: 
at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
...
{code}

If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. 

Note: the newly created table needs to have more than one file to trigger the 
bug (if there is only a single file, we will not need to merge metadata). 


  was:
saveAsTable is fine and seems we have successfully deleted the old data and 
written the new data. However, when reading the newly created table, an error 
will be thrown.
{code}
Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: 
could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has 
conflicting values: 
at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
...
{code}

If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data.



 Cannot read the parquet table after overwriting the existing table when 
 spark.sql.parquet.cacheMetadata=true
 

 Key: SPARK-6016
 URL: https://issues.apache.org/jira/browse/SPARK-6016
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 saveAsTable is fine and seems we have successfully deleted the old data and 
 written the new data. However, when reading the newly created table, an error 
 will be thrown.
 {code}
 Error in SQL statement: java.lang.RuntimeException: 
 java.lang.RuntimeException: could not merge metadata: key 
 org.apache.spark.sql.parquet.row.metadata has conflicting values: 
 at 
 parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
   at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
   at 
 org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
   at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
   ...
 {code}
 If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the 
 data. 
 Note: the newly created table needs to have more than one file to trigger the 
 bug (if there is only a single file, we will not need to merge metadata). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5750) Document that ordering of elements in shuffled partitions is not deterministic across runs


[ 
https://issues.apache.org/jira/browse/SPARK-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337116#comment-14337116
 ] 

Sean Owen commented on SPARK-5750:
--

Yes, although on some of those I'm not as clear what the resolution is. If it's 
going to extend the scope of this beyond a paragraph in the programming guide, 
they're best dealt with separately.

 Document that ordering of elements in shuffled partitions is not 
 deterministic across runs
 --

 Key: SPARK-5750
 URL: https://issues.apache.org/jira/browse/SPARK-5750
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Josh Rosen

 The ordering of elements in shuffled partitions is not deterministic across 
 runs.  For instance, consider the following example:
 {code}
 val largeFiles = sc.textFile(...)
 val airlines = largeFiles.repartition(2000).cache()
 println(airlines.first)
 {code}
 If this code is run twice, then each run will output a different result.  
 There is non-determinism in the shuffle read code that accounts for this:
 Spark's shuffle read path processes blocks as soon as they are fetched  Spark 
 uses 
 [ShuffleBlockFetcherIterator|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala]
  to fetch shuffle data from mappers.  In this code, requests for multiple 
 blocks from the same host are batched together, so nondeterminism in where 
 tasks are run means that the set of requests can vary across runs.  In 
 addition, there's an [explicit 
 call|https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L256]
  to randomize the order of the batched fetch requests.  As a result, shuffle 
 operations cannot be guaranteed to produce the same ordering of the elements 
 in their partitions.
 Therefore, Spark should update its docs to clarify that the ordering of 
 elements in shuffle RDDs' partitions is non-deterministic.  Note, however, 
 that the _set_ of elements in each partition will be deterministic: if we 
 used {{mapPartitions}} to sort each partition, then the {{first()}} call 
 above would produce a deterministic result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6011) Out of disk space due to Spark not deleting shuffle files of lost executors


 [ 
https://issues.apache.org/jira/browse/SPARK-6011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6011:
-
 Component/s: (was: Spark Core)
  YARN
Target Version/s:   (was: 1.3.1)
   Fix Version/s: (was: 1.3.1)

So from the PR discussion, I don't believe the proposed change can proceed. I 
am not sure it gets at the underlying issue here either, which is a concern 
that has been raised, for example, in 
https://issues.apache.org/jira/browse/SPARK-5836 . I do think it's worth 
tracking a) at least documenting this, and Marcelo's suggestion that maybe the 
block manager can later do more proactive cleanup.

Have you used {{spark.cleaner.ttl}} by the way? I'm not sure if it helps in 
this case.

I'd like to focus the discussion in one place so would prefer to resolve this 
as a duplicate and merge into SPARK-5836

 Out of disk space due to Spark not deleting shuffle files of lost executors
 ---

 Key: SPARK-6011
 URL: https://issues.apache.org/jira/browse/SPARK-6011
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
 Environment: Running Spark in Yarn-Client mode
Reporter: pankaj arora

 If Executors gets lost abruptly spark does not delete its shuffle files till 
 application ends.
 Ours is long running application which is serving requests received through 
 REST APIs and if any of the executor gets lost shuffle files are not deleted 
 and that leads to local disk going out of space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files


[ 
https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337128#comment-14337128
 ] 

Sean Owen commented on SPARK-5836:
--

I'd like to take this up, since I've heard versions of this come up frequently 
lately. A first step is indeed improving documentation. I want to confirm or 
deny things I only sort of know about how temp files are treated.

- Temp files/dirs created by executors may live as long as the executors, but 
should be deleted with executors?
- Shuffle files however may live longer?
- {{spark.cleaner.ttl}} is relevant to this or no?

If we believe that temp files die when they should (er, well, [~vanzin] is 
fixing a few things around temp dirs right now), then is the surprising thing 
here the life of shuffle files? 

In which case maybe [~ilganeli] can cover this when writing up some basics 
about how the shuffle works?

But I want to figure out definitively what the right thing is to say about 
behavior right now, even if the behavior should or could be different in the 
future.

CC [~sandyr]

 Highlight in Spark documentation that by default Spark does not delete its 
 temporary files
 --

 Key: SPARK-5836
 URL: https://issues.apache.org/jira/browse/SPARK-5836
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Tomasz Dudziak

 We recently learnt the hard way (in a prod system) that Spark by default does 
 not delete its temporary files until it is stopped. WIthin a relatively short 
 time span of heavy Spark use the disk of our prod machine filled up 
 completely because of multiple shuffle files written to it. We think there 
 should be better documentation around the fact that after a job is finished 
 it leaves a lot of rubbish behind so that this does not come as a surprise.
 Probably a good place to highlight that fact would be the documentation of 
 {{spark.local.dir}} property, which controls where Spark temporary files are 
 written. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true

Yin Huai created SPARK-6016:
---

 Summary: Cannot read the parquet table after overwriting the 
existing table when spark.sql.parquet.cacheMetadata=true
 Key: SPARK-6016
 URL: https://issues.apache.org/jira/browse/SPARK-6016
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker


saveAsTable is fine and seems we have successfully deleted the old data and 
written the new data. However, when reading the newly created table, an error 
will be thrown.
{code}
Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: 
could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has 
conflicting values: 
at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
...
{code}

If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true


[ 
https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337189#comment-14337189
 ] 

Yin Huai commented on SPARK-6016:
-

cc [~lian cheng]

 Cannot read the parquet table after overwriting the existing table when 
 spark.sql.parquet.cacheMetadata=true
 

 Key: SPARK-6016
 URL: https://issues.apache.org/jira/browse/SPARK-6016
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 saveAsTable is fine and seems we have successfully deleted the old data and 
 written the new data. However, when reading the newly created table, an error 
 will be thrown.
 {code}
 Error in SQL statement: java.lang.RuntimeException: 
 java.lang.RuntimeException: could not merge metadata: key 
 org.apache.spark.sql.parquet.row.metadata has conflicting values: 
 at 
 parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
   at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
   at 
 org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
   at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
   ...
 {code}
 If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the 
 data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6015) Python docs' source code links are all broken


[ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337247#comment-14337247
 ] 

Joseph K. Bradley commented on SPARK-6015:
--

That would be great; would you mind making a PR for that?  I'll modify this 
JIRA to be for that backport.

 Python docs' source code links are all broken
 -

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor
 Fix For: 1.3.0


 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 I believe these are generated because we have viewcode set in the doc conf 
 file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33]
 I'm not sure why the source code is not being generated/posted.  Should the 
 links be removed, or do we want to add the source code?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2


 [ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6015:
-
Description: 
The Python docs display {code}[source]{code} links which should link to source 
code, but none work in the documentation provided on the Apache Spark website.  
E.g., go here 
[https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
 and click a source code link on the right.

The PR for [SPARK-5994] included a fix of this for 1.3 which should be 
backported to 1.2.

  was:
The Python docs display {code}[source]{code} links which should link to source 
code, but none work in the documentation provided on the Apache Spark website.  
E.g., go here 
[https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
 and click a source code link on the right.

I believe these are generated because we have viewcode set in the doc conf 
file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33]

I'm not sure why the source code is not being generated/posted.  Should the 
links be removed, or do we want to add the source code?


 Backport Python doc source code link fix to 1.2
 ---

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor
 Fix For: 1.3.0


 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 The PR for [SPARK-5994] included a fix of this for 1.3 which should be 
 backported to 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2


 [ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6015:
-
Summary: Backport Python doc source code link fix to 1.2  (was: Python 
docs' source code links are all broken)

 Backport Python doc source code link fix to 1.2
 ---

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor
 Fix For: 1.3.0


 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 I believe these are generated because we have viewcode set in the doc conf 
 file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33]
 I'm not sure why the source code is not being generated/posted.  Should the 
 links be removed, or do we want to add the source code?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2


 [ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6015:
-
Affects Version/s: (was: 1.3.0)
   1.2.1

 Backport Python doc source code link fix to 1.2
 ---

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.2.1
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor

 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 The PR for [SPARK-5994] included a fix of this for 1.3 which should be 
 backported to 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-6015) Backport Python doc source code link fix to 1.2


 [ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-6015:
--

 Backport Python doc source code link fix to 1.2
 ---

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.2.1
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor

 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 The PR for [SPARK-5994] included a fix of this for 1.3 which should be 
 backported to 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5975) SparkSubmit --jars not present on driver in python

2015-02-25 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5975:
-
Component/s: (was: Spark Core)
 PySpark

 SparkSubmit --jars not present on driver in python
 --

 Key: SPARK-5975
 URL: https://issues.apache.org/jira/browse/SPARK-5975
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Submit
Affects Versions: 1.3.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 Reported by [~tdas].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4845) Adding a parallelismRatio to control the partitions num of shuffledRDD


 [ 
https://issues.apache.org/jira/browse/SPARK-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4845.
--
  Resolution: Won't Fix
   Fix Version/s: (was: 1.3.0)
Target Version/s:   (was: 1.3.0)

See PR discussion.

 Adding a parallelismRatio to control the partitions num of shuffledRDD
 --

 Key: SPARK-4845
 URL: https://issues.apache.org/jira/browse/SPARK-4845
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Fei Wang

 Adding parallelismRatio to control the partitions num of shuffledRDD, the 
 rule is:
  Math.max(1, parallelismRatio * number of partitions of the largest upstream 
 RDD)
 The ratio is 1.0 by default to make it compatible with the old version. 
 When we have a good experience on it, we can change this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6014) java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on YARN


 [ 
https://issues.apache.org/jira/browse/SPARK-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6014:
-
Component/s: YARN

 java.io.IOException: Filesystem is thrown when ctrl+c or ctrl+d spark-sql on 
 YARN 
 --

 Key: SPARK-6014
 URL: https://issues.apache.org/jira/browse/SPARK-6014
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
 Environment: Hadoop 2.4, YARN
Reporter: Cheolsoo Park
Priority: Minor
  Labels: yarn

 This is a regression of SPARK-2261. In branch-1.3 and master, 
 {{EventLoggingListener}} throws {{java.io.IOException: Filesystem closed}} 
 when ctrl+c or ctrl+d the spark-sql shell.
 The root cause is that DFSClient is already shut down before 
 EventLoggingListener invokes the following HDFS methods, and thus, 
 DFSClient.isClientRunning() check fails-
 {code}
 Line #135: hadoopDataStream.foreach(hadoopFlushMethod.invoke(_))
 Line #187: if (fileSystem.exists(target)) {
 {code}
 The followings are full stack trace-
 {code}
 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:135)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:135)
   at 
 org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:170)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
   at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
   at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
   at 
 org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53)
   at 
 org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1613)
   at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60)
 Caused by: java.io.IOException: Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1843)
   at 
 org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1804)
   at 
 org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:127)
   ... 19 more
 {code}
 {code}
 Exception in thread Thread-3 java.io.IOException: Filesystem closed
   at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1760)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
   at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:187)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1379)
   at scala.Option.foreach(Option.scala:236)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1379)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To

[jira] [Updated] (SPARK-5982) Remove Local Read Time


 [ 
https://issues.apache.org/jira/browse/SPARK-5982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5982:
-
Component/s: Spark Core

 Remove Local Read Time
 --

 Key: SPARK-5982
 URL: https://issues.apache.org/jira/browse/SPARK-5982
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Blocker

 LocalReadTime was added to TaskMetrics for the 1.3.0 release.  However, this 
 time is actually only a small subset of the local read time, because local 
 shuffle files are memory mapped, so most of the read time occurs later, as 
 data is read from the memory mapped files and the data actually gets read 
 from disk.  We should remove this before the 1.3.0 release, so we never 
 expose this incomplete info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000


 [ 
https://issues.apache.org/jira/browse/SPARK-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5949:
-
Component/s: Spark Core

 Driver program has to register roaring bitmap classes used by spark with Kryo 
 when number of partitions is greater than 2000
 

 Key: SPARK-5949
 URL: https://issues.apache.org/jira/browse/SPARK-5949
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Peter Torok
  Labels: kryo, partitioning, serialization

 When more than 2000 partitions are being used with Kryo, the following 
 classes need to be registered by driver program:
 - org.apache.spark.scheduler.HighlyCompressedMapStatus
 - org.roaringbitmap.RoaringBitmap
 - org.roaringbitmap.RoaringArray
 - org.roaringbitmap.ArrayContainer
 - org.roaringbitmap.RoaringArray$Element
 - org.roaringbitmap.RoaringArray$Element[]
 - short[]
 Our project doesn't have dependency on roaring bitmap and 
 HighlyCompressedMapStatus is intended for internal spark usage. Spark should 
 take care of this registration when Kryo is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6015) Python docs' source code links are all broken


 [ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-6015:
-

Assignee: Davies Liu

 Python docs' source code links are all broken
 -

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor

 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 I believe these are generated because we have viewcode set in the doc conf 
 file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33]
 I'm not sure why the source code is not being generated/posted.  Should the 
 links be removed, or do we want to add the source code?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6015) Python docs' source code links are all broken


[ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337240#comment-14337240
 ] 

Davies Liu commented on SPARK-6015:
---

Should we backport that fix into 1.2?

 Python docs' source code links are all broken
 -

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor
 Fix For: 1.3.0


 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 I believe these are generated because we have viewcode set in the doc conf 
 file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33]
 I'm not sure why the source code is not being generated/posted.  Should the 
 links be removed, or do we want to add the source code?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6015) Python docs' source code links are all broken


 [ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-6015.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

 Python docs' source code links are all broken
 -

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor
 Fix For: 1.3.0


 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 I believe these are generated because we have viewcode set in the doc conf 
 file: [https://github.com/apache/spark/blob/master/python/docs/conf.py#L33]
 I'm not sure why the source code is not being generated/posted.  Should the 
 links be removed, or do we want to add the source code?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2


 [ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6015:
-
Target Version/s: 1.2.2  (was: 1.3.0)

 Backport Python doc source code link fix to 1.2
 ---

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.2.1
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor

 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 The PR for [SPARK-5994] included a fix of this for 1.3 which should be 
 backported to 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6015) Backport Python doc source code link fix to 1.2


 [ 
https://issues.apache.org/jira/browse/SPARK-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6015:
-
Fix Version/s: (was: 1.3.0)

 Backport Python doc source code link fix to 1.2
 ---

 Key: SPARK-6015
 URL: https://issues.apache.org/jira/browse/SPARK-6015
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 1.2.1
Reporter: Joseph K. Bradley
Assignee: Davies Liu
Priority: Minor

 The Python docs display {code}[source]{code} links which should link to 
 source code, but none work in the documentation provided on the Apache Spark 
 website.  E.g., go here 
 [https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.regression]
  and click a source code link on the right.
 The PR for [SPARK-5994] included a fix of this for 1.3 which should be 
 backported to 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5970) Temporary directories are not removed (but their content is)


 [ 
https://issues.apache.org/jira/browse/SPARK-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5970:
-
Priority: Minor  (was: Major)
Assignee: Milan Straka

 Temporary directories are not removed (but their content is)
 

 Key: SPARK-5970
 URL: https://issues.apache.org/jira/browse/SPARK-5970
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Linux, 64bit
 spark-1.2.1-bin-hadoop2.4.tgz
Reporter: Milan Straka
Assignee: Milan Straka
Priority: Minor
 Fix For: 1.4.0


 How to reproduce: 
 - extract spark-1.2.1-bin-hadoop2.4.tgz
 - without any further configuration, run bin/pyspark
 - run sc.stop() and close python shell
 Expected results:
 - no temporary directories are left in /tmp
 Actual results:
 - four empty temporary directories are created in /tmp, for example after 
 {{ls -d /tmp/spark*}}:{code}
 /tmp/spark-1577b13d-4b9a-4e35-bac2-6e84e5605f53
 /tmp/spark-96084e69-77fd-42fb-ab10-e1fc74296fe3
 /tmp/spark-ab2ea237-d875-485e-b16c-5b0ac31bd753
 /tmp/spark-ddeb0363-4760-48a4-a189-81321898b146
 {code}
 The issue is caused by changes in {{util/Utils.scala}}. Consider the 
 {{createDirectory}}:
 {code}  /**
* Create a directory inside the given parent directory. The directory is 
 guaranteed to be
* newly created, and is not marked for automatic deletion.
*/
   def createDirectory(root: String, namePrefix: String = spark): File = ...
 {code}
 The {{createDirectory}} is used in two places. The first is in 
 {{createTempDir}}, where it is marked for automatic deletion:
 {code}
   def createTempDir(
   root: String = System.getProperty(java.io.tmpdir),
   namePrefix: String = spark): File = {
 val dir = createDirectory(root, namePrefix)
 registerShutdownDeleteDir(dir)
 dir
   }
 {code}
 Nevertheless, it is also used in {{getOrCreateLocalDirs}} where it is _not_ 
 marked for automatic deletion:
 {code}
   private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] 
 = {
 if (isRunningInYarnContainer(conf)) {
   // If we are in yarn mode, systems can have different disk layouts so 
 we must set it
   // to what Yarn on this system said was available. Note this assumes 
 that Yarn has
   // created the directories already, and that they are secured so that 
 only the
   // user has access to them.
   getYarnLocalDirs(conf).split(,)
 } else {
   // In non-Yarn mode (or for the driver in yarn-client mode), we cannot 
 trust the user
   // configuration to point to a secure directory. So create a 
 subdirectory with restricted
   // permissions under each listed directory.
   Option(conf.getenv(SPARK_LOCAL_DIRS))
 .getOrElse(conf.get(spark.local.dir, 
 System.getProperty(java.io.tmpdir)))
 .split(,)
 .flatMap { root =
   try {
 val rootDir = new File(root)
 if (rootDir.exists || rootDir.mkdirs()) {
   Some(createDirectory(root).getAbsolutePath())
 } else {
   logError(sFailed to create dir in $root. Ignoring this 
 directory.)
   None
 }
   } catch {
 case e: IOException =
 logError(sFailed to create local root dir in $root. Ignoring 
 this directory.)
 None
   }
 }
 .toArray
 }
   }
 {code}
 Therefore I think the
 {code}
 Some(createDirectory(root).getAbsolutePath())
 {code}
 should be replaced by something like (I am not an experienced Scala 
 programmer):
 {code}
 val dir = createDirectory(root)
 registerShutdownDeleteDir(dir)
 Some(dir.getAbsolutePath())
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5970) Temporary directories are not removed (but their content is)


 [ 
https://issues.apache.org/jira/browse/SPARK-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5970.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4759
[https://github.com/apache/spark/pull/4759]

 Temporary directories are not removed (but their content is)
 

 Key: SPARK-5970
 URL: https://issues.apache.org/jira/browse/SPARK-5970
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Linux, 64bit
 spark-1.2.1-bin-hadoop2.4.tgz
Reporter: Milan Straka
 Fix For: 1.4.0


 How to reproduce: 
 - extract spark-1.2.1-bin-hadoop2.4.tgz
 - without any further configuration, run bin/pyspark
 - run sc.stop() and close python shell
 Expected results:
 - no temporary directories are left in /tmp
 Actual results:
 - four empty temporary directories are created in /tmp, for example after 
 {{ls -d /tmp/spark*}}:{code}
 /tmp/spark-1577b13d-4b9a-4e35-bac2-6e84e5605f53
 /tmp/spark-96084e69-77fd-42fb-ab10-e1fc74296fe3
 /tmp/spark-ab2ea237-d875-485e-b16c-5b0ac31bd753
 /tmp/spark-ddeb0363-4760-48a4-a189-81321898b146
 {code}
 The issue is caused by changes in {{util/Utils.scala}}. Consider the 
 {{createDirectory}}:
 {code}  /**
* Create a directory inside the given parent directory. The directory is 
 guaranteed to be
* newly created, and is not marked for automatic deletion.
*/
   def createDirectory(root: String, namePrefix: String = spark): File = ...
 {code}
 The {{createDirectory}} is used in two places. The first is in 
 {{createTempDir}}, where it is marked for automatic deletion:
 {code}
   def createTempDir(
   root: String = System.getProperty(java.io.tmpdir),
   namePrefix: String = spark): File = {
 val dir = createDirectory(root, namePrefix)
 registerShutdownDeleteDir(dir)
 dir
   }
 {code}
 Nevertheless, it is also used in {{getOrCreateLocalDirs}} where it is _not_ 
 marked for automatic deletion:
 {code}
   private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] 
 = {
 if (isRunningInYarnContainer(conf)) {
   // If we are in yarn mode, systems can have different disk layouts so 
 we must set it
   // to what Yarn on this system said was available. Note this assumes 
 that Yarn has
   // created the directories already, and that they are secured so that 
 only the
   // user has access to them.
   getYarnLocalDirs(conf).split(,)
 } else {
   // In non-Yarn mode (or for the driver in yarn-client mode), we cannot 
 trust the user
   // configuration to point to a secure directory. So create a 
 subdirectory with restricted
   // permissions under each listed directory.
   Option(conf.getenv(SPARK_LOCAL_DIRS))
 .getOrElse(conf.get(spark.local.dir, 
 System.getProperty(java.io.tmpdir)))
 .split(,)
 .flatMap { root =
   try {
 val rootDir = new File(root)
 if (rootDir.exists || rootDir.mkdirs()) {
   Some(createDirectory(root).getAbsolutePath())
 } else {
   logError(sFailed to create dir in $root. Ignoring this 
 directory.)
   None
 }
   } catch {
 case e: IOException =
 logError(sFailed to create local root dir in $root. Ignoring 
 this directory.)
 None
   }
 }
 .toArray
 }
   }
 {code}
 Therefore I think the
 {code}
 Some(createDirectory(root).getAbsolutePath())
 {code}
 should be replaced by something like (I am not an experienced Scala 
 programmer):
 {code}
 val dir = createDirectory(root)
 registerShutdownDeleteDir(dir)
 Some(dir.getAbsolutePath())
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5975) SparkSubmit --jars not present on driver in python

2015-02-25 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5975:
-
Summary: SparkSubmit --jars not present on driver in python  (was: 
SparkSubmit --jars not present on driver)

 SparkSubmit --jars not present on driver in python
 --

 Key: SPARK-5975
 URL: https://issues.apache.org/jira/browse/SPARK-5975
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Submit
Affects Versions: 1.3.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 Reported by [~tdas].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md

2015-02-25 Thread Brennon York (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337308#comment-14337308
 ] 

Brennon York commented on SPARK-1182:
-

[~rxin] I've incorporated all the changes you mentioned (save for #6) and 
updated all merge conflicts thus far (not fun haha). If we want to move forward 
with this I'd ask that we move with a bit of brevity on this one before I need 
to fix a ton of merge conflicts again :/ Thanks!

 Sort the configuration parameters in configuration.md
 -

 Key: SPARK-1182
 URL: https://issues.apache.org/jira/browse/SPARK-1182
 Project: Spark
  Issue Type: Task
  Components: Documentation
Reporter: Reynold Xin
Assignee: prashant
Priority: Minor

 It is a little bit confusing right now since the config options are all over 
 the place in some arbitrarily sorted order.
 https://github.com/apache/spark/blob/master/docs/configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2015-02-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-4924:
--
Target Version/s: 1.4.0  (was: 1.3.0)

 Factor out code to launch Spark applications into a separate library
 

 Key: SPARK-4924
 URL: https://issues.apache.org/jira/browse/SPARK-4924
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Attachments: spark-launcher.txt


 One of the questions we run into rather commonly is how to start a Spark 
 application from my Java/Scala program?. There currently isn't a good answer 
 to that:
 - Instantiating SparkContext has limitations (e.g., you can only have one 
 active context at the moment, plus you lose the ability to submit apps in 
 cluster mode)
 - Calling SparkSubmit directly is doable but you lose a lot of the logic 
 handled by the shell scripts
 - Calling the shell script directly is doable,  but sort of ugly from an API 
 point of view.
 I think it would be nice to have a small library that handles that for users. 
 On top of that, this library could be used by Spark itself to replace a lot 
 of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1955) VertexRDD can incorrectly assume index sharing

2015-02-25 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-1955.
---
   Resolution: Fixed
Fix Version/s: 1.2.2
   1.3.0
 Assignee: Brennon York  (was: Ankur Dave)

Issue resolved by pull request 4705
https://github.com/apache/spark/pull/4705

 VertexRDD can incorrectly assume index sharing
 --

 Key: SPARK-1955
 URL: https://issues.apache.org/jira/browse/SPARK-1955
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Ankur Dave
Assignee: Brennon York
Priority: Minor
 Fix For: 1.3.0, 1.2.2


 Many VertexRDD operations (diff, leftJoin, innerJoin) can use a fast zip join 
 if both operands are VertexRDDs sharing the same index (i.e., one operand is 
 derived from the other). This check is implemented by matching on the operand 
 type and using the fast join strategy if both are VertexRDDs.
 This is clearly fine when both do in fact share the same index. It is also 
 fine when the two VertexRDDs have the same partitioner but different indexes, 
 because each VertexPartition will detect the index mismatch and fall back to 
 the slow but correct local join strategy.
 However, when they have different numbers of partitions or different 
 partition functions, an exception or even silently incorrect results can 
 occur.
 For example:
 {code}
 import org.apache.spark._
 import org.apache.spark.graphx._
 // Construct VertexRDDs with different numbers of partitions
 val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2)), 1))
 val b = VertexRDD(sc.parallelize(List((0L, 5)), 8))
 // Try to join them. Appears to work...
 val c = a.innerJoin(b) { (vid, x, y) = x + y }
 // ... but then fails with java.lang.IllegalArgumentException: Can't zip RDDs 
 with unequal numbers of partitions
 c.collect
 // Construct VertexRDDs with different partition functions
 val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2))).partitionBy(new 
 HashPartitioner(2)))
 val bVerts = sc.parallelize(List((1L, 5)))
 val b = VertexRDD(bVerts.partitionBy(new RangePartitioner(2, bVerts)))
 // Try to join them. We expect (1L, 7).
 val c = a.innerJoin(b) { (vid, x, y) = x + y }
 // Silent failure: we get an empty set!
 c.collect
 {code}
 VertexRDD should check equality of partitioners before using the fast zip 
 join. If the partitioners are different, the two datasets should be 
 automatically co-partitioned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6017) Provide transparent secure communication channel on Yarn

2015-02-25 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-6017:
-

 Summary: Provide transparent secure communication channel on Yarn
 Key: SPARK-6017
 URL: https://issues.apache.org/jira/browse/SPARK-6017
 Project: Spark
  Issue Type: Umbrella
  Components: YARN
Reporter: Marcelo Vanzin


A quick description:

Currently driver and executors communicate through an insecure channel, so 
anyone can listen on the network and see what's going on. That prevents Spark 
from adding some features securely (e.g. SPARK-5342, SPARK-5682) without 
resorting to using internal Hadoop APIs.

Spark 1.3.0 will add SSL support, but properly configuring SSL is not a trivial 
task for operators, let alone users.

In light of those, we should add a more transparent secure transport layer. 
I've written a short spec to identify the areas in Spark that need work to 
achieve this, and I'll attach the document to this issue shortly.

Note I'm restricting things to Yarn currently, because as far as I know it's 
the only cluster manager that provides the needed security features to 
bootstrap the secure Spark transport. The design itself doesn't really rely on 
Yarn per se, just on a secure way to distribute the initial secret (which the 
Yarn/HDFS combo provides).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true


 [ 
https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6016:

Description: 
saveAsTable is fine and seems we have successfully deleted the old data and 
written the new data. However, when reading the newly created table, an error 
will be thrown.
{code}
Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: 
could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has 
conflicting values: 
at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
...
{code}

If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. 

Note: the newly created table needs to have more than one file to trigger the 
bug (if there is only a single file, we will not need to merge metadata). 

To reproduce it, try...
{code}
import org.apache.spark.sql.SaveMode
import sqlContext._
sql(drop table if exists test)

val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
s{a:$i}), 2)) // we will save to 2 parquet files.
df1.saveAsTable(test, parquet, SaveMode.Overwrite)
sql(select * from test).collect.foreach(println) // Warm the 
FilteringParquetRowInputFormat.footerCache

val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
s{b:$i}), 4)) // we will save to 4 parquet files.
df2.saveAsTable(test, parquet, SaveMode.Overwrite)
sql(select * from test).collect.foreach(println)
{code}

  was:
saveAsTable is fine and seems we have successfully deleted the old data and 
written the new data. However, when reading the newly created table, an error 
will be thrown.
{code}
Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: 
could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has 
conflicting values: 
at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
...
{code}

If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. 

Note: the newly created table needs to have more than one file to trigger the 
bug (if there is only a single file, we will not need to merge metadata). 



 Cannot read the parquet table after overwriting the existing table when 
 spark.sql.parquet.cacheMetadata=true
 

 Key: SPARK-6016
 URL: https://issues.apache.org/jira/browse/SPARK-6016
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 saveAsTable is fine and seems we have successfully deleted the old data and 
 written the new data. However, when reading the newly created table, an error 
 will be thrown.
 {code}
 Error in SQL statement: java.lang.RuntimeException: 
 java.lang.RuntimeException: could not merge metadata: key 
 org.apache.spark.sql.parquet.row.metadata has conflicting values: 
 at 
 parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
   at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
   at 
 org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
   at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
   ...
 {code}
 If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the 
 data. 
 Note: the newly created table needs to have more than one file to trigger the 
 bug (if there is only a single file, we will not need to merge metadata). 
 To reproduce it, try...
 {code}
 import org.apache.spark.sql.SaveMode
 import sqlContext._
 sql(drop table if exists test)
 val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
 s{a:$i}), 2)) // we will save to 2 parquet files.
 df1.saveAsTable(test, parquet, SaveMode.Overwrite)
 sql(select * from test).collect.foreach(println) // Warm the 
 FilteringParquetRowInputFormat.footerCache
 val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
 s{b:$i}), 4)) // we will save

[jira] [Updated] (SPARK-6017) Provide transparent secure communication channel on Yarn

2015-02-25 Thread Marcelo Vanzin (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-6017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marcelo Vanzin updated SPARK-6017:
--
Attachment: secure_spark_on_yarn.pdf

First draft of problem statement and proposed solutions.

Once there's agreement on these I'll create sub-tasks to do the individual work
pieces.

Provide transparent secure communication channel on Yarn

Key: SPARK-6017
URL: https://issues.apache.org/jira/browse/SPARK-6017
Project: Spark
Issue Type: Umbrella
Components: YARN
Reporter: Marcelo Vanzin
Attachments: secure_spark_on_yarn.pdf

A quick description:
Currently driver and executors communicate through an insecure channel, so
anyone can listen on the network and see what's going on. That prevents Spark
from adding some features securely (e.g. SPARK-5342, SPARK-5682) without
resorting to using internal Hadoop APIs.
Spark 1.3.0 will add SSL support, but properly configuring SSL is not a
trivial task for operators, let alone users.
In light of those, we should add a more transparent secure transport layer.
I've written a short spec to identify the areas in Spark that need work to
achieve this, and I'll attach the document to this issue shortly.
Note I'm restricting things to Yarn currently, because as far as I know it's
the only cluster manager that provides the needed security features to
bootstrap the secure Spark transport. The design itself doesn't really rely
on Yarn per se, just on a secure way to distribute the initial secret (which
the Yarn/HDFS combo provides).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6016) Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true


 [ 
https://issues.apache.org/jira/browse/SPARK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6016:

Description: 
saveAsTable is fine and seems we have successfully deleted the old data and 
written the new data. However, when reading the newly created table, an error 
will be thrown.
{code}
Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: 
could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has 
conflicting values: 
at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
...
{code}

If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. 

Note: the newly created table needs to have more than one file to trigger the 
bug (if there is only a single file, we will not need to merge metadata). 

To reproduce it, try...
{code}
import org.apache.spark.sql.SaveMode
import sqlContext._
sql(drop table if exists test)

val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
s{a:$i}), 2)) // we will save to 2 parquet files.
df1.saveAsTable(test, parquet, SaveMode.Overwrite)
sql(select * from test).collect.foreach(println) // Warm the 
FilteringParquetRowInputFormat.footerCache

val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
s{b:$i}), 4)) // we will save to 4 parquet files.
df2.saveAsTable(test, parquet, SaveMode.Overwrite)
sql(select * from test).collect.foreach(println)
{code}
For this example, we have two outdated footers for df1 in footerCache and since 
we have four parquet files for the new test table, we picked up 2 new footers 
for df2. Then, we hit the bug.

  was:
saveAsTable is fine and seems we have successfully deleted the old data and 
written the new data. However, when reading the newly created table, an error 
will be thrown.
{code}
Error in SQL statement: java.lang.RuntimeException: java.lang.RuntimeException: 
could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has 
conflicting values: 
at parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
at 
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$1.getPartitions(newParquet.scala:461)
...
{code}

If I set spark.sql.parquet.cacheMetadata to false, it's fine to query the data. 

Note: the newly created table needs to have more than one file to trigger the 
bug (if there is only a single file, we will not need to merge metadata). 

To reproduce it, try...
{code}
import org.apache.spark.sql.SaveMode
import sqlContext._
sql(drop table if exists test)

val df1 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
s{a:$i}), 2)) // we will save to 2 parquet files.
df1.saveAsTable(test, parquet, SaveMode.Overwrite)
sql(select * from test).collect.foreach(println) // Warm the 
FilteringParquetRowInputFormat.footerCache

val df2 = sqlContext.jsonRDD(sc.parallelize((1 to 10).map(i = 
s{b:$i}), 4)) // we will save to 4 parquet files.
df2.saveAsTable(test, parquet, SaveMode.Overwrite)
sql(select * from test).collect.foreach(println)
{code}


 Cannot read the parquet table after overwriting the existing table when 
 spark.sql.parquet.cacheMetadata=true
 

 Key: SPARK-6016
 URL: https://issues.apache.org/jira/browse/SPARK-6016
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 saveAsTable is fine and seems we have successfully deleted the old data and 
 written the new data. However, when reading the newly created table, an error 
 will be thrown.
 {code}
 Error in SQL statement: java.lang.RuntimeException: 
 java.lang.RuntimeException: could not merge metadata: key 
 org.apache.spark.sql.parquet.row.metadata has conflicting values: 
 at 
 parquet.hadoop.api.InitContext.getMergedKeyValueMetaData(InitContext.java:67)
   at parquet.hadoop.api.ReadSupport.init(ReadSupport.java:84)
   at 
 org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:469)
   at 
 parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
   at

[jira] [Updated] (SPARK-6018) NoSuchMethodError in Spark app is swallowed by YARN AM

2015-02-25 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated SPARK-6018:
-
Description: 
I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. 
Due to changes in DF and SchemaRDD, my app failed with 
{{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the 
error was silently swallowed.

The problem is that pattern matching in Spark AM fails to catch 
NoSuchMethodError-
{code}
15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: 
YarnClusterScheduler.postStartHook done
Exception in thread Driver scala.MatchError: java.lang.NoSuchMethodError: 
org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD;
 (of class java.lang.NoSuchMethodError)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
{code}

  was:
I discovered while testing 1.3 RC with old 1.2 Spark job that I had. Due to 
changes in DF and SchemaRDD, my app failed with 
{{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the 
error was silently swallowed.

The problem is that pattern matching in Spark AM fails to catch 
NoSuchMethodError-
{code}
15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: 
YarnClusterScheduler.postStartHook done
Exception in thread Driver scala.MatchError: java.lang.NoSuchMethodError: 
org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD;
 (of class java.lang.NoSuchMethodError)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
{code}


 NoSuchMethodError in Spark app is swallowed by YARN AM
 --

 Key: SPARK-6018
 URL: https://issues.apache.org/jira/browse/SPARK-6018
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Cheolsoo Park
Priority: Minor
  Labels: yarn

 I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. 
 Due to changes in DF and SchemaRDD, my app failed with 
 {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the 
 error was silently swallowed.
 The problem is that pattern matching in Spark AM fails to catch 
 NoSuchMethodError-
 {code}
 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: 
 YarnClusterScheduler.postStartHook done
 Exception in thread Driver scala.MatchError: java.lang.NoSuchMethodError: 
 org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD;
  (of class java.lang.NoSuchMethodError)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2770) Rename spark-ganglia-lgpl to ganglia-lgpl


 [ 
https://issues.apache.org/jira/browse/SPARK-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2770.
--
Resolution: Won't Fix

 Rename spark-ganglia-lgpl to ganglia-lgpl
 -

 Key: SPARK-2770
 URL: https://issues.apache.org/jira/browse/SPARK-2770
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Chris Fregly
Assignee: Chris Fregly
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6022) GraphX `diff` test incorrectly operating on values (not VertexId's)

2015-02-25 Thread Brennon York (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337513#comment-14337513
 ] 

Brennon York commented on SPARK-6022:
-

FWIW I have this fix put in place under my branch for 
[SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600], but wanted to 
point it out here as I'm going to need to remove the original test. This is all 
assuming I'm not completely insane and missing something here.

cc [~ankurd]

 GraphX `diff` test incorrectly operating on values (not VertexId's)
 ---

 Key: SPARK-6022
 URL: https://issues.apache.org/jira/browse/SPARK-6022
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Brennon York

 The current GraphX {{diff}} test operates on values rather than the 
 VertexId's and, if {{diff}} were working properly (per 
 [SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600]), it should 
 fail this test. The code to test {{diff}} should look like the below as it 
 correctly generates {{VertexRDD}}'s with different {{VertexId}}'s to {{diff}} 
 against.
 {code}
 test(diff functionality with small concrete values) {
 withSpark { sc =
   val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 2L).map(id 
 = (id, id.toInt)))
   // setA := Set((0L, 0), (1L, 1))
   val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(1L until 3L).map(id 
 = (id, id.toInt+2)))
   // setB := Set((1L, 3), (2L, 4))
   val diff = setA.diff(setB)
   assert(diff.collect.toSet == Set((2L, 4)))
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3901) Add SocketSink capability for Spark metrics


[ 
https://issues.apache.org/jira/browse/SPARK-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337512#comment-14337512
 ] 

Sean Owen commented on SPARK-3901:
--

It sounds like this can't proceed, blocked on features from Metrics upstream? 
is it a question of a newer version?

 Add SocketSink capability for Spark metrics
 ---

 Key: SPARK-3901
 URL: https://issues.apache.org/jira/browse/SPARK-3901
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Sreepathi Prasanna
Priority: Minor
   Original Estimate: 48h
  Remaining Estimate: 48h

 Spark depends on Coda hale metrics library to collect metrics. Today we can 
 send metrics to console, csv and jmx. We use chukwa as a monitoring framework 
 to monitor the hadoop services. To extend the the framework to collect spark 
 metrics, we need additional socketsink capability which is not there at the 
 moment in Spark. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-725) Ran out of disk space on EC2 master due to Ganglia logs


 [ 
https://issues.apache.org/jira/browse/SPARK-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-725.
-
Resolution: Not a Problem

Sounds like the best guess was that this wasn't a Spark issue.

 Ran out of disk space on EC2 master due to Ganglia logs
 ---

 Key: SPARK-725
 URL: https://issues.apache.org/jira/browse/SPARK-725
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 0.7.0
Reporter: Josh Rosen

 This morning, I started a Spark Standalone cluster on EC2 using 50 m1.medium 
 instances.  When I tried to rebuild Spark ~5.5 hours later, the build failed 
 because the master ran out of disk space.  It looks like 
 {{/var/lib/ganglia/rrds/spark}} grew to 4.2 gigabytes, using over half of the 
 AMI's EBS disk space.
 Is there a default setting that we can change to place a harder limit on the 
 total amount of space used by Ganglia to prevent this from happening?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5974) Add save/load to examples in ML guide

2015-02-25 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5974.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 Add save/load to examples in ML guide
 -

 Key: SPARK-5974
 URL: https://issues.apache.org/jira/browse/SPARK-5974
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor
 Fix For: 1.3.0


 We should add save/load (model import/export) to the Scala and Java code 
 examples in the ML guide.  This is not yet supported in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1182) Sort the configuration parameters in configuration.md


[ 
https://issues.apache.org/jira/browse/SPARK-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337555#comment-14337555
 ] 

Reynold Xin commented on SPARK-1182:


Merged. Thanks for doing this, [~boyork]. 

Couple minor things that would be great to be addressed as a followup PR:

1. Link to the Mesos configuration table rather than just the Mesos page.

2. If possible, make Mesos/YARN/Standalone appear in the table of contents on 
top.


 Sort the configuration parameters in configuration.md
 -

 Key: SPARK-1182
 URL: https://issues.apache.org/jira/browse/SPARK-1182
 Project: Spark
  Issue Type: Task
  Components: Documentation
Reporter: Reynold Xin
Assignee: prashant
Priority: Minor

 It is a little bit confusing right now since the config options are all over 
 the place in some arbitrarily sorted order.
 https://github.com/apache/spark/blob/master/docs/configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-786) Clean up old work directories in standalone worker


 [ 
https://issues.apache.org/jira/browse/SPARK-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-786.
-
Resolution: Duplicate

 Clean up old work directories in standalone worker
 --

 Key: SPARK-786
 URL: https://issues.apache.org/jira/browse/SPARK-786
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 0.7.2
Reporter: Matei Zaharia

 We should add a setting to clean old work directories after X days. 
 Otherwise, the directory gets filled forever with shuffle files and logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-02-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337733#comment-14337733
 ] 

Shixiong Zhu commented on SPARK-5124:
-

[~rxin] could you clarify how to reply the message in `receiveAndReply`? Use 
the return value, or `message.reply()`?

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6025) Helper method for GradientBoostedTrees to compute validation error

Joseph K. Bradley created SPARK-6025:


 Summary: Helper method for GradientBoostedTrees to compute 
validation error
 Key: SPARK-6025
 URL: https://issues.apache.org/jira/browse/SPARK-6025
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


Create a helper method for computing the error at each iteration of boosting.  
This should be used post-hoc to compute the error efficiently on a new dataset.

E.g.:
{code}
def evaluateEachIteration(data: RDD[LabeledPoint], evaluator): Array[Double]
{code}

Notes:
* It should run in the same big-O time as predict() by keeping a running total 
(residual).
* A different method name could be good.
* It could take an evaluator and/or could evaluate using the training metric by 
default.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6004) Pick the best model when training GradientBoostedTrees with validation


[ 
https://issues.apache.org/jira/browse/SPARK-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337793#comment-14337793
 ] 

Joseph K. Bradley commented on SPARK-6004:
--

If validation is not being done, we should return the full model.  It would be 
strange if users were unable to get the full model.

 Pick the best model when training GradientBoostedTrees with validation
 --

 Key: SPARK-6004
 URL: https://issues.apache.org/jira/browse/SPARK-6004
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Liang-Chi Hsieh
Priority: Minor

 Since the validation error does not change monotonically, in practice, it 
 should be proper to pick the best model when training GradientBoostedTrees 
 with validation instead of stopping it early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6027) Make KafkaUtils work in Python with kafka-assembly provided as --jar

Tathagata Das created SPARK-6027:


 Summary: Make KafkaUtils work in Python with kafka-assembly 
provided as --jar
 Key: SPARK-6027
 URL: https://issues.apache.org/jira/browse/SPARK-6027
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Reporter: Tathagata Das
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-02-25 Thread Shixiong Zhu (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337724#comment-14337724
]

Shixiong Zhu commented on SPARK-5124:
-

[~vanzin], thanks for the suggestions. I agree most of them. Just a little
comment for the following point:

{quote}
The default Endpoint has no thread-safety guarantees. You can wrap an Endpoint
in an EventLoop if you want messages to be handled using a queue, or
synchronize your receive() method (although that can block the dispatcher
thread, which could be bad). But this would easily allow actors to process
multiple messages concurrently if desired.
{quote}

Every Endpoint with EventLoop needs to have an exclusive thread. So it will
increase the thread number significantly and pay some cost for the extra thread
context switch. However, I think we can have a global Dispatcher for the
Endpoints that need thread-safety guarantees. Endpoint needs to register itself
to the Dispatcher. Dispatcher will dispatch the messages to these Endpoints and
guarantee the thread-safety. This Dispatcher can only have a few threads and
queues for dispatching messages.

Standardize internal RPC interface
--

Key: SPARK-5124
URL: https://issues.apache.org/jira/browse/SPARK-5124
Project: Spark
Issue Type: Sub-task
Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf

In Spark we use Akka as the RPC layer. It would be great if we can
standardize the internal RPC interface to facilitate testing. This will also
provide the foundation to try other RPC implementations in the future.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-02-25 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337745#comment-14337745
 ] 

Aaron Davidson commented on SPARK-5124:
---

[~zsxwing] For receiveAndReply, I think the intention is message.reply(), so it 
can be performed in a separate thread.

My comment on failure would be so that exceptions can be relayed during 
sendWithReply() (this is natural anyway if sendWithReply returns a Future, it 
can just be completed with the exception).

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6004) Pick the best model when training GradientBoostedTrees with validation


[ 
https://issues.apache.org/jira/browse/SPARK-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337762#comment-14337762
 ] 

Joseph K. Bradley commented on SPARK-6004:
--

I'm not too worried about stopping early when users call runWithValidation().  
If they know enough to use a validation set, then I think it's reasonable to 
expect them to set validationTol according to what they need.  This was 
discussed a little in [SPARK-5972], where the decision was to provide a helper 
method for users to do validation post-hoc: [SPARK-6025]

I feel like the current default behavior is good since it chooses efficiency, 
while still leaving the option to do more expensive training with potentially 
higher accuracy.

 Pick the best model when training GradientBoostedTrees with validation
 --

 Key: SPARK-6004
 URL: https://issues.apache.org/jira/browse/SPARK-6004
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Liang-Chi Hsieh
Priority: Minor

 Since the validation error does not change monotonically, in practice, it 
 should be proper to pick the best model when training GradientBoostedTrees 
 with validation instead of stopping it early.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6020) Flaky test: o.a.s.sql.columnar.PartitionBatchPruningSuite

2015-02-25 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6020:
-
Assignee: Cheng Lian

 Flaky test: o.a.s.sql.columnar.PartitionBatchPruningSuite
 -

 Key: SPARK-6020
 URL: https://issues.apache.org/jira/browse/SPARK-6020
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.3.0
Reporter: Andrew Or
Assignee: Cheng Lian
Priority: Critical

 Observed in the following builds, only one of which has something to do with 
 SQL:
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27931/
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27930/
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27929/
 org.apache.spark.sql.columnar.PartitionBatchPruningSuite.SELECT key FROM 
 pruningData WHERE NOT (key IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30))
 {code}
 Error Message
 8 did not equal 10 Wrong number of read batches: == Parsed Logical Plan == 
 'Project ['key]  'Filter NOT 'key IN 
 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
'UnresolvedRelation [pruningData], None  == Analyzed Logical Plan == 
 Project [key#5245]  Filter NOT key#5245 IN 
 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
LogicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions 
 at ExistingRDD.scala:35  == Optimized Logical Plan == Project [key#5245]  
 Filter NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)
InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] 
 at mapPartitions at ExistingRDD.scala:35), Some(pruningData)  == Physical 
 Plan == Filter NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)
   InMemoryColumnarTableScan [key#5245], [NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)],
  (InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] 
 at mapPartitions at ExistingRDD.scala:35), Some(pruningData))  Code 
 Generation: false == RDD ==
 Stacktrace
 sbt.ForkMain$ForkError: 8 did not equal 10 Wrong number of read batches: == 
 Parsed Logical Plan ==
 'Project ['key]
  'Filter NOT 'key IN 
 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
   'UnresolvedRelation [pruningData], None
 == Analyzed Logical Plan ==
 Project [key#5245]
  Filter NOT key#5245 IN 
 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
   LogicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] at mapPartitions 
 at ExistingRDD.scala:35
 == Optimized Logical Plan ==
 Project [key#5245]
  Filter NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)
   InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] 
 at mapPartitions at ExistingRDD.scala:35), Some(pruningData)
 == Physical Plan ==
 Filter NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)
  InMemoryColumnarTableScan [key#5245], [NOT key#5245 INSET 
 (5,10,24,25,14,20,29,1,6,28,21,9,13,2,17,22,27,12,7,3,18,16,11,26,23,8,30,19,4,15)],
  (InMemoryRelation [key#5245,value#5246], true, 10, StorageLevel(true, true, 
 false, true, 1), (PhysicalRDD [key#5245,value#5246], MapPartitionsRDD[3202] 
 at mapPartitions at ExistingRDD.scala:35), Some(pruningData))
 Code Generation: false
 == RDD ==
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at 
 org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
   at 
 org.apache.spark.sql.columnar.PartitionBatchPruningSuite$$anonfun$checkBatchPruning$1.apply$mcV$sp(PartitionBatchPruningSuite.scala:119)
   at 
 org.apache.spark.sql.columnar.PartitionBatchPruningSuite$$anonfun$checkBatchPruning$1.apply(PartitionBatchPruningSuite.scala:107)
   at 
 org.apache.spark.sql.columnar.PartitionBatchPruningSuite$$anonfun$checkBatchPruning$1.apply(PartitionBatchPruningSuite.scala:107)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at

[jira] [Commented] (SPARK-5981) pyspark ML models should support predict/transform on vector within map

2015-02-25 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337805#comment-14337805
 ] 

Manoj Kumar commented on SPARK-5981:


[~josephkb] Hi, Can I work on this?

 pyspark ML models should support predict/transform on vector within map
 ---

 Key: SPARK-5981
 URL: https://issues.apache.org/jira/browse/SPARK-5981
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Currently, most Python models only have limited support for single-vector 
 prediction.
 E.g., one can call {code}model.predict(myFeatureVector){code} for a single 
 instance, but that fails within a map for Python ML models and transformers 
 which use JavaModelWrapper:
 {code}
 data.map(lambda features: model.predict(features))
 {code}
 This fails because JavaModelWrapper.call uses the SparkContext (within the 
 transformation).  (It works for linear models, which do prediction within 
 Python.)
 Supporting prediction within a map would require storing the model and doing 
 prediction/transformation within Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5185) pyspark --jars does not add classes to driver class path


[ 
https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337840#comment-14337840
 ] 

Tathagata Das commented on SPARK-5185:
--

I also encountered this for KafkaUtils in Python. I am doing the said 
workaround. But we should fix this for the general case. 



 pyspark --jars does not add classes to driver class path
 

 Key: SPARK-5185
 URL: https://issues.apache.org/jira/browse/SPARK-5185
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Uri Laserson
Assignee: Andrew Or

 I have some random class I want access to from an Spark shell, say 
 {{com.cloudera.science.throwaway.ThrowAway}}.  You can find the specific 
 example I used here:
 https://gist.github.com/laserson/e9e3bd265e1c7a896652
 I packaged it as {{throwaway.jar}}.
 If I then run {{bin/spark-shell}} like so:
 {code}
 bin/spark-shell --master local[1] --jars throwaway.jar
 {code}
 I can execute
 {code}
 val a = new com.cloudera.science.throwaway.ThrowAway()
 {code}
 Successfully.
 I now run PySpark like so:
 {code}
 PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
 throwaway.jar
 {code}
 which gives me an error when I try to instantiate the class through Py4J:
 {code}
 In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 ---
 Py4JError Traceback (most recent call last)
 ipython-input-1-4eedbe023c29 in module()
  1 sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __getattr__(self, name)
 724 def __getattr__(self, name):
 725 if name == '__call__':
 -- 726 raise Py4JError('Trying to call a package.')
 727 new_fqn = self._fqn + '.' + name
 728 command = REFLECTION_COMMAND_NAME +\
 Py4JError: Trying to call a package.
 {code}
 However, if I explicitly add the {{--driver-class-path}} to add the same jar
 {code}
 PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
 throwaway.jar --driver-class-path throwaway.jar
 {code}
 it works
 {code}
 In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 Out[1]: JavaObject id=o18
 {code}
 However, the docs state that {{--jars}} should also set the driver class path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5185) pyspark --jars does not add classes to driver class path


 [ 
https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5185:
-
Assignee: Andrew Or  (was: Burak Yavuz)

 pyspark --jars does not add classes to driver class path
 

 Key: SPARK-5185
 URL: https://issues.apache.org/jira/browse/SPARK-5185
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Uri Laserson
Assignee: Andrew Or

 I have some random class I want access to from an Spark shell, say 
 {{com.cloudera.science.throwaway.ThrowAway}}.  You can find the specific 
 example I used here:
 https://gist.github.com/laserson/e9e3bd265e1c7a896652
 I packaged it as {{throwaway.jar}}.
 If I then run {{bin/spark-shell}} like so:
 {code}
 bin/spark-shell --master local[1] --jars throwaway.jar
 {code}
 I can execute
 {code}
 val a = new com.cloudera.science.throwaway.ThrowAway()
 {code}
 Successfully.
 I now run PySpark like so:
 {code}
 PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
 throwaway.jar
 {code}
 which gives me an error when I try to instantiate the class through Py4J:
 {code}
 In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 ---
 Py4JError Traceback (most recent call last)
 ipython-input-1-4eedbe023c29 in module()
  1 sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __getattr__(self, name)
 724 def __getattr__(self, name):
 725 if name == '__call__':
 -- 726 raise Py4JError('Trying to call a package.')
 727 new_fqn = self._fqn + '.' + name
 728 command = REFLECTION_COMMAND_NAME +\
 Py4JError: Trying to call a package.
 {code}
 However, if I explicitly add the {{--driver-class-path}} to add the same jar
 {code}
 PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
 throwaway.jar --driver-class-path throwaway.jar
 {code}
 it works
 {code}
 In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 Out[1]: JavaObject id=o18
 {code}
 However, the docs state that {{--jars}} should also set the driver class path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5185) pyspark --jars does not add classes to driver class path


 [ 
https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5185:
-
Assignee: Burak Yavuz

 pyspark --jars does not add classes to driver class path
 

 Key: SPARK-5185
 URL: https://issues.apache.org/jira/browse/SPARK-5185
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Uri Laserson
Assignee: Burak Yavuz

 I have some random class I want access to from an Spark shell, say 
 {{com.cloudera.science.throwaway.ThrowAway}}.  You can find the specific 
 example I used here:
 https://gist.github.com/laserson/e9e3bd265e1c7a896652
 I packaged it as {{throwaway.jar}}.
 If I then run {{bin/spark-shell}} like so:
 {code}
 bin/spark-shell --master local[1] --jars throwaway.jar
 {code}
 I can execute
 {code}
 val a = new com.cloudera.science.throwaway.ThrowAway()
 {code}
 Successfully.
 I now run PySpark like so:
 {code}
 PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
 throwaway.jar
 {code}
 which gives me an error when I try to instantiate the class through Py4J:
 {code}
 In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 ---
 Py4JError Traceback (most recent call last)
 ipython-input-1-4eedbe023c29 in module()
  1 sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __getattr__(self, name)
 724 def __getattr__(self, name):
 725 if name == '__call__':
 -- 726 raise Py4JError('Trying to call a package.')
 727 new_fqn = self._fqn + '.' + name
 728 command = REFLECTION_COMMAND_NAME +\
 Py4JError: Trying to call a package.
 {code}
 However, if I explicitly add the {{--driver-class-path}} to add the same jar
 {code}
 PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
 throwaway.jar --driver-class-path throwaway.jar
 {code}
 it works
 {code}
 In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 Out[1]: JavaObject id=o18
 {code}
 However, the docs state that {{--jars}} should also set the driver class path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2168) History Server renered page not suitable for load balancing


[ 
https://issues.apache.org/jira/browse/SPARK-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337845#comment-14337845
 ] 

Apache Spark commented on SPARK-2168:
-

User 'elyast' has created a pull request for this issue:
https://github.com/apache/spark/pull/4778

 History Server renered page not suitable for load balancing
 ---

 Key: SPARK-2168
 URL: https://issues.apache.org/jira/browse/SPARK-2168
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Lukasz Jastrzebski
Priority: Minor

 Small issue but still.
 I run history server through Marathon and balance it through haproxy. The 
 problem is that links generated by HistoryPage (links to completed 
 applications) are absolute, e.g. a 
 href=http://some-server:port/history/...;completedApplicationName/a , but 
 instead they should be relative, e.g.  a 
 hfref=/history/...completedApplicationName/a, so they can be load 
 balanced. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6024) When a data source table has too many columns, cannot persist it in metastore.

Yin Huai created SPARK-6024:
---

 Summary: When a data source table has too many columns, cannot 
persist it in metastore.
 Key: SPARK-6024
 URL: https://issues.apache.org/jira/browse/SPARK-6024
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker


Because we are using table properties of a Hive metastore table to store the 
schema, when a schema is too wide, we cannot persist it in metastore.

{code}
15/02/25 18:13:50 ERROR metastore.RetryingHMSHandler: Retrying HMSHandler after 
1000 ms (attempt 1 of 1) with error: javax.jdo.JDODataStoreException: Put 
request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) VALUES 
(?,?,?) 
at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451)
at 
org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732)
at 
org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752)
at 
org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:719)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108)
at com.sun.proxy.$Proxy15.createTable(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1261)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
at com.sun.proxy.$Proxy16.create_table_with_environment_context(Unknown 
Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:558)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:547)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy17.createTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:613)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:136)
at 
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:243)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)
at 
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1013)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:963)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:929)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:907)
at 
$line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25)
at 
$line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:30)
at 
$line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:32)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:34)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:36)
at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:38)
at $line39.$read$$iwC$$iwC$$iwC$$iwC.init(console:40)
at $line39.$read$$iwC$$iwC$$iwC.init(console:42)
at $line39.$read$$iwC$$iwC.init(console:44)
at $line39.$read$$iwC.init(console:46)
at $line39.$read.init(console:48)
at

[jira] [Commented] (SPARK-6024) When a data source table has too many columns, it's schema cannot be stored in metastore.