[jira] [Resolved] (SPARK-2067) Spark logo in application UI uses absolute path
[ https://issues.apache.org/jira/browse/SPARK-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2067. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Issue resolved by pull request 1006 [https://github.com/apache/spark/pull/1006] Spark logo in application UI uses absolute path --- Key: SPARK-2067 URL: https://issues.apache.org/jira/browse/SPARK-2067 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Neville Li Priority: Trivial Fix For: 1.0.1, 1.1.0 Link of the Spark logo in application UI (top left corner) is hard coded to /, and points to the wrong page when running with YARN proxy. Should use uiRoot instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2067) Spark logo in application UI uses absolute path
[ https://issues.apache.org/jira/browse/SPARK-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2067: --- Assignee: Neville Li Spark logo in application UI uses absolute path --- Key: SPARK-2067 URL: https://issues.apache.org/jira/browse/SPARK-2067 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Neville Li Assignee: Neville Li Priority: Trivial Fix For: 1.0.1, 1.1.0 Link of the Spark logo in application UI (top left corner) is hard coded to /, and points to the wrong page when running with YARN proxy. Should use uiRoot instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2077) Log serializer in use on application startup
Andrew Ash created SPARK-2077: - Summary: Log serializer in use on application startup Key: SPARK-2077 URL: https://issues.apache.org/jira/browse/SPARK-2077 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash In a recent mailing list thread a user was uncertain that their {{spark.serializer}} setting was in effect. Let's log the serializer being used to protect against typos on the setting. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2077) Log serializer in use on application startup
[ https://issues.apache.org/jira/browse/SPARK-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021670#comment-14021670 ] Andrew Ash commented on SPARK-2077: --- https://github.com/apache/spark/pull/1017 Log serializer in use on application startup Key: SPARK-2077 URL: https://issues.apache.org/jira/browse/SPARK-2077 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash In a recent mailing list thread a user was uncertain that their {{spark.serializer}} setting was in effect. Let's log the serializer being used to protect against typos on the setting. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2078) Use ISO8601 date formats in logging
[ https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021709#comment-14021709 ] Andrew Ash commented on SPARK-2078: --- https://github.com/apache/spark/pull/1018 Use ISO8601 date formats in logging --- Key: SPARK-2078 URL: https://issues.apache.org/jira/browse/SPARK-2078 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Currently, logging has 2 digit years and doesn't include milliseconds in logging timestamps. Use ISO8601 date formats instead of the current custom formats. There is some precedent here for ISO8601 format -- it's what [Hadoop uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2078) Use ISO8601 date formats in logging
Andrew Ash created SPARK-2078: - Summary: Use ISO8601 date formats in logging Key: SPARK-2078 URL: https://issues.apache.org/jira/browse/SPARK-2078 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Currently, logging has 2 digit years and doesn't include milliseconds in logging timestamps. Use ISO8601 date formats instead of the current custom formats. There is some precedent here for ISO8601 format -- it's what [Hadoop uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021716#comment-14021716 ] Weihua Jiang commented on SPARK-2044: - Hi Matei, Thanks a lot for your reply. 1. I am confused about your idea of sorting flag. ``The goal is to allow diverse shuffle implementations, so it doesn't make sense to add a flag for it. If we add a flag, every ShuffleManager will need to implement this feature. Instead we're trying to make the smallest interface that the code consuming this data needs, so that we can try multiple implementations of ShuffleManager and see which of these features work best. The Ordering object means that keys are comparable. This flag here would be to tell the ShuffleManager to sort the data, so that downstream algorithms like joins can work more efficiently.`` For your first statement, it seems you want to keep interface minimal, thus no need-to-sort flag is allowed. But for your second statement, you are allowing user to ask ShuffleManager to perform sort for the data. From my point of view, it is better to have such a flag to allow user to ask ShuffleManager to perform sort. Thus, operation like SQL order by can be implemented more efficiently. ShuffleManager can provide some utility class to perform general sorting so that not every implementation needs to implement its own sorting logic. 2. I agree that, for ShuffleReader, read a partition range is more efficient. However, if we want to break the barrier between map and reduce stage, we will encounter a situation that, when a reducer starts, not all its partitions are ready. If using partition range, reducer will wait for all partitions to be ready before executing reducer. It is better if reducer can start execution when some (not all) partitions are ready. The POC code can be found at https://github.com/lirui-intel/spark/tree/removeStageBarrier. This is why I think we need another read() function to specify a partition list instead of a range. Pluggable interface for shuffles Key: SPARK-2044 URL: https://issues.apache.org/jira/browse/SPARK-2044 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Matei Zaharia Assignee: Matei Zaharia Attachments: Pluggableshuffleproposal.pdf Given that a lot of the current activity in Spark Core is in shuffles, I wanted to propose factoring out shuffle implementations in a way that will make experimentation easier. Ideally we will converge on one implementation, but for a while, this could also be used to have several implementations coexist. I'm suggesting this because I aware of at least three efforts to look at shuffle (from Yahoo!, Intel and Databricks). Some of the things people are investigating are: * Push-based shuffle where data moves directly from mappers to reducers * Sorting-based instead of hash-based shuffle, to create fewer files (helps a lot with file handles and memory usage on large shuffles) * External spilling within a key * Changing the level of parallelism or even algorithm for downstream stages at runtime based on statistics of the map output (this is a thing we had prototyped in the Shark research project but never merged in core) I've attached a design doc with a proposed interface. It's not too crazy because the interface between shuffles and the rest of the code is already pretty narrow (just some iterators for reading data and a writer interface for writing it). Bigger changes will be needed in the interaction with DAGScheduler and BlockManager for some of the ideas above, but we can handle those separately, and this interface will allow us to experiment with some short-term stuff sooner. If things go well I'd also like to send a sort-based shuffle implementation for 1.1, but we'll see how the timing on that works out. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021714#comment-14021714 ] Weihua Jiang commented on SPARK-2044: - Hi Matei, Thanks a lot for your reply. 1. I am confused about your idea of sorting flag. ``The goal is to allow diverse shuffle implementations, so it doesn't make sense to add a flag for it. If we add a flag, every ShuffleManager will need to implement this feature. Instead we're trying to make the smallest interface that the code consuming this data needs, so that we can try multiple implementations of ShuffleManager and see which of these features work best. The Ordering object means that keys are comparable. This flag here would be to tell the ShuffleManager to sort the data, so that downstream algorithms like joins can work more efficiently.`` For your first statement, it seems you want to keep interface minimal, thus no need-to-sort flag is allowed. But for your second statement, you are allowing user to ask ShuffleManager to perform sort for the data. From my point of view, it is better to have such a flag to allow user to ask ShuffleManager to perform sort. Thus, operation like SQL order by can be implemented more efficiently. ShuffleManager can provide some utility class to perform general sorting so that not every implementation needs to implement its own sorting logic. 2. I agree that, for ShuffleReader, read a partition range is more efficient. However, if we want to break the barrier between map and reduce stage, we will encounter a situation that, when a reducer starts, not all its partitions are ready. If using partition range, reducer will wait for all partitions to be ready before executing reducer. It is better if reducer can start execution when some (not all) partitions are ready. The POC code can be found at https://github.com/lirui-intel/spark/tree/removeStageBarrier. This is why I think we need another read() function to specify a partition list instead of a range. Pluggable interface for shuffles Key: SPARK-2044 URL: https://issues.apache.org/jira/browse/SPARK-2044 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Matei Zaharia Assignee: Matei Zaharia Attachments: Pluggableshuffleproposal.pdf Given that a lot of the current activity in Spark Core is in shuffles, I wanted to propose factoring out shuffle implementations in a way that will make experimentation easier. Ideally we will converge on one implementation, but for a while, this could also be used to have several implementations coexist. I'm suggesting this because I aware of at least three efforts to look at shuffle (from Yahoo!, Intel and Databricks). Some of the things people are investigating are: * Push-based shuffle where data moves directly from mappers to reducers * Sorting-based instead of hash-based shuffle, to create fewer files (helps a lot with file handles and memory usage on large shuffles) * External spilling within a key * Changing the level of parallelism or even algorithm for downstream stages at runtime based on statistics of the map output (this is a thing we had prototyped in the Shark research project but never merged in core) I've attached a design doc with a proposed interface. It's not too crazy because the interface between shuffles and the rest of the code is already pretty narrow (just some iterators for reading data and a writer interface for writing it). Bigger changes will be needed in the interaction with DAGScheduler and BlockManager for some of the ideas above, but we can handle those separately, and this interface will allow us to experiment with some short-term stuff sooner. If things go well I'd also like to send a sort-based shuffle implementation for 1.1, but we'll see how the timing on that works out. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1308) Add partitions() method to PySpark RDDs
[ https://issues.apache.org/jira/browse/SPARK-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1308. Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Reynold Xin Add partitions() method to PySpark RDDs --- Key: SPARK-1308 URL: https://issues.apache.org/jira/browse/SPARK-1308 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 0.9.0 Reporter: Nicholas Chammas Assignee: Reynold Xin Priority: Minor Fix For: 1.1.0 In Spark, you can do this: {code} // Scala val a = sc.parallelize(List(1, 2, 3, 4), 4) a.partitions.size {code} Please make this possible in PySpark too. The work-around available is quite simple: {code} # Python a = sc.parallelize([1, 2, 3, 4], 4) a._jrdd.splits().size() {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1308) Add getNumPartitions() method to PySpark RDDs
[ https://issues.apache.org/jira/browse/SPARK-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1308: --- Assignee: Syed A. Hashmi (was: Reynold Xin) Add getNumPartitions() method to PySpark RDDs - Key: SPARK-1308 URL: https://issues.apache.org/jira/browse/SPARK-1308 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 0.9.0 Reporter: Nicholas Chammas Assignee: Syed A. Hashmi Priority: Minor Fix For: 1.1.0 In Spark, you can do this: {code} // Scala val a = sc.parallelize(List(1, 2, 3, 4), 4) a.partitions.size {code} Please make this possible in PySpark too. The work-around available is quite simple: {code} # Python a = sc.parallelize([1, 2, 3, 4], 4) a._jrdd.splits().size() {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2044) Pluggable interface for shuffles
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021716#comment-14021716 ] Weihua Jiang edited comment on SPARK-2044 at 6/9/14 7:10 AM: - Hi Matei, Thanks a lot for your reply. 1. I am confused about your idea of sorting flag. ??The goal is to allow diverse shuffle implementations, so it doesn't make sense to add a flag for it. If we add a flag, every ShuffleManager will need to implement this feature. Instead we're trying to make the smallest interface that the code consuming this data needs, so that we can try multiple implementations of ShuffleManager and see which of these features work best. The Ordering object means that keys are comparable. This flag here would be to tell the ShuffleManager to sort the data, so that downstream algorithms like joins can work more efficiently.?? For your first statement, it seems you want to keep interface minimal, thus no need-to-sort flag is allowed. But for your second statement, you are allowing user to ask ShuffleManager to perform sort for the data. From my point of view, it is better to have such a flag to allow user to ask ShuffleManager to perform sort. Thus, operation like SQL order by can be implemented more efficiently. ShuffleManager can provide some utility class to perform general sorting so that not every implementation needs to implement its own sorting logic. 2. I agree that, for ShuffleReader, read a partition range is more efficient. However, if we want to break the barrier between map and reduce stage, we will encounter a situation that, when a reducer starts, not all its partitions are ready. If using partition range, reducer will wait for all partitions to be ready before executing reducer. It is better if reducer can start execution when some (not all) partitions are ready. The POC code can be found at https://github.com/lirui-intel/spark/tree/removeStageBarrier. This is why I think we need another read() function to specify a partition list instead of a range. was (Author: whjiang): Hi Matei, Thanks a lot for your reply. 1. I am confused about your idea of sorting flag. ``The goal is to allow diverse shuffle implementations, so it doesn't make sense to add a flag for it. If we add a flag, every ShuffleManager will need to implement this feature. Instead we're trying to make the smallest interface that the code consuming this data needs, so that we can try multiple implementations of ShuffleManager and see which of these features work best. The Ordering object means that keys are comparable. This flag here would be to tell the ShuffleManager to sort the data, so that downstream algorithms like joins can work more efficiently.`` For your first statement, it seems you want to keep interface minimal, thus no need-to-sort flag is allowed. But for your second statement, you are allowing user to ask ShuffleManager to perform sort for the data. From my point of view, it is better to have such a flag to allow user to ask ShuffleManager to perform sort. Thus, operation like SQL order by can be implemented more efficiently. ShuffleManager can provide some utility class to perform general sorting so that not every implementation needs to implement its own sorting logic. 2. I agree that, for ShuffleReader, read a partition range is more efficient. However, if we want to break the barrier between map and reduce stage, we will encounter a situation that, when a reducer starts, not all its partitions are ready. If using partition range, reducer will wait for all partitions to be ready before executing reducer. It is better if reducer can start execution when some (not all) partitions are ready. The POC code can be found at https://github.com/lirui-intel/spark/tree/removeStageBarrier. This is why I think we need another read() function to specify a partition list instead of a range. Pluggable interface for shuffles Key: SPARK-2044 URL: https://issues.apache.org/jira/browse/SPARK-2044 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Matei Zaharia Assignee: Matei Zaharia Attachments: Pluggableshuffleproposal.pdf Given that a lot of the current activity in Spark Core is in shuffles, I wanted to propose factoring out shuffle implementations in a way that will make experimentation easier. Ideally we will converge on one implementation, but for a while, this could also be used to have several implementations coexist. I'm suggesting this because I aware of at least three efforts to look at shuffle (from Yahoo!, Intel and Databricks). Some of the things people are investigating are: * Push-based shuffle where data moves directly from mappers to reducers * Sorting-based
[jira] [Updated] (SPARK-1308) Add getNumPartitions() method to PySpark RDDs
[ https://issues.apache.org/jira/browse/SPARK-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1308: --- Summary: Add getNumPartitions() method to PySpark RDDs (was: Add partitions() method to PySpark RDDs) Add getNumPartitions() method to PySpark RDDs - Key: SPARK-1308 URL: https://issues.apache.org/jira/browse/SPARK-1308 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 0.9.0 Reporter: Nicholas Chammas Assignee: Reynold Xin Priority: Minor Fix For: 1.1.0 In Spark, you can do this: {code} // Scala val a = sc.parallelize(List(1, 2, 3, 4), 4) a.partitions.size {code} Please make this possible in PySpark too. The work-around available is quite simple: {code} # Python a = sc.parallelize([1, 2, 3, 4], 4) a._jrdd.splits().size() {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2044) Pluggable interface for shuffles
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021716#comment-14021716 ] Weihua Jiang edited comment on SPARK-2044 at 6/9/14 7:11 AM: - Hi Matei, Thanks a lot for your reply. 1. I am confused about your idea of sorting flag. {quote} The goal is to allow diverse shuffle implementations, so it doesn't make sense to add a flag for it. If we add a flag, every ShuffleManager will need to implement this feature. Instead we're trying to make the smallest interface that the code consuming this data needs, so that we can try multiple implementations of ShuffleManager and see which of these features work best. The Ordering object means that keys are comparable. This flag here would be to tell the ShuffleManager to sort the data, so that downstream algorithms like joins can work more efficiently. {quote} For your first statement, it seems you want to keep interface minimal, thus no need-to-sort flag is allowed. But for your second statement, you are allowing user to ask ShuffleManager to perform sort for the data. From my point of view, it is better to have such a flag to allow user to ask ShuffleManager to perform sort. Thus, operation like SQL order by can be implemented more efficiently. ShuffleManager can provide some utility class to perform general sorting so that not every implementation needs to implement its own sorting logic. 2. I agree that, for ShuffleReader, read a partition range is more efficient. However, if we want to break the barrier between map and reduce stage, we will encounter a situation that, when a reducer starts, not all its partitions are ready. If using partition range, reducer will wait for all partitions to be ready before executing reducer. It is better if reducer can start execution when some (not all) partitions are ready. The POC code can be found at https://github.com/lirui-intel/spark/tree/removeStageBarrier. This is why I think we need another read() function to specify a partition list instead of a range. was (Author: whjiang): Hi Matei, Thanks a lot for your reply. 1. I am confused about your idea of sorting flag. ??The goal is to allow diverse shuffle implementations, so it doesn't make sense to add a flag for it. If we add a flag, every ShuffleManager will need to implement this feature. Instead we're trying to make the smallest interface that the code consuming this data needs, so that we can try multiple implementations of ShuffleManager and see which of these features work best. The Ordering object means that keys are comparable. This flag here would be to tell the ShuffleManager to sort the data, so that downstream algorithms like joins can work more efficiently.?? For your first statement, it seems you want to keep interface minimal, thus no need-to-sort flag is allowed. But for your second statement, you are allowing user to ask ShuffleManager to perform sort for the data. From my point of view, it is better to have such a flag to allow user to ask ShuffleManager to perform sort. Thus, operation like SQL order by can be implemented more efficiently. ShuffleManager can provide some utility class to perform general sorting so that not every implementation needs to implement its own sorting logic. 2. I agree that, for ShuffleReader, read a partition range is more efficient. However, if we want to break the barrier between map and reduce stage, we will encounter a situation that, when a reducer starts, not all its partitions are ready. If using partition range, reducer will wait for all partitions to be ready before executing reducer. It is better if reducer can start execution when some (not all) partitions are ready. The POC code can be found at https://github.com/lirui-intel/spark/tree/removeStageBarrier. This is why I think we need another read() function to specify a partition list instead of a range. Pluggable interface for shuffles Key: SPARK-2044 URL: https://issues.apache.org/jira/browse/SPARK-2044 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Matei Zaharia Assignee: Matei Zaharia Attachments: Pluggableshuffleproposal.pdf Given that a lot of the current activity in Spark Core is in shuffles, I wanted to propose factoring out shuffle implementations in a way that will make experimentation easier. Ideally we will converge on one implementation, but for a while, this could also be used to have several implementations coexist. I'm suggesting this because I aware of at least three efforts to look at shuffle (from Yahoo!, Intel and Databricks). Some of the things people are investigating are: * Push-based shuffle where data moves directly from mappers to reducers *
[jira] [Commented] (SPARK-1944) Document --verbose in spark-shell -h
[ https://issues.apache.org/jira/browse/SPARK-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021726#comment-14021726 ] Andrew Ash commented on SPARK-1944: --- https://github.com/apache/spark/pull/1020 Document --verbose in spark-shell -h Key: SPARK-1944 URL: https://issues.apache.org/jira/browse/SPARK-1944 Project: Spark Issue Type: Documentation Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Assignee: Andrew Ash Priority: Minor The below help for spark-submit should make mention of the {{--verbose}} option {noformat} aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h Usage: spark-submit [options] app jar [app options] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Mode to deploy the app in, either 'client' or 'cluster'. --class CLASS_NAME Name of your app's main class (required for Java apps). --arg ARG Argument to be passed to your application's main class. This option can be specified multiple times for multiple args. --name NAME The name of your application (Default: 'Spark'). --jars JARS A comma-separated list of local jars to include on the driver classpath and that SparkContext.addJar will work with. Doesn't work on standalone with 'cluster' deploy mode. --files FILES Comma separated list of files to be placed in the working dir of each executor. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). --driver-java-options Extra Java options to pass to the driver --driver-library-path Extra library path entries to pass to the driver --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1). --supervise If given, restarts the driver on failure. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. YARN-only: --executor-cores NUMNumber of cores per executor (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: 'default'). --num-executors NUM Number of executors to (Default: 2). --archives ARCHIVES Comma separated list of archives to be extracted into the working dir of each executor. aash@aash-mbp ~/git/spark$ {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021797#comment-14021797 ] Santiago M. Mola commented on SPARK-1977: - Xiangrui Meng, I can't reproduce it at the moment. It takes a quite big dataset to reproduce and I have my machines busy. But I'm pretty sure the stacktrace is exactly the same as the one posted by Neville Li. My bet is that this will be fixed with next Twitter Chill release: https://github.com/twitter/chill/commit/b47512c2c75b94b7c5945985306fa303576bf90d mutable.BitSet in ALS not serializable with KryoSerializer -- Key: SPARK-1977 URL: https://issues.apache.org/jira/browse/SPARK-1977 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Neville Li Priority: Minor OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't register mutable.BitSet. Right now we have to register mutable.BitSet manually. A proper fix would be using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at
[jira] [Updated] (SPARK-1719) spark.executor.extraLibraryPath isn't applied on yarn
[ https://issues.apache.org/jira/browse/SPARK-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1719: --- Fix Version/s: 1.1.0 spark.executor.extraLibraryPath isn't applied on yarn - Key: SPARK-1719 URL: https://issues.apache.org/jira/browse/SPARK-1719 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Guoqiang Li Fix For: 1.1.0 Looking through the code for spark on yarn I don't see that spark.executor.extraLibraryPath is being properly applied when it launches executors. It is using the spark.driver.libraryPath in the ClientBase. Note I didn't actually test it so its possible I missed something. I also think better to use LD_LIBRARY_PATH rather then -Djava.library.path. once java.library.path is set, it doesn't search LD_LIBRARY_PATH. In Hadoop we switched to use LD_LIBRARY_PATH instead of java.library.path. See https://issues.apache.org/jira/browse/MAPREDUCE-4072. I'll split this into separate jira. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2071) Package private classes that are deleted from an older version of Spark trigger errors
[ https://issues.apache.org/jira/browse/SPARK-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025239#comment-14025239 ] Prashant Sharma commented on SPARK-2071: Or manually place the jar of the older version on ./spark-class before invoking GenerateMimaIgnore. Package private classes that are deleted from an older version of Spark trigger errors -- Key: SPARK-2071 URL: https://issues.apache.org/jira/browse/SPARK-2071 Project: Spark Issue Type: Sub-task Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma Fix For: 1.1.0 We should figure out how to fix this. One idea is to run the MIMA exclude generator with sbt itself (rather than ./spark-class) so it can run against the older versions of Spark and make sure to exclude classes that are marked as package private in that version as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1291) Link the spark UI to RM ui in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025244#comment-14025244 ] Thomas Graves commented on SPARK-1291: -- https://github.com/apache/spark/pull/1002 Link the spark UI to RM ui in yarn-client mode -- Key: SPARK-1291 URL: https://issues.apache.org/jira/browse/SPARK-1291 Project: Spark Issue Type: Improvement Affects Versions: 0.9.0, 1.0.0 Reporter: Thomas Graves Currently when you run spark on yarn in the yarn-client mode the spark UI is not linked up to the Yarn Resource manager UI so its harder for a user of YARN to find the UI. Note that in yarn-standalone/yarn-cluster mode it is properly linked up. Ideally the yarn-client UI should also be hooked up to the Yarn RM proxy for security. The challenge with the yarn-client mode is that the UI is started before the application master and it doesn't know what the yarn proxy link is when the UI started. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025255#comment-14025255 ] Paul R. Brown commented on SPARK-2075: -- The job is run by a Java client that connects to the master (using a SparkContext). Bundling is performed by a Maven build with two shade plugin invocations, one to package a driver uberjar and one to packager a worker uberjar. The worker flavor is sent to the worker nodes, the driver contains the code to connect to the master and run the job. The Maven build runs against the JAR from Maven Central, and the deployment uses the Spark 1.0.0 hadoop1 download. (The Spark is staged to S3 once and then downloaded onto master/worker nodes and set up during cluster provisioning.) The Maven build uses the usual Scala setup with the library as a dependency and the plugin: {code} dependency groupIdorg.scala-lang/groupId artifactIdscala-library/artifactId version2.10.3/version /dependency {code} {code} plugin groupIdnet.alchim31.maven/groupId artifactIdscala-maven-plugin/artifactId executions execution goals goalcompile/goal goaltestCompile/goal /goals /execution /executions configuration scalaVersion2.10.3/scalaVersion jvmArgs jvmArg-Xms64m/jvmArg jvmArg-Xmx4096m/jvmArg /jvmArgs /configuration /plugin {code} Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact --- Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025296#comment-14025296 ] Paul R. Brown edited comment on SPARK-2075 at 6/9/14 3:36 PM: -- As food for thought, [here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] is the {{code}}InnerClass{{code}} section of the JVM spec. It looks like there have been some changes from 2.10.3 to 2.10.4 (e.g., [SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in. I think the thing most likely to work is to ensure that exactly the same bits are used by all of the distributions and posted to Maven Central. (For some discussion on inner class naming stability, there was quite a bit of it on the Java 8 lambda discussion list, e.g., [this message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].) was (Author: paulrbrown): As food for thought, [here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] is the {code}InnerClass{code} section of the JVM spec. It looks like there have been some changes from 2.10.3 to 2.10.4 (e.g., [SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in. I think the thing most likely to work is to ensure that exactly the same bits are used by all of the distributions and posted to Maven Central. (For some discussion on inner class naming stability, there was quite a bit of it on the Java 8 lambda discussion list, e.g., [this message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact --- Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025296#comment-14025296 ] Paul R. Brown commented on SPARK-2075: -- As food for thought, [here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] is the {code}InnerClass{code} section of the JVM spec. It looks like there have been some changes from 2.10.3 to 2.10.4 (e.g., [SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in. I think the thing most likely to work is to ensure that exactly the same bits are used by all of the distributions and posted to Maven Central. (For some discussion on inner class naming stability, there was quite a bit of it on the Java 8 lambda discussion list, e.g., [this message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact --- Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025296#comment-14025296 ] Paul R. Brown edited comment on SPARK-2075 at 6/9/14 3:37 PM: -- As food for thought, [here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] is the {{InnerClass}} section of the JVM spec. It looks like there have been some changes from 2.10.3 to 2.10.4 (e.g., [SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in. I think the thing most likely to work is to ensure that exactly the same bits are used by all of the distributions and posted to Maven Central. (For some discussion on inner class naming stability, there was quite a bit of it on the Java 8 lambda discussion list, e.g., [this message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].) was (Author: paulrbrown): As food for thought, [here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] is the {{code}}InnerClass{{code}} section of the JVM spec. It looks like there have been some changes from 2.10.3 to 2.10.4 (e.g., [SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in. I think the thing most likely to work is to ensure that exactly the same bits are used by all of the distributions and posted to Maven Central. (For some discussion on inner class naming stability, there was quite a bit of it on the Java 8 lambda discussion list, e.g., [this message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact --- Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025296#comment-14025296 ] Paul R. Brown edited comment on SPARK-2075 at 6/9/14 3:54 PM: -- As food for thought, [here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] is the {{InnerClass}} section of the JVM spec. It looks like there have been some changes from 2.10.3 to 2.10.4 (e.g., [SI-6546|https://issues.scala-lang.org/browse/SI-6546]), but I didn't dig in. I think the thing most likely to work is to ensure that exactly the same bits are used by all of the distributions and posted to Maven Central. (For some discussion on inner class naming stability, there was quite a bit of it on the Java 8 lambda discussion list, e.g., [this message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].) was (Author: paulrbrown): As food for thought, [here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] is the {{InnerClass}} section of the JVM spec. It looks like there have been some changes from 2.10.3 to 2.10.4 (e.g., [SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in. I think the thing most likely to work is to ensure that exactly the same bits are used by all of the distributions and posted to Maven Central. (For some discussion on inner class naming stability, there was quite a bit of it on the Java 8 lambda discussion list, e.g., [this message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact --- Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2080) Yarn: history UI link missing, wrong reported user
Marcelo Vanzin created SPARK-2080: - Summary: Yarn: history UI link missing, wrong reported user Key: SPARK-2080 URL: https://issues.apache.org/jira/browse/SPARK-2080 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Marcelo Vanzin In Yarn client mode, the History UI link is not set for finished applications (it is for cluster mode). In Yarn cluster mode, the user reported by the application is wrong - it reports the user running the Yarn service, not the user running the Yarn application. PR is up: https://github.com/apache/spark/pull/1002 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2079) Skip unnecessary wrapping in List when serializing SchemaRDD to Python
Kan Zhang created SPARK-2079: Summary: Skip unnecessary wrapping in List when serializing SchemaRDD to Python Key: SPARK-2079 URL: https://issues.apache.org/jira/browse/SPARK-2079 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 1.0.0 Reporter: Kan Zhang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2080) Yarn: history UI link missing, wrong reported user
[ https://issues.apache.org/jira/browse/SPARK-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025346#comment-14025346 ] Marcelo Vanzin commented on SPARK-2080: --- Patrick / someone, I can't seem to be able to assign bugs to myself anymore, could someone do that? Thanks. Yarn: history UI link missing, wrong reported user -- Key: SPARK-2080 URL: https://issues.apache.org/jira/browse/SPARK-2080 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Marcelo Vanzin In Yarn client mode, the History UI link is not set for finished applications (it is for cluster mode). In Yarn cluster mode, the user reported by the application is wrong - it reports the user running the Yarn service, not the user running the Yarn application. PR is up: https://github.com/apache/spark/pull/1002 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2079) Skip unnecessary wrapping in List when serializing SchemaRDD to Python
[ https://issues.apache.org/jira/browse/SPARK-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025363#comment-14025363 ] Kan Zhang commented on SPARK-2079: -- PR: https://github.com/apache/spark/pull/1023 Skip unnecessary wrapping in List when serializing SchemaRDD to Python -- Key: SPARK-2079 URL: https://issues.apache.org/jira/browse/SPARK-2079 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 1.0.0 Reporter: Kan Zhang Assignee: Kan Zhang Finishing the TODO: {code} private[sql] def javaToPython: JavaRDD[Array[Byte]] = { val fieldNames: Seq[String] = this.queryExecution.analyzed.output.map(_.name) this.mapPartitions { iter = val pickle = new Pickler iter.map { row = val map: JMap[String, Any] = new java.util.HashMap // TODO: We place the map in an ArrayList so that the object is pickled to a List[Dict]. // Ideally we should be able to pickle an object directly into a Python collection so we // don't have to create an ArrayList every time. val arr: java.util.ArrayList[Any] = new java.util.ArrayList row.zip(fieldNames).foreach { case (obj, name) = map.put(name, obj) } arr.add(map) pickle.dumps(arr) } } } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2079) Removing unnecessary wrapping when serializing SchemaRDD to Python
[ https://issues.apache.org/jira/browse/SPARK-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kan Zhang updated SPARK-2079: - Summary: Removing unnecessary wrapping when serializing SchemaRDD to Python (was: Skip unnecessary wrapping in List when serializing SchemaRDD to Python) Removing unnecessary wrapping when serializing SchemaRDD to Python -- Key: SPARK-2079 URL: https://issues.apache.org/jira/browse/SPARK-2079 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 1.0.0 Reporter: Kan Zhang Assignee: Kan Zhang Finishing the TODO: {code} private[sql] def javaToPython: JavaRDD[Array[Byte]] = { val fieldNames: Seq[String] = this.queryExecution.analyzed.output.map(_.name) this.mapPartitions { iter = val pickle = new Pickler iter.map { row = val map: JMap[String, Any] = new java.util.HashMap // TODO: We place the map in an ArrayList so that the object is pickled to a List[Dict]. // Ideally we should be able to pickle an object directly into a Python collection so we // don't have to create an ArrayList every time. val arr: java.util.ArrayList[Any] = new java.util.ArrayList row.zip(fieldNames).foreach { case (obj, name) = map.put(name, obj) } arr.add(map) pickle.dumps(arr) } } } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1944) Document --verbose in spark-shell -h
[ https://issues.apache.org/jira/browse/SPARK-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025408#comment-14025408 ] Patrick Wendell commented on SPARK-1944: Accidental edit - my bad! Document --verbose in spark-shell -h Key: SPARK-1944 URL: https://issues.apache.org/jira/browse/SPARK-1944 Project: Spark Issue Type: Documentation Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Assignee: Andrew Ash Priority: Minor Fix For: 1.0.1, 1.1.0 The below help for spark-submit should make mention of the {{--verbose}} option {noformat} aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h Usage: spark-submit [options] app jar [app options] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Mode to deploy the app in, either 'client' or 'cluster'. --class CLASS_NAME Name of your app's main class (required for Java apps). --arg ARG Argument to be passed to your application's main class. This option can be specified multiple times for multiple args. --name NAME The name of your application (Default: 'Spark'). --jars JARS A comma-separated list of local jars to include on the driver classpath and that SparkContext.addJar will work with. Doesn't work on standalone with 'cluster' deploy mode. --files FILES Comma separated list of files to be placed in the working dir of each executor. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). --driver-java-options Extra Java options to pass to the driver --driver-library-path Extra library path entries to pass to the driver --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1). --supervise If given, restarts the driver on failure. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. YARN-only: --executor-cores NUMNumber of cores per executor (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: 'default'). --num-executors NUM Number of executors to (Default: 2). --archives ARCHIVES Comma separated list of archives to be extracted into the working dir of each executor. aash@aash-mbp ~/git/spark$ {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1944) Document --verbose in spark-shell -h
[ https://issues.apache.org/jira/browse/SPARK-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1944: --- Target Version/s: 1.0.1, 1.1.0 Fix Version/s: (was: 1.0.1) (was: 1.1.0) Document --verbose in spark-shell -h Key: SPARK-1944 URL: https://issues.apache.org/jira/browse/SPARK-1944 Project: Spark Issue Type: Documentation Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Ash Assignee: Andrew Ash Priority: Minor Fix For: 1.0.1, 1.1.0 The below help for spark-submit should make mention of the {{--verbose}} option {noformat} aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h Usage: spark-submit [options] app jar [app options] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Mode to deploy the app in, either 'client' or 'cluster'. --class CLASS_NAME Name of your app's main class (required for Java apps). --arg ARG Argument to be passed to your application's main class. This option can be specified multiple times for multiple args. --name NAME The name of your application (Default: 'Spark'). --jars JARS A comma-separated list of local jars to include on the driver classpath and that SparkContext.addJar will work with. Doesn't work on standalone with 'cluster' deploy mode. --files FILES Comma separated list of files to be placed in the working dir of each executor. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). --driver-java-options Extra Java options to pass to the driver --driver-library-path Extra library path entries to pass to the driver --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1). --supervise If given, restarts the driver on failure. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. YARN-only: --executor-cores NUMNumber of cores per executor (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: 'default'). --num-executors NUM Number of executors to (Default: 2). --archives ARCHIVES Comma separated list of archives to be extracted into the working dir of each executor. aash@aash-mbp ~/git/spark$ {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2081) Undefine output() from the abstract class Command and implement it in concrete subclasses
Zongheng Yang created SPARK-2081: Summary: Undefine output() from the abstract class Command and implement it in concrete subclasses Key: SPARK-2081 URL: https://issues.apache.org/jira/browse/SPARK-2081 Project: Spark Issue Type: Improvement Reporter: Zongheng Yang Priority: Minor It doesn't make too much sense to have that method in the abstract class. Relevant discussions / cases where this issue comes up: https://github.com/apache/spark/pull/956 https://github.com/apache/spark/pull/1003 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2075: --- Priority: Critical (was: Major) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact --- Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Priority: Critical Fix For: 1.0.1 Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact
[ https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2075: --- Fix Version/s: 1.0.1 Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact --- Key: SPARK-2075 URL: https://issues.apache.org/jira/browse/SPARK-2075 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 1.0.0 Reporter: Paul R. Brown Priority: Critical Fix For: 1.0.1 Running a job built against the Maven dep for 1.0.0 and the hadoop1 distribution produces: {code} java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 {code} Here's what's in the Maven dep as of 1.0.0: {code} jar tvf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 13:57:58 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} And here's what's in the hadoop1 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs' {code} I.e., it's not there. It is in the hadoop2 distribution: {code} jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs' 1519 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 Mon May 26 07:29:54 PDT 2014 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2081) Undefine output() from the abstract class Command and implement it in concrete subclasses
[ https://issues.apache.org/jira/browse/SPARK-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2081: Assignee: Zongheng Yang Undefine output() from the abstract class Command and implement it in concrete subclasses - Key: SPARK-2081 URL: https://issues.apache.org/jira/browse/SPARK-2081 Project: Spark Issue Type: Improvement Reporter: Zongheng Yang Assignee: Zongheng Yang Priority: Minor It doesn't make too much sense to have that method in the abstract class. Relevant discussions / cases where this issue comes up: https://github.com/apache/spark/pull/956 https://github.com/apache/spark/pull/1003 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2034) KafkaInputDStream doesn't close resources and may prevent JVM shutdown
[ https://issues.apache.org/jira/browse/SPARK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2034: Assignee: (was: Kan Zhang) KafkaInputDStream doesn't close resources and may prevent JVM shutdown -- Key: SPARK-2034 URL: https://issues.apache.org/jira/browse/SPARK-2034 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0 Reporter: Sean Owen Tobias noted today on the mailing list: {quote} I am trying to use Spark Streaming with Kafka, which works like a charm -- except for shutdown. When I run my program with sbt run-main, sbt will never exit, because there are two non-daemon threads left that don't die. I created a minimal example at https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-kafkadoesntshutdown-scala. It starts a StreamingContext and does nothing more than connecting to a Kafka server and printing what it receives. Using the `future { ... }` construct, I shut down the StreamingContext after some seconds and then print the difference between the threads at start time and at end time. The output can be found at https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output1. There are a number of threads remaining that will prevent sbt from exiting. When I replace `KafkaUtils.createStream(...)` with a call that does exactly the same, except that it calls `consumerConnector.shutdown()` in `KafkaReceiver.onStop()` (which it should, IMO), the output is as shown at https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output2. Does anyone have *any* idea what is going on here and why the program doesn't shut down properly? The behavior is the same with both kafka 0.8.0 and 0.8.1.1, by the way. {quote} Something similar was noted last year: http://mail-archives.apache.org/mod_mbox/spark-dev/201309.mbox/%3c1380220041.2428.yahoomail...@web160804.mail.bf1.yahoo.com%3E KafkaInputDStream doesn't close ConsumerConnector in onStop(), and does not close the Executor it creates. The latter leaves non-daemon threads and can prevent the JVM from shutting down even if streaming is closed properly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025580#comment-14025580 ] Matei Zaharia commented on SPARK-2044: -- Hey Weihua, I'll look into the sorting flag; I initially envisioned that the shuffle manager would just tell the calling code whether the data is sorted (otherwise it sorts it by itself), but maybe it does make sense to push sorting into the interface. For the ranges on ShuffleReader, I think you misunderstood my meaning slightly. I don't *want* the reduction code (e.g. combineByKey or groupByKey) to even know that map tasks are running at different times. It should simply request its range of reduce partitions once, and then the shuffle *implementation* should see which maps are ready and start pulling from those. Note also that the partition range there is for reduce partitions (e.g. our job has 100 reduce partitions and we ask for partitions 2-5 because we decided to have just one reduce task for those). It's not for map IDs. Pluggable interface for shuffles Key: SPARK-2044 URL: https://issues.apache.org/jira/browse/SPARK-2044 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Matei Zaharia Assignee: Matei Zaharia Attachments: Pluggableshuffleproposal.pdf Given that a lot of the current activity in Spark Core is in shuffles, I wanted to propose factoring out shuffle implementations in a way that will make experimentation easier. Ideally we will converge on one implementation, but for a while, this could also be used to have several implementations coexist. I'm suggesting this because I aware of at least three efforts to look at shuffle (from Yahoo!, Intel and Databricks). Some of the things people are investigating are: * Push-based shuffle where data moves directly from mappers to reducers * Sorting-based instead of hash-based shuffle, to create fewer files (helps a lot with file handles and memory usage on large shuffles) * External spilling within a key * Changing the level of parallelism or even algorithm for downstream stages at runtime based on statistics of the map output (this is a thing we had prototyped in the Shark research project but never merged in core) I've attached a design doc with a proposed interface. It's not too crazy because the interface between shuffles and the rest of the code is already pretty narrow (just some iterators for reading data and a writer interface for writing it). Bigger changes will be needed in the interaction with DAGScheduler and BlockManager for some of the ideas above, but we can handle those separately, and this interface will allow us to experiment with some short-term stuff sooner. If things go well I'd also like to send a sort-based shuffle implementation for 1.1, but we'll see how the timing on that works out. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2082) Stratified sampling implementation in PairRDDFunctions
Doris Xin created SPARK-2082: Summary: Stratified sampling implementation in PairRDDFunctions Key: SPARK-2082 URL: https://issues.apache.org/jira/browse/SPARK-2082 Project: Spark Issue Type: New Feature Reporter: Doris Xin Implementation of stratified sampling that guarantees exact sample size = sum(math.ceil(S_i*sampingRate)) where S_i is the size of each stratum. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1522) YARN ClientBase will throw a NPE if there is no YARN application specific classpath.
[ https://issues.apache.org/jira/browse/SPARK-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-1522. -- Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Bernardo Gomez Palacio YARN ClientBase will throw a NPE if there is no YARN application specific classpath. Key: SPARK-1522 URL: https://issues.apache.org/jira/browse/SPARK-1522 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 0.9.0, 0.9.1, 1.0.0 Reporter: Bernardo Gomez Palacio Assignee: Bernardo Gomez Palacio Priority: Critical Labels: YARN Fix For: 1.1.0 The current implementation of ClientBase.getDefaultYarnApplicationClasspath inspects the MRJobConfig class for the field DEFAULT_YARN_APPLICATION_CLASSPATH when it should be really looking into YarnConfiguration. If the Application Configuration has no yarn.application.classpath defined a NPE exception will be thrown. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2083) Allow local task to retry after failure.
Peng Cheng created SPARK-2083: - Summary: Allow local task to retry after failure. Key: SPARK-2083 URL: https://issues.apache.org/jira/browse/SPARK-2083 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0 Reporter: Peng Cheng Priority: Trivial If a job is submitted to run locally using masterURL = local[X], spark will not retry a failed task regardless of your spark.task.maxFailures setting. This design is to facilitate debugging and QA of spark application where all tasks are expected to succeed and yield a results. Unfortunately, such setting will prevent a local job from finished if any of its task cannot guarantee a result (e.g. visiting an external resouce/API), and retrying inside the task is less favoured (e.g. the task needs to be executed on a different computer on production). User however can still set masterURL =local[X,Y] to override this (where Y is the local maxFailures), but it is not documented and hard to manage. A quick fix to this can be to add a new configuration property spark.local.maxFailures with a default value of 1. So user knows exactly where to change when reading the documentation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2084) Mention SPARK_JAR in env var section on configuration page
Sandy Ryza created SPARK-2084: - Summary: Mention SPARK_JAR in env var section on configuration page Key: SPARK-2084 URL: https://issues.apache.org/jira/browse/SPARK-2084 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2085) Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)
[ https://issues.apache.org/jira/browse/SPARK-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-2085: -- Description: The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while users number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda. Link to PR: https://github.com/apache/spark/pull/1026 was: The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while users number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda. Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS) --- Key: SPARK-2085 URL: https://issues.apache.org/jira/browse/SPARK-2085 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Shuo Xiang Priority: Minor The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while users number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda. Link to PR: https://github.com/apache/spark/pull/1026 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2085) Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)
Shuo Xiang created SPARK-2085: - Summary: Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS) Key: SPARK-2085 URL: https://issues.apache.org/jira/browse/SPARK-2085 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Shuo Xiang Priority: Minor The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while users number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2085) Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)
[ https://issues.apache.org/jira/browse/SPARK-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-2085: -- Description: The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while user number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda. Link to PR: https://github.com/apache/spark/pull/1026 was: The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while users number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda. Link to PR: https://github.com/apache/spark/pull/1026 Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS) --- Key: SPARK-2085 URL: https://issues.apache.org/jira/browse/SPARK-2085 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Shuo Xiang Priority: Minor The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while user number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda. Link to PR: https://github.com/apache/spark/pull/1026 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)
[ https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025895#comment-14025895 ] Erik Erlandson commented on SPARK-1493: --- RAT itself appears to preclude exclusion using a /path/to/file.ext regex because it traverses the directory tree and applies its exclusion filter only to individual file names. The filter never sees an entire path path/to/file.ext, only path, to, and file.ext https://github.com/apache/rat/blob/incubator-site-import/rat/rat-core/src/main/java/org/apache/rat/DirectoryWalker.java#L127 Either RAT needs a new filtering feature that can see an entire path, or the report it generates has to be filtered post-hoc. Apache RAT excludes don't work with file path (instead of file name) Key: SPARK-1493 URL: https://issues.apache.org/jira/browse/SPARK-1493 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Patrick Wendell Labels: starter Fix For: 1.1.0 Right now the way we do RAT checks, it doesn't work if you try to exclude: /path/to/file.ext you have to just exclude file.ext -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)
[ https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025895#comment-14025895 ] Erik Erlandson edited comment on SPARK-1493 at 6/9/14 11:13 PM: RAT itself appears to preclude exclusion using a /path/to/file.ext regex because it traverses the directory tree and applies its exclusion filter only to individual file names. The filter never sees an entire path path/to/file.ext, only path, to, and file.ext https://github.com/apache/rat/blob/incubator-site-import/rat/rat-core/src/main/java/org/apache/rat/DirectoryWalker.java#L127 Either RAT needs a new filtering feature that can see an entire path, or the report it generates has to be filtered post-hoc. Filed an RFE against RAT: RAT-161 was (Author: eje): RAT itself appears to preclude exclusion using a /path/to/file.ext regex because it traverses the directory tree and applies its exclusion filter only to individual file names. The filter never sees an entire path path/to/file.ext, only path, to, and file.ext https://github.com/apache/rat/blob/incubator-site-import/rat/rat-core/src/main/java/org/apache/rat/DirectoryWalker.java#L127 Either RAT needs a new filtering feature that can see an entire path, or the report it generates has to be filtered post-hoc. Apache RAT excludes don't work with file path (instead of file name) Key: SPARK-1493 URL: https://issues.apache.org/jira/browse/SPARK-1493 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Patrick Wendell Labels: starter Fix For: 1.1.0 Right now the way we do RAT checks, it doesn't work if you try to exclude: /path/to/file.ext you have to just exclude file.ext -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1704) Support EXPLAIN in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1704: Assignee: Zongheng Yang (was: Michael Armbrust) Support EXPLAIN in Spark SQL Key: SPARK-1704 URL: https://issues.apache.org/jira/browse/SPARK-1704 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux Reporter: Yangjp Assignee: Zongheng Yang Labels: sql Fix For: 1.0.1, 1.1.0 Original Estimate: 612h Remaining Estimate: 612h 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src 14/05/03 22:08:40 INFO ParseDriver: Parse Completed 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION : java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*]) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248) at org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39) at org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72) at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407) at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410) at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653) at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67) at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1704) Support EXPLAIN in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-1704: --- Assignee: Michael Armbrust Support EXPLAIN in Spark SQL Key: SPARK-1704 URL: https://issues.apache.org/jira/browse/SPARK-1704 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux Reporter: Yangjp Assignee: Michael Armbrust Labels: sql Fix For: 1.0.1, 1.1.0 Original Estimate: 612h Remaining Estimate: 612h 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src 14/05/03 22:08:40 INFO ParseDriver: Parse Completed 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION : java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*]) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248) at org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39) at org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72) at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407) at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410) at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653) at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67) at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1704) Support EXPLAIN in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1704. - Resolution: Fixed Fix Version/s: 1.0.1 Support EXPLAIN in Spark SQL Key: SPARK-1704 URL: https://issues.apache.org/jira/browse/SPARK-1704 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux Reporter: Yangjp Labels: sql Fix For: 1.0.1, 1.1.0 Original Estimate: 612h Remaining Estimate: 612h 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src 14/05/03 22:08:40 INFO ParseDriver: Parse Completed 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION : java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*]) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248) at org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39) at org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72) at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407) at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410) at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653) at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67) at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)
[ https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1493: --- Fix Version/s: (was: 1.1.0) Apache RAT excludes don't work with file path (instead of file name) Key: SPARK-1493 URL: https://issues.apache.org/jira/browse/SPARK-1493 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Patrick Wendell Labels: starter Right now the way we do RAT checks, it doesn't work if you try to exclude: /path/to/file.ext you have to just exclude file.ext -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)
[ https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025954#comment-14025954 ] Patrick Wendell commented on SPARK-1493: Thanks for looking into this Erik. It seems like maybe there isn't a good way to do unless we want to implement filtering post-hoc (and it might be tricky to support e.g. globbing in that case). Apache RAT excludes don't work with file path (instead of file name) Key: SPARK-1493 URL: https://issues.apache.org/jira/browse/SPARK-1493 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Patrick Wendell Labels: starter Right now the way we do RAT checks, it doesn't work if you try to exclude: /path/to/file.ext you have to just exclude file.ext -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2086) Improve output of toDebugString to make shuffle boundaries more clear
[ https://issues.apache.org/jira/browse/SPARK-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2086: --- Description: It would be nice if the toDebugString method of an RDD did a better job of explaining where shuffle boundaries occur in the lineage graph. One way to do this would be to only indent the tree at a shuffle boundary instead of indenting it for every parent. We can determine when a shuffle boundary occurs based on the type of dependency seen in the RDD. was:It would be nice if the toDebugString method of an RDD did a better job of explaining where shuffle boundaries occur in the lineage graph. One way to do this would be to only indent the tree at a shuffle boundary instead of indenting it for every parent. Improve output of toDebugString to make shuffle boundaries more clear - Key: SPARK-2086 URL: https://issues.apache.org/jira/browse/SPARK-2086 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Assignee: Gregory Owen Priority: Minor It would be nice if the toDebugString method of an RDD did a better job of explaining where shuffle boundaries occur in the lineage graph. One way to do this would be to only indent the tree at a shuffle boundary instead of indenting it for every parent. We can determine when a shuffle boundary occurs based on the type of dependency seen in the RDD. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization
Doris Xin created SPARK-2088: Summary: NPE in toString when creationSiteInfo is null after deserialization Key: SPARK-2088 URL: https://issues.apache.org/jira/browse/SPARK-2088 Project: Spark Issue Type: Bug Reporter: Doris Xin After deserialization, the transient field creationSiteInfo does not get backfilled with the default value, but the toString method, which is invoked by the serializer, expects the field to always be non-null. The following issue is encountered during serialization: java.lang.NullPointerException at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198) at org.apache.spark.rdd.RDD.toString(RDD.scala:1263) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.immutable.$colon$colon.writeObject(List.scala:379) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46) at org.apache.spark.scheduler.ResultTask.writeExternal(ResultTask.scala:125) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) at
[jira] [Commented] (SPARK-1305) Support persisting RDD's directly to Tachyon
[ https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026037#comment-14026037 ] Henry Saputra commented on SPARK-1305: -- Sorry to comment on old JIRA but does anyone have PR for this ticket? Support persisting RDD's directly to Tachyon Key: SPARK-1305 URL: https://issues.apache.org/jira/browse/SPARK-1305 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Patrick Wendell Assignee: Haoyuan Li Priority: Blocker Fix For: 1.0.0 This is already an ongoing pull request - in a nutshell we want to support Tachyon as a storage level in Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1305) Support persisting RDD's directly to Tachyon
[ https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026038#comment-14026038 ] Henry Saputra commented on SPARK-1305: -- Never mind, Found it, it was when Spark in incubtor Support persisting RDD's directly to Tachyon Key: SPARK-1305 URL: https://issues.apache.org/jira/browse/SPARK-1305 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Patrick Wendell Assignee: Haoyuan Li Priority: Blocker Fix For: 1.0.0 This is already an ongoing pull request - in a nutshell we want to support Tachyon as a storage level in Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2071) Package private classes that are deleted from an older version of Spark trigger errors
[ https://issues.apache.org/jira/browse/SPARK-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026045#comment-14026045 ] Patrick Wendell commented on SPARK-2071: Yes we could use sbt to retreive them and place them in lib_managed or something similar. Package private classes that are deleted from an older version of Spark trigger errors -- Key: SPARK-2071 URL: https://issues.apache.org/jira/browse/SPARK-2071 Project: Spark Issue Type: Sub-task Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma Fix For: 1.1.0 We should figure out how to fix this. One idea is to run the MIMA exclude generator with sbt itself (rather than ./spark-class) so it can run against the older versions of Spark and make sure to exclude classes that are marked as package private in that version as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization
[ https://issues.apache.org/jira/browse/SPARK-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2088: --- Assignee: Doris Xin NPE in toString when creationSiteInfo is null after deserialization --- Key: SPARK-2088 URL: https://issues.apache.org/jira/browse/SPARK-2088 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Doris Xin Assignee: Doris Xin After deserialization, the transient field creationSiteInfo does not get backfilled with the default value, but the toString method, which is invoked by the serializer, expects the field to always be non-null. The following issue is encountered during serialization: java.lang.NullPointerException at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198) at org.apache.spark.rdd.RDD.toString(RDD.scala:1263) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.immutable.$colon$colon.writeObject(List.scala:379) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46) at
[jira] [Updated] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization
[ https://issues.apache.org/jira/browse/SPARK-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2088: --- Target Version/s: 1.0.0, 1.0.1 NPE in toString when creationSiteInfo is null after deserialization --- Key: SPARK-2088 URL: https://issues.apache.org/jira/browse/SPARK-2088 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Doris Xin Assignee: Doris Xin After deserialization, the transient field creationSiteInfo does not get backfilled with the default value, but the toString method, which is invoked by the serializer, expects the field to always be non-null. The following issue is encountered during serialization: java.lang.NullPointerException at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198) at org.apache.spark.rdd.RDD.toString(RDD.scala:1263) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.immutable.$colon$colon.writeObject(List.scala:379) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46) at
[jira] [Updated] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization
[ https://issues.apache.org/jira/browse/SPARK-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2088: --- Affects Version/s: 1.0.0 NPE in toString when creationSiteInfo is null after deserialization --- Key: SPARK-2088 URL: https://issues.apache.org/jira/browse/SPARK-2088 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Doris Xin Assignee: Doris Xin After deserialization, the transient field creationSiteInfo does not get backfilled with the default value, but the toString method, which is invoked by the serializer, expects the field to always be non-null. The following issue is encountered during serialization: java.lang.NullPointerException at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198) at org.apache.spark.rdd.RDD.toString(RDD.scala:1263) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at scala.collection.immutable.$colon$colon.writeObject(List.scala:379) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46) at
[jira] [Commented] (SPARK-2000) cannot connect to cluster in Standalone mode when run spark-shell in one of the cluster node without specify master
[ https://issues.apache.org/jira/browse/SPARK-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026052#comment-14026052 ] Chen Chao commented on SPARK-2000: -- Hi, Patrick, I just thought it was the same problem to https://issues.apache.org/jira/browse/SPARK-1028 . Anyway, if u think it is not necessary , please close the issue : ) cannot connect to cluster in Standalone mode when run spark-shell in one of the cluster node without specify master --- Key: SPARK-2000 URL: https://issues.apache.org/jira/browse/SPARK-2000 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Chen Chao Assignee: Chen Chao Labels: shell cannot connect to cluster in Standalone mode when run spark-shell in one of the cluster node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1998) SparkFlumeEvent with body bigger than 1020 bytes are not read properly
[ https://issues.apache.org/jira/browse/SPARK-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026053#comment-14026053 ] sunshangchun commented on SPARK-1998: - I've pulled a request here(https://github.com/apache/spark/pull/951) Does anyone can submit and resolve it ? SparkFlumeEvent with body bigger than 1020 bytes are not read properly -- Key: SPARK-1998 URL: https://issues.apache.org/jira/browse/SPARK-1998 Project: Spark Issue Type: Bug Reporter: sun.sam Attachments: patch.diff -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
Sandy Ryza created SPARK-2089: - Summary: With YARN, preferredNodeLocalityData isn't honored Key: SPARK-2089 URL: https://issues.apache.org/jira/browse/SPARK-2089 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Sandy Ryza When running in YARN cluster mode, apps can pass preferred locality data when constructing a Spark context that will dictate where to request executor containers. This is currently broken because of a race condition. The Spark-YARN code runs the user class and waits for it to start up a SparkContext. During its initialization, the SparkContext will create a YarnClusterScheduler, which notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then immediately fetches the preferredNodeLocationData from the SparkContext and uses it to start requesting containers. But in the SparkContext constructor that takes the preferredNodeLocationData, setting preferredNodeLocationData comes after the rest of the initialization, so, if the Spark-YARN code comes around quickly enough after being notified, the data that's fetched is the empty unset version. The occurred during all of my runs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles
[ https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026109#comment-14026109 ] Weihua Jiang commented on SPARK-2044: - Hi Matei, Thanks for the reply. I am glad that you thinking pushing sorting into the interface is useful. Yes, you are right. I misunderstand the partition id and map id. For partition id range, I am totally OK with it. Pluggable interface for shuffles Key: SPARK-2044 URL: https://issues.apache.org/jira/browse/SPARK-2044 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Matei Zaharia Assignee: Matei Zaharia Attachments: Pluggableshuffleproposal.pdf Given that a lot of the current activity in Spark Core is in shuffles, I wanted to propose factoring out shuffle implementations in a way that will make experimentation easier. Ideally we will converge on one implementation, but for a while, this could also be used to have several implementations coexist. I'm suggesting this because I aware of at least three efforts to look at shuffle (from Yahoo!, Intel and Databricks). Some of the things people are investigating are: * Push-based shuffle where data moves directly from mappers to reducers * Sorting-based instead of hash-based shuffle, to create fewer files (helps a lot with file handles and memory usage on large shuffles) * External spilling within a key * Changing the level of parallelism or even algorithm for downstream stages at runtime based on statistics of the map output (this is a thing we had prototyped in the Shark research project but never merged in core) I've attached a design doc with a proposed interface. It's not too crazy because the interface between shuffles and the rest of the code is already pretty narrow (just some iterators for reading data and a writer interface for writing it). Bigger changes will be needed in the interaction with DAGScheduler and BlockManager for some of the ideas above, but we can handle those separately, and this interface will allow us to experiment with some short-term stuff sooner. If things go well I'd also like to send a sort-based shuffle implementation for 1.1, but we'll see how the timing on that works out. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1416) Add support for SequenceFiles in PySpark
[ https://issues.apache.org/jira/browse/SPARK-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1416. -- Resolution: Fixed Fix Version/s: 1.1.0 Target Version/s: 1.1.0 Implemented in https://github.com/apache/spark/pull/455 Add support for SequenceFiles in PySpark Key: SPARK-1416 URL: https://issues.apache.org/jira/browse/SPARK-1416 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Nick Pentreath Fix For: 1.1.0 Just covering the basic Hadoop Writable types (e.g. primitives, arrays of primitives, text) should still let people store data more efficiently. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1416) Add support for SequenceFiles in PySpark
[ https://issues.apache.org/jira/browse/SPARK-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1416: - Assignee: Nick Pentreath Add support for SequenceFiles in PySpark Key: SPARK-1416 URL: https://issues.apache.org/jira/browse/SPARK-1416 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Nick Pentreath Fix For: 1.1.0 Just covering the basic Hadoop Writable types (e.g. primitives, arrays of primitives, text) should still let people store data more efficiently. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2090) spark-shell input text entry not showing on REPL
Richard Conway created SPARK-2090: - Summary: spark-shell input text entry not showing on REPL Key: SPARK-2090 URL: https://issues.apache.org/jira/browse/SPARK-2090 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.0 Environment: Ubuntu 14.04; Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_60) Reporter: Richard Conway Priority: Critical Fix For: 1.0.0 spark-shell doesn't allow text to be displayed on input Failed to created SparkJLineReader: java.io.IOException: Permission denied Falling back to SimpleReader. The driver has 2 workers on 2 virtual machines and error free apart from the above line so I think it may have something to do with the introduction of the new SecurityManager. The upshot is that when you type nothing is displayed on the screen. For example, type test at the scala prompt and you won't see the input but the output will show. scala console:11: error: package test is not a value test ^ -- This message was sent by Atlassian JIRA (v6.2#6252)