date:20140609


 [ 
https://issues.apache.org/jira/browse/SPARK-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2067.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

Issue resolved by pull request 1006
[https://github.com/apache/spark/pull/1006]

 Spark logo in application UI uses absolute path
 ---

 Key: SPARK-2067
 URL: https://issues.apache.org/jira/browse/SPARK-2067
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Trivial
 Fix For: 1.0.1, 1.1.0


 Link of the Spark logo in application UI (top left corner) is hard coded to 
 /, and points to the wrong page when running with YARN proxy. Should use 
 uiRoot instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2067) Spark logo in application UI uses absolute path


 [ 
https://issues.apache.org/jira/browse/SPARK-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2067:
---

Assignee: Neville Li

 Spark logo in application UI uses absolute path
 ---

 Key: SPARK-2067
 URL: https://issues.apache.org/jira/browse/SPARK-2067
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Neville Li
Assignee: Neville Li
Priority: Trivial
 Fix For: 1.0.1, 1.1.0


 Link of the Spark logo in application UI (top left corner) is hard coded to 
 /, and points to the wrong page when running with YARN proxy. Should use 
 uiRoot instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2077) Log serializer in use on application startup

Andrew Ash created SPARK-2077:
-

 Summary: Log serializer in use on application startup
 Key: SPARK-2077
 URL: https://issues.apache.org/jira/browse/SPARK-2077
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash


In a recent mailing list thread a user was uncertain that their 
{{spark.serializer}} setting was in effect.

Let's log the serializer being used to protect against typos on the setting.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2077) Log serializer in use on application startup


[ 
https://issues.apache.org/jira/browse/SPARK-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021670#comment-14021670
 ] 

Andrew Ash commented on SPARK-2077:
---

https://github.com/apache/spark/pull/1017

 Log serializer in use on application startup
 

 Key: SPARK-2077
 URL: https://issues.apache.org/jira/browse/SPARK-2077
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash

 In a recent mailing list thread a user was uncertain that their 
 {{spark.serializer}} setting was in effect.
 Let's log the serializer being used to protect against typos on the setting.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2078) Use ISO8601 date formats in logging


[ 
https://issues.apache.org/jira/browse/SPARK-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021709#comment-14021709
 ] 

Andrew Ash commented on SPARK-2078:
---

https://github.com/apache/spark/pull/1018

 Use ISO8601 date formats in logging
 ---

 Key: SPARK-2078
 URL: https://issues.apache.org/jira/browse/SPARK-2078
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash

 Currently, logging has 2 digit years and doesn't include milliseconds in 
 logging timestamps.
 Use ISO8601 date formats instead of the current custom formats.
 There is some precedent here for ISO8601 format -- it's what [Hadoop 
 uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2078) Use ISO8601 date formats in logging

Andrew Ash created SPARK-2078:
-

 Summary: Use ISO8601 date formats in logging
 Key: SPARK-2078
 URL: https://issues.apache.org/jira/browse/SPARK-2078
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash


Currently, logging has 2 digit years and doesn't include milliseconds in 
logging timestamps.

Use ISO8601 date formats instead of the current custom formats.

There is some precedent here for ISO8601 format -- it's what [Hadoop 
uses|https://github.com/apache/hadoop-common/blob/d92a8a29978e35ed36c4d4721a21c356c1ff1d4d/hadoop-common-project/hadoop-minikdc/src/main/resources/log4j.properties]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

[
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021716#comment-14021716
]

Weihua Jiang commented on SPARK-2044:
-

Hi Matei,

Thanks a lot for your reply.

1. I am confused about your idea of sorting flag.
``The goal is to allow diverse shuffle implementations, so it doesn't make
sense to add a flag for it. If we add a flag, every ShuffleManager will need to
implement this feature. Instead we're trying to make the smallest interface
that the code consuming this data needs, so that we can try multiple
implementations of ShuffleManager and see which of these features work best.
The Ordering object means that keys are comparable. This flag here would be to
tell the ShuffleManager to sort the data, so that downstream algorithms like
joins can work more efficiently.``
For your first statement, it seems you want to keep interface minimal, thus no
need-to-sort flag is allowed. But for your second statement, you are allowing
user to ask ShuffleManager to perform sort for the data.
From my point of view, it is better to have such a flag to allow user to ask
ShuffleManager to perform sort. Thus, operation like SQL order by can be
implemented more efficiently. ShuffleManager can provide some utility class to
perform general sorting so that not every implementation needs to implement
its own sorting logic.

2. I agree that, for ShuffleReader, read a partition range is more efficient.
However, if we want to break the barrier between map and reduce stage, we will
encounter a situation that, when a reducer starts, not all its partitions are
ready. If using partition range, reducer will wait for all partitions to be
ready before executing reducer. It is better if reducer can start execution
when some (not all) partitions are ready. The POC code can be found at
https://github.com/lirui-intel/spark/tree/removeStageBarrier. This is why I
think we need another read() function to specify a partition list instead of a
range.

Pluggable interface for shuffles

Key: SPARK-2044
URL: https://issues.apache.org/jira/browse/SPARK-2044
Project: Spark
Issue Type: Improvement
Components: Shuffle, Spark Core
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Attachments: Pluggableshuffleproposal.pdf

Given that a lot of the current activity in Spark Core is in shuffles, I
wanted to propose factoring out shuffle implementations in a way that will
make experimentation easier. Ideally we will converge on one implementation,
but for a while, this could also be used to have several implementations
coexist. I'm suggesting this because I aware of at least three efforts to
look at shuffle (from Yahoo!, Intel and Databricks). Some of the things
people are investigating are:
* Push-based shuffle where data moves directly from mappers to reducers
* Sorting-based instead of hash-based shuffle, to create fewer files (helps a
lot with file handles and memory usage on large shuffles)
* External spilling within a key
* Changing the level of parallelism or even algorithm for downstream stages
at runtime based on statistics of the map output (this is a thing we had
prototyped in the Shark research project but never merged in core)
I've attached a design doc with a proposed interface. It's not too crazy
because the interface between shuffles and the rest of the code is already
pretty narrow (just some iterators for reading data and a writer interface
for writing it). Bigger changes will be needed in the interaction with
DAGScheduler and BlockManager for some of the ideas above, but we can handle
those separately, and this interface will allow us to experiment with some
short-term stuff sooner.
If things go well I'd also like to send a sort-based shuffle implementation
for 1.1, but we'll see how the timing on that works out.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

[
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021714#comment-14021714
]

Weihua Jiang commented on SPARK-2044:
-

Hi Matei,

Thanks a lot for your reply.

Pluggable interface for shuffles

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1308) Add partitions() method to PySpark RDDs

2014-06-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1308.


   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Reynold Xin

 Add partitions() method to PySpark RDDs
 ---

 Key: SPARK-1308
 URL: https://issues.apache.org/jira/browse/SPARK-1308
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 0.9.0
Reporter: Nicholas Chammas
Assignee: Reynold Xin
Priority: Minor
 Fix For: 1.1.0


 In Spark, you can do this:
 {code}
 // Scala
 val a = sc.parallelize(List(1, 2, 3, 4), 4)
 a.partitions.size
 {code}
 Please make this possible in PySpark too.
 The work-around available is quite simple:
 {code}
 # Python
 a = sc.parallelize([1, 2, 3, 4], 4)
 a._jrdd.splits().size()
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1308) Add getNumPartitions() method to PySpark RDDs

2014-06-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1308:
---

Assignee: Syed A. Hashmi  (was: Reynold Xin)

 Add getNumPartitions() method to PySpark RDDs
 -

 Key: SPARK-1308
 URL: https://issues.apache.org/jira/browse/SPARK-1308
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 0.9.0
Reporter: Nicholas Chammas
Assignee: Syed A. Hashmi
Priority: Minor
 Fix For: 1.1.0


 In Spark, you can do this:
 {code}
 // Scala
 val a = sc.parallelize(List(1, 2, 3, 4), 4)
 a.partitions.size
 {code}
 Please make this possible in PySpark too.
 The work-around available is quite simple:
 {code}
 # Python
 a = sc.parallelize([1, 2, 3, 4], 4)
 a._jrdd.splits().size()
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2044) Pluggable interface for shuffles

[
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021716#comment-14021716
]

Weihua Jiang edited comment on SPARK-2044 at 6/9/14 7:10 AM:
-

Hi Matei,

Thanks a lot for your reply.

1. I am confused about your idea of sorting flag.
??The goal is to allow diverse shuffle implementations, so it doesn't make
sense to add a flag for it. If we add a flag, every ShuffleManager will need to
implement this feature. Instead we're trying to make the smallest interface
that the code consuming this data needs, so that we can try multiple
implementations of ShuffleManager and see which of these features work best.
The Ordering object means that keys are comparable. This flag here would be to
tell the ShuffleManager to sort the data, so that downstream algorithms like
joins can work more efficiently.??
For your first statement, it seems you want to keep interface minimal, thus no
need-to-sort flag is allowed. But for your second statement, you are allowing
user to ask ShuffleManager to perform sort for the data.
From my point of view, it is better to have such a flag to allow user to ask
ShuffleManager to perform sort. Thus, operation like SQL order by can be
implemented more efficiently. ShuffleManager can provide some utility class to
perform general sorting so that not every implementation needs to implement
its own sorting logic.

was (Author: whjiang):
Hi Matei,

Thanks a lot for your reply.

Pluggable interface for shuffles

[jira] [Updated] (SPARK-1308) Add getNumPartitions() method to PySpark RDDs

2014-06-09 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1308:
---

Summary: Add getNumPartitions() method to PySpark RDDs  (was: Add 
partitions() method to PySpark RDDs)

 Add getNumPartitions() method to PySpark RDDs
 -

 Key: SPARK-1308
 URL: https://issues.apache.org/jira/browse/SPARK-1308
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 0.9.0
Reporter: Nicholas Chammas
Assignee: Reynold Xin
Priority: Minor
 Fix For: 1.1.0


 In Spark, you can do this:
 {code}
 // Scala
 val a = sc.parallelize(List(1, 2, 3, 4), 4)
 a.partitions.size
 {code}
 Please make this possible in PySpark too.
 The work-around available is quite simple:
 {code}
 # Python
 a = sc.parallelize([1, 2, 3, 4], 4)
 a._jrdd.splits().size()
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2044) Pluggable interface for shuffles

[
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021716#comment-14021716
]

Weihua Jiang edited comment on SPARK-2044 at 6/9/14 7:11 AM:
-

Hi Matei,

Thanks a lot for your reply.

1. I am confused about your idea of sorting flag.
{quote}
The goal is to allow diverse shuffle implementations, so it doesn't make sense
to add a flag for it. If we add a flag, every ShuffleManager will need to
implement this feature. Instead we're trying to make the smallest interface
that the code consuming this data needs, so that we can try multiple
implementations of ShuffleManager and see which of these features work best.
The Ordering object means that keys are comparable. This flag here would be to
tell the ShuffleManager to sort the data, so that downstream algorithms like
joins can work more efficiently.
{quote}
For your first statement, it seems you want to keep interface minimal, thus no
need-to-sort flag is allowed. But for your second statement, you are allowing
user to ask ShuffleManager to perform sort for the data.
From my point of view, it is better to have such a flag to allow user to ask
ShuffleManager to perform sort. Thus, operation like SQL order by can be
implemented more efficiently. ShuffleManager can provide some utility class to
perform general sorting so that not every implementation needs to implement
its own sorting logic.

was (Author: whjiang):
Hi Matei,

Thanks a lot for your reply.

Pluggable interface for shuffles

[jira] [Commented] (SPARK-1944) Document --verbose in spark-shell -h


[ 
https://issues.apache.org/jira/browse/SPARK-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021726#comment-14021726
 ] 

Andrew Ash commented on SPARK-1944:
---

https://github.com/apache/spark/pull/1020

 Document --verbose in spark-shell -h
 

 Key: SPARK-1944
 URL: https://issues.apache.org/jira/browse/SPARK-1944
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Assignee: Andrew Ash
Priority: Minor

 The below help for spark-submit should make mention of the {{--verbose}} 
 option
 {noformat}
 aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h
 Usage: spark-submit [options] app jar [app options]
 Options:
   --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
 local.
   --deploy-mode DEPLOY_MODE   Mode to deploy the app in, either 'client' or 
 'cluster'.
   --class CLASS_NAME  Name of your app's main class (required for 
 Java apps).
   --arg ARG   Argument to be passed to your application's 
 main class. This
   option can be specified multiple times for 
 multiple args.
   --name NAME The name of your application (Default: 'Spark').
   --jars JARS A comma-separated list of local jars to include 
 on the
   driver classpath and that SparkContext.addJar 
 will work
   with. Doesn't work on standalone with 'cluster' 
 deploy mode.
   --files FILES   Comma separated list of files to be placed in 
 the working dir
   of each executor.
   --properties-file FILE  Path to a file from which to load extra 
 properties. If not
   specified, this will look for 
 conf/spark-defaults.conf.
   --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 
 512M).
   --driver-java-options   Extra Java options to pass to the driver
   --driver-library-path   Extra library path entries to pass to the driver
   --driver-class-path Extra class path entries to pass to the driver. 
 Note that
   jars added with --jars are automatically 
 included in the
   classpath.
   --executor-memory MEM   Memory per executor (e.g. 1000M, 2G) (Default: 
 1G).
  Spark standalone with cluster deploy mode only:
   --driver-cores NUM  Cores for driver (Default: 1).
   --supervise If given, restarts the driver on failure.
  Spark standalone and Mesos only:
   --total-executor-cores NUM  Total cores for all executors.
  YARN-only:
   --executor-cores NUMNumber of cores per executor (Default: 1).
   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
 'default').
   --num-executors NUM Number of executors to (Default: 2).
   --archives ARCHIVES Comma separated list of archives to be 
 extracted into the
   working dir of each executor.
 aash@aash-mbp ~/git/spark$
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-09 Thread Santiago M. Mola (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021797#comment-14021797
 ] 

Santiago M. Mola commented on SPARK-1977:
-

Xiangrui Meng, I can't reproduce it at the moment. It takes a quite big dataset 
to reproduce and I have my machines busy. But I'm pretty sure the stacktrace is 
exactly the same as the one posted by Neville Li. My bet is that this will be 
fixed with next Twitter Chill release: 
https://github.com/twitter/chill/commit/b47512c2c75b94b7c5945985306fa303576bf90d

 mutable.BitSet in ALS not serializable with KryoSerializer
 --

 Key: SPARK-1977
 URL: https://issues.apache.org/jira/browse/SPARK-1977
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Neville Li
Priority: Minor

 OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
 KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
 register mutable.BitSet.
 Right now we have to register mutable.BitSet manually. A proper fix would be 
 using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
 com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
 scala.collection.mutable.HashSet
 Serialization trace:
 shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at

[jira] [Updated] (SPARK-1719) spark.executor.extraLibraryPath isn't applied on yarn

2014-06-09 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1719:
---

Fix Version/s: 1.1.0

 spark.executor.extraLibraryPath isn't applied on yarn
 -

 Key: SPARK-1719
 URL: https://issues.apache.org/jira/browse/SPARK-1719
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Guoqiang Li
 Fix For: 1.1.0


 Looking through the code for spark on yarn I don't see that 
 spark.executor.extraLibraryPath is being properly applied when it launches 
 executors.  It is using the spark.driver.libraryPath in the ClientBase.
 Note I didn't actually test it so its possible I missed something.
 I also think better to use LD_LIBRARY_PATH rather then -Djava.library.path.  
 once  java.library.path is set, it doesn't search LD_LIBRARY_PATH.  In Hadoop 
 we switched to use LD_LIBRARY_PATH instead of java.library.path.  See 
 https://issues.apache.org/jira/browse/MAPREDUCE-4072.  I'll split this into 
 separate jira.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2071) Package private classes that are deleted from an older version of Spark trigger errors

2014-06-09 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025239#comment-14025239
 ] 

Prashant Sharma commented on SPARK-2071:


Or manually place the jar of the older version on ./spark-class before invoking 
GenerateMimaIgnore.

 Package private classes that are deleted from an older version of Spark 
 trigger errors
 --

 Key: SPARK-2071
 URL: https://issues.apache.org/jira/browse/SPARK-2071
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma
 Fix For: 1.1.0


 We should figure out how to fix this. One idea is to run the MIMA exclude 
 generator with sbt itself (rather than ./spark-class) so it can run against 
 the older versions of Spark and make sure to exclude classes that are marked 
 as package private in that version as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1291) Link the spark UI to RM ui in yarn-client mode

2014-06-09 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025244#comment-14025244
 ] 

Thomas Graves commented on SPARK-1291:
--

https://github.com/apache/spark/pull/1002

 Link the spark UI to RM ui in yarn-client mode
 --

 Key: SPARK-1291
 URL: https://issues.apache.org/jira/browse/SPARK-1291
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.9.0, 1.0.0
Reporter: Thomas Graves

 Currently when you run spark on yarn in the yarn-client mode the spark UI is 
 not linked up to the Yarn Resource manager UI so its harder for a user of 
 YARN to find the UI.  Note that in yarn-standalone/yarn-cluster mode it is 
 properly linked up.
 Ideally the yarn-client UI should also be hooked up to the Yarn RM proxy for 
 security.
 The challenge with the yarn-client mode is that the UI is started before the 
 application master and it doesn't know what the yarn proxy link is when the 
 UI started. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact


[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025255#comment-14025255
 ] 

Paul R. Brown commented on SPARK-2075:
--

The job is run by a Java client that connects to the master (using a 
SparkContext).

Bundling is performed by a Maven build with two shade plugin invocations, one 
to package a driver uberjar and one to packager a worker uberjar.  The 
worker flavor is sent to the worker nodes, the driver contains the code to 
connect to the master and run the job.  The Maven build runs against the JAR 
from Maven Central, and the deployment uses the Spark 1.0.0 hadoop1 download.  
(The Spark is staged to S3 once and then downloaded onto master/worker nodes 
and set up during cluster provisioning.)

The Maven build uses the usual Scala setup with the library as a dependency and 
the plugin:

{code}
dependency
groupIdorg.scala-lang/groupId
artifactIdscala-library/artifactId
version2.10.3/version
/dependency
{code}

{code}
plugin
groupIdnet.alchim31.maven/groupId
artifactIdscala-maven-plugin/artifactId
executions
execution
goals
goalcompile/goal
goaltestCompile/goal
/goals
/execution
/executions
configuration
scalaVersion2.10.3/scalaVersion
jvmArgs
  jvmArg-Xms64m/jvmArg
  jvmArg-Xmx4096m/jvmArg
/jvmArgs
/configuration
/plugin
{code}


 Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 
 1.0.0 artifact
 ---

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown

 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact


[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025296#comment-14025296
 ] 

Paul R. Brown edited comment on SPARK-2075 at 6/9/14 3:36 PM:
--

As food for thought, 
[here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] 
is the {{code}}InnerClass{{code}} section of the JVM spec.  It looks like there 
have been some changes from 2.10.3 to 2.10.4 (e.g., 
[SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in.

I think the thing most likely to work is to ensure that exactly the same bits 
are used by all of the distributions and posted to Maven Central.  (For some 
discussion on inner class naming stability, there was quite a bit of it on the 
Java 8 lambda discussion list, e.g., [this 
message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].)


was (Author: paulrbrown):
As food for thought, 
[here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] 
is the {code}InnerClass{code} section of the JVM spec.  It looks like there 
have been some changes from 2.10.3 to 2.10.4 (e.g., 
[SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in.

I think the thing most likely to work is to ensure that exactly the same bits 
are used by all of the distributions and posted to Maven Central.  (For some 
discussion on inner class naming stability, there was quite a bit of it on the 
Java 8 lambda discussion list, e.g., [this 
message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].)

 Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 
 1.0.0 artifact
 ---

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown

 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact


[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025296#comment-14025296
 ] 

Paul R. Brown commented on SPARK-2075:
--

As food for thought, 
[here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] 
is the {code}InnerClass{code} section of the JVM spec.  It looks like there 
have been some changes from 2.10.3 to 2.10.4 (e.g., 
[SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in.

I think the thing most likely to work is to ensure that exactly the same bits 
are used by all of the distributions and posted to Maven Central.  (For some 
discussion on inner class naming stability, there was quite a bit of it on the 
Java 8 lambda discussion list, e.g., [this 
message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].)

 Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 
 1.0.0 artifact
 ---

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown

 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact


[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025296#comment-14025296
 ] 

Paul R. Brown edited comment on SPARK-2075 at 6/9/14 3:37 PM:
--

As food for thought, 
[here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] 
is the {{InnerClass}} section of the JVM spec.  It looks like there have been 
some changes from 2.10.3 to 2.10.4 (e.g., 
[SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in.

I think the thing most likely to work is to ensure that exactly the same bits 
are used by all of the distributions and posted to Maven Central.  (For some 
discussion on inner class naming stability, there was quite a bit of it on the 
Java 8 lambda discussion list, e.g., [this 
message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].)


was (Author: paulrbrown):
As food for thought, 
[here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] 
is the {{code}}InnerClass{{code}} section of the JVM spec.  It looks like there 
have been some changes from 2.10.3 to 2.10.4 (e.g., 
[SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in.

I think the thing most likely to work is to ensure that exactly the same bits 
are used by all of the distributions and posted to Maven Central.  (For some 
discussion on inner class naming stability, there was quite a bit of it on the 
Java 8 lambda discussion list, e.g., [this 
message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].)

 Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 
 1.0.0 artifact
 ---

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown

 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact


[ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025296#comment-14025296
 ] 

Paul R. Brown edited comment on SPARK-2075 at 6/9/14 3:54 PM:
--

As food for thought, 
[here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] 
is the {{InnerClass}} section of the JVM spec.  It looks like there have been 
some changes from 2.10.3 to 2.10.4 (e.g., 
[SI-6546|https://issues.scala-lang.org/browse/SI-6546]), but I didn't dig in.

I think the thing most likely to work is to ensure that exactly the same bits 
are used by all of the distributions and posted to Maven Central.  (For some 
discussion on inner class naming stability, there was quite a bit of it on the 
Java 8 lambda discussion list, e.g., [this 
message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].)


was (Author: paulrbrown):
As food for thought, 
[here|http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.6] 
is the {{InnerClass}} section of the JVM spec.  It looks like there have been 
some changes from 2.10.3 to 2.10.4 (e.g., 
[SI-6546|https://issues.scala-lang.org/browse/SI-6546], but I didn't dig in.

I think the thing most likely to work is to ensure that exactly the same bits 
are used by all of the distributions and posted to Maven Central.  (For some 
discussion on inner class naming stability, there was quite a bit of it on the 
Java 8 lambda discussion list, e.g., [this 
message|http://mail.openjdk.java.net/pipermail/lambda-spec-experts/2013-July/000316.html].)

 Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 
 1.0.0 artifact
 ---

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown

 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2080) Yarn: history UI link missing, wrong reported user

2014-06-09 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-2080:
-

 Summary: Yarn: history UI link missing, wrong reported user
 Key: SPARK-2080
 URL: https://issues.apache.org/jira/browse/SPARK-2080
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin


In Yarn client mode, the History UI link is not set for finished applications 
(it is for cluster mode). In Yarn cluster mode, the user reported by the 
application is wrong - it reports the user running the Yarn service, not the 
user running the Yarn application.

PR is up:
https://github.com/apache/spark/pull/1002



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2079) Skip unnecessary wrapping in List when serializing SchemaRDD to Python

2014-06-09 Thread Kan Zhang (JIRA)

Kan Zhang created SPARK-2079:


 Summary: Skip unnecessary wrapping in List when serializing 
SchemaRDD to Python
 Key: SPARK-2079
 URL: https://issues.apache.org/jira/browse/SPARK-2079
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.0.0
Reporter: Kan Zhang






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2080) Yarn: history UI link missing, wrong reported user

2014-06-09 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025346#comment-14025346
 ] 

Marcelo Vanzin commented on SPARK-2080:
---

Patrick / someone, I can't seem to be able to assign bugs to myself anymore, 
could someone do that? Thanks.

 Yarn: history UI link missing, wrong reported user
 --

 Key: SPARK-2080
 URL: https://issues.apache.org/jira/browse/SPARK-2080
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin

 In Yarn client mode, the History UI link is not set for finished applications 
 (it is for cluster mode). In Yarn cluster mode, the user reported by the 
 application is wrong - it reports the user running the Yarn service, not the 
 user running the Yarn application.
 PR is up:
 https://github.com/apache/spark/pull/1002



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2079) Skip unnecessary wrapping in List when serializing SchemaRDD to Python

2014-06-09 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025363#comment-14025363
 ] 

Kan Zhang commented on SPARK-2079:
--

PR: https://github.com/apache/spark/pull/1023

 Skip unnecessary wrapping in List when serializing SchemaRDD to Python
 --

 Key: SPARK-2079
 URL: https://issues.apache.org/jira/browse/SPARK-2079
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.0.0
Reporter: Kan Zhang
Assignee: Kan Zhang

 Finishing the TODO:
 {code}
   private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
 val fieldNames: Seq[String] = 
 this.queryExecution.analyzed.output.map(_.name)
 this.mapPartitions { iter =
   val pickle = new Pickler
   iter.map { row =
 val map: JMap[String, Any] = new java.util.HashMap
 // TODO: We place the map in an ArrayList so that the object is 
 pickled to a List[Dict].
 // Ideally we should be able to pickle an object directly into a 
 Python collection so we
 // don't have to create an ArrayList every time.
 val arr: java.util.ArrayList[Any] = new java.util.ArrayList
 row.zip(fieldNames).foreach { case (obj, name) =
   map.put(name, obj)
 }
 arr.add(map)
 pickle.dumps(arr)
   }
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2079) Removing unnecessary wrapping when serializing SchemaRDD to Python

2014-06-09 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2079:
-

Summary: Removing unnecessary wrapping when serializing SchemaRDD to Python 
 (was: Skip unnecessary wrapping in List when serializing SchemaRDD to Python)

 Removing unnecessary wrapping when serializing SchemaRDD to Python
 --

 Key: SPARK-2079
 URL: https://issues.apache.org/jira/browse/SPARK-2079
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.0.0
Reporter: Kan Zhang
Assignee: Kan Zhang

 Finishing the TODO:
 {code}
   private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
 val fieldNames: Seq[String] = 
 this.queryExecution.analyzed.output.map(_.name)
 this.mapPartitions { iter =
   val pickle = new Pickler
   iter.map { row =
 val map: JMap[String, Any] = new java.util.HashMap
 // TODO: We place the map in an ArrayList so that the object is 
 pickled to a List[Dict].
 // Ideally we should be able to pickle an object directly into a 
 Python collection so we
 // don't have to create an ArrayList every time.
 val arr: java.util.ArrayList[Any] = new java.util.ArrayList
 row.zip(fieldNames).foreach { case (obj, name) =
   map.put(name, obj)
 }
 arr.add(map)
 pickle.dumps(arr)
   }
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1944) Document --verbose in spark-shell -h


[ 
https://issues.apache.org/jira/browse/SPARK-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025408#comment-14025408
 ] 

Patrick Wendell commented on SPARK-1944:


Accidental edit - my bad!

 Document --verbose in spark-shell -h
 

 Key: SPARK-1944
 URL: https://issues.apache.org/jira/browse/SPARK-1944
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Assignee: Andrew Ash
Priority: Minor
 Fix For: 1.0.1, 1.1.0


 The below help for spark-submit should make mention of the {{--verbose}} 
 option
 {noformat}
 aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h
 Usage: spark-submit [options] app jar [app options]
 Options:
   --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
 local.
   --deploy-mode DEPLOY_MODE   Mode to deploy the app in, either 'client' or 
 'cluster'.
   --class CLASS_NAME  Name of your app's main class (required for 
 Java apps).
   --arg ARG   Argument to be passed to your application's 
 main class. This
   option can be specified multiple times for 
 multiple args.
   --name NAME The name of your application (Default: 'Spark').
   --jars JARS A comma-separated list of local jars to include 
 on the
   driver classpath and that SparkContext.addJar 
 will work
   with. Doesn't work on standalone with 'cluster' 
 deploy mode.
   --files FILES   Comma separated list of files to be placed in 
 the working dir
   of each executor.
   --properties-file FILE  Path to a file from which to load extra 
 properties. If not
   specified, this will look for 
 conf/spark-defaults.conf.
   --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 
 512M).
   --driver-java-options   Extra Java options to pass to the driver
   --driver-library-path   Extra library path entries to pass to the driver
   --driver-class-path Extra class path entries to pass to the driver. 
 Note that
   jars added with --jars are automatically 
 included in the
   classpath.
   --executor-memory MEM   Memory per executor (e.g. 1000M, 2G) (Default: 
 1G).
  Spark standalone with cluster deploy mode only:
   --driver-cores NUM  Cores for driver (Default: 1).
   --supervise If given, restarts the driver on failure.
  Spark standalone and Mesos only:
   --total-executor-cores NUM  Total cores for all executors.
  YARN-only:
   --executor-cores NUMNumber of cores per executor (Default: 1).
   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
 'default').
   --num-executors NUM Number of executors to (Default: 2).
   --archives ARCHIVES Comma separated list of archives to be 
 extracted into the
   working dir of each executor.
 aash@aash-mbp ~/git/spark$
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1944) Document --verbose in spark-shell -h


 [ 
https://issues.apache.org/jira/browse/SPARK-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1944:
---

Target Version/s: 1.0.1, 1.1.0
   Fix Version/s: (was: 1.0.1)
  (was: 1.1.0)

 Document --verbose in spark-shell -h
 

 Key: SPARK-1944
 URL: https://issues.apache.org/jira/browse/SPARK-1944
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Ash
Assignee: Andrew Ash
Priority: Minor
 Fix For: 1.0.1, 1.1.0


 The below help for spark-submit should make mention of the {{--verbose}} 
 option
 {noformat}
 aash@aash-mbp ~/git/spark$ ./bin/spark-submit -h
 Usage: spark-submit [options] app jar [app options]
 Options:
   --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
 local.
   --deploy-mode DEPLOY_MODE   Mode to deploy the app in, either 'client' or 
 'cluster'.
   --class CLASS_NAME  Name of your app's main class (required for 
 Java apps).
   --arg ARG   Argument to be passed to your application's 
 main class. This
   option can be specified multiple times for 
 multiple args.
   --name NAME The name of your application (Default: 'Spark').
   --jars JARS A comma-separated list of local jars to include 
 on the
   driver classpath and that SparkContext.addJar 
 will work
   with. Doesn't work on standalone with 'cluster' 
 deploy mode.
   --files FILES   Comma separated list of files to be placed in 
 the working dir
   of each executor.
   --properties-file FILE  Path to a file from which to load extra 
 properties. If not
   specified, this will look for 
 conf/spark-defaults.conf.
   --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 
 512M).
   --driver-java-options   Extra Java options to pass to the driver
   --driver-library-path   Extra library path entries to pass to the driver
   --driver-class-path Extra class path entries to pass to the driver. 
 Note that
   jars added with --jars are automatically 
 included in the
   classpath.
   --executor-memory MEM   Memory per executor (e.g. 1000M, 2G) (Default: 
 1G).
  Spark standalone with cluster deploy mode only:
   --driver-cores NUM  Cores for driver (Default: 1).
   --supervise If given, restarts the driver on failure.
  Spark standalone and Mesos only:
   --total-executor-cores NUM  Total cores for all executors.
  YARN-only:
   --executor-cores NUMNumber of cores per executor (Default: 1).
   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
 'default').
   --num-executors NUM Number of executors to (Default: 2).
   --archives ARCHIVES Comma separated list of archives to be 
 extracted into the
   working dir of each executor.
 aash@aash-mbp ~/git/spark$
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2081) Undefine output() from the abstract class Command and implement it in concrete subclasses

2014-06-09 Thread Zongheng Yang (JIRA)

Zongheng Yang created SPARK-2081:


 Summary: Undefine output() from the abstract class Command and 
implement it in concrete subclasses
 Key: SPARK-2081
 URL: https://issues.apache.org/jira/browse/SPARK-2081
 Project: Spark
  Issue Type: Improvement
Reporter: Zongheng Yang
Priority: Minor


It doesn't make too much sense to have that method in the abstract class.

Relevant discussions / cases where this issue comes up: 
https://github.com/apache/spark/pull/956
https://github.com/apache/spark/pull/1003




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact


 [ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2075:
---

Priority: Critical  (was: Major)

 Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 
 1.0.0 artifact
 ---

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown
Priority: Critical
 Fix For: 1.0.1


 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2075) Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 1.0.0 artifact


 [ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2075:
---

Fix Version/s: 1.0.1

 Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 
 1.0.0 artifact
 ---

 Key: SPARK-2075
 URL: https://issues.apache.org/jira/browse/SPARK-2075
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core
Affects Versions: 1.0.0
Reporter: Paul R. Brown
Priority: Critical
 Fix For: 1.0.1


 Running a job built against the Maven dep for 1.0.0 and the hadoop1 
 distribution produces:
 {code}
 java.lang.ClassNotFoundException:
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
 {code}
 Here's what's in the Maven dep as of 1.0.0:
 {code}
 jar tvf 
 ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
  | grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 13:57:58 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}
 And here's what's in the hadoop1 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
 {code}
 I.e., it's not there.  It is in the hadoop2 distribution:
 {code}
 jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
   1519 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
   1560 Mon May 26 07:29:54 PDT 2014 
 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2081) Undefine output() from the abstract class Command and implement it in concrete subclasses


 [ 
https://issues.apache.org/jira/browse/SPARK-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2081:


Assignee: Zongheng Yang

 Undefine output() from the abstract class Command and implement it in 
 concrete subclasses
 -

 Key: SPARK-2081
 URL: https://issues.apache.org/jira/browse/SPARK-2081
 Project: Spark
  Issue Type: Improvement
Reporter: Zongheng Yang
Assignee: Zongheng Yang
Priority: Minor

 It doesn't make too much sense to have that method in the abstract class.
 Relevant discussions / cases where this issue comes up: 
 https://github.com/apache/spark/pull/956
 https://github.com/apache/spark/pull/1003



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2034) KafkaInputDStream doesn't close resources and may prevent JVM shutdown

[
https://issues.apache.org/jira/browse/SPARK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael Armbrust updated SPARK-2034:

Assignee: (was: Kan Zhang)

KafkaInputDStream doesn't close resources and may prevent JVM shutdown
--

Key: SPARK-2034
URL: https://issues.apache.org/jira/browse/SPARK-2034
Project: Spark
Issue Type: Bug
Components: Streaming
Affects Versions: 1.0.0
Reporter: Sean Owen

Tobias noted today on the mailing list:
{quote}
I am trying to use Spark Streaming with Kafka, which works like a
charm -- except for shutdown. When I run my program with sbt
run-main, sbt will never exit, because there are two non-daemon
threads left that don't die.
I created a minimal example at
https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-kafkadoesntshutdown-scala.
It starts a StreamingContext and does nothing more than connecting to
a Kafka server and printing what it receives. Using the `future { ...
}` construct, I shut down the StreamingContext after some seconds and
then print the difference between the threads at start time and at end
time. The output can be found at
https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output1.
There are a number of threads remaining that will prevent sbt from
exiting.
When I replace `KafkaUtils.createStream(...)` with a call that does
exactly the same, except that it calls `consumerConnector.shutdown()`
in `KafkaReceiver.onStop()` (which it should, IMO), the output is as
shown at
https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output2.
Does anyone have *any* idea what is going on here and why the program
doesn't shut down properly? The behavior is the same with both kafka
0.8.0 and 0.8.1.1, by the way.
{quote}
Something similar was noted last year:
http://mail-archives.apache.org/mod_mbox/spark-dev/201309.mbox/%3c1380220041.2428.yahoomail...@web160804.mail.bf1.yahoo.com%3E

KafkaInputDStream doesn't close ConsumerConnector in onStop(), and does not
close the Executor it creates. The latter leaves non-daemon threads and can
prevent the JVM from shutting down even if streaming is closed properly.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-09 Thread Matei Zaharia (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025580#comment-14025580
]

Matei Zaharia commented on SPARK-2044:
--

Hey Weihua,

I'll look into the sorting flag; I initially envisioned that the shuffle
manager would just tell the calling code whether the data is sorted (otherwise
it sorts it by itself), but maybe it does make sense to push sorting into the
interface.

For the ranges on ShuffleReader, I think you misunderstood my meaning slightly.
I don't *want* the reduction code (e.g. combineByKey or groupByKey) to even
know that map tasks are running at different times. It should simply request
its range of reduce partitions once, and then the shuffle *implementation*
should see which maps are ready and start pulling from those. Note also that
the partition range there is for reduce partitions (e.g. our job has 100 reduce
partitions and we ask for partitions 2-5 because we decided to have just one
reduce task for those). It's not for map IDs.

Pluggable interface for shuffles

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2082) Stratified sampling implementation in PairRDDFunctions

2014-06-09 Thread Doris Xin (JIRA)

Doris Xin created SPARK-2082:


 Summary: Stratified sampling implementation in PairRDDFunctions
 Key: SPARK-2082
 URL: https://issues.apache.org/jira/browse/SPARK-2082
 Project: Spark
  Issue Type: New Feature
Reporter: Doris Xin


Implementation of stratified sampling that guarantees exact sample size = 
sum(math.ceil(S_i*sampingRate)) where S_i is the size of each stratum. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1522) YARN ClientBase will throw a NPE if there is no YARN application specific classpath.

2014-06-09 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-1522.
--

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Bernardo Gomez Palacio

 YARN ClientBase will throw a NPE if there is no YARN application specific 
 classpath.
 

 Key: SPARK-1522
 URL: https://issues.apache.org/jira/browse/SPARK-1522
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Bernardo Gomez Palacio
Assignee: Bernardo Gomez Palacio
Priority: Critical
  Labels: YARN
 Fix For: 1.1.0


 The current implementation of ClientBase.getDefaultYarnApplicationClasspath 
 inspects the MRJobConfig class for the field 
 DEFAULT_YARN_APPLICATION_CLASSPATH when it should be really looking into 
 YarnConfiguration.
 If the Application Configuration has no yarn.application.classpath defined a 
 NPE exception will be thrown.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2083) Allow local task to retry after failure.

2014-06-09 Thread Peng Cheng (JIRA)

Peng Cheng created SPARK-2083:
-

 Summary: Allow local task to retry after failure.
 Key: SPARK-2083
 URL: https://issues.apache.org/jira/browse/SPARK-2083
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Peng Cheng
Priority: Trivial


If a job is submitted to run locally using masterURL = local[X], spark will 
not retry a failed task regardless of your spark.task.maxFailures setting. 
This design is to facilitate debugging and QA of spark application where all 
tasks are expected to succeed and yield a results. Unfortunately, such setting 
will prevent a local job from finished if any of its task cannot guarantee a 
result (e.g. visiting an external resouce/API), and retrying inside the task is 
less favoured (e.g. the task needs to be executed on a different computer on 
production).

User however can still set masterURL =local[X,Y] to override this (where Y is 
the local maxFailures), but it is not documented and hard to manage. A quick 
fix to this can be to add a new configuration property 
spark.local.maxFailures with a default value of 1. So user knows exactly 
where to change when reading the documentation




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2084) Mention SPARK_JAR in env var section on configuration page

2014-06-09 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-2084:
-

 Summary: Mention SPARK_JAR in env var section on configuration page
 Key: SPARK-2084
 URL: https://issues.apache.org/jira/browse/SPARK-2084
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2085) Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)

2014-06-09 Thread Shuo Xiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-2085:
--

Description: 
The current implementation of ALS takes a single regularization parameter and 
apply it on both of the user factors and the product factors. This kind of 
regularization can be less effective while users number is significantly larger 
than the number of products (and vice versa). For example, if we have 10M users 
and 1K product, regularization on user factors will dominate. Following the 
discussion in [this 
thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704),
 the implementation in this PR will regularize each factor vector by #ratings * 
lambda.

Link to PR: https://github.com/apache/spark/pull/1026

  was:
The current implementation of ALS takes a single regularization parameter and 
apply it on both of the user factors and the product factors. This kind of 
regularization can be less effective while users number is significantly larger 
than the number of products (and vice versa). For example, if we have 10M users 
and 1K product, regularization on user factors will dominate. Following the 
discussion in [this 
thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704),
 the implementation in this PR will regularize each factor vector by #ratings * 
lambda.



 Apply user-specific regularization instead of uniform regularization in 
 Alternating Least Squares (ALS)
 ---

 Key: SPARK-2085
 URL: https://issues.apache.org/jira/browse/SPARK-2085
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Shuo Xiang
Priority: Minor

 The current implementation of ALS takes a single regularization parameter and 
 apply it on both of the user factors and the product factors. This kind of 
 regularization can be less effective while users number is significantly 
 larger than the number of products (and vice versa). For example, if we have 
 10M users and 1K product, regularization on user factors will dominate. 
 Following the discussion in [this 
 thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704),
  the implementation in this PR will regularize each factor vector by #ratings 
 * lambda.
 Link to PR: https://github.com/apache/spark/pull/1026



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2085) Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)

2014-06-09 Thread Shuo Xiang (JIRA)

Shuo Xiang created SPARK-2085:
-

 Summary: Apply user-specific regularization instead of uniform 
regularization in Alternating Least Squares (ALS)
 Key: SPARK-2085
 URL: https://issues.apache.org/jira/browse/SPARK-2085
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Shuo Xiang
Priority: Minor


The current implementation of ALS takes a single regularization parameter and 
apply it on both of the user factors and the product factors. This kind of 
regularization can be less effective while users number is significantly larger 
than the number of products (and vice versa). For example, if we have 10M users 
and 1K product, regularization on user factors will dominate. Following the 
discussion in [this 
thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704),
 the implementation in this PR will regularize each factor vector by #ratings * 
lambda.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2085) Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)

2014-06-09 Thread Shuo Xiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-2085:
--

Description: 
The current implementation of ALS takes a single regularization parameter and 
apply it on both of the user factors and the product factors. This kind of 
regularization can be less effective while user number is significantly larger 
than the number of products (and vice versa). For example, if we have 10M users 
and 1K product, regularization on user factors will dominate. Following the 
discussion in [this 
thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704),
 the implementation in this PR will regularize each factor vector by #ratings * 
lambda.

Link to PR: https://github.com/apache/spark/pull/1026

  was:
The current implementation of ALS takes a single regularization parameter and 
apply it on both of the user factors and the product factors. This kind of 
regularization can be less effective while users number is significantly larger 
than the number of products (and vice versa). For example, if we have 10M users 
and 1K product, regularization on user factors will dominate. Following the 
discussion in [this 
thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704),
 the implementation in this PR will regularize each factor vector by #ratings * 
lambda.

Link to PR: https://github.com/apache/spark/pull/1026


 Apply user-specific regularization instead of uniform regularization in 
 Alternating Least Squares (ALS)
 ---

 Key: SPARK-2085
 URL: https://issues.apache.org/jira/browse/SPARK-2085
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Shuo Xiang
Priority: Minor

 The current implementation of ALS takes a single regularization parameter and 
 apply it on both of the user factors and the product factors. This kind of 
 regularization can be less effective while user number is significantly 
 larger than the number of products (and vice versa). For example, if we have 
 10M users and 1K product, regularization on user factors will dominate. 
 Following the discussion in [this 
 thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704),
  the implementation in this PR will regularize each factor vector by #ratings 
 * lambda.
 Link to PR: https://github.com/apache/spark/pull/1026



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)

2014-06-09 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025895#comment-14025895
 ] 

Erik Erlandson commented on SPARK-1493:
---

RAT itself appears to preclude exclusion using a /path/to/file.ext regex 
because it traverses the directory tree and applies its exclusion filter only 
to individual file names.  The filter never sees an entire path 
path/to/file.ext, only path, to, and file.ext

https://github.com/apache/rat/blob/incubator-site-import/rat/rat-core/src/main/java/org/apache/rat/DirectoryWalker.java#L127

Either RAT needs a new filtering feature that can see an entire path, or the 
report it generates has to be filtered post-hoc.


 Apache RAT excludes don't work with file path (instead of file name)
 

 Key: SPARK-1493
 URL: https://issues.apache.org/jira/browse/SPARK-1493
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Patrick Wendell
  Labels: starter
 Fix For: 1.1.0


 Right now the way we do RAT checks, it doesn't work if you try to exclude:
 /path/to/file.ext
 you have to just exclude
 file.ext



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)

2014-06-09 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025895#comment-14025895
 ] 

Erik Erlandson edited comment on SPARK-1493 at 6/9/14 11:13 PM:


RAT itself appears to preclude exclusion using a /path/to/file.ext regex 
because it traverses the directory tree and applies its exclusion filter only 
to individual file names.  The filter never sees an entire path 
path/to/file.ext, only path, to, and file.ext

https://github.com/apache/rat/blob/incubator-site-import/rat/rat-core/src/main/java/org/apache/rat/DirectoryWalker.java#L127

Either RAT needs a new filtering feature that can see an entire path, or the 
report it generates has to be filtered post-hoc.

Filed an RFE against RAT:  RAT-161


was (Author: eje):
RAT itself appears to preclude exclusion using a /path/to/file.ext regex 
because it traverses the directory tree and applies its exclusion filter only 
to individual file names.  The filter never sees an entire path 
path/to/file.ext, only path, to, and file.ext

https://github.com/apache/rat/blob/incubator-site-import/rat/rat-core/src/main/java/org/apache/rat/DirectoryWalker.java#L127

Either RAT needs a new filtering feature that can see an entire path, or the 
report it generates has to be filtered post-hoc.


 Apache RAT excludes don't work with file path (instead of file name)
 

 Key: SPARK-1493
 URL: https://issues.apache.org/jira/browse/SPARK-1493
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Patrick Wendell
  Labels: starter
 Fix For: 1.1.0


 Right now the way we do RAT checks, it doesn't work if you try to exclude:
 /path/to/file.ext
 you have to just exclude
 file.ext



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1704) Support EXPLAIN in Spark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1704:


Assignee: Zongheng Yang  (was: Michael Armbrust)

 Support EXPLAIN in Spark SQL
 

 Key: SPARK-1704
 URL: https://issues.apache.org/jira/browse/SPARK-1704
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux
Reporter: Yangjp
Assignee: Zongheng Yang
  Labels: sql
 Fix For: 1.0.1, 1.1.0

   Original Estimate: 612h
  Remaining Estimate: 612h

 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src
 14/05/03 22:08:40 INFO ParseDriver: Parse Completed
 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION :
 java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
 (Project [*])
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248)
 at 
 org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39)
 at 
 org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124)
 at 
 org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-1704) Support EXPLAIN in Spark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-1704:
---

Assignee: Michael Armbrust

 Support EXPLAIN in Spark SQL
 

 Key: SPARK-1704
 URL: https://issues.apache.org/jira/browse/SPARK-1704
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux
Reporter: Yangjp
Assignee: Michael Armbrust
  Labels: sql
 Fix For: 1.0.1, 1.1.0

   Original Estimate: 612h
  Remaining Estimate: 612h

 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src
 14/05/03 22:08:40 INFO ParseDriver: Parse Completed
 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION :
 java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
 (Project [*])
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248)
 at 
 org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39)
 at 
 org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124)
 at 
 org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1704) Support EXPLAIN in Spark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1704.
-

   Resolution: Fixed
Fix Version/s: 1.0.1

 Support EXPLAIN in Spark SQL
 

 Key: SPARK-1704
 URL: https://issues.apache.org/jira/browse/SPARK-1704
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux
Reporter: Yangjp
  Labels: sql
 Fix For: 1.0.1, 1.1.0

   Original Estimate: 612h
  Remaining Estimate: 612h

 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src
 14/05/03 22:08:40 INFO ParseDriver: Parse Completed
 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION :
 java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
 (Project [*])
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248)
 at 
 org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39)
 at 
 org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124)
 at 
 org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)


 [ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1493:
---

Fix Version/s: (was: 1.1.0)

 Apache RAT excludes don't work with file path (instead of file name)
 

 Key: SPARK-1493
 URL: https://issues.apache.org/jira/browse/SPARK-1493
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Patrick Wendell
  Labels: starter

 Right now the way we do RAT checks, it doesn't work if you try to exclude:
 /path/to/file.ext
 you have to just exclude
 file.ext



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)


[ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14025954#comment-14025954
 ] 

Patrick Wendell commented on SPARK-1493:


Thanks for looking into this Erik. It seems like maybe there isn't a good way 
to do unless we want to implement filtering post-hoc (and it might be tricky to 
support e.g. globbing in that case).

 Apache RAT excludes don't work with file path (instead of file name)
 

 Key: SPARK-1493
 URL: https://issues.apache.org/jira/browse/SPARK-1493
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Patrick Wendell
  Labels: starter

 Right now the way we do RAT checks, it doesn't work if you try to exclude:
 /path/to/file.ext
 you have to just exclude
 file.ext



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2086) Improve output of toDebugString to make shuffle boundaries more clear


 [ 
https://issues.apache.org/jira/browse/SPARK-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2086:
---

Description: 
It would be nice if the toDebugString method of an RDD did a better job of 
explaining where shuffle boundaries occur in the lineage graph. One way to do 
this would be to only indent the tree at a shuffle boundary instead of 
indenting it for every parent. 

We can determine when a shuffle boundary occurs based on the type of dependency 
seen in the RDD.

  was:It would be nice if the toDebugString method of an RDD did a better job 
of explaining where shuffle boundaries occur in the lineage graph. One way to 
do this would be to only indent the tree at a shuffle boundary instead of 
indenting it for every parent. 


 Improve output of toDebugString to make shuffle boundaries more clear
 -

 Key: SPARK-2086
 URL: https://issues.apache.org/jira/browse/SPARK-2086
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Assignee: Gregory Owen
Priority: Minor

 It would be nice if the toDebugString method of an RDD did a better job of 
 explaining where shuffle boundaries occur in the lineage graph. One way to do 
 this would be to only indent the tree at a shuffle boundary instead of 
 indenting it for every parent. 
 We can determine when a shuffle boundary occurs based on the type of 
 dependency seen in the RDD.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization

2014-06-09 Thread Doris Xin (JIRA)

Doris Xin created SPARK-2088:


 Summary: NPE in toString when creationSiteInfo is null after 
deserialization
 Key: SPARK-2088
 URL: https://issues.apache.org/jira/browse/SPARK-2088
 Project: Spark
  Issue Type: Bug
Reporter: Doris Xin


After deserialization, the transient field creationSiteInfo does not get 
backfilled with the default value, but the toString method, which is invoked by 
the serializer, expects the field to always be non-null. The following issue is 
encountered during serialization:

java.lang.NullPointerException
at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198)
at org.apache.spark.rdd.RDD.toString(RDD.scala:1263)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46)
at 
org.apache.spark.scheduler.ResultTask.writeExternal(ResultTask.scala:125)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
at

[jira] [Commented] (SPARK-1305) Support persisting RDD's directly to Tachyon

2014-06-09 Thread Henry Saputra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026037#comment-14026037
 ] 

Henry Saputra commented on SPARK-1305:
--

Sorry to comment on old JIRA but does anyone have PR for this ticket?

 Support persisting RDD's directly to Tachyon
 

 Key: SPARK-1305
 URL: https://issues.apache.org/jira/browse/SPARK-1305
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Patrick Wendell
Assignee: Haoyuan Li
Priority: Blocker
 Fix For: 1.0.0


 This is already an ongoing pull request - in a nutshell we want to support 
 Tachyon as a storage level in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1305) Support persisting RDD's directly to Tachyon

2014-06-09 Thread Henry Saputra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026038#comment-14026038
 ] 

Henry Saputra commented on SPARK-1305:
--

Never mind, Found it, it was when Spark in incubtor

 Support persisting RDD's directly to Tachyon
 

 Key: SPARK-1305
 URL: https://issues.apache.org/jira/browse/SPARK-1305
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Patrick Wendell
Assignee: Haoyuan Li
Priority: Blocker
 Fix For: 1.0.0


 This is already an ongoing pull request - in a nutshell we want to support 
 Tachyon as a storage level in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2071) Package private classes that are deleted from an older version of Spark trigger errors


[ 
https://issues.apache.org/jira/browse/SPARK-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026045#comment-14026045
 ] 

Patrick Wendell commented on SPARK-2071:


Yes we could use sbt to retreive them and place them in lib_managed or 
something similar.

 Package private classes that are deleted from an older version of Spark 
 trigger errors
 --

 Key: SPARK-2071
 URL: https://issues.apache.org/jira/browse/SPARK-2071
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma
 Fix For: 1.1.0


 We should figure out how to fix this. One idea is to run the MIMA exclude 
 generator with sbt itself (rather than ./spark-class) so it can run against 
 the older versions of Spark and make sure to exclude classes that are marked 
 as package private in that version as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization


 [ 
https://issues.apache.org/jira/browse/SPARK-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2088:
---

Assignee: Doris Xin

 NPE in toString when creationSiteInfo is null after deserialization
 ---

 Key: SPARK-2088
 URL: https://issues.apache.org/jira/browse/SPARK-2088
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Doris Xin
Assignee: Doris Xin

 After deserialization, the transient field creationSiteInfo does not get 
 backfilled with the default value, but the toString method, which is invoked 
 by the serializer, expects the field to always be non-null. The following 
 issue is encountered during serialization:
 java.lang.NullPointerException
   at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198)
   at org.apache.spark.rdd.RDD.toString(RDD.scala:1263)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
   at 
 org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46)
   at

[jira] [Updated] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization


 [ 
https://issues.apache.org/jira/browse/SPARK-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2088:
---

Target Version/s: 1.0.0, 1.0.1

 NPE in toString when creationSiteInfo is null after deserialization
 ---

 Key: SPARK-2088
 URL: https://issues.apache.org/jira/browse/SPARK-2088
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Doris Xin
Assignee: Doris Xin

 After deserialization, the transient field creationSiteInfo does not get 
 backfilled with the default value, but the toString method, which is invoked 
 by the serializer, expects the field to always be non-null. The following 
 issue is encountered during serialization:
 java.lang.NullPointerException
   at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198)
   at org.apache.spark.rdd.RDD.toString(RDD.scala:1263)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
   at 
 org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46)
   at

[jira] [Updated] (SPARK-2088) NPE in toString when creationSiteInfo is null after deserialization


 [ 
https://issues.apache.org/jira/browse/SPARK-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2088:
---

Affects Version/s: 1.0.0

 NPE in toString when creationSiteInfo is null after deserialization
 ---

 Key: SPARK-2088
 URL: https://issues.apache.org/jira/browse/SPARK-2088
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Doris Xin
Assignee: Doris Xin

 After deserialization, the transient field creationSiteInfo does not get 
 backfilled with the default value, but the toString method, which is invoked 
 by the serializer, expects the field to always be non-null. The following 
 issue is encountered during serialization:
 java.lang.NullPointerException
   at org.apache.spark.rdd.RDD.getCreationSite(RDD.scala:1198)
   at org.apache.spark.rdd.RDD.toString(RDD.scala:1263)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1418)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
   at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
   at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
   at 
 org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:46)
   at

[jira] [Commented] (SPARK-2000) cannot connect to cluster in Standalone mode when run spark-shell in one of the cluster node without specify master

2014-06-09 Thread Chen Chao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026052#comment-14026052
 ] 

Chen Chao commented on SPARK-2000:
--

Hi, Patrick, I just thought it was the same problem to 
https://issues.apache.org/jira/browse/SPARK-1028 .
Anyway, if u think it is not necessary , please close the issue : )

 cannot connect to cluster in Standalone mode when run spark-shell in one of 
 the cluster node without specify master
 ---

 Key: SPARK-2000
 URL: https://issues.apache.org/jira/browse/SPARK-2000
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Chen Chao
Assignee: Chen Chao
  Labels: shell

 cannot connect to cluster in Standalone mode when run spark-shell in one of 
 the cluster node.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1998) SparkFlumeEvent with body bigger than 1020 bytes are not read properly

2014-06-09 Thread sunshangchun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026053#comment-14026053
 ] 

sunshangchun commented on SPARK-1998:
-

I've pulled a request here(https://github.com/apache/spark/pull/951)
Does anyone can submit and resolve it ?


 SparkFlumeEvent with body bigger than 1020 bytes are not read properly
 --

 Key: SPARK-1998
 URL: https://issues.apache.org/jira/browse/SPARK-1998
 Project: Spark
  Issue Type: Bug
Reporter: sun.sam
 Attachments: patch.diff






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2014-06-09 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-2089:
-

 Summary: With YARN, preferredNodeLocalityData isn't honored 
 Key: SPARK-2089
 URL: https://issues.apache.org/jira/browse/SPARK-2089
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Sandy Ryza


When running in YARN cluster mode, apps can pass preferred locality data when 
constructing a Spark context that will dictate where to request executor 
containers.

This is currently broken because of a race condition.  The Spark-YARN code runs 
the user class and waits for it to start up a SparkContext.  During its 
initialization, the SparkContext will create a YarnClusterScheduler, which 
notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
immediately fetches the preferredNodeLocationData from the SparkContext and 
uses it to start requesting containers.

But in the SparkContext constructor that takes the preferredNodeLocationData, 
setting preferredNodeLocationData comes after the rest of the initialization, 
so, if the Spark-YARN code comes around quickly enough after being notified, 
the data that's fetched is the empty unset version.  The occurred during all of 
my runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles