[jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-22 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977906#comment-13977906
 ] 

Nishkam Ravi commented on SPARK-1576:
-

Hi Sandy, I've initiated a pull request for this fix.

> Passing of JAVA_OPTS to YARN on command line
> 
>
> Key: SPARK-1576
> URL: https://issues.apache.org/jira/browse/SPARK-1576
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Nishkam Ravi
> Fix For: 0.9.0, 1.0.0, 0.9.1
>
> Attachments: SPARK-1576.patch
>
>
> JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
> or as config vars (after Patrick's recent change). It would be good to allow 
> the user to pass them on command line as well to restrict scope to single 
> application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1584) Upgrade Flume dependency to 1.4.0

2014-04-22 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977901#comment-13977901
 ] 

Sandy Ryza commented on SPARK-1584:
---

This will allow us to take advantage of features like compression and 
encryption in Flume channels.

> Upgrade Flume dependency to 1.4.0
> -
>
> Key: SPARK-1584
> URL: https://issues.apache.org/jira/browse/SPARK-1584
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 0.9.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1480) Choose classloader consistently inside of Spark codebase

2014-04-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977897#comment-13977897
 ] 

Patrick Wendell commented on SPARK-1480:


https://github.com/apache/spark/pull/398/files

> Choose classloader consistently inside of Spark codebase
> 
>
> Key: SPARK-1480
> URL: https://issues.apache.org/jira/browse/SPARK-1480
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The Spark codebase is not always consistent on which class loader it uses 
> when classlaoders are explicitly passed to things like serializers. This 
> caused SPARK-1403 and also causes a bug where when the driver has a modified 
> context class loader it is not translated correctly in local mode to the 
> (local) executor.
> In most cases what we want is the following behavior:
> 1. If there is a context classloader on the thread, use that.
> 2. Otherwise use the classloader that loaded Spark.
> We should just have a utility function for this and call that function 
> whenever we need to get a classloader.
> Note that SPARK-1403 is a workaround for this exact problem (it sets the 
> context class loader because downstream code assumes it is set). Once this 
> gets fixed in a more general way SPARK-1403 can be reverted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1584) Upgrade Flume dependency to 1.4.0

2014-04-22 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-1584:
-

 Summary: Upgrade Flume dependency to 1.4.0
 Key: SPARK-1584
 URL: https://issues.apache.org/jira/browse/SPARK-1584
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.9.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1583) Use java.util.HashMap.remove by mistake in BlockManagerMasterActor.removeBlockManager

2014-04-22 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977895#comment-13977895
 ] 

Shixiong Zhu commented on SPARK-1583:
-

PR: https://github.com/apache/spark/pull/500

> Use java.util.HashMap.remove by mistake in 
> BlockManagerMasterActor.removeBlockManager
> -
>
> Key: SPARK-1583
> URL: https://issues.apache.org/jira/browse/SPARK-1583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: easyfix
>
> The following code in BlockManagerMasterActor.removeBlockManager uses a value 
> to remove an entry from java.util.HashMap.
>   if (locations.size == 0) {
> blockLocations.remove(locations)
>   }
> Should change to "blockLocations.remove(blockId)".



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1583) Use java.util.HashMap.remove by mistake in BlockManagerMasterActor.removeBlockManager

2014-04-22 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-1583:
---

 Summary: Use java.util.HashMap.remove by mistake in 
BlockManagerMasterActor.removeBlockManager
 Key: SPARK-1583
 URL: https://issues.apache.org/jira/browse/SPARK-1583
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


The following code in BlockManagerMasterActor.removeBlockManager uses a value 
to remove an entry from java.util.HashMap.

  if (locations.size == 0) {
blockLocations.remove(locations)
  }

Should change to "blockLocations.remove(blockId)".



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1438) Update RDD.sample() API to make seed parameter optional

2014-04-22 Thread Arun Ramakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977874#comment-13977874
 ] 

Arun Ramakrishnan commented on SPARK-1438:
--

NEW PR at https://github.com/apache/spark/pull/477

> Update RDD.sample() API to make seed parameter optional
> ---
>
> Key: SPARK-1438
> URL: https://issues.apache.org/jira/browse/SPARK-1438
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Blocker
>  Labels: Starter
> Fix For: 1.0.0
>
>
> When a seed is not given, it should pick one based on Math.random().
> This needs to be done in Java and Python as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1577) GraphX mapVertices with KryoSerialization

2014-04-22 Thread Joseph E. Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977863#comment-13977863
 ] 

Joseph E. Gonzalez commented on SPARK-1577:
---

I have narrowed the issued down to (line 46 in GraphKryoRegistrator):

{code} 
kryo.setReferences(false)
{code}

This creates an issue in the Spark REPL which leads to cyclic references.  
Removing this line addresses the issue.  I will submit a pull request with the 
fix. 

In fact, I can reproduce the bug with the following much simpler block of code:

{code}
class A(a: String) extends Serializable
val x = sc.parallelize(Array.fill(10)(new A("hello")))
x.collect
{code}

tl;dr

Disabling reference tracking in Kryo will break the Spark Shell.  


> GraphX mapVertices with KryoSerialization
> -
>
> Key: SPARK-1577
> URL: https://issues.apache.org/jira/browse/SPARK-1577
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Joseph E. Gonzalez
>
> If Kryo is enabled by setting:
> {code}
> SPARK_JAVA_OPTS+="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
>  "
> SPARK_JAVA_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
>   "
> {code}
> in conf/spark_env.conf and running the following block of code in the shell:
> {code}
> import org.apache.spark.graphx._
> import org.apache.spark.graphx.lib._
> import org.apache.spark.rdd.RDD
> val vertexArray = Array(
>   (1L, ("Alice", 28)),
>   (2L, ("Bob", 27)),
>   (3L, ("Charlie", 65)),
>   (4L, ("David", 42)),
>   (5L, ("Ed", 55)),
>   (6L, ("Fran", 50))
>   )
> val edgeArray = Array(
>   Edge(2L, 1L, 7),
>   Edge(2L, 4L, 2),
>   Edge(3L, 2L, 4),
>   Edge(3L, 6L, 3),
>   Edge(4L, 1L, 1),
>   Edge(5L, 2L, 2),
>   Edge(5L, 3L, 8),
>   Edge(5L, 6L, 3)
>   )
> val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
> val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
> val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
> // Define a class to more clearly model the user property
> case class User(name: String, age: Int, inDeg: Int, outDeg: Int)
> // Transform the graph
> val userGraph = graph.mapVertices{ case (id, (name, age)) => User(name, age, 
> 0, 0) }
> {code}
> The following block of code works:
> {code}
> userGraph.vertices.count
> {code}
> and the following block of code generates a Kryo error:
> {code}
> userGraph.vertices.collect
> {code}
> There error:
> {code}
> java.lang.StackOverflowError
>   at 
> sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
>   at 
> sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
>   at java.lang.reflect.Field.get(Field.java:379)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1494) Hive Dependencies being checked by MIMA

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1494:
---

Fix Version/s: 1.0.0

> Hive Dependencies being checked by MIMA
> ---
>
> Key: SPARK-1494
> URL: https://issues.apache.org/jira/browse/SPARK-1494
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, SQL
>Affects Versions: 1.0.0
>Reporter: Ahir Reddy
>Assignee: Michael Armbrust
>Priority: Minor
> Fix For: 1.0.0
>
>
> It looks like code in companion objects is being invoked by the MIMA checker, 
> as it uses Scala reflection to check all of the interfaces. As a result it's 
> starting a Spark context and eventually out of memory errors. As a temporary 
> fix all classes that contain "hive" or "Hive" are excluded from the check.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1494) Hive Dependencies being checked by MIMA

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1494.


Resolution: Fixed

> Hive Dependencies being checked by MIMA
> ---
>
> Key: SPARK-1494
> URL: https://issues.apache.org/jira/browse/SPARK-1494
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, SQL
>Affects Versions: 1.0.0
>Reporter: Ahir Reddy
>Assignee: Michael Armbrust
>Priority: Minor
>
> It looks like code in companion objects is being invoked by the MIMA checker, 
> as it uses Scala reflection to check all of the interfaces. As a result it's 
> starting a Spark context and eventually out of memory errors. As a temporary 
> fix all classes that contain "hive" or "Hive" are excluded from the check.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1582) Job cancellation does not interrupt threads

2014-04-22 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-1582:
-

 Summary: Job cancellation does not interrupt threads
 Key: SPARK-1582
 URL: https://issues.apache.org/jira/browse/SPARK-1582
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 0.9.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Cancelling Spark jobs is limited because executors that are blocked are not 
interrupted. In effect, the cancellation will succeed and the job will no 
longer be "running", but executor threads may still be tied up with the 
cancelled job and unable to do further work until complete. This is 
particularly problematic in the case of deadlock or unlimited/long timeouts.

It would be useful if cancelling a job would call Thread.interrupt() in order 
to interrupt blocking in most situations, such as Object monitors or IO. The 
one caveat is [HDFS-1208|https://issues.apache.org/jira/browse/HDFS-1208], 
where HDFS's DFSClient will not only swallow InterruptedException but may 
reinterpret them as IOException, causing HDFS to mark a node as permanently 
failed. Thus, this feature must be optional and probably off by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1581) Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker

2014-04-22 Thread Christophe Clapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe Clapp updated SPARK-1581:


Summary: Allow One Flume Avro RPC Server for Each Worker rather than Just 
One Worker  (was: Allow One Flume Avro RPC Server for Each Worker rather than 
Just One)

> Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker
> ---
>
> Key: SPARK-1581
> URL: https://issues.apache.org/jira/browse/SPARK-1581
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Christophe Clapp
>Priority: Minor
>  Labels: Flume
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1581) Allow One Flume Avro RPC Server for Each Worker rather than Just One

2014-04-22 Thread Christophe Clapp (JIRA)
Christophe Clapp created SPARK-1581:
---

 Summary: Allow One Flume Avro RPC Server for Each Worker rather 
than Just One
 Key: SPARK-1581
 URL: https://issues.apache.org/jira/browse/SPARK-1581
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Christophe Clapp
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1562) Exclude internal catalyst classes from scaladoc, or make them package private

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1562.


Resolution: Fixed

> Exclude internal catalyst classes from scaladoc, or make them package private
> -
>
> Key: SPARK-1562
> URL: https://issues.apache.org/jira/browse/SPARK-1562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Michael - this is up to you. But I noticed there are a ton of internal 
> catalyst types that show up in our scaladoc. I'm not sure if you mean these 
> to be user-facing API's. If not, it might be good to hide them from the docs 
> or make them package private.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1562) Exclude internal catalyst classes from scaladoc, or make them package private

2014-04-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977767#comment-13977767
 ] 

Patrick Wendell commented on SPARK-1562:


Thanks for taking a look at this!

> Exclude internal catalyst classes from scaladoc, or make them package private
> -
>
> Key: SPARK-1562
> URL: https://issues.apache.org/jira/browse/SPARK-1562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Michael - this is up to you. But I noticed there are a ton of internal 
> catalyst types that show up in our scaladoc. I'm not sure if you mean these 
> to be user-facing API's. If not, it might be good to hide them from the docs 
> or make them package private.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1376) In the yarn-cluster submitter, rename "args" option to "arg"

2014-04-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977757#comment-13977757
 ] 

Patrick Wendell commented on SPARK-1376:


There was follow up work on this by [~mengxr] in 
https://github.com/apache/spark/pull/485

> In the yarn-cluster submitter, rename "args" option to "arg"
> 
>
> Key: SPARK-1376
> URL: https://issues.apache.org/jira/browse/SPARK-1376
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 0.9.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> This was discussed in the SPARK-1126 PR.  "args" will be kept around for 
> backwards compatibility.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1578) Do not require setting of cleaner TTL when creating StreamingContext

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1578.


   Resolution: Fixed
Fix Version/s: 1.0.0

> Do not require setting of cleaner TTL when creating StreamingContext
> 
>
> Key: SPARK-1578
> URL: https://issues.apache.org/jira/browse/SPARK-1578
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Since shuffles and RDDs that are out of context are automatically cleaned by 
> Spark core (using ContextCleaner) there is no need for setting the cleaner 
> TTL in StreamingContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-22 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977754#comment-13977754
 ] 

Sandy Ryza commented on SPARK-1576:
---

Hi [~nravi], Spark patches should be submitted as pull requests against the 
Spark github - https://github.com/apache/spark.

> Passing of JAVA_OPTS to YARN on command line
> 
>
> Key: SPARK-1576
> URL: https://issues.apache.org/jira/browse/SPARK-1576
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Nishkam Ravi
> Fix For: 0.9.0, 1.0.0, 0.9.1
>
> Attachments: SPARK-1576.patch
>
>
> JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
> or as config vars (after Patrick's recent change). It would be good to allow 
> the user to pass them on command line as well to restrict scope to single 
> application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1538) SparkUI forgets about all persisted RDD's not directly associated with the Stage

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1538.


Resolution: Fixed
  Assignee: Andrew Or

> SparkUI forgets about all persisted RDD's not directly associated with the 
> Stage
> 
>
> Key: SPARK-1538
> URL: https://issues.apache.org/jira/browse/SPARK-1538
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The following command creates two RDDs in one Stage:
> sc.parallelize(1 to 1000, 4).persist.map(_ + 1).count
> More specifically, parallelize creates one, and map creates another. If we 
> persist only the first one, it does not actually show up on the StorageTab of 
> the SparkUI.
> This is because StageInfo only keeps around information for the last RDD 
> associated with the stage, but forgets about all of its parents. The proposal 
> here is to have StageInfo climb the RDD dependency ladder to keep a list of 
> all associated RDDInfos.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1281) Partitioning in ALS

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1281.


   Resolution: Fixed
Fix Version/s: 1.0.0

> Partitioning in ALS
> ---
>
> Key: SPARK-1281
> URL: https://issues.apache.org/jira/browse/SPARK-1281
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
> Fix For: 1.0.0
>
>
> There are some minor issues about partitioning with the current 
> implementation of ALS:
> 1. Mod-based partitioner is used for mapping users/products to blocks. This 
> might cause problems if the ids contains information. For example, the last 
> digit may indicate the user/product type. This can be fixed by hashing.
> 2. HashPartitioner is used on the initial partition. This is the same as the 
> mod-based partitioner when the key is a positive integer. But it is certainly 
> error-prone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1281) Partitioning in ALS

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1281:
---

Assignee: Tor Myklebust

> Partitioning in ALS
> ---
>
> Key: SPARK-1281
> URL: https://issues.apache.org/jira/browse/SPARK-1281
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Tor Myklebust
>Priority: Minor
> Fix For: 1.0.0
>
>
> There are some minor issues about partitioning with the current 
> implementation of ALS:
> 1. Mod-based partitioner is used for mapping users/products to blocks. This 
> might cause problems if the ids contains information. For example, the last 
> digit may indicate the user/product type. This can be fixed by hashing.
> 2. HashPartitioner is used on the initial partition. This is the same as the 
> mod-based partitioner when the key is a positive integer. But it is certainly 
> error-prone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1119) Make examples and assembly jar naming consistent between maven/sbt

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1119:
---

Description: 
Right now it's somewhat different. Also it should be consistent with what the 
classpath and example scripts calculate.

Build with Maven:
./assembly/target/scala-2.10/spark-assembly_2.10-1.0.0-SNAPSHOT-hadoop1.0.4.jar
./examples/target/scala-2.10/spark-examples_2.10-assembly-1.0.0-SNAPSHOT.jar

  was:Right now it's somewhat different. Also it should be consistent with what 
the classpath and example scripts calculate.


> Make examples and assembly jar naming consistent between maven/sbt
> --
>
> Key: SPARK-1119
> URL: https://issues.apache.org/jira/browse/SPARK-1119
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Patrick Wendell
> Fix For: 1.0.0
>
>
> Right now it's somewhat different. Also it should be consistent with what the 
> classpath and example scripts calculate.
> Build with Maven:
> ./assembly/target/scala-2.10/spark-assembly_2.10-1.0.0-SNAPSHOT-hadoop1.0.4.jar
> ./examples/target/scala-2.10/spark-examples_2.10-assembly-1.0.0-SNAPSHOT.jar



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-800) Improve Quickstart Docs to Make Full Deployment More Clear

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-800.
---

   Resolution: Fixed
Fix Version/s: 1.0.0

This has been fixed with the recent changes to the docs.

> Improve Quickstart Docs to Make Full Deployment More Clear
> --
>
> Key: SPARK-800
> URL: https://issues.apache.org/jira/browse/SPARK-800
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
> Fix For: 1.0.0
>
>
> Some things are not super clear to people that should be in the quickstart 
> standalone job section.
> 1. Explain that people need to package their dependencies either by making an 
> uber jar (which is added to the SC) or by copying to each slave node if they 
> build an actual project. This isn't clear since there are no dependencies in 
> the quickstart itself. Warn them about issues creating an uber-jar with akka.
> 2. Explain that people need to set the required resources in an env variable 
> before launching (or whatever new way Matei added to do this). 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-800) Improve Quickstart Docs to Make Full Deployment More Clear

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-800:
--

Reporter: Patrick Wendell  (was: Patrick Cogan)

> Improve Quickstart Docs to Make Full Deployment More Clear
> --
>
> Key: SPARK-800
> URL: https://issues.apache.org/jira/browse/SPARK-800
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> Some things are not super clear to people that should be in the quickstart 
> standalone job section.
> 1. Explain that people need to package their dependencies either by making an 
> uber jar (which is added to the SC) or by copying to each slave node if they 
> build an actual project. This isn't clear since there are no dependencies in 
> the quickstart itself. Warn them about issues creating an uber-jar with akka.
> 2. Explain that people need to set the required resources in an env variable 
> before launching (or whatever new way Matei added to do this). 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-800) Improve Quickstart Docs to Make Full Deployment More Clear

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-800:
--

Assignee: Patrick Wendell  (was: Patrick Cogan)

> Improve Quickstart Docs to Make Full Deployment More Clear
> --
>
> Key: SPARK-800
> URL: https://issues.apache.org/jira/browse/SPARK-800
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Cogan
>Assignee: Patrick Wendell
>
> Some things are not super clear to people that should be in the quickstart 
> standalone job section.
> 1. Explain that people need to package their dependencies either by making an 
> uber jar (which is added to the SC) or by copying to each slave node if they 
> build an actual project. This isn't clear since there are no dependencies in 
> the quickstart itself. Warn them about issues creating an uber-jar with akka.
> 2. Explain that people need to set the required resources in an env variable 
> before launching (or whatever new way Matei added to do this). 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1580) ALS: Estimate communication and computation costs given a partitioner

2014-04-22 Thread Tor Myklebust (JIRA)
Tor Myklebust created SPARK-1580:


 Summary: ALS: Estimate communication and computation costs given a 
partitioner
 Key: SPARK-1580
 URL: https://issues.apache.org/jira/browse/SPARK-1580
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tor Myklebust
Priority: Minor


It would be nice to be able to estimate the amount of work needed to solve an 
ALS problem.  The chief components of this "work" are computation time---time 
spent forming and solving the least squares problems---and communication 
cost---the number of bytes sent across the network.  Communication cost depends 
heavily on how the users and products are partitioned.

We currently do not try to cluster users or products so that fewer feature 
vectors need to be communicated.  This is intended as a first step toward that 
end---we ought to be able to tell whether one partitioning is better than 
another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1579) PySpark should distinguish expected IOExceptions from unexpected ones in the worker

2014-04-22 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1579:
--

 Summary: PySpark should distinguish expected IOExceptions from 
unexpected ones in the worker
 Key: SPARK-1579
 URL: https://issues.apache.org/jira/browse/SPARK-1579
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Patrick Wendell
Assignee: Aaron Davidson
 Fix For: 1.1.0


I chatted with [~adav] a bit about this. Right now we drop IOExceptions because 
they are (in some cases) expected if a Python worker returns before consuming 
its entire input. The issue is this swallows legitimate IO exceptions when they 
occur.

One thought we had was to change the daemon.py file to, instead of closing the 
socket when the function is over, simply busy-wait on the socket being closed. 
We'd transfer the responsibility for closing the socket to the Java reader. The 
Java reader could, when it has finished consuming output form Python, set a 
flag on a volatile variable to indicate that Python has fully returned. Then if 
an IOException is found, we only swallow it if we are expecting it.

This would also let us remove the warning message right now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1579) PySpark should distinguish expected IOExceptions from unexpected ones in the worker

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1579:
---

Description: 
I chatted with [~adav] a bit about this. Right now we drop IOExceptions because 
they are (in some cases) expected if a Python worker returns before consuming 
its entire input. The issue is this swallows legitimate IO exceptions when they 
occur.

One thought we had was to change the daemon.py file to, instead of closing the 
socket when the function is over, simply busy-wait on the socket being closed. 
We'd transfer the responsibility for closing the socket to the Java reader. The 
Java reader could, when it has finished consuming output form Python, set a 
flag on a volatile variable to indicate that Python has fully returned, and 
then close the socket. Then if an IOException is found, we only swallow it if 
we are expecting it.

This would also let us remove the warning message right now.

  was:
I chatted with [~adav] a bit about this. Right now we drop IOExceptions because 
they are (in some cases) expected if a Python worker returns before consuming 
its entire input. The issue is this swallows legitimate IO exceptions when they 
occur.

One thought we had was to change the daemon.py file to, instead of closing the 
socket when the function is over, simply busy-wait on the socket being closed. 
We'd transfer the responsibility for closing the socket to the Java reader. The 
Java reader could, when it has finished consuming output form Python, set a 
flag on a volatile variable to indicate that Python has fully returned. Then if 
an IOException is found, we only swallow it if we are expecting it.

This would also let us remove the warning message right now.


> PySpark should distinguish expected IOExceptions from unexpected ones in the 
> worker
> ---
>
> Key: SPARK-1579
> URL: https://issues.apache.org/jira/browse/SPARK-1579
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Patrick Wendell
>Assignee: Aaron Davidson
> Fix For: 1.1.0
>
>
> I chatted with [~adav] a bit about this. Right now we drop IOExceptions 
> because they are (in some cases) expected if a Python worker returns before 
> consuming its entire input. The issue is this swallows legitimate IO 
> exceptions when they occur.
> One thought we had was to change the daemon.py file to, instead of closing 
> the socket when the function is over, simply busy-wait on the socket being 
> closed. We'd transfer the responsibility for closing the socket to the Java 
> reader. The Java reader could, when it has finished consuming output form 
> Python, set a flag on a volatile variable to indicate that Python has fully 
> returned, and then close the socket. Then if an IOException is found, we only 
> swallow it if we are expecting it.
> This would also let us remove the warning message right now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1579) PySpark should distinguish expected IOExceptions from unexpected ones in the worker

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1579:
---

Description: 
I chatted with [~adav] a bit about this. Right now we drop IOExceptions because 
they are (in some cases) expected if a Python worker returns before consuming 
its entire input. The issue is this swallows legitimate IO exceptions when they 
occur.

One thought we had was to change the daemon.py file to, instead of closing the 
socket when the function is over, simply busy-wait on the socket being closed. 
We'd transfer the responsibility for closing the socket to the Java reader. The 
Java reader could, when it has finished consuming output form Python, set a 
flag on a volatile variable to indicate that Python has fully returned, and 
then close the socket. Then if an IOException is thrown in the write thread, it 
only swallows the exception if we are expecting it.

This would also let us remove the warning message right now.

  was:
I chatted with [~adav] a bit about this. Right now we drop IOExceptions because 
they are (in some cases) expected if a Python worker returns before consuming 
its entire input. The issue is this swallows legitimate IO exceptions when they 
occur.

One thought we had was to change the daemon.py file to, instead of closing the 
socket when the function is over, simply busy-wait on the socket being closed. 
We'd transfer the responsibility for closing the socket to the Java reader. The 
Java reader could, when it has finished consuming output form Python, set a 
flag on a volatile variable to indicate that Python has fully returned, and 
then close the socket. Then if an IOException is found, we only swallow it if 
we are expecting it.

This would also let us remove the warning message right now.


> PySpark should distinguish expected IOExceptions from unexpected ones in the 
> worker
> ---
>
> Key: SPARK-1579
> URL: https://issues.apache.org/jira/browse/SPARK-1579
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Patrick Wendell
>Assignee: Aaron Davidson
> Fix For: 1.1.0
>
>
> I chatted with [~adav] a bit about this. Right now we drop IOExceptions 
> because they are (in some cases) expected if a Python worker returns before 
> consuming its entire input. The issue is this swallows legitimate IO 
> exceptions when they occur.
> One thought we had was to change the daemon.py file to, instead of closing 
> the socket when the function is over, simply busy-wait on the socket being 
> closed. We'd transfer the responsibility for closing the socket to the Java 
> reader. The Java reader could, when it has finished consuming output form 
> Python, set a flag on a volatile variable to indicate that Python has fully 
> returned, and then close the socket. Then if an IOException is thrown in the 
> write thread, it only swallows the exception if we are expecting it.
> This would also let us remove the warning message right now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1571) UnresolvedException when running JavaSparkSQL example

2014-04-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1571:


Priority: Blocker  (was: Major)

> UnresolvedException when running JavaSparkSQL example
> -
>
> Key: SPARK-1571
> URL: https://issues.apache.org/jira/browse/SPARK-1571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Michael Armbrust
>Priority: Blocker
>
> When running JavaSparkSQL example using spark-submit in local mode (this 
> happens after fixing the class loading issue in SPARK-1570).
> 14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'age
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
>   at 
> org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1571) UnresolvedException when running JavaSparkSQL example

2014-04-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-1571:
---

Assignee: Michael Armbrust

> UnresolvedException when running JavaSparkSQL example
> -
>
> Key: SPARK-1571
> URL: https://issues.apache.org/jira/browse/SPARK-1571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Michael Armbrust
>
> When running JavaSparkSQL example using spark-submit in local mode (this 
> happens after fixing the class loading issue in SPARK-1570).
> 14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'age
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
>   at 
> org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1578) Do not require setting of cleaner TTL when creating StreamingContext

2014-04-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1578:
-

Description: Since shuffles and RDDs that are out of context are 
automatically cleaned by Spark core (using ContextCleaner) there is no need for 
setting the cleaner TTL in StreamingContext.

> Do not require setting of cleaner TTL when creating StreamingContext
> 
>
> Key: SPARK-1578
> URL: https://issues.apache.org/jira/browse/SPARK-1578
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Since shuffles and RDDs that are out of context are automatically cleaned by 
> Spark core (using ContextCleaner) there is no need for setting the cleaner 
> TTL in StreamingContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1578) Do not require setting of cleaner TTL when creating StreamingContext

2014-04-22 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-1578:


 Summary: Do not require setting of cleaner TTL when creating 
StreamingContext
 Key: SPARK-1578
 URL: https://issues.apache.org/jira/browse/SPARK-1578
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1577) GraphX mapVertices with KryoSerialization

2014-04-22 Thread Joseph E. Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez updated SPARK-1577:
--

Description: 
If Kryo is enabled by setting:

{{{quote}
SPARK_JAVA_OPTS+="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
"
SPARK_JAVA_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
  "
{quote}}}

in conf/spark_env.conf and running the following block of code in the shell:

{code}
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
import org.apache.spark.rdd.RDD

val vertexArray = Array(
  (1L, ("Alice", 28)),
  (2L, ("Bob", 27)),
  (3L, ("Charlie", 65)),
  (4L, ("David", 42)),
  (5L, ("Ed", 55)),
  (6L, ("Fran", 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)

val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

// Define a class to more clearly model the user property
case class User(name: String, age: Int, inDeg: Int, outDeg: Int)

// Transform the graph
val userGraph = graph.mapVertices{ case (id, (name, age)) => User(name, age, 0, 
0) }
{code}

The following block of code works:

{code}
userGraph.vertices.count
{code}

and the following block of code generates a Kryo error:

{code}
userGraph.vertices.collect
{code}

There error:

{code}
java.lang.StackOverflowError
at 
sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
at 
sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
at java.lang.reflect.Field.get(Field.java:379)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
{code}

 

  was:
If Kryo is enabled by setting:

{{{quote}
SPARK_JAVA_OPTS+="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
"
SPARK_JAVA_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
  "
{quote}}}

in conf/spark_env.conf and running the following block of code in the shell:

{{{quote}
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
import org.apache.spark.rdd.RDD

val vertexArray = Array(
  (1L, ("Alice", 28)),
  (2L, ("Bob", 27)),
  (3L, ("Charlie", 65)),
  (4L, ("David", 42)),
  (5L, ("Ed", 55)),
  (6L, ("Fran", 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)

val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

// Define a class to more clearly model the user property
case class User(name: String, age: Int, inDeg: Int, outDeg: Int)

// Transform the graph
val userGraph = graph.mapVertices{ case (id, (name, age)) => User(name, age, 0, 
0) }
{quote}}}

The following block of code works:

{{{quote}
userGraph.vertices.count
{quote}}}

and the following block of code generates a Kryo error:

{{{quote}
userGraph.vertices.collect
{quote}}}

There error:

{{{quote}
java.lang.StackOverflowError
at 
sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
at 
sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
at java.lang.reflect.Field.get(Field.java:379)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.ja

[jira] [Updated] (SPARK-1577) GraphX mapVertices with KryoSerialization

2014-04-22 Thread Joseph E. Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez updated SPARK-1577:
--

Description: 
If Kryo is enabled by setting:

{code}
SPARK_JAVA_OPTS+="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
"
SPARK_JAVA_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
  "
{code}

in conf/spark_env.conf and running the following block of code in the shell:

{code}
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
import org.apache.spark.rdd.RDD

val vertexArray = Array(
  (1L, ("Alice", 28)),
  (2L, ("Bob", 27)),
  (3L, ("Charlie", 65)),
  (4L, ("David", 42)),
  (5L, ("Ed", 55)),
  (6L, ("Fran", 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)

val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

// Define a class to more clearly model the user property
case class User(name: String, age: Int, inDeg: Int, outDeg: Int)

// Transform the graph
val userGraph = graph.mapVertices{ case (id, (name, age)) => User(name, age, 0, 
0) }
{code}

The following block of code works:

{code}
userGraph.vertices.count
{code}

and the following block of code generates a Kryo error:

{code}
userGraph.vertices.collect
{code}

There error:

{code}
java.lang.StackOverflowError
at 
sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
at 
sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
at java.lang.reflect.Field.get(Field.java:379)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
{code}

 

  was:
If Kryo is enabled by setting:

{{{quote}
SPARK_JAVA_OPTS+="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
"
SPARK_JAVA_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
  "
{quote}}}

in conf/spark_env.conf and running the following block of code in the shell:

{code}
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
import org.apache.spark.rdd.RDD

val vertexArray = Array(
  (1L, ("Alice", 28)),
  (2L, ("Bob", 27)),
  (3L, ("Charlie", 65)),
  (4L, ("David", 42)),
  (5L, ("Ed", 55)),
  (6L, ("Fran", 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)

val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

// Define a class to more clearly model the user property
case class User(name: String, age: Int, inDeg: Int, outDeg: Int)

// Transform the graph
val userGraph = graph.mapVertices{ case (id, (name, age)) => User(name, age, 0, 
0) }
{code}

The following block of code works:

{code}
userGraph.vertices.count
{code}

and the following block of code generates a Kryo error:

{code}
userGraph.vertices.collect
{code}

There error:

{code}
java.lang.StackOverflowError
at 
sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
at 
sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
at java.lang.reflect.Field.get(Field.java:379)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.eso

[jira] [Updated] (SPARK-1577) GraphX mapVertices with KryoSerialization

2014-04-22 Thread Joseph E. Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez updated SPARK-1577:
--

Description: 
If Kryo is enabled by setting:

{{{quote}
SPARK_JAVA_OPTS+="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
"
SPARK_JAVA_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
  "
{quote}}}

in conf/spark_env.conf and running the following block of code in the shell:

{{{quote}
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
import org.apache.spark.rdd.RDD

val vertexArray = Array(
  (1L, ("Alice", 28)),
  (2L, ("Bob", 27)),
  (3L, ("Charlie", 65)),
  (4L, ("David", 42)),
  (5L, ("Ed", 55)),
  (6L, ("Fran", 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)

val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

// Define a class to more clearly model the user property
case class User(name: String, age: Int, inDeg: Int, outDeg: Int)

// Transform the graph
val userGraph = graph.mapVertices{ case (id, (name, age)) => User(name, age, 0, 
0) }
{quote}}}

The following block of code works:

{{{quote}
userGraph.vertices.count
{quote}}}

and the following block of code generates a Kryo error:

{{{quote}
userGraph.vertices.collect
{quote}}}

There error:

{{{quote}
java.lang.StackOverflowError
at 
sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
at 
sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
at java.lang.reflect.Field.get(Field.java:379)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
{quote}}}

 

  was:
If Kryo is enabled by setting:

{{{quote}
SPARK_JAVA_OPTS+="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
"
SPARK_JAVA_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
  "
{quote}}}

in conf/spark_env.conf and running the following block of code in the shell:

{{{quote}
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
import org.apache.spark.rdd.RDD

val vertexArray = Array(
  (1L, ("Alice", 28)),
  (2L, ("Bob", 27)),
  (3L, ("Charlie", 65)),
  (4L, ("David", 42)),
  (5L, ("Ed", 55)),
  (6L, ("Fran", 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)

val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

// Define a class to more clearly model the user property
case class User(name: String, age: Int, inDeg: Int, outDeg: Int)

// Transform the graph
val userGraph = graph.mapVertices{ case (id, (name, age)) => User(name, age, 0, 
0) }
{quote}}}

The following block of code works:

{{{quote}
userGraph.vertices.count
{quote}}}

and the following block of code generates a Kryo error:

{{{quote}
degreeGraph.vertices.collect
{quote}}}

There error:

{{{quote}
java.lang.StackOverflowError
at 
sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
at 
sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
at java.lang.reflect.Field.get(Field.java:379)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kry

[jira] [Created] (SPARK-1577) GraphX mapVertices with KryoSerialization

2014-04-22 Thread Joseph E. Gonzalez (JIRA)
Joseph E. Gonzalez created SPARK-1577:
-

 Summary: GraphX mapVertices with KryoSerialization
 Key: SPARK-1577
 URL: https://issues.apache.org/jira/browse/SPARK-1577
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Joseph E. Gonzalez


If Kryo is enabled by setting:

{{{quote}
SPARK_JAVA_OPTS+="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
"
SPARK_JAVA_OPTS+="-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
  "
{quote}}}

in conf/spark_env.conf and running the following block of code in the shell:

{{{quote}
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
import org.apache.spark.rdd.RDD

val vertexArray = Array(
  (1L, ("Alice", 28)),
  (2L, ("Bob", 27)),
  (3L, ("Charlie", 65)),
  (4L, ("David", 42)),
  (5L, ("Ed", 55)),
  (6L, ("Fran", 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)

val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

// Define a class to more clearly model the user property
case class User(name: String, age: Int, inDeg: Int, outDeg: Int)

// Transform the graph
val userGraph = graph.mapVertices{ case (id, (name, age)) => User(name, age, 0, 
0) }
{quote}}}

The following block of code works:

{{{quote}
userGraph.vertices.count
{quote}}}

and the following block of code generates a Kryo error:

{{{quote}
degreeGraph.vertices.collect
{quote}}}

There error:

{{{quote}
java.lang.StackOverflowError
at 
sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
at 
sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
at java.lang.reflect.Field.get(Field.java:379)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
{quote}}}

 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1570) Class loading issue when using Spark SQL Java API

2014-04-22 Thread Kan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang resolved SPARK-1570.
--

Resolution: Fixed

> Class loading issue when using Spark SQL Java API
> -
>
> Key: SPARK-1570
> URL: https://issues.apache.org/jira/browse/SPARK-1570
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> ClassNotFoundException in Executor when running JavaSparkSQL example using 
> spark-submit in local mode.
> 14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
> java.lang.ClassNotFoundException: 
> org.apache.spark.examples.sql.JavaSparkSQL.Person
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:190)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-22 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977472#comment-13977472
 ] 

Nishkam Ravi edited comment on SPARK-1576 at 4/22/14 9:36 PM:
--

Patch attached.


was (Author: nravi):
Attached is a patch for this. Please review and suggest changes.

> Passing of JAVA_OPTS to YARN on command line
> 
>
> Key: SPARK-1576
> URL: https://issues.apache.org/jira/browse/SPARK-1576
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Nishkam Ravi
> Fix For: 0.9.0, 1.0.0, 0.9.1
>
> Attachments: SPARK-1576.patch
>
>
> JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
> or as config vars (after Patrick's recent change). It would be good to allow 
> the user to pass them on command line as well to restrict scope to single 
> application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-22 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi updated SPARK-1576:


Attachment: SPARK-1576.patch

> Passing of JAVA_OPTS to YARN on command line
> 
>
> Key: SPARK-1576
> URL: https://issues.apache.org/jira/browse/SPARK-1576
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Nishkam Ravi
> Fix For: 0.9.0, 1.0.0, 0.9.1
>
> Attachments: SPARK-1576.patch
>
>
> JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
> or as config vars (after Patrick's recent change). It would be good to allow 
> the user to pass them on command line as well to restrict scope to single 
> application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-22 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977472#comment-13977472
 ] 

Nishkam Ravi edited comment on SPARK-1576 at 4/22/14 9:35 PM:
--

Attached is a patch for this. Please review and suggest changes.


was (Author: nravi):
Attached is a patch for this. 

> Passing of JAVA_OPTS to YARN on command line
> 
>
> Key: SPARK-1576
> URL: https://issues.apache.org/jira/browse/SPARK-1576
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Nishkam Ravi
> Fix For: 0.9.0, 1.0.0, 0.9.1
>
> Attachments: SPARK-1576.patch
>
>
> JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
> or as config vars (after Patrick's recent change). It would be good to allow 
> the user to pass them on command line as well to restrict scope to single 
> application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-22 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977472#comment-13977472
 ] 

Nishkam Ravi commented on SPARK-1576:
-

Attached is a patch for this. 

> Passing of JAVA_OPTS to YARN on command line
> 
>
> Key: SPARK-1576
> URL: https://issues.apache.org/jira/browse/SPARK-1576
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Nishkam Ravi
> Fix For: 0.9.0, 1.0.0, 0.9.1
>
>
> JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
> or as config vars (after Patrick's recent change). It would be good to allow 
> the user to pass them on command line as well to restrict scope to single 
> application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-22 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-1576:
---

 Summary: Passing of JAVA_OPTS to YARN on command line
 Key: SPARK-1576
 URL: https://issues.apache.org/jira/browse/SPARK-1576
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Nishkam Ravi
 Fix For: 1.0.0, 0.9.1, 0.9.0


JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
or as config vars (after Patrick's recent change). The user should be able to 
pass them on command line as well to restrict scope to single application 
invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-22 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi updated SPARK-1576:


Description: JAVA_OPTS can be passed by using either env variables (i.e., 
SPARK_JAVA_OPTS) or as config vars (after Patrick's recent change). It would be 
good to allow the user to pass them on command line as well to restrict scope 
to single application invocation.  (was: JAVA_OPTS can be passed by using 
either env variables (i.e., SPARK_JAVA_OPTS) or as config vars (after Patrick's 
recent change). The user should be able to pass them on command line as well to 
restrict scope to single application invocation.)

> Passing of JAVA_OPTS to YARN on command line
> 
>
> Key: SPARK-1576
> URL: https://issues.apache.org/jira/browse/SPARK-1576
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Nishkam Ravi
> Fix For: 0.9.0, 1.0.0, 0.9.1
>
>
> JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
> or as config vars (after Patrick's recent change). It would be good to allow 
> the user to pass them on command line as well to restrict scope to single 
> application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1575) failing tests with master branch

2014-04-22 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-1575:
---

 Summary: failing tests with master branch 
 Key: SPARK-1575
 URL: https://issues.apache.org/jira/browse/SPARK-1575
 Project: Spark
  Issue Type: Test
Reporter: Nishkam Ravi


Built the master branch against Hadoop version 2.3.0-cdh5.0.0 with 
SPARK_YARN=true. sbt tests don't go through successfully (tried multiple runs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1574) ec2/spark_ec2.py should provide option to control number of attempts for ssh operations

2014-04-22 Thread Art Peel (JIRA)
Art Peel created SPARK-1574:
---

 Summary: ec2/spark_ec2.py should provide option to control number 
of attempts for ssh operations
 Key: SPARK-1574
 URL: https://issues.apache.org/jira/browse/SPARK-1574
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0
Reporter: Art Peel
Priority: Minor


EC instances are sometimes slow to start up.  When this happens, generating the 
cluster ssh key or sending the generated cluster key to the slaves can fail due 
to an ssh timeout.

The script currently hard-codes the number of tries for ssh operations as 2.

For more flexibility, it should be possible to specify the number of tries with 
a command-line option, --num-ssh-tries, that defaults to 2 to keep the current 
behavior if not provided.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1573) slight modification with regards to sbt/sbt test

2014-04-22 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-1573:
---

 Summary: slight modification with regards to sbt/sbt test
 Key: SPARK-1573
 URL: https://issues.apache.org/jira/browse/SPARK-1573
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Nishkam Ravi


When the sources are built against a certain Hadoop version with 
SPARK_YARN=true, the same settings seem necessary when running sbt/sbt test. 

For example:
SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 SPARK_YARN=true sbt/sbt assembly

SPARK_HADOOP_VERSION=2.3.0-cdh5.0.0 SPARK_YARN=true sbt/sbt test

Otherwise build errors and failing tests are seen.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1540) Investigate whether we should require keys in PairRDD to be Comparable

2014-04-22 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-1540:


Assignee: Matei Zaharia

> Investigate whether we should require keys in PairRDD to be Comparable
> --
>
> Key: SPARK-1540
> URL: https://issues.apache.org/jira/browse/SPARK-1540
> Project: Spark
>  Issue Type: New Feature
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>Priority: Blocker
> Fix For: 1.0.0
>
>
> This is kind of a bigger change, but it would make it easier to do sort-based 
> versions of external operations later. We might also get away without it. 
> Note that sortByKey() already does require an Ordering or Comparables.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1572) Uncaught IO exceptions in Pyspark kill Executor

2014-04-22 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-1572:
-

 Summary: Uncaught IO exceptions in Pyspark kill Executor
 Key: SPARK-1572
 URL: https://issues.apache.org/jira/browse/SPARK-1572
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 0.9.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson


If an exception is thrown in the Python "stdin writer" thread during this line:

{code}
PythonRDD.writeIteratorToStream(parent.iterator(split, context), dataOut)
{code}

(e.g., while reading from an HDFS source) then the exception will be handled by 
the default ThreadUncaughtExceptionHandler, which is set in Executor. The 
default behavior is, unfortunately, to call System.exit().

Ideally, normal exceptions while running a task should not bring down all the 
executors of a Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1571) UnresolvedException when running JavaSparkSQL example

2014-04-22 Thread Kan Zhang (JIRA)
Kan Zhang created SPARK-1571:


 Summary: UnresolvedException when running JavaSparkSQL example
 Key: SPARK-1571
 URL: https://issues.apache.org/jira/browse/SPARK-1571
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Kan Zhang


When run JavaSparkSQL example using spark-submit in local mode (this happens 
after fixing the class loading issue in SPARK-1570).

14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'age
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
at 
org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1571) UnresolvedException when running JavaSparkSQL example

2014-04-22 Thread Kan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1571:
-

Description: 
When running JavaSparkSQL example using spark-submit in local mode (this 
happens after fixing the class loading issue in SPARK-1570).

14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'age
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
at 
org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)


  was:
When run JavaSparkSQL example using spark-submit in local mode (this happens 
after fixing the class loading issue in SPARK-1570).

14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'age
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
at 
org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
at 
org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
at 
org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)



> UnresolvedException when running JavaSparkSQL example
> -
>
> Key: SPARK-1571
> URL: https://issues.apache.org/jira/browse/SPARK-1571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>
> When running JavaSparkSQL example using spark-submit in local mode (this 
> happens after fixing the class loading issue in SPARK-1570).
> 14/04/22 12:46:47 ERROR Executor: Exception in task ID 0
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'age
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.c2(Expression.scala:203)
>   at 
> org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual.eval(predicates.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:84)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:43)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1570) Class loading issue when using Spark SQL Java API

2014-04-22 Thread Kan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1570:
-

Description: 
ClassNotFoundException in Executor when running JavaSparkSQL example using 
spark-submit in local mode.

14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
java.lang.ClassNotFoundException: 
org.apache.spark.examples.sql.JavaSparkSQL.Person
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)


  was:
ClassNotFoundException in Executor when running JavaSparkSQL example using 
spark-submit.

14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
java.lang.ClassNotFoundException: 
org.apache.spark.examples.sql.JavaSparkSQL.Person
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



> Class loading issue when using Spark SQL Java API
> -
>
> Key: SPARK-1570
> URL: https://issues.apache.org/jira/browse/SPARK-1570
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> ClassNotFoundException in Executor when running JavaSparkSQL example using 
> spark-submit in local mode.
> 14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
> java.lang.ClassNotFoundException: 
> org.apache.spark.examples.sql.JavaSparkSQL.Person
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:190)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1570) Class loading issue when using Spark SQL Java API

2014-04-22 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977279#comment-13977279
 ] 

Kan Zhang commented on SPARK-1570:
--

PR: https://github.com/apache/spark/pull/484

> Class loading issue when using Spark SQL Java API
> -
>
> Key: SPARK-1570
> URL: https://issues.apache.org/jira/browse/SPARK-1570
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.0.0
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> ClassNotFoundException in Executor when running JavaSparkSQL example using 
> spark-submit.
> 14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
> java.lang.ClassNotFoundException: 
> org.apache.spark.examples.sql.JavaSparkSQL.Person
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:190)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
>   at 
> org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1570) Class loading issue when using Spark SQL Java API

2014-04-22 Thread Kan Zhang (JIRA)
Kan Zhang created SPARK-1570:


 Summary: Class loading issue when using Spark SQL Java API
 Key: SPARK-1570
 URL: https://issues.apache.org/jira/browse/SPARK-1570
 Project: Spark
  Issue Type: Bug
  Components: Java API, SQL
Affects Versions: 1.0.0
Reporter: Kan Zhang
Assignee: Kan Zhang
Priority: Blocker
 Fix For: 1.0.0


ClassNotFoundException in Executor when running JavaSparkSQL example using 
spark-submit.

14/04/22 12:26:20 ERROR Executor: Exception in task ID 0
java.lang.ClassNotFoundException: 
org.apache.spark.examples.sql.JavaSparkSQL.Person
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:90)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$1.apply(JavaSQLContext.scala:88)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:512)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1569) Spark on Yarn, authentication broken by pr299

2014-04-22 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-1569:


 Summary: Spark on Yarn, authentication broken by pr299
 Key: SPARK-1569
 URL: https://issues.apache.org/jira/browse/SPARK-1569
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Priority: Blocker


https://github.com/apache/spark/pull/299 changed the way configuration was done 
and passed to the executors.  This breaks use of authentication as the executor 
needs to know that authentication is enabled before connecting to the driver.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1568) Spark 0.9.0 hangs reading s3

2014-04-22 Thread sam (JIRA)
sam created SPARK-1568:
--

 Summary: Spark 0.9.0 hangs reading s3
 Key: SPARK-1568
 URL: https://issues.apache.org/jira/browse/SPARK-1568
 Project: Spark
  Issue Type: Bug
Reporter: sam


I've tried several jobs now and many of the tasks complete, then it get stuck 
and just hangs.  The exact same jobs function perfectly fine if I distcp to 
hdfs first and read from hdfs.

Many thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1554) Update doc overview page to not mention building if you get a pre-built distro

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1554:
---

Assignee: Patrick Wendell

> Update doc overview page to not mention building if you get a pre-built distro
> --
>
> Key: SPARK-1554
> URL: https://issues.apache.org/jira/browse/SPARK-1554
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Patrick Wendell
> Fix For: 1.0.0
>
>
> SBT assembly takes a long time and we should tell people to skip it if they 
> got a binary build (which will likely be the most common case).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1563) Add package-info.java and package.scala files for all packages that appear in docs

2014-04-22 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1563:
-

Summary: Add package-info.java and package.scala files for all packages 
that appear in docs  (was: Add package-info.java files for all packages that 
appear in Javadoc)

> Add package-info.java and package.scala files for all packages that appear in 
> docs
> --
>
> Key: SPARK-1563
> URL: https://issues.apache.org/jira/browse/SPARK-1563
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Prashant Sharma
>  Labels: Starter
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1567) Add language tabs to quick start guide

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1567.


Resolution: Fixed

> Add language tabs to quick start guide
> --
>
> Key: SPARK-1567
> URL: https://issues.apache.org/jira/browse/SPARK-1567
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Patrick Wendell
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1567) Add language tabs to quick start guide

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1567:
---

Assignee: Patrick Wendell

> Add language tabs to quick start guide
> --
>
> Key: SPARK-1567
> URL: https://issues.apache.org/jira/browse/SPARK-1567
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Patrick Wendell
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1505) [streaming] Add 0.9 to 1.0 migration guide for streaming receiver

2014-04-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-1505:


Assignee: Tathagata Das

> [streaming] Add 0.9 to 1.0 migration guide for streaming receiver
> -
>
> Key: SPARK-1505
> URL: https://issues.apache.org/jira/browse/SPARK-1505
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1564:
---

Assignee: Andrew Or

> Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
> -
>
> Key: SPARK-1564
> URL: https://issues.apache.org/jira/browse/SPARK-1564
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1563) Add package-info.java files for all packages that appear in Javadoc

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1563:
---

Assignee: Prashant Sharma

> Add package-info.java files for all packages that appear in Javadoc
> ---
>
> Key: SPARK-1563
> URL: https://issues.apache.org/jira/browse/SPARK-1563
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Prashant Sharma
>  Labels: Starter
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1566) Consolidate the Spark Programming Guide with tabs for all languages

2014-04-22 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1566:


 Summary: Consolidate the Spark Programming Guide with tabs for all 
languages
 Key: SPARK-1566
 URL: https://issues.apache.org/jira/browse/SPARK-1566
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia


Right now it's Scala-only and the other ones say "look at the Scala programming 
guide first".



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1567) Add language tabs to quick start guide

2014-04-22 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1567:


 Summary: Add language tabs to quick start guide
 Key: SPARK-1567
 URL: https://issues.apache.org/jira/browse/SPARK-1567
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1480) Choose classloader consistently inside of Spark codebase

2014-04-22 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977064#comment-13977064
 ] 

Kan Zhang commented on SPARK-1480:
--

[~pwendell] do you mind posting a link to the PR? Thx.

> Choose classloader consistently inside of Spark codebase
> 
>
> Key: SPARK-1480
> URL: https://issues.apache.org/jira/browse/SPARK-1480
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The Spark codebase is not always consistent on which class loader it uses 
> when classlaoders are explicitly passed to things like serializers. This 
> caused SPARK-1403 and also causes a bug where when the driver has a modified 
> context class loader it is not translated correctly in local mode to the 
> (local) executor.
> In most cases what we want is the following behavior:
> 1. If there is a context classloader on the thread, use that.
> 2. Otherwise use the classloader that loaded Spark.
> We should just have a utility function for this and call that function 
> whenever we need to get a classloader.
> Note that SPARK-1403 is a workaround for this exact problem (it sets the 
> context class loader because downstream code assumes it is set). Once this 
> gets fixed in a more general way SPARK-1403 can be reverted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1560) PySpark SQL depends on Java 7 only jars

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1560.


Resolution: Fixed

> PySpark SQL depends on Java 7 only jars
> ---
>
> Key: SPARK-1560
> URL: https://issues.apache.org/jira/browse/SPARK-1560
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Ahir Reddy
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We need to republish the pickler built with java 7. Details below:
> {code}
> 14/04/19 12:31:29 INFO rdd.HadoopRDD: Input split: 
> file:/Users/ceteri/opt/spark-branch-1.0/examples/src/main/resources/people.txt:0+16
> Exception in thread "Local computation of job 1" 
> java.lang.UnsupportedClassVersionError: net/razorvine/pickle/Unpickler : 
> Unsupported major.minor version 51.0
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>   at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$pythonToJavaMap$1.apply(PythonRDD.scala:295)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$pythonToJavaMap$1.apply(PythonRDD.scala:294)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:518)
>   at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:518)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:243)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:234)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:700)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:685)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1499) Workers continuously produce failing executors

2014-04-22 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976294#comment-13976294
 ] 

Adrian Wang edited comment on SPARK-1499 at 4/22/14 8:35 AM:
-

Have you looked into the log of the failing worker?
I think there must be a lot of lines like
"
ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@slave1:45324] -> 
[akka.tcp://sparkExecutor@slave1:59294]: Error [Association failed with 
[akka.tcp://sparkExecutor@slave1:59294]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@slave1:59294]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: slave1/192.168.1.2:59294
"


was (Author: adrian-wang):
Have you look into the log of the failing worker?
I think there must be a lot of lines like
"
ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@slave1:45324] -> 
[akka.tcp://sparkExecutor@slave1:59294]: Error [Association failed with 
[akka.tcp://sparkExecutor@slave1:59294]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@slave1:59294]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: slave1/192.168.1.2:59294
"

> Workers continuously produce failing executors
> --
>
> Key: SPARK-1499
> URL: https://issues.apache.org/jira/browse/SPARK-1499
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.0, 0.9.1
>Reporter: Aaron Davidson
>
> If a node is in a bad state, such that newly started executors fail on 
> startup or first use, the Standalone Cluster Worker will happily keep 
> spawning new ones. A better behavior would be for a Worker to mark itself as 
> dead if it has had a history of continuously producing erroneous executors, 
> or else to somehow prevent a driver from re-registering executors from the 
> same machine repeatedly.
> Reported on mailing list: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E
> Relevant logs: 
> {noformat}
> 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: 
> app-20140411190649-0008/4 is now FAILED (Command exited with code 53)
> 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 
> app-20140411190649-0008/4 removed: Command exited with code 53
> 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 
> disconnected, so removing it
> 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 
> (already removed): Failed to create local directory (bad spark.local.dir?)
> 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: 
> app-20140411190649-0008/27 on 
> worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 
> (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores
> 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor 
> ID app-20140411190649-0008/27 on hostPort 
> ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM
> 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: 
> app-20140411190649-0008/27 is now RUNNING
> 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: 
> Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 
> with 32.7 GB RAM
> 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default 
> tbl=wikistats_pd
> 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root  ip=unknown-ip-addr  
> cmd=get_table : db=default tbl=wikistats_pd 
> 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string 
> projectcode, string pagename, i32 pageviews, i32 bytes}
> 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: 
> columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, 
> string, int, int] separator=[[B@29a81175] nullstring=\N 
> lastColumnTakesRest=false
> shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295]
>  with ID 27
> show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 
> disconnected, so removing it
> 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 
> (already removed): remote Akka client disassociated
> 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: 
> app-20140411190649-0008/27 is now FAILED (Command exited with code 53)
> 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 
> app-201404111906

[jira] [Updated] (SPARK-1565) Spark examples should be changed given spark-submit

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1565:
---

Description: 
Example: 
https://github.com/pwendell/spark/commit/3fdad8ada28c1cef9b3b367993057de22295b8ed

Now that we have the spark-submit script we should make some changes to the 
examples:

1. We should always create a SparkConf that sets the app name and use the 
SparkContext constructor that accepts a conf.
2. We shouldn't set the master based on a command line argument, instead users 
can do this on their own with spark-submit.
3. The examples projects should mark spark-core/streaming as a "provided" 
dependency, so that the examples assembly jar does not include spark.

Then users can launch examples like this:

{code}
./bin/spark-submit 
examples/target/scala-2.10/spark-examples-assembly-1.0.0-SNAPSHOT.jar \
  --class org.apache.spark.examples.SparkPi \
  --arg 1000
{code}

  was:
https://github.com/pwendell/spark/commit/3fdad8ada28c1cef9b3b367993057de22295b8ed

Now that we have the spark-submit script we should make some changes to the 
examples:

1. We should always create a SparkConf that sets the app name and use the 
SparkContext constructor that accepts a conf.
2. We shouldn't set the master based on a command line argument, instead users 
can do this on their own with spark-submit.
3. The examples projects should mark spark-core/streaming as a "provided" 
dependency, so that the examples assembly jar does not include spark.

Then users can launch examples like this:

{code}
./bin/spark-submit 
examples/target/scala-2.10/spark-examples-assembly-1.0.0-SNAPSHOT.jar \
  --class org.apache.spark.examples.SparkPi \
  --arg 1000
{code}


> Spark examples should be changed given spark-submit
> ---
>
> Key: SPARK-1565
> URL: https://issues.apache.org/jira/browse/SPARK-1565
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Patrick Wendell
>Assignee: Prashant Sharma
> Fix For: 1.0.0
>
>
> Example: 
> https://github.com/pwendell/spark/commit/3fdad8ada28c1cef9b3b367993057de22295b8ed
> Now that we have the spark-submit script we should make some changes to the 
> examples:
> 1. We should always create a SparkConf that sets the app name and use the 
> SparkContext constructor that accepts a conf.
> 2. We shouldn't set the master based on a command line argument, instead 
> users can do this on their own with spark-submit.
> 3. The examples projects should mark spark-core/streaming as a "provided" 
> dependency, so that the examples assembly jar does not include spark.
> Then users can launch examples like this:
> {code}
> ./bin/spark-submit 
> examples/target/scala-2.10/spark-examples-assembly-1.0.0-SNAPSHOT.jar \
>   --class org.apache.spark.examples.SparkPi \
>   --arg 1000
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1565) Spark examples should be changed given spark-submit

2014-04-22 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1565:
--

 Summary: Spark examples should be changed given spark-submit
 Key: SPARK-1565
 URL: https://issues.apache.org/jira/browse/SPARK-1565
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Reporter: Patrick Wendell
Assignee: Prashant Sharma
 Fix For: 1.0.0


https://github.com/pwendell/spark/commit/3fdad8ada28c1cef9b3b367993057de22295b8ed

Now that we have the spark-submit script we should make some changes to the 
examples:

1. We should always create a SparkConf that sets the app name and use the 
SparkContext constructor that accepts a conf.
2. We shouldn't set the master based on a command line argument, instead users 
can do this on their own with spark-submit.
3. The examples projects should mark spark-core/streaming as a "provided" 
dependency, so that the examples assembly jar does not include spark.

Then users can launch examples like this:

{code}
./bin/spark-submit 
examples/target/scala-2.10/spark-examples-assembly-1.0.0-SNAPSHOT.jar \
  --class org.apache.spark.examples.SparkPi \
  --arg 1000
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2014-04-22 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976494#comment-13976494
 ] 

Aaron Davidson commented on SPARK-1529:
---

Hey, looking through the code a little more in depth reveals that the rewind 
and truncate functionality of DiskBlockObjectWriter is actually used during 
shuffle file consolidation. The issue is that we store the metadata for each 
consolidated shuffle file as a consecutive set of offsets into the file (an 
"offset table"). That is, if we have 3 blocks stored in the same file, rather 
than storing 3 pairs of (offset, length), we simply store the offsets and use 
the fact that they're laid out consecutively to reconstruct the lengths. This 
means we can't suffer "holes" in the data structure of partial writes, and thus 
rely on the fact that partial writes (which are not included in the data 
structure right now) are always of size 0.

I think getting around this is pretty straightforward, however: We can simply 
store the offsets of all partial writes in the offset table, and just avoid 
storing them in the index we build to look up the positions of particular map 
tasks in the offset table. This will mean we can reconstruct the lengths 
properly, but most importantly it means we will not think that our failed map 
tasks were successful (because index lookups for them will still fail, even 
though they're in the offset table).

This seems like a pretty clean solution that wraps up our usage of 
FileChannels, save where we mmap files back into memory. We will likely want to 
special-case the blocks to make sure we can mmap them directly when reading 
from the local file system.

> Support setting spark.local.dirs to a hadoop FileSystem 
> 
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
> Fix For: 1.1.0
>
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1561) sbt/sbt assembly generates too many local files

2014-04-22 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976492#comment-13976492
 ] 

Xiangrui Meng commented on SPARK-1561:
--

Tried with a clean repo, the number is 254924, which is probably fine.

> sbt/sbt assembly generates too many local files
> ---
>
> Key: SPARK-1561
> URL: https://issues.apache.org/jira/browse/SPARK-1561
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>
> Running `find ./ | wc -l` after `sbt/sbt assembly` returned 
> 564365
> This hits the default limit of #INode of an 8GB EXT FS (the default volume 
> size for an EC2 instance), which means you can do nothing after 'sbt/sbt 
> assembly` on such a partition.
> Most of the small files are under assembly/target/streams and the same folder 
> under examples/.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1561) sbt/sbt assembly generates too many local files

2014-04-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1561:
-

Description: 
Running `find ./ | wc -l` after `sbt/sbt assembly` returned 

564365

This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.


  was:
Running `find ./ | wc -l` after `sbt/sbt assembly` returned 



This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.



> sbt/sbt assembly generates too many local files
> ---
>
> Key: SPARK-1561
> URL: https://issues.apache.org/jira/browse/SPARK-1561
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>
> Running `find ./ | wc -l` after `sbt/sbt assembly` returned 
> 564365
> This hits the default limit of #INode of an 8GB EXT FS (the default volume 
> size for an EC2 instance), which means you can do nothing after 'sbt/sbt 
> assembly` on such a partition.
> Most of the small files are under assembly/target/streams and the same folder 
> under examples/.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1561) sbt/sbt assembly generates too many local files

2014-04-22 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976487#comment-13976487
 ] 

Xiangrui Meng commented on SPARK-1561:
--

Tried adding 

{code}
assemblyOption in assembly ~= { _.copy(cacheUnzip = false) },
assemblyOption in assembly ~= { _.copy(cacheOutput = false) },
{code}

to SparkBuild.scala, but it didn't help reduce the number of files.

> sbt/sbt assembly generates too many local files
> ---
>
> Key: SPARK-1561
> URL: https://issues.apache.org/jira/browse/SPARK-1561
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>
> Running `find ./ | wc -l` after `sbt/sbt assembly` returned 
> 564365
> This hits the default limit of #INode of an 8GB EXT FS (the default volume 
> size for an EC2 instance), which means you can do nothing after 'sbt/sbt 
> assembly` on such a partition.
> Most of the small files are under assembly/target/streams and the same folder 
> under examples/.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1561) sbt/sbt assembly generates too many local files

2014-04-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1561:
-

Description: 
Running `find ./ | wc -l` after `sbt/sbt assembly` returned 



This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.


  was:
Running `find ./ | wc -l` after `sbt/sbt assembly` returned 

564365

This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.



> sbt/sbt assembly generates too many local files
> ---
>
> Key: SPARK-1561
> URL: https://issues.apache.org/jira/browse/SPARK-1561
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>
> Running `find ./ | wc -l` after `sbt/sbt assembly` returned 
> This hits the default limit of #INode of an 8GB EXT FS (the default volume 
> size for an EC2 instance), which means you can do nothing after 'sbt/sbt 
> assembly` on such a partition.
> Most of the small files are under assembly/target/streams and the same folder 
> under examples/.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1145) Memory mapping with many small blocks can cause JVM allocation failures

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1145:
---

Reporter: Patrick Wendell  (was: Patrick Cogan)

> Memory mapping with many small blocks can cause JVM allocation failures
> ---
>
> Key: SPARK-1145
> URL: https://issues.apache.org/jira/browse/SPARK-1145
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
> Fix For: 1.0.0
>
>
> During a shuffle each block or block segment is memory mapped to a file. When 
> the segments are very small and there are a large number of them, the memory 
> maps can start failing and eventually the JVM will terminate. It's not clear 
> exactly what's happening but it appears that when the JVM terminates about 
> 265MB of virtual address space is used by memory mapped files. This doesn't 
> seem affected at all by `-XXmaxdirectmemorysize` - AFAIK that option is just 
> to give the JVM its own self imposed limit rather than allow it to run into 
> OS limits. 
> At the time of JVM failure it appears the overall OS memory becomes scarce, 
> so it's possible there are overheads for each memory mapped file that are 
> adding up here. One overhead is that the memory mapping occurs at the 
> granularity of pages, so if blocks are really small there is natural overhead 
> required to pad to the page boundary.
> In the particular case where I saw this, the JVM was running 4 reducers, each 
> of which was trying to access about 30,000 blocks for a total of 120,000 
> concurrent reads. At about 65,000 open files it crapped out. In this case 
> each file was about 1000 bytes.
> User should really be coalescing or using fewer reducers if they have 1000 
> byte shuffle files, but I expect this to happen nonetheless. My proposal was 
> that if the file is smaller than a few pages, we should just read it into a 
> java buffer and not bother to memory map it. Memory mapping huge numbers of 
> small files in the JVM is neither recommended or good for performance, AFAIK.
> Below is the stack trace:
> {code}
> 14/02/27 08:32:35 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.io.IOException: Map failed
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:888)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:89)
>   at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:285)
>   at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
>   at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
>   at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
>   at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:512)
>   at 
> org.apache.spark.network.ConnectionManager$$anon$8.run(ConnectionManager.scala:478)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}
> And the JVM error log had a bunch of entries like this:
> {code}
> 7f4b48f89000-7f4b48f8a000 r--s  ca:30 1622077901 
> /mnt4/spark/spark-local-20140227020022-227c/26/shuffle_0_22312_38
> 7f4b48f8a000-7f4b48f8b000 r--s  ca:20 545892715  
> /mnt3/spark/spark-local-20140227020022-5ef5/3a/shuffle_0_26808_20
> 7f4b48f8b000-7f4b48f8c000 r--s  ca:50 1622480741 
> /mnt2/spark/spark-local-20140227020022-315b/1c/shuffle_0_29013_19
> 7f4b48f8c000-7f4b48f8d0

[jira] [Updated] (SPARK-1145) Memory mapping with many small blocks can cause JVM allocation failures

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1145:
---

Assignee: Patrick Wendell  (was: Patrick Cogan)

> Memory mapping with many small blocks can cause JVM allocation failures
> ---
>
> Key: SPARK-1145
> URL: https://issues.apache.org/jira/browse/SPARK-1145
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Patrick Cogan
>Assignee: Patrick Wendell
> Fix For: 1.0.0
>
>
> During a shuffle each block or block segment is memory mapped to a file. When 
> the segments are very small and there are a large number of them, the memory 
> maps can start failing and eventually the JVM will terminate. It's not clear 
> exactly what's happening but it appears that when the JVM terminates about 
> 265MB of virtual address space is used by memory mapped files. This doesn't 
> seem affected at all by `-XXmaxdirectmemorysize` - AFAIK that option is just 
> to give the JVM its own self imposed limit rather than allow it to run into 
> OS limits. 
> At the time of JVM failure it appears the overall OS memory becomes scarce, 
> so it's possible there are overheads for each memory mapped file that are 
> adding up here. One overhead is that the memory mapping occurs at the 
> granularity of pages, so if blocks are really small there is natural overhead 
> required to pad to the page boundary.
> In the particular case where I saw this, the JVM was running 4 reducers, each 
> of which was trying to access about 30,000 blocks for a total of 120,000 
> concurrent reads. At about 65,000 open files it crapped out. In this case 
> each file was about 1000 bytes.
> User should really be coalescing or using fewer reducers if they have 1000 
> byte shuffle files, but I expect this to happen nonetheless. My proposal was 
> that if the file is smaller than a few pages, we should just read it into a 
> java buffer and not bother to memory map it. Memory mapping huge numbers of 
> small files in the JVM is neither recommended or good for performance, AFAIK.
> Below is the stack trace:
> {code}
> 14/02/27 08:32:35 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.io.IOException: Map failed
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:888)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:89)
>   at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:285)
>   at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
>   at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
>   at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
>   at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:512)
>   at 
> org.apache.spark.network.ConnectionManager$$anon$8.run(ConnectionManager.scala:478)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}
> And the JVM error log had a bunch of entries like this:
> {code}
> 7f4b48f89000-7f4b48f8a000 r--s  ca:30 1622077901 
> /mnt4/spark/spark-local-20140227020022-227c/26/shuffle_0_22312_38
> 7f4b48f8a000-7f4b48f8b000 r--s  ca:20 545892715  
> /mnt3/spark/spark-local-20140227020022-5ef5/3a/shuffle_0_26808_20
> 7f4b48f8b000-7f4b48f8c000 r--s  ca:50 1622480741 
> /mnt2/spark/spark-local-20140227020022-315b/1c/shuffle_0_29013_19
> 7f4b48f8c000-7f4b48f8d000

[jira] [Updated] (SPARK-1145) Memory mapping with many small blocks can cause JVM allocation failures

2014-04-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1145:
---

Fix Version/s: 1.0.0

> Memory mapping with many small blocks can cause JVM allocation failures
> ---
>
> Key: SPARK-1145
> URL: https://issues.apache.org/jira/browse/SPARK-1145
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Patrick Cogan
>Assignee: Patrick Cogan
> Fix For: 1.0.0
>
>
> During a shuffle each block or block segment is memory mapped to a file. When 
> the segments are very small and there are a large number of them, the memory 
> maps can start failing and eventually the JVM will terminate. It's not clear 
> exactly what's happening but it appears that when the JVM terminates about 
> 265MB of virtual address space is used by memory mapped files. This doesn't 
> seem affected at all by `-XXmaxdirectmemorysize` - AFAIK that option is just 
> to give the JVM its own self imposed limit rather than allow it to run into 
> OS limits. 
> At the time of JVM failure it appears the overall OS memory becomes scarce, 
> so it's possible there are overheads for each memory mapped file that are 
> adding up here. One overhead is that the memory mapping occurs at the 
> granularity of pages, so if blocks are really small there is natural overhead 
> required to pad to the page boundary.
> In the particular case where I saw this, the JVM was running 4 reducers, each 
> of which was trying to access about 30,000 blocks for a total of 120,000 
> concurrent reads. At about 65,000 open files it crapped out. In this case 
> each file was about 1000 bytes.
> User should really be coalescing or using fewer reducers if they have 1000 
> byte shuffle files, but I expect this to happen nonetheless. My proposal was 
> that if the file is smaller than a few pages, we should just read it into a 
> java buffer and not bother to memory map it. Memory mapping huge numbers of 
> small files in the JVM is neither recommended or good for performance, AFAIK.
> Below is the stack trace:
> {code}
> 14/02/27 08:32:35 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.io.IOException: Map failed
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:888)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:89)
>   at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:285)
>   at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
>   at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
>   at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
>   at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
>   at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:512)
>   at 
> org.apache.spark.network.ConnectionManager$$anon$8.run(ConnectionManager.scala:478)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}
> And the JVM error log had a bunch of entries like this:
> {code}
> 7f4b48f89000-7f4b48f8a000 r--s  ca:30 1622077901 
> /mnt4/spark/spark-local-20140227020022-227c/26/shuffle_0_22312_38
> 7f4b48f8a000-7f4b48f8b000 r--s  ca:20 545892715  
> /mnt3/spark/spark-local-20140227020022-5ef5/3a/shuffle_0_26808_20
> 7f4b48f8b000-7f4b48f8c000 r--s  ca:50 1622480741 
> /mnt2/spark/spark-local-20140227020022-315b/1c/shuffle_0_29013_19
> 7f4b48f8c000-7f4b48f8d000 r--s  ca:30 10082610