[jira] [Updated] (SPARK-1480) Choose classloader consistently inside of Spark codebase

2014-04-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1480:
---

Description: 
The Spark codebase is not always consistent on which class loader it uses when 
classlaoders are explicitly passed to things like serializers. This caused 
SPARK-1403 and also causes a bug where when the driver has a modified context 
class loader it is not translated correctly.

In most cases what we want is the following behavior:
1. If there is a context classloader on the thread, use that.
2. Otherwise use the classloader that loaded Spark.

We should just have a utility function for this and call that function whenever 
we need to get a classloader.

Note that SPARK-1403 is a workaround for this exact problem (it sets the 
context class loader because downstream code assumes it is set). Once this gets 
fixed in a more general way SPARK-1403 can be reverted.



  was:
The Spark codebase is not always consistent on which class loader it uses when 
classlaoders are explicitly passed to things like serializers.

In most cases what we want is the following behavior:
1. If there is a context classloader on the thread, use that.
2. Otherwise use the classloader that loaded Spark.

We should just have a utility function for this and call that function whenever 
we need to get a classloader.

Note that SPARK-1403 is a workaround for this exact problem (it sets the 
context class loader because downstream code assumes it is set). Once this gets 
fixed in a more general way SPARK-1403 can be reverted.




 Choose classloader consistently inside of Spark codebase
 

 Key: SPARK-1480
 URL: https://issues.apache.org/jira/browse/SPARK-1480
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 The Spark codebase is not always consistent on which class loader it uses 
 when classlaoders are explicitly passed to things like serializers. This 
 caused SPARK-1403 and also causes a bug where when the driver has a modified 
 context class loader it is not translated correctly.
 In most cases what we want is the following behavior:
 1. If there is a context classloader on the thread, use that.
 2. Otherwise use the classloader that loaded Spark.
 We should just have a utility function for this and call that function 
 whenever we need to get a classloader.
 Note that SPARK-1403 is a workaround for this exact problem (it sets the 
 context class loader because downstream code assumes it is set). Once this 
 gets fixed in a more general way SPARK-1403 can be reverted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1480) Choose classloader consistently inside of Spark codebase

2014-04-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1480:
---

Description: 
The Spark codebase is not always consistent on which class loader it uses when 
classlaoders are explicitly passed to things like serializers. This caused 
SPARK-1403 and also causes a bug where when the driver has a modified context 
class loader it is not translated correctly in local mode to the (local) 
executor.

In most cases what we want is the following behavior:
1. If there is a context classloader on the thread, use that.
2. Otherwise use the classloader that loaded Spark.

We should just have a utility function for this and call that function whenever 
we need to get a classloader.

Note that SPARK-1403 is a workaround for this exact problem (it sets the 
context class loader because downstream code assumes it is set). Once this gets 
fixed in a more general way SPARK-1403 can be reverted.



  was:
The Spark codebase is not always consistent on which class loader it uses when 
classlaoders are explicitly passed to things like serializers. This caused 
SPARK-1403 and also causes a bug where when the driver has a modified context 
class loader it is not translated correctly.

In most cases what we want is the following behavior:
1. If there is a context classloader on the thread, use that.
2. Otherwise use the classloader that loaded Spark.

We should just have a utility function for this and call that function whenever 
we need to get a classloader.

Note that SPARK-1403 is a workaround for this exact problem (it sets the 
context class loader because downstream code assumes it is set). Once this gets 
fixed in a more general way SPARK-1403 can be reverted.




 Choose classloader consistently inside of Spark codebase
 

 Key: SPARK-1480
 URL: https://issues.apache.org/jira/browse/SPARK-1480
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 The Spark codebase is not always consistent on which class loader it uses 
 when classlaoders are explicitly passed to things like serializers. This 
 caused SPARK-1403 and also causes a bug where when the driver has a modified 
 context class loader it is not translated correctly in local mode to the 
 (local) executor.
 In most cases what we want is the following behavior:
 1. If there is a context classloader on the thread, use that.
 2. Otherwise use the classloader that loaded Spark.
 We should just have a utility function for this and call that function 
 whenever we need to get a classloader.
 Note that SPARK-1403 is a workaround for this exact problem (it sets the 
 context class loader because downstream code assumes it is set). Once this 
 gets fixed in a more general way SPARK-1403 can be reverted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1480) Choose classloader consistently inside of Spark codebase

2014-04-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1480:
---

Fix Version/s: (was: 1.1.0)
   1.0.0

 Choose classloader consistently inside of Spark codebase
 

 Key: SPARK-1480
 URL: https://issues.apache.org/jira/browse/SPARK-1480
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 The Spark codebase is not always consistent on which class loader it uses 
 when classlaoders are explicitly passed to things like serializers.
 In most cases what we want is the following behavior:
 1. If there is a context classloader on the thread, use that.
 2. Otherwise use the classloader that loaded Spark.
 We should just have a utility function for this and call that function 
 whenever we need to get a classloader.
 Note that SPARK-1403 is a workaround for this exact problem (it sets the 
 context class loader because downstream code assumes it is set). Once this 
 gets fixed in a more general way SPARK-1403 can be reverted.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1479) building spark on 2.0.0-cdh4.4.0 failed

2014-04-13 Thread witgo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967769#comment-13967769
 ] 

witgo edited comment on SPARK-1479 at 4/13/14 9:34 AM:
---

CDH4.4.0 yarn api has changed .Right now Spark doesn't support 
cdh4.4,cdh4.5,cdh4. HoweverSpark support cdh4.3  


was (Author: witgo):
CDH4.4.0 yarn api has changed .Right now Spark doesn't support 
cdh4.4,cdh4.5,cdh4.6

 building spark on 2.0.0-cdh4.4.0 failed
 ---

 Key: SPARK-1479
 URL: https://issues.apache.org/jira/browse/SPARK-1479
 Project: Spark
  Issue Type: Question
 Environment: 2.0.0-cdh4.4.0
 Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
 spark 0.9.1
 java version 1.6.0_32
Reporter: jackielihf
 Attachments: mvn.log


 [INFO] 
 
 [ERROR] Failed to execute goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on 
 project spark-yarn-alpha_2.10: Execution scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. CompileFailed - 
 [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
 goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile 
 (scala-compile-first) on project spark-yarn-alpha_2.10: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
   at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:110)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
   ... 19 more
 Caused by: Compilation failed
   at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:76)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:35)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:29)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:101)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.compileScala$1(AggressiveCompile.scala:70)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:88)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:60)
   at 
 sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:24)
   at 
 sbt.inc.IncrementalCompile$$anonfun$doCompile$1.apply(Compile.scala:22)
   at sbt.inc.Incremental$.cycle(Incremental.scala:40)
   at 

[jira] [Commented] (SPARK-1482) Potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset

2014-04-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967782#comment-13967782
 ] 

Shixiong Zhu commented on SPARK-1482:
-

PR: https://github.com/apache/spark/pull/400

 Potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset
 -

 Key: SPARK-1482
 URL: https://issues.apache.org/jira/browse/SPARK-1482
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu
Priority: Minor
  Labels: easyfix

 writer.close should be put in the finally block to avoid potential 
 resource leaks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1479) building spark on 2.0.0-cdh4.4.0 failed

2014-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967784#comment-13967784
 ] 

Sean Owen commented on SPARK-1479:
--

yarn is the slightly more appropriate profile, but, read: 
https://github.com/apache/spark/pull/151
What Spark doesn't quite support is YARN beta and that's what you've got on 
your hands here.

FWIW I am in favor of the change in this PR to make it all work. Soon, support 
for YARN alpha/beta can just go away anyway.

If you are interested in CDH, the best thing is moving to CDH5, which already 
has Spark set up in standalone mode, and which has YARN stable. It also works 
with CDH 4.6 in standalone mode as a parcel.

 building spark on 2.0.0-cdh4.4.0 failed
 ---

 Key: SPARK-1479
 URL: https://issues.apache.org/jira/browse/SPARK-1479
 Project: Spark
  Issue Type: Question
 Environment: 2.0.0-cdh4.4.0
 Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
 spark 0.9.1
 java version 1.6.0_32
Reporter: jackielihf
 Attachments: mvn.log


 [INFO] 
 
 [ERROR] Failed to execute goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on 
 project spark-yarn-alpha_2.10: Execution scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. CompileFailed - 
 [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
 goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile 
 (scala-compile-first) on project spark-yarn-alpha_2.10: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
   at 
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
   at 
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
   at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
   at 
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed.
   at 
 org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:110)
   at 
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209)
   ... 19 more
 Caused by: Compilation failed
   at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:76)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:35)
   at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:29)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:71)
   at 
 sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:101)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.compileScala$1(AggressiveCompile.scala:70)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:88)
   at 
 sbt.compiler.AggressiveCompile$$anonfun$4.apply(AggressiveCompile.scala:60)
   

[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks

2014-04-13 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967854#comment-13967854
 ] 

Mridul Muralidharan edited comment on SPARK-1476 at 4/13/14 2:45 PM:
-

There are multiple issues at play here :

a) If a block goes beyond 2G, everything fails - this is the case for shuffle, 
cached and 'normal' blocks.
Irrespective of storage level and/or other config.
In a lot of cases, particularly when data generated 'increases' after a 
map/flatMap/etc, the user has no control over the size increase in the output 
block for a given input block (think iterations over datasets).

b) Increasing number of partitions is not always an option (for the subset of 
usecases where it can be done) :
1) It has an impact on number of intermediate files created while doing a 
shuffle. (ulimit constraints, IO performance issues, etc).
2) It does not help when there is skew anyway.

c) Compression codec, serializer used, etc have an impact.

d) 2G is extremely low limit to have in modern hardware : and this is 
particularly a severe limitation when we have nodes running on 32G to 64G ram 
and TB's of disk space available for spark.


To address specific points raised above :

A) [~pwendell] Mapreduce jobs dont fail in case the block size of files 
increases - it might be inefficient, it still runs (not that I know of any case 
where it does actually, but theoretically I guess it can become inefficient).
So analogy does not apply.
Also to add, 2G is not really an unlimited increase in block size - and in MR, 
output of a map can easily go a couple of orders above 2G : whether it is 
followed by a reduce or not.

B) [~matei] In the specific cases it was failing, the users were not caching 
the data but directly going to shuffle.
There was no skew from what I see : just the data size per key is high; and 
there are a lot of keys too btw (as iterations increase and nnz increases).
Note that it was an impl detail that it was not being cached - it could have 
been too.
Additionally, compression and/or serialization also apply implicitly in this 
case, since it was impacting shuffle - the 2G limit was observed at both the 
map and reduce side (in two different jobs).


In general, our effort is to make spark as a drop in replacement for most 
usecases which are currently being done via MR/Pig/etc.
Limitations of this sort make it difficult to position spark as a credible 
alternative.


Current approach we are exploring is to remove all direct references to 
ByteBuffer from spark (except for ConnectionManager, etc parts); and rely on a 
BlockData or similar datastructure which encapsulate the data corresponding to 
a block. By default, a single ByteBuffer should suffice but in case it does 
not, the class will automatically take care of splitting across blocks.
Similarly, all references to byte array backed streams will need to be replaced 
with a wrapper stream which multiplexes over byte array streams.
The performance impact for all 'normal' usecases should be the minimal, while 
allowing for spark to be used in cases where 2G limit is being hit.

The only unknown here is tachyon integration : where the interface is a 
ByteBuffer - and I am not knowledgable enough to comment on what the issues 
there would be.


was (Author: mridulm80):

There are multiple issues at play here :

a) If a block goes beyond 2G, everything fails - this is the case for shuffle, 
cached and 'normal' blocks.
Irrespective of storage level and/or other config.
In a lot of cases, particularly when data generated 'increases' after a 
map/flatMap/etc, the user has no control over the size increase in the output 
block for a given input block (think iterations over datasets).

b) Increasing number of partitions is not always an option (for the subset of 
usecases where it can be done) :
1) It has an impact on number of intermediate files created while doing a 
shuffle. (ulimit constraints, IO performance issues, etc).
2) It does not help when there is skew anyway.

c) Compression codec, serializer used, etc have an impact.

d) 2G is extremely low limit to have in modern hardware : and this is 
practically a severe limitation when we have nodes running on 32G to 64G ram 
and TB's of disk space available for spark.


To address specific points raised above :

A) [~pwendell] Mapreduce jobs dont fail in case the block size of files 
increases - it might be inefficient, it still runs (not that I know of any case 
where it does actually, but theoretically I guess it can).
So analogy does not apply.
To add, 2G is not really an unlimited increase in block size - and in MR, 
output of a map can easily go a couple of orders above 2G.

B) [~matei] In the specific cases it was failing, the users were not caching 
the data but directly going to shuffle.
There was no skew from what we see : just the data size per key is high; 

[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks

2014-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967911#comment-13967911
 ] 

Patrick Wendell edited comment on SPARK-1476 at 4/13/14 7:01 PM:
-

[~mrid...@yahoo-inc.com] I think the proposed change would benefit from a 
design doc to explain exactly the cases we want to fix and what trade-offs we 
are willing to make in terms of complexity.

Agreed that there is definitely room for improvement in the out-of-the-box 
behavior here.

Right now the limits as I understand them are (a) the shuffle output from one 
mapper to one reducer cannot be more than 2GB. (b) partitions of an RDD cannot 
exceed 2GB.

I see (a) as the bigger of the two issues. It would be helpful to have specific 
examples of workloads where this causes a problem and the associated data 
sizes, etc. For instance, say I want to do a 1 Terabyte shuffle. Right now 
number of (mappers * reducers) needs to be  ~1000 for this to work (e.g. 100 
mappers and 10 reducers) assuming uniform partitioning. That doesn't seem too 
crazy of an assumption, but if you have skew this would be a much bigger 
problem. 

Would it be possible to improve (a) but not (b) with a much simpler design? I'm 
not sure (maybe they reduce to the same problem), but it's something a design 
doc could help flesh out.

Popping up a bit - I think our goal should be to handle reasonable workloads 
and not to be 100% compliant with the semantics of Hadoop MapReduce. After all, 
in-memory RDD's are not even a concept in MapReduce. And keep in mind that 
MapReduce became so bloated/complex of a project that it is today no longer 
possible to make substantial changes to it. That's something we definitely want 
to avoid by keeping Spark internals as simple as possible.


was (Author: pwendell):
[~mrid...@yahoo-inc.com] I think the proposed change would benefit from a 
design doc to explain exactly the cases we want to fix and what trade-offs we 
are willing to make in terms of complexity.

Agreed that there is definitely room for improvement in the out-of-the-box 
behavior here.

Right now the limits as I understand them are (a) the shuffle output from one 
mapper to one reducer cannot be more than 2GB. (b) partitions of an RDD cannot 
exceed 2GB.

I see (a) as the bigger of the two issues. It would be helpful to have specific 
examples of workloads where this causes a problem and the associated data 
sizes, etc. For instance, say I want to do a 1 Terabyte shuffle. Right now 
number of (mappers * reducers) needs to be  ~1000 for this to work (e.g. 100 
mappers and 10 reducers) assuming uniform partitioning. That doesn't seem too 
crazy of an assumption, but if you have skew this would be a much bigger 
problem. 

Popping up a bit - I think our goal should be to handle reasonable workloads 
and not to be 100% compliant with the semantics of Hadoop MapReduce. After all, 
in-memory RDD's are not even a concept in MapReduce. And keep in mind that 
MapReduce became so bloated/complex of a project that it is today no longer 
possible to make substantial changes to it. That's something we definitely want 
to avoid by keeping Spark internals as simple as possible.

 2GB limit in spark for blocks
 -

 Key: SPARK-1476
 URL: https://issues.apache.org/jira/browse/SPARK-1476
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: all
Reporter: Mridul Muralidharan
Priority: Critical
 Fix For: 1.1.0


 The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
 the size of the block to 2GB.
 This has implication not just for managed blocks in use, but also for shuffle 
 blocks (memory mapped blocks are limited to 2gig, even though the api allows 
 for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
 This is a severe limitation for use of spark when used on non trivial 
 datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1484) MLlib should warn if you are using an iterative algorithm on non-cached data

2014-04-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1484:


 Summary: MLlib should warn if you are using an iterative algorithm 
on non-cached data
 Key: SPARK-1484
 URL: https://issues.apache.org/jira/browse/SPARK-1484
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Matei Zaharia


Not sure what the best way to warn is, but even printing to the log is probably 
fine. We may want to print at the end of the training run as well as the 
beginning to make it more visible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)