[jira] [Comment Edited] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed
[ https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000460#comment-14000460 ] Harry Brundage edited comment on SPARK-1849 at 5/16/14 11:02 PM: - I disagree - the data isn't badly encoded, just not UTF-8 encoded, which when we're talking about data from the internet really isn't all that uncommon. You could extend my specific problem of some lines in the source file being a different encoding to a file entirely encoded in iso-8859-1, which is likely something Spark should deal with considering all the effort put into supporting Windows. I don't think asking users to drop down to writing a custom {{InputFormat}} to deal with the realities of large data is a good move if Spark wants to become the fast and general data processing engine for large scale data. I could certainly use {{sc.hadoopFile}} to load in my data and work with the {{org.apache.hadoop.io.Text}} objects myself, but A) why force everyone dealing with this issue to go through the pain of figuring that out, and B) I'm in PySpark where I can't actually do that without fancy Py4J trickery. I think encoding issues should be in your face. was (Author: airhorns): I disagree - the data isn't badly encoded, just not UTF-8 encoded, which when we're talking about data from the internet really isn't all that uncommon. You could extend my specific problem of some lines in the source file being a different encoding to a file entirely encoded in iso-8859-1, which is likely something Spark should deal with considering all the effort put into supporting Windows. I don't think asking users to drop down to writing custom {{InputFormat}}s to deal with the realities of large data is a good move if Spark wants to become the fast and general data processing engine for large scale data. I could certainly use {{sc.hadoopFile}} to load in my data and work with the {{org.apache.hadoop.io.Text}} objects myself, but A) why force everyone dealing with this issue to go through the pain of figuring that out, and B) I'm in PySpark where I can't actually do that without fancy Py4J trickery. I think encoding issues should be in your face. Broken UTF-8 encoded data gets character replacements and thus can't be fixed --- Key: SPARK-1849 URL: https://issues.apache.org/jira/browse/SPARK-1849 Project: Spark Issue Type: Bug Reporter: Harry Brundage Fix For: 1.0.0, 0.9.1 Attachments: encoding_test I'm trying to process a file which isn't valid UTF-8 data inside hadoop using Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that we should fix? It looks like {{HadoopRDD}} uses {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement character, \uFFFD. Some example code mimicking what {{sc.textFile}} does underneath: {code} scala sc.textFile(path).collect()(0) res8: String = ?pple scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes() res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101) scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.getBytes).collect()(0) res10: Array[Byte] = Array(-60, 112, 112, 108, 101) {code} In the above example, the first two snippets show the string representation and byte representation of the example line of text. The third snippet shows what happens if you call {{getBytes}} on the {{Text}} object which comes back from hadoop land: we get the real bytes in the file out. Now, I think this is a bug, though you may disagree. The text inside my file is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to rescue and re-encode into UTF-8, because I want my application to be smart like that. I think Spark should give me the raw broken string so I can re-encode, but I can't get at the original bytes in order to guess at what the source encoding might be, as they have already been replaced. I'm dealing with data from some CDN access logs which are to put it nicely diversely encoded, but I think a use case Spark should fully support. So, my suggested fix, which I'd like some guidance, is to change {{textFile}} to spit out broken strings by not using {{Text}}'s UTF-8 encoding. Further compounding this issue is that my application is actually in PySpark, but we can talk about how bytes fly through to Scala land after this if we agree that this is an issue at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1800) Add broadcast hash join operator
[ https://issues.apache.org/jira/browse/SPARK-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1800: Fix Version/s: 1.1.0 Add broadcast hash join operator Key: SPARK-1800 URL: https://issues.apache.org/jira/browse/SPARK-1800 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1808) bin/pyspark does not load default configuration properties
[ https://issues.apache.org/jira/browse/SPARK-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999468#comment-13999468 ] Andrew Or commented on SPARK-1808: -- https://github.com/apache/spark/pull/799 bin/pyspark does not load default configuration properties -- Key: SPARK-1808 URL: https://issues.apache.org/jira/browse/SPARK-1808 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.0.1 ... because it doesn't go through spark-submit. Either we make it go through spark-submit (hard), or we extract the load default configurations logic and set them for the JVM that launches the py4j GatewayServer (easier). Right now, the only way to set config values for bin/pyspark is to do it through SPARK_JAVA_OPTS in spark-env.sh, which is supposedly deprecated. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1601) CacheManager#getOrCompute() does not return an InterruptibleIterator
[ https://issues.apache.org/jira/browse/SPARK-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1601: --- Assignee: Reynold Xin (was: Aaron Davidson) CacheManager#getOrCompute() does not return an InterruptibleIterator Key: SPARK-1601 URL: https://issues.apache.org/jira/browse/SPARK-1601 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 0.9.1 Reporter: Aaron Davidson Assignee: Reynold Xin Fix For: 1.0.0 When getOrCompute goes down the compute path for an RDD that should be stored in memory, it returns an iterator over an array, which is not interruptible. This mainly means that any consumers of that iterator, which may consume slowly, will not be interrupted in a timely manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1553) Support alternating nonnegative least-squares
[ https://issues.apache.org/jira/browse/SPARK-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1553: - Priority: Major (was: Minor) Support alternating nonnegative least-squares - Key: SPARK-1553 URL: https://issues.apache.org/jira/browse/SPARK-1553 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 0.9.0 Reporter: Tor Myklebust Assignee: Tor Myklebust Fix For: 1.1.0 There's already an ALS implementation. It can be tweaked to support nonnegative least-squares by conditionally running a nonnegative least-squares solve instead of a least-squares solver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1845) Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections.
Takuya Ueshin created SPARK-1845: Summary: Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections. Key: SPARK-1845 URL: https://issues.apache.org/jira/browse/SPARK-1845 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin When I execute {{orderBy}} or {{limit}} for {{SchemaRDD}} including {{ArrayType}} or {{MapType}}, {{SparkSqlSerializer}} throws the following exception: {quote} com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.$colon$colon {quote} or {quote} com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.Vector {quote} or {quote} com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap {quote} and so on. This is because registrations of serializers for each concrete collections are missing in {{SparkSqlSerializer}}. I believe it should use {{AllScalaRegistrar}}. {{AllScalaRegistrar}} covers a lot of serializers for concrete classes of {{Seq}}, {{Map}} for {{ArrayType}}, {{MapType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed
[ https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000460#comment-14000460 ] Harry Brundage commented on SPARK-1849: --- I disagree - the data isn't badly encoded, just not UTF-8 encoded, which when we're talking about data from the internet really isn't all that uncommon. You could extend my specific problem of some lines in the source file being a different encoding to a file entirely encoded in iso-8859-1, which is likely something Spark should deal with considering all the effort put into supporting Windows. I don't think asking users to drop down to writing custom {{InputFormat}}s to deal with the realities of large data is a good move if Spark wants to become the fast and general data processing engine for large scale data. I could certainly use {{sc.hadoopFile}} to load in my data and work with the {{org.apache.hadoop.io.Text}} objects myself, but A) why force everyone dealing with this issue to go through the pain of figuring that out, and B) I'm in PySpark where I can't actually do that without fancy Py4J trickery. I think encoding issues should be in your face. Broken UTF-8 encoded data gets character replacements and thus can't be fixed --- Key: SPARK-1849 URL: https://issues.apache.org/jira/browse/SPARK-1849 Project: Spark Issue Type: Bug Reporter: Harry Brundage Fix For: 1.0.0, 0.9.1 Attachments: encoding_test I'm trying to process a file which isn't valid UTF-8 data inside hadoop using Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that we should fix? It looks like {{HadoopRDD}} uses {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement character, \uFFFD. Some example code mimicking what {{sc.textFile}} does underneath: {code} scala sc.textFile(path).collect()(0) res8: String = ?pple scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes() res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101) scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.getBytes).collect()(0) res10: Array[Byte] = Array(-60, 112, 112, 108, 101) {code} In the above example, the first two snippets show the string representation and byte representation of the example line of text. The third snippet shows what happens if you call {{getBytes}} on the {{Text}} object which comes back from hadoop land: we get the real bytes in the file out. Now, I think this is a bug, though you may disagree. The text inside my file is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to rescue and re-encode into UTF-8, because I want my application to be smart like that. I think Spark should give me the raw broken string so I can re-encode, but I can't get at the original bytes in order to guess at what the source encoding might be, as they have already been replaced. I'm dealing with data from some CDN access logs which are to put it nicely diversely encoded, but I think a use case Spark should fully support. So, my suggested fix, which I'd like some guidance, is to change {{textFile}} to spit out broken strings by not using {{Text}}'s UTF-8 encoding. Further compounding this issue is that my application is actually in PySpark, but we can talk about how bytes fly through to Scala land after this if we agree that this is an issue at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1851) Upgrade Avro dependency to 1.7.6 so Spark can read Avro files
Sandy Ryza created SPARK-1851: - Summary: Upgrade Avro dependency to 1.7.6 so Spark can read Avro files Key: SPARK-1851 URL: https://issues.apache.org/jira/browse/SPARK-1851 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Priority: Critical I tried to set up a basic example getting a Spark job to read an Avro container file with Avro specifics. This results in a ClassNotFoundException: can't convert GenericData.Record to com.cloudera.sparkavro.User. The reason is: * When creating records, to decide whether to be specific or generic, Avro tries to load a class with the name specified in the schema. * Initially, executors just have the system jars (which include Avro), and load the app jars dynamically with a URLClassLoader that's set as the context classloader for the task threads. * Avro tries to load the generated classes with SpecificData.class.getClassLoader(), which sidesteps this URLClassLoader and goes up to the AppClassLoader. Avro 1.7.6 has a change (AVRO-987) that falls back to the Thread's context classloader when the SpecificData.class.getClassLoader() fails. I tested with Avro 1.7.6 and did not observe the problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1863) Allowing user jars to take precedence over Spark jars does not work as expected
koert kuipers created SPARK-1863: Summary: Allowing user jars to take precedence over Spark jars does not work as expected Key: SPARK-1863 URL: https://issues.apache.org/jira/browse/SPARK-1863 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: koert kuipers Priority: Minor See here: http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html The issue seems to be that within ChildExecutorURLClassLoader userClassLoader has no visibility on classes managed by parentClassLoader because their is no parent/child relationship. What this means that if a class is loaded by userClassLoader and it refers to a class loaded by parentClassLoader you get a NoClassDefFoundError. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed
[ https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000397#comment-14000397 ] Mridul Muralidharan commented on SPARK-1849: Looks like textFile is probably the wrong api to use. You cannot recover from badly encoded data ... Better would be to write your own InputFormat which does what you need. Broken UTF-8 encoded data gets character replacements and thus can't be fixed --- Key: SPARK-1849 URL: https://issues.apache.org/jira/browse/SPARK-1849 Project: Spark Issue Type: Bug Reporter: Harry Brundage Fix For: 1.0.0, 0.9.1 Attachments: encoding_test I'm trying to process a file which isn't valid UTF-8 data inside hadoop using Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that we should fix? It looks like {{HadoopRDD}} uses {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement character, \uFFFD. Some example code mimicking what {{sc.textFile}} does underneath: {code} scala sc.textFile(path).collect()(0) res8: String = ?pple scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes() res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101) scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.getBytes).collect()(0) res10: Array[Byte] = Array(-60, 112, 112, 108, 101) {code} In the above example, the first two snippets show the string representation and byte representation of the example line of text. The third snippet shows what happens if you call {{getBytes}} on the {{Text}} object which comes back from hadoop land: we get the real bytes in the file out. Now, I think this is a bug, though you may disagree. The text inside my file is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to rescue and re-encode into UTF-8, because I want my application to be smart like that. I think Spark should give me the raw broken string so I can re-encode, but I can't get at the original bytes in order to guess at what the source encoding might be, as they have already been replaced. I'm dealing with data from some CDN access logs which are to put it nicely diversely encoded, but I think a use case Spark should fully support. So, my suggested fix, which I'd like some guidance, is to change {{textFile}} to spit out broken strings by not using {{Text}}'s UTF-8 encoding. Further compounding this issue is that my application is actually in PySpark, but we can talk about how bytes fly through to Scala land after this if we agree that this is an issue at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1729) Make Flume pull data from source, rather than the current push model
[ https://issues.apache.org/jira/browse/SPARK-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000574#comment-14000574 ] Hari Shreedharan commented on SPARK-1729: - PR: https://github.com/apache/spark/pull/807 Make Flume pull data from source, rather than the current push model Key: SPARK-1729 URL: https://issues.apache.org/jira/browse/SPARK-1729 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das Assignee: Hari Shreedharan Fix For: 1.1.0 This makes sure that the if the Spark executor running the receiver goes down, the new receiver on a new node can still get data from Flume. This is not possible in the current model, as Flume is configured to push data to a executor/worker and if that worker is down, Flume cant push data. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1863) Allowing user jars to take precedence over Spark jars does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-1863: - Description: See here: http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html The issue seems to be that within ChildExecutorURLClassLoader userClassLoader has no visibility on classes managed by parentClassLoader because their is no parent/child relationship. What this means that if a class is loaded by userClassLoader and it refers to a class loaded by parentClassLoader you get a NoClassDefFoundError. was: See here: http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html The issue seems to be that within ChildExecutorURLClassLoader userClassLoader has no visibility on classes managed by parentClassLoader because their is no parent/child relationship. What this means that if a class is loaded by userClassLoader and it refers to a class loaded by parentClassLoader you get a NoClassDefFoundError. When i addresses this by creating a new version of ChildExecutorURLClassLoader that does have the proper parent-child relationship and reverses the loading order inside loadClass the class loading seemed to work fine but now classes like SparkEnv were loaded by ChildExecutorURLClassLoader leading to NPEs on SparkEnv.get() To verify that the issue was that SparkEnv was now loaded by ChildExecutorURLClassLoader i forced SparkEnv to be loaded by the parent classloader. That didnt help. Then i forced all spark classes to be loaded by parent classloader and that did help. But it causes even bigger problems: java.lang.LinkageError: loader constraint violation: when resolving overridden method myclass.MyRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator; the class loader (instance of org/apache/spark/executor/ChildExecutorURLClassLoader) of the current class, myclass/MyRDD, and its superclass loader (instance of sun/misc/Launcher$AppClassLoader), have different Class objects for the type TaskContext;)Lscala/collection/Iterator; used in the signature Allowing user jars to take precedence over Spark jars does not work as expected --- Key: SPARK-1863 URL: https://issues.apache.org/jira/browse/SPARK-1863 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: koert kuipers Priority: Minor See here: http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html The issue seems to be that within ChildExecutorURLClassLoader userClassLoader has no visibility on classes managed by parentClassLoader because their is no parent/child relationship. What this means that if a class is loaded by userClassLoader and it refers to a class loaded by parentClassLoader you get a NoClassDefFoundError. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1864) Classpath not correctly sent to executors.
[ https://issues.apache.org/jira/browse/SPARK-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000587#comment-14000587 ] Michael Armbrust commented on SPARK-1864: - https://github.com/apache/spark/pull/808 Classpath not correctly sent to executors. -- Key: SPARK-1864 URL: https://issues.apache.org/jira/browse/SPARK-1864 Project: Spark Issue Type: Bug Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1864) Classpath not correctly sent to executors.
Michael Armbrust created SPARK-1864: --- Summary: Classpath not correctly sent to executors. Key: SPARK-1864 URL: https://issues.apache.org/jira/browse/SPARK-1864 Project: Spark Issue Type: Bug Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK
[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000620#comment-14000620 ] Xiangrui Meng commented on SPARK-1782: -- If you need the the latest Breeze to use eigs, I would prefer calling ARPACK directly. svd for sparse matrix using ARPACK -- Key: SPARK-1782 URL: https://issues.apache.org/jira/browse/SPARK-1782 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Li Pu Original Estimate: 672h Remaining Estimate: 672h Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix. One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner. The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration. I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication. It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced
[ https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-1866: -- Description: Take the following example: {code} val x = 5 val instances = new org.apache.hadoop.fs.Path(/) /* non-serializable */ sc.parallelize(0 until 10).map { _ = val instances = 3 (instances, x) }.collect {code} This produces a java.io.NotSerializableException: org.apache.hadoop.fs.Path, despite the fact that the outer instances is not actually used within the closure. If you change the name of the outer variable instances to something else, the code executes correctly, indicating that it is the fact that the two variables share a name that causes the issue. Additionally, if the outer scope is not used (i.e., we do not reference x in the above example), the issue does not appear. was: Take the following example: {code} val x = 5 val instances = new org.apache.hadoop.fs.Path(/) /* non-serializable */ sc.parallelize(0 until 10).map { _ = val instances = 3 (instances, x) }.collect {code} This produces a java.io.NotSerializableException: org.apache.hadoop.fs.Path, despite the fact that the outer instances is not actually used within the closure. If you change the name of the outer variable instances to something else, the code executes correctly, indicating that it is the fact that the two variables share a name that causes the issue. Closure cleaner does not null shadowed fields when outer scope is referenced Key: SPARK-1866 URL: https://issues.apache.org/jira/browse/SPARK-1866 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Aaron Davidson Priority: Critical Fix For: 1.1.0, 1.0.1 Take the following example: {code} val x = 5 val instances = new org.apache.hadoop.fs.Path(/) /* non-serializable */ sc.parallelize(0 until 10).map { _ = val instances = 3 (instances, x) }.collect {code} This produces a java.io.NotSerializableException: org.apache.hadoop.fs.Path, despite the fact that the outer instances is not actually used within the closure. If you change the name of the outer variable instances to something else, the code executes correctly, indicating that it is the fact that the two variables share a name that causes the issue. Additionally, if the outer scope is not used (i.e., we do not reference x in the above example), the issue does not appear. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1864) Classpath not correctly sent to executors.
[ https://issues.apache.org/jira/browse/SPARK-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1864. Resolution: Fixed Issue resolved by pull request 808 [https://github.com/apache/spark/pull/808] Classpath not correctly sent to executors. -- Key: SPARK-1864 URL: https://issues.apache.org/jira/browse/SPARK-1864 Project: Spark Issue Type: Bug Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1865) Improve behavior of cleanup of disk state
Aaron Davidson created SPARK-1865: - Summary: Improve behavior of cleanup of disk state Key: SPARK-1865 URL: https://issues.apache.org/jira/browse/SPARK-1865 Project: Spark Issue Type: Improvement Components: Deploy, Spark Core Reporter: Aaron Davidson Right now the behavior of disk cleanup is centered around the exit hook of the executor, which attempts to cleanup shuffle files and disk manager blocks, but may fail. We should make this behavior more predictable, perhaps by letting the Standalone Worker cleanup the disk state, and adding a flag to disable having the executor cleanup its own state. -- This message was sent by Atlassian JIRA (v6.2#6252)