[jira] [Comment Edited] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed

2014-05-16 Thread Harry Brundage (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000460#comment-14000460
 ] 

Harry Brundage edited comment on SPARK-1849 at 5/16/14 11:02 PM:
-

I disagree - the data isn't badly encoded, just not UTF-8 encoded, which when 
we're talking about data from the internet really isn't all that uncommon. You 
could extend my specific problem of some lines in the source file being a 
different encoding to a file entirely encoded in iso-8859-1, which is likely 
something Spark should deal with considering all the effort put into supporting 
Windows. I don't think asking users to drop down to writing a custom 
{{InputFormat}} to deal with the realities of large data is a good move if 
Spark wants to become the fast and general data processing engine for large 
scale data.

I could certainly use {{sc.hadoopFile}} to load in my data and work with the 
{{org.apache.hadoop.io.Text}} objects myself, but A) why force everyone dealing 
with this issue to go through the pain of figuring that out, and B) I'm in 
PySpark where I can't actually do that without fancy Py4J trickery. I think 
encoding issues should be in your face.


was (Author: airhorns):
I disagree - the data isn't badly encoded, just not UTF-8 encoded, which when 
we're talking about data from the internet really isn't all that uncommon. You 
could extend my specific problem of some lines in the source file being a 
different encoding to a file entirely encoded in iso-8859-1, which is likely 
something Spark should deal with considering all the effort put into supporting 
Windows. I don't think asking users to drop down to writing custom 
{{InputFormat}}s to deal with the realities of large data is a good move if 
Spark wants to become the fast and general data processing engine for large 
scale data.

I could certainly use {{sc.hadoopFile}} to load in my data and work with the 
{{org.apache.hadoop.io.Text}} objects myself, but A) why force everyone dealing 
with this issue to go through the pain of figuring that out, and B) I'm in 
PySpark where I can't actually do that without fancy Py4J trickery. I think 
encoding issues should be in your face.

 Broken UTF-8 encoded data gets character replacements and thus can't be 
 fixed
 ---

 Key: SPARK-1849
 URL: https://issues.apache.org/jira/browse/SPARK-1849
 Project: Spark
  Issue Type: Bug
Reporter: Harry Brundage
 Fix For: 1.0.0, 0.9.1

 Attachments: encoding_test


 I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
 Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
 we should fix? It looks like {{HadoopRDD}} uses 
 {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
 believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
 character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
 underneath:
 {code}
 scala sc.textFile(path).collect()(0)
 res8: String = ?pple
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes()
 res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.getBytes).collect()(0)
 res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
 {code}
 In the above example, the first two snippets show the string representation 
 and byte representation of the example line of text. The third snippet shows 
 what happens if you call {{getBytes}} on the {{Text}} object which comes back 
 from hadoop land: we get the real bytes in the file out.
 Now, I think this is a bug, though you may disagree. The text inside my file 
 is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
 rescue and re-encode into UTF-8, because I want my application to be smart 
 like that. I think Spark should give me the raw broken string so I can 
 re-encode, but I can't get at the original bytes in order to guess at what 
 the source encoding might be, as they have already been replaced. I'm dealing 
 with data from some CDN access logs which are to put it nicely diversely 
 encoded, but I think a use case Spark should fully support. So, my suggested 
 fix, which I'd like some guidance, is to change {{textFile}} to spit out 
 broken strings by not using {{Text}}'s UTF-8 encoding.
 Further compounding this issue is that my application is actually in PySpark, 
 but we can talk about how bytes fly through to Scala land after this if we 
 agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1800) Add broadcast hash join operator

2014-05-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1800:


Fix Version/s: 1.1.0

 Add broadcast hash join operator
 

 Key: SPARK-1800
 URL: https://issues.apache.org/jira/browse/SPARK-1800
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1808) bin/pyspark does not load default configuration properties

2014-05-16 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999468#comment-13999468
 ] 

Andrew Or commented on SPARK-1808:
--

https://github.com/apache/spark/pull/799

 bin/pyspark does not load default configuration properties
 --

 Key: SPARK-1808
 URL: https://issues.apache.org/jira/browse/SPARK-1808
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.1


 ... because it doesn't go through spark-submit. Either we make it go through 
 spark-submit (hard), or we extract the load default configurations logic and 
 set them for the JVM that launches the py4j GatewayServer (easier).
 Right now, the only way to set config values for bin/pyspark is to do it 
 through SPARK_JAVA_OPTS in spark-env.sh, which is supposedly deprecated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1601) CacheManager#getOrCompute() does not return an InterruptibleIterator

2014-05-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1601:
---

Assignee: Reynold Xin  (was: Aaron Davidson)

 CacheManager#getOrCompute() does not return an InterruptibleIterator
 

 Key: SPARK-1601
 URL: https://issues.apache.org/jira/browse/SPARK-1601
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 0.9.1
Reporter: Aaron Davidson
Assignee: Reynold Xin
 Fix For: 1.0.0


 When getOrCompute goes down the compute path for an RDD that should be 
 stored in memory, it returns an iterator over an array, which is not 
 interruptible. This mainly means that any consumers of that iterator, which 
 may consume slowly, will not be interrupted in a timely manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1553) Support alternating nonnegative least-squares

2014-05-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1553:
-

Priority: Major  (was: Minor)

 Support alternating nonnegative least-squares
 -

 Key: SPARK-1553
 URL: https://issues.apache.org/jira/browse/SPARK-1553
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 0.9.0
Reporter: Tor Myklebust
Assignee: Tor Myklebust
 Fix For: 1.1.0


 There's already an ALS implementation.  It can be tweaked to support 
 nonnegative least-squares by conditionally running a nonnegative 
 least-squares solve instead of a least-squares solver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1845) Use AllScalaRegistrar for SparkSqlSerializer to register serializers of Scala collections.

2014-05-16 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-1845:


 Summary: Use AllScalaRegistrar for SparkSqlSerializer to register 
serializers of Scala collections.
 Key: SPARK-1845
 URL: https://issues.apache.org/jira/browse/SPARK-1845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


When I execute {{orderBy}} or {{limit}} for {{SchemaRDD}} including 
{{ArrayType}} or {{MapType}}, {{SparkSqlSerializer}} throws the following 
exception:

{quote}
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
no-arg constructor): scala.collection.immutable.$colon$colon
{quote}

or

{quote}
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
no-arg constructor): scala.collection.immutable.Vector
{quote}

or

{quote}
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap
{quote}

and so on.

This is because registrations of serializers for each concrete collections are 
missing in {{SparkSqlSerializer}}.
I believe it should use {{AllScalaRegistrar}}.
{{AllScalaRegistrar}} covers a lot of serializers for concrete classes of 
{{Seq}}, {{Map}} for {{ArrayType}}, {{MapType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed

2014-05-16 Thread Harry Brundage (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000460#comment-14000460
 ] 

Harry Brundage commented on SPARK-1849:
---

I disagree - the data isn't badly encoded, just not UTF-8 encoded, which when 
we're talking about data from the internet really isn't all that uncommon. You 
could extend my specific problem of some lines in the source file being a 
different encoding to a file entirely encoded in iso-8859-1, which is likely 
something Spark should deal with considering all the effort put into supporting 
Windows. I don't think asking users to drop down to writing custom 
{{InputFormat}}s to deal with the realities of large data is a good move if 
Spark wants to become the fast and general data processing engine for large 
scale data.

I could certainly use {{sc.hadoopFile}} to load in my data and work with the 
{{org.apache.hadoop.io.Text}} objects myself, but A) why force everyone dealing 
with this issue to go through the pain of figuring that out, and B) I'm in 
PySpark where I can't actually do that without fancy Py4J trickery. I think 
encoding issues should be in your face.

 Broken UTF-8 encoded data gets character replacements and thus can't be 
 fixed
 ---

 Key: SPARK-1849
 URL: https://issues.apache.org/jira/browse/SPARK-1849
 Project: Spark
  Issue Type: Bug
Reporter: Harry Brundage
 Fix For: 1.0.0, 0.9.1

 Attachments: encoding_test


 I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
 Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
 we should fix? It looks like {{HadoopRDD}} uses 
 {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
 believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
 character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
 underneath:
 {code}
 scala sc.textFile(path).collect()(0)
 res8: String = ?pple
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes()
 res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.getBytes).collect()(0)
 res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
 {code}
 In the above example, the first two snippets show the string representation 
 and byte representation of the example line of text. The third snippet shows 
 what happens if you call {{getBytes}} on the {{Text}} object which comes back 
 from hadoop land: we get the real bytes in the file out.
 Now, I think this is a bug, though you may disagree. The text inside my file 
 is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
 rescue and re-encode into UTF-8, because I want my application to be smart 
 like that. I think Spark should give me the raw broken string so I can 
 re-encode, but I can't get at the original bytes in order to guess at what 
 the source encoding might be, as they have already been replaced. I'm dealing 
 with data from some CDN access logs which are to put it nicely diversely 
 encoded, but I think a use case Spark should fully support. So, my suggested 
 fix, which I'd like some guidance, is to change {{textFile}} to spit out 
 broken strings by not using {{Text}}'s UTF-8 encoding.
 Further compounding this issue is that my application is actually in PySpark, 
 but we can talk about how bytes fly through to Scala land after this if we 
 agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1851) Upgrade Avro dependency to 1.7.6 so Spark can read Avro files

2014-05-16 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-1851:
-

 Summary: Upgrade Avro dependency to 1.7.6 so Spark can read Avro 
files
 Key: SPARK-1851
 URL: https://issues.apache.org/jira/browse/SPARK-1851
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Priority: Critical


I tried to set up a basic example getting a Spark job to read an Avro container 
file with Avro specifics.  This results in a ClassNotFoundException: can't 
convert GenericData.Record to com.cloudera.sparkavro.User.

The reason is:
* When creating records, to decide whether to be specific or generic, Avro 
tries to load a class with the name specified in the schema.
* Initially, executors just have the system jars (which include Avro), and load 
the app jars dynamically with a URLClassLoader that's set as the context 
classloader for the task threads.
* Avro tries to load the generated classes with 
SpecificData.class.getClassLoader(), which sidesteps this URLClassLoader and 
goes up to the AppClassLoader.

Avro 1.7.6 has a change (AVRO-987) that falls back to the Thread's context 
classloader when the SpecificData.class.getClassLoader() fails.  I tested with 
Avro 1.7.6 and did not observe the problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1863) Allowing user jars to take precedence over Spark jars does not work as expected

2014-05-16 Thread koert kuipers (JIRA)
koert kuipers created SPARK-1863:


 Summary: Allowing user jars to take precedence over Spark jars 
does not work as expected
 Key: SPARK-1863
 URL: https://issues.apache.org/jira/browse/SPARK-1863
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: koert kuipers
Priority: Minor


See here:
http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html

The issue seems to be that within ChildExecutorURLClassLoader userClassLoader 
has no visibility on classes managed by parentClassLoader because their is no 
parent/child relationship. What this means that if a class is loaded by 
userClassLoader and it refers to a class loaded by parentClassLoader you get a 
NoClassDefFoundError.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed

2014-05-16 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000397#comment-14000397
 ] 

Mridul Muralidharan commented on SPARK-1849:


Looks like textFile is probably the wrong api to use.
You cannot recover from badly encoded data ... Better would be to write your 
own InputFormat which does what you need.

 Broken UTF-8 encoded data gets character replacements and thus can't be 
 fixed
 ---

 Key: SPARK-1849
 URL: https://issues.apache.org/jira/browse/SPARK-1849
 Project: Spark
  Issue Type: Bug
Reporter: Harry Brundage
 Fix For: 1.0.0, 0.9.1

 Attachments: encoding_test


 I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
 Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
 we should fix? It looks like {{HadoopRDD}} uses 
 {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
 believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
 character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
 underneath:
 {code}
 scala sc.textFile(path).collect()(0)
 res8: String = ?pple
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes()
 res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.getBytes).collect()(0)
 res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
 {code}
 In the above example, the first two snippets show the string representation 
 and byte representation of the example line of text. The third snippet shows 
 what happens if you call {{getBytes}} on the {{Text}} object which comes back 
 from hadoop land: we get the real bytes in the file out.
 Now, I think this is a bug, though you may disagree. The text inside my file 
 is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
 rescue and re-encode into UTF-8, because I want my application to be smart 
 like that. I think Spark should give me the raw broken string so I can 
 re-encode, but I can't get at the original bytes in order to guess at what 
 the source encoding might be, as they have already been replaced. I'm dealing 
 with data from some CDN access logs which are to put it nicely diversely 
 encoded, but I think a use case Spark should fully support. So, my suggested 
 fix, which I'd like some guidance, is to change {{textFile}} to spit out 
 broken strings by not using {{Text}}'s UTF-8 encoding.
 Further compounding this issue is that my application is actually in PySpark, 
 but we can talk about how bytes fly through to Scala land after this if we 
 agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1729) Make Flume pull data from source, rather than the current push model

2014-05-16 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000574#comment-14000574
 ] 

Hari Shreedharan commented on SPARK-1729:
-

PR: https://github.com/apache/spark/pull/807

 Make Flume pull data from source, rather than the current push model
 

 Key: SPARK-1729
 URL: https://issues.apache.org/jira/browse/SPARK-1729
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Hari Shreedharan
 Fix For: 1.1.0


 This makes sure that the if the Spark executor running the receiver goes 
 down, the new receiver on a new node can still get data from Flume.
 This is not possible in the current model, as Flume is configured to push 
 data to a executor/worker and if that worker is down, Flume cant push data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1863) Allowing user jars to take precedence over Spark jars does not work as expected

2014-05-16 Thread koert kuipers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-1863:
-

Description: 
See here:
http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html

The issue seems to be that within ChildExecutorURLClassLoader userClassLoader 
has no visibility on classes managed by parentClassLoader because their is no 
parent/child relationship. What this means that if a class is loaded by 
userClassLoader and it refers to a class loaded by parentClassLoader you get a 
NoClassDefFoundError.

  was:
See here:
http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html

The issue seems to be that within ChildExecutorURLClassLoader userClassLoader 
has no visibility on classes managed by parentClassLoader because their is no 
parent/child relationship. What this means that if a class is loaded by 
userClassLoader and it refers to a class loaded by parentClassLoader you get a 
NoClassDefFoundError.

When i addresses this by creating a new version of ChildExecutorURLClassLoader 
that does have the proper parent-child relationship and reverses the loading 
order inside loadClass the class loading seemed to work fine but now classes 
like SparkEnv were loaded by ChildExecutorURLClassLoader leading to NPEs on 
SparkEnv.get()

To verify that the issue was that SparkEnv was now loaded by 
ChildExecutorURLClassLoader i forced SparkEnv to be loaded by the parent 
classloader. That didnt help. Then i forced all spark classes to be loaded by 
parent classloader and that did help. But it causes even bigger problems:

java.lang.LinkageError: loader constraint violation: when resolving overridden 
method 
myclass.MyRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;
 the class loader (instance of 
org/apache/spark/executor/ChildExecutorURLClassLoader) of the current class, 
myclass/MyRDD, and its superclass loader (instance of 
sun/misc/Launcher$AppClassLoader), have different Class objects for the type 
TaskContext;)Lscala/collection/Iterator; used in the signature



 Allowing user jars to take precedence over Spark jars does not work as 
 expected
 ---

 Key: SPARK-1863
 URL: https://issues.apache.org/jira/browse/SPARK-1863
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: koert kuipers
Priority: Minor

 See here:
 http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-td5832.html
 The issue seems to be that within ChildExecutorURLClassLoader userClassLoader 
 has no visibility on classes managed by parentClassLoader because their is no 
 parent/child relationship. What this means that if a class is loaded by 
 userClassLoader and it refers to a class loaded by parentClassLoader you get 
 a NoClassDefFoundError.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1864) Classpath not correctly sent to executors.

2014-05-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000587#comment-14000587
 ] 

Michael Armbrust commented on SPARK-1864:
-

https://github.com/apache/spark/pull/808

 Classpath not correctly sent to executors.
 --

 Key: SPARK-1864
 URL: https://issues.apache.org/jira/browse/SPARK-1864
 Project: Spark
  Issue Type: Bug
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1864) Classpath not correctly sent to executors.

2014-05-16 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-1864:
---

 Summary: Classpath not correctly sent to executors.
 Key: SPARK-1864
 URL: https://issues.apache.org/jira/browse/SPARK-1864
 Project: Spark
  Issue Type: Bug
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-05-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000620#comment-14000620
 ] 

Xiangrui Meng commented on SPARK-1782:
--

If you need the the latest Breeze to use eigs, I would prefer calling ARPACK 
directly.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1866) Closure cleaner does not null shadowed fields when outer scope is referenced

2014-05-16 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-1866:
--

Description: 
Take the following example:
{code}
val x = 5
val instances = new org.apache.hadoop.fs.Path(/) /* non-serializable */
sc.parallelize(0 until 10).map { _ =
  val instances = 3
  (instances, x)
}.collect
{code}

This produces a java.io.NotSerializableException: org.apache.hadoop.fs.Path, 
despite the fact that the outer instances is not actually used within the 
closure. If you change the name of the outer variable instances to something 
else, the code executes correctly, indicating that it is the fact that the two 
variables share a name that causes the issue.

Additionally, if the outer scope is not used (i.e., we do not reference x in 
the above example), the issue does not appear.

  was:
Take the following example:
{code}
val x = 5
val instances = new org.apache.hadoop.fs.Path(/) /* non-serializable */
sc.parallelize(0 until 10).map { _ =
  val instances = 3
  (instances, x)
}.collect
{code}

This produces a java.io.NotSerializableException: org.apache.hadoop.fs.Path, 
despite the fact that the outer instances is not actually used within the 
closure. If you change the name of the outer variable instances to something 
else, the code executes correctly, indicating that it is the fact that the two 
variables share a name that causes the issue.


 Closure cleaner does not null shadowed fields when outer scope is referenced
 

 Key: SPARK-1866
 URL: https://issues.apache.org/jira/browse/SPARK-1866
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Critical
 Fix For: 1.1.0, 1.0.1


 Take the following example:
 {code}
 val x = 5
 val instances = new org.apache.hadoop.fs.Path(/) /* non-serializable */
 sc.parallelize(0 until 10).map { _ =
   val instances = 3
   (instances, x)
 }.collect
 {code}
 This produces a java.io.NotSerializableException: 
 org.apache.hadoop.fs.Path, despite the fact that the outer instances is not 
 actually used within the closure. If you change the name of the outer 
 variable instances to something else, the code executes correctly, indicating 
 that it is the fact that the two variables share a name that causes the issue.
 Additionally, if the outer scope is not used (i.e., we do not reference x 
 in the above example), the issue does not appear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1864) Classpath not correctly sent to executors.

2014-05-16 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1864.


Resolution: Fixed

Issue resolved by pull request 808
[https://github.com/apache/spark/pull/808]

 Classpath not correctly sent to executors.
 --

 Key: SPARK-1864
 URL: https://issues.apache.org/jira/browse/SPARK-1864
 Project: Spark
  Issue Type: Bug
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1865) Improve behavior of cleanup of disk state

2014-05-16 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-1865:
-

 Summary: Improve behavior of cleanup of disk state
 Key: SPARK-1865
 URL: https://issues.apache.org/jira/browse/SPARK-1865
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Reporter: Aaron Davidson


Right now the behavior of disk cleanup is centered around the exit hook of the 
executor, which attempts to cleanup shuffle files and disk manager blocks, but 
may fail. We should make this behavior more predictable, perhaps by letting the 
Standalone Worker cleanup the disk state, and adding a flag to disable having 
the executor cleanup its own state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


<    1   2