[jira] [Commented] (SPARK-2288) Hide ShuffleBlockManager behind ShuffleManager
[ https://issues.apache.org/jira/browse/SPARK-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116249#comment-14116249 ] Reynold Xin commented on SPARK-2288: Thanks for the design doc, Raymond. Next time it would be better to also comment on the new block type you are adding. Cheers. Hide ShuffleBlockManager behind ShuffleManager -- Key: SPARK-2288 URL: https://issues.apache.org/jira/browse/SPARK-2288 Project: Spark Issue Type: Sub-task Components: Block Manager, Shuffle Reporter: Raymond Liu Assignee: Raymond Liu Attachments: shuffleblockmanager.pdf This is a sub task for SPARK-2275. At present, In shuffle write path, the shuffle block manager manage the mapping from some blockID to a FileSegment for the benefit of consolidate shuffle, this way it bypass the block store's blockId based access mode. Then in the read path, when read a shuffle block data, disk store query shuffleBlockManager to hack the normal blockId to file mapping in order to correctly read data from file. This really rend to a lot of bi-directional dependencies between modules and the code logic is some how messed up. None of the shuffle block manager and blockManager/Disk Store fully control the read path. They are tightly coupled in low level code modules. And it make it hard to implement other shuffle manager logics. e.g. a sort based shuffle which might merge all output from one map partition to a single file. This will need to hack more into the diskStore/diskBlockManager etc to find out the right data to be read. Possible approaching: So I think it might be better that we expose an FileSegment based read interface for DiskStore in addition to the current blockID based interface. Then those mapping blockId to FileSegment code logic can all reside in the specific shuffle manager, if they do need to merge data into one single object. they take care of the mapping logic in both read/write path and take the responsibility of read / write shuffle data The BlockStore itself should just take care of read/write as required, it should not involve into the data mapping logic at all. This might make the interface between modules more clear and decouple each other in a more clean way. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2288) Hide ShuffleBlockManager behind ShuffleManager
[ https://issues.apache.org/jira/browse/SPARK-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2288. Resolution: Fixed Fix Version/s: 1.2.0 Hide ShuffleBlockManager behind ShuffleManager -- Key: SPARK-2288 URL: https://issues.apache.org/jira/browse/SPARK-2288 Project: Spark Issue Type: Sub-task Components: Block Manager, Shuffle Reporter: Raymond Liu Assignee: Raymond Liu Fix For: 1.2.0 Attachments: shuffleblockmanager.pdf This is a sub task for SPARK-2275. At present, In shuffle write path, the shuffle block manager manage the mapping from some blockID to a FileSegment for the benefit of consolidate shuffle, this way it bypass the block store's blockId based access mode. Then in the read path, when read a shuffle block data, disk store query shuffleBlockManager to hack the normal blockId to file mapping in order to correctly read data from file. This really rend to a lot of bi-directional dependencies between modules and the code logic is some how messed up. None of the shuffle block manager and blockManager/Disk Store fully control the read path. They are tightly coupled in low level code modules. And it make it hard to implement other shuffle manager logics. e.g. a sort based shuffle which might merge all output from one map partition to a single file. This will need to hack more into the diskStore/diskBlockManager etc to find out the right data to be read. Possible approaching: So I think it might be better that we expose an FileSegment based read interface for DiskStore in addition to the current blockID based interface. Then those mapping blockId to FileSegment code logic can all reside in the specific shuffle manager, if they do need to merge data into one single object. they take care of the mapping logic in both read/write path and take the responsibility of read / write shuffle data The BlockStore itself should just take care of read/write as required, it should not involve into the data mapping logic at all. This might make the interface between modules more clear and decouple each other in a more clean way. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3305) Remove unused import from UI classes.
[ https://issues.apache.org/jira/browse/SPARK-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3305. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Kousuke Saruta Remove unused import from UI classes. - Key: SPARK-3305 URL: https://issues.apache.org/jira/browse/SPARK-3305 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Trivial Fix For: 1.2.0 UI classes are continue to change and left unused import. Let's keep clean. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3323) yarn website's Tracking UI links to the Standby RM
wangfei created SPARK-3323: -- Summary: yarn website's Tracking UI links to the Standby RM Key: SPARK-3323 URL: https://issues.apache.org/jira/browse/SPARK-3323 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: wangfei Running a big application, sometimes will occur this situation: When clicking the Tracking UI of the running application, it links to the Standby RM. With Info as fellow: This is standby RM.Redirecting to the current active RM: some address But actually the address of this website is the same with the some address -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ
Yi Tian created SPARK-3324: -- Summary: YARN module has nonstandard structure which cause compile error In IntelliJ Key: SPARK-3324 URL: https://issues.apache.org/jira/browse/SPARK-3324 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Environment: Mac OS: 10.9.4 IntelliJ IDEA: 13.1.4 Scala Plugins: 0.41.2 Maven: 3.0.5 Reporter: Yi Tian Priority: Minor The YARN module has nonstandard path structure like: ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recogniz codes in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and choose the version of yarn api via maven profile setting in the pom.xml. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-3324: -- Description: The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recogniz codes in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and choose the version of yarn api via maven profile setting in the pom.xml. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. was: The YARN module has nonstandard path structure like: ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recogniz codes in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and choose the version of yarn api via maven profile setting in the pom.xml. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. YARN module has nonstandard structure which cause compile error In IntelliJ --- Key: SPARK-3324 URL: https://issues.apache.org/jira/browse/SPARK-3324 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Environment: Mac OS: 10.9.4 IntelliJ IDEA: 13.1.4 Scala Plugins: 0.41.2 Maven: 3.0.5 Reporter: Yi Tian Priority: Minor Labels: intellij, maven, yarn The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recogniz codes in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and choose the version of yarn api via maven profile setting in the pom.xml. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116278#comment-14116278 ] Sean Owen commented on SPARK-3324: -- I agree, I've had a similar problem and just resolved it manually. I imagine the answer is, soon we're going to delete alpha anyway and then this is moot. YARN module has nonstandard structure which cause compile error In IntelliJ --- Key: SPARK-3324 URL: https://issues.apache.org/jira/browse/SPARK-3324 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Environment: Mac OS: 10.9.4 IntelliJ IDEA: 13.1.4 Scala Plugins: 0.41.2 Maven: 3.0.5 Reporter: Yi Tian Priority: Minor Labels: intellij, maven, yarn The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recogniz codes in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and choose the version of yarn api via maven profile setting in the pom.xml. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3325) Add a parameter to the method print in class DStream.
Yadong Qi created SPARK-3325: Summary: Add a parameter to the method print in class DStream. Key: SPARK-3325 URL: https://issues.apache.org/jira/browse/SPARK-3325 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: Yadong Qi def print(num: Int = 10) User can control the number of elements which to print. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3325) Add a parameter to the method print in class DStream.
[ https://issues.apache.org/jira/browse/SPARK-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yadong Qi updated SPARK-3325: - Component/s: (was: Spark Core) Streaming Add a parameter to the method print in class DStream. - Key: SPARK-3325 URL: https://issues.apache.org/jira/browse/SPARK-3325 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: Yadong Qi def print(num: Int = 10) User can control the number of elements which to print. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3325) Add a parameter to the method print in class DStream.
[ https://issues.apache.org/jira/browse/SPARK-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116322#comment-14116322 ] Apache Spark commented on SPARK-3325: - User 'watermen' has created a pull request for this issue: https://github.com/apache/spark/pull/2216 Add a parameter to the method print in class DStream. - Key: SPARK-3325 URL: https://issues.apache.org/jira/browse/SPARK-3325 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: Yadong Qi def print(num: Int = 10) User can control the number of elements which to print. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3326) can't access a static variable after init in mapper
Gavin Zhang created SPARK-3326: -- Summary: can't access a static variable after init in mapper Key: SPARK-3326 URL: https://issues.apache.org/jira/browse/SPARK-3326 Project: Spark Issue Type: Bug Environment: CDH5.1.0 Spark1.0.0 Reporter: Gavin Zhang I wrote a object like: object Foo { private Bar bar = null def init(Bar bar){ this.bar = bar } def getSome(){ bar.someDef() } } In Spark main def, I read some text from HDFS and init this object. And after then calling getSome(). I was successful with this code: sc.textFile(args(0)).take(10).map(println(Foo.getSome())) However, when I changed it for write output to HDFS, I found the bar variable in Foo object is null: sc.textFile(args(0)).map(line=Foo.getSome()).saveAsTextFile(args(1)) WHY? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3326) can't access a static variable after init in mapper
[ https://issues.apache.org/jira/browse/SPARK-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116327#comment-14116327 ] Sean Owen commented on SPARK-3326: -- The call to Foo.getSome() occurs remotely, on a different JVM with a different copy of your class. You may initialize your instance in the driver, but this leaves it uninitialized in the remote workers. You can initialize this in a static block. Or you can simply reference the value of Foo.getSome() directly in your map function and then it is serialized in the closure. All that you send right now is a function that depends on what Foo.getSome() returns when it's called, not what it happens to return on the driver. Consider broadcast variables if it's large. If that's what's going on then this is normal behavior. can't access a static variable after init in mapper --- Key: SPARK-3326 URL: https://issues.apache.org/jira/browse/SPARK-3326 Project: Spark Issue Type: Bug Environment: CDH5.1.0 Spark1.0.0 Reporter: Gavin Zhang I wrote a object like: object Foo { private Bar bar = null def init(Bar bar){ this.bar = bar } def getSome(){ bar.someDef() } } In Spark main def, I read some text from HDFS and init this object. And after then calling getSome(). I was successful with this code: sc.textFile(args(0)).take(10).map(println(Foo.getSome())) However, when I changed it for write output to HDFS, I found the bar variable in Foo object is null: sc.textFile(args(0)).map(line=Foo.getSome()).saveAsTextFile(args(1)) WHY? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3326) can't access a static variable after init in mapper
[ https://issues.apache.org/jira/browse/SPARK-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116387#comment-14116387 ] Gavin Zhang commented on SPARK-3326: It works. Thanks! can't access a static variable after init in mapper --- Key: SPARK-3326 URL: https://issues.apache.org/jira/browse/SPARK-3326 Project: Spark Issue Type: Bug Environment: CDH5.1.0 Spark1.0.0 Reporter: Gavin Zhang I wrote a object like: object Foo { private Bar bar = null def init(Bar bar){ this.bar = bar } def getSome(){ bar.someDef() } } In Spark main def, I read some text from HDFS and init this object. And after then calling getSome(). I was successful with this code: sc.textFile(args(0)).take(10).map(println(Foo.getSome())) However, when I changed it for write output to HDFS, I found the bar variable in Foo object is null: sc.textFile(args(0)).map(line=Foo.getSome()).saveAsTextFile(args(1)) WHY? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3321) Defining a class within python main script
[ https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Guo updated SPARK-3321: - Priority: Critical (was: Blocker) Defining a class within python main script -- Key: SPARK-3321 URL: https://issues.apache.org/jira/browse/SPARK-3321 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.1 Environment: Python version 2.6.6 Spark version version 1.0.1 jdk1.6.0_43 Reporter: Shawn Guo Priority: Critical *leftOuterJoin(self, other, numPartitions=None)* Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k. *Background*: leftOuterJoin will produce None element in result dataset. I define a new class 'Null' in the main script to replace all python native None to new 'Null' object. 'Null' object overload the [] operator. {code:title=Class Null|borderStyle=solid} class Null(object): def __getitem__(self,key): return None; def __getstate__(self): pass; def __setstate__(self, dict): pass; def convert_to_null(x): return Null() if x is None else x X = A.leftOuterJoin(B) X.mapValues(lambda line: (line[0],convert_to_null(line[1])) {code} The code seems running good in pyspark console, however spark-submit failed with below error messages: /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] /tmp/python_test.py {noformat} File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 191, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 124, in dump_stream self._write_with_length(obj, stream) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 134, in _write_with_length serialized = self.dumps(obj) File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, line 279, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) PicklingError: Can't pickle class '__main__.Null': attribute lookup __main__.Null failed org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456)
[jira] [Created] (SPARK-3327) Make broadcasted value mutable for caching useful information
Liang-Chi Hsieh created SPARK-3327: -- Summary: Make broadcasted value mutable for caching useful information Key: SPARK-3327 URL: https://issues.apache.org/jira/browse/SPARK-3327 Project: Spark Issue Type: New Feature Reporter: Liang-Chi Hsieh When implementing some algorithms, it is helpful that we can cache some useful information for using later. Specifically, we would like to performa operation A on each partition of data. Some variables are updated. Then we want to run operation B on the data too. B operation uses the variables updated by operation A. One of the examples is the Liblinear on Spark from Dr. Lin. They discuss the problem in Section IV.D of the paper Large-scale Logistic Regression and Linear Support Vector Machines Using Spark. Currently broadcasted variables can satisfy partial need for that. We can broadcast variables to reduce communication costs. However, because broadcasted variables can not be modified, it doesn't help solve the problem and we maybe need to collect updated variables back to master and broadcast them again before conducting next data operation. I would like to add an interface to broadcasted variables to make them mutable so later data operations can use them again. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3327) Make broadcasted value mutable for caching useful information
[ https://issues.apache.org/jira/browse/SPARK-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116532#comment-14116532 ] Apache Spark commented on SPARK-3327: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/2217 Make broadcasted value mutable for caching useful information - Key: SPARK-3327 URL: https://issues.apache.org/jira/browse/SPARK-3327 Project: Spark Issue Type: New Feature Reporter: Liang-Chi Hsieh When implementing some algorithms, it is helpful that we can cache some useful information for using later. Specifically, we would like to performa operation A on each partition of data. Some variables are updated. Then we want to run operation B on the data too. B operation uses the variables updated by operation A. One of the examples is the Liblinear on Spark from Dr. Lin. They discuss the problem in Section IV.D of the paper Large-scale Logistic Regression and Linear Support Vector Machines Using Spark. Currently broadcasted variables can satisfy partial need for that. We can broadcast variables to reduce communication costs. However, because broadcasted variables can not be modified, it doesn't help solve the problem and we maybe need to collect updated variables back to master and broadcast them again before conducting next data operation. I would like to add an interface to broadcasted variables to make them mutable so later data operations can use them again. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116536#comment-14116536 ] Remus Rusanu commented on SPARK-2356: - HADOOP-11003 is requesting hadoop-common to reduce the severity of the error logged in this case. Exception: Could not locate executable null\bin\winutils.exe in the Hadoop --- Key: SPARK-2356 URL: https://issues.apache.org/jira/browse/SPARK-2356 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kostiantyn Kudriavtsev Priority: Critical I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file from local filesystem): {code} 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) {code} It's happened because Hadoop config is initialized each time when spark context is created regardless is hadoop required or not. I propose to add some special flag to indicate if hadoop config is required (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116536#comment-14116536 ] Remus Rusanu edited comment on SPARK-2356 at 8/30/14 7:50 PM: -- HADOOP-11003 is requesting hadoop-common to reduce the severity of the error logged in this case. The error is raised, but getWinUtilsPath() catches it and logs the stack with error severity. Your code should not see the exception. was (Author: rusanu): HADOOP-11003 is requesting hadoop-common to reduce the severity of the error logged in this case. Exception: Could not locate executable null\bin\winutils.exe in the Hadoop --- Key: SPARK-2356 URL: https://issues.apache.org/jira/browse/SPARK-2356 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kostiantyn Kudriavtsev Priority: Critical I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file from local filesystem): {code} 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) {code} It's happened because Hadoop config is initialized each time when spark context is created regardless is hadoop required or not. I propose to add some special flag to indicate if hadoop config is required (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2889) Spark creates Hadoop Configuration objects inconsistently
[ https://issues.apache.org/jira/browse/SPARK-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2889. -- Resolution: Fixed Fix Version/s: 1.2.0 Spark creates Hadoop Configuration objects inconsistently -- Key: SPARK-2889 URL: https://issues.apache.org/jira/browse/SPARK-2889 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.2.0 Looking through Spark code, there are 3 ways used to get Configuration objects: - Use SparkContext.hadoopConfiguration - Use SparkHadoopUtil.newConfiguration - Call {{new Configuration()}} directly Only the first one supports setting hadoop configs via {{spark.hadoop.*}} properties. We should probably make everybody agree about how to do things. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3318) The documentation for addFiles is wrong
[ https://issues.apache.org/jira/browse/SPARK-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3318: - Assignee: Holden Karau The documentation for addFiles is wrong --- Key: SPARK-3318 URL: https://issues.apache.org/jira/browse/SPARK-3318 Project: Spark Issue Type: Documentation Reporter: holdenk Assignee: Holden Karau Priority: Trivial It indicates we should use the path rather than the file name, but all of the tests and the functionality works with file name. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3318) The documentation for addFiles is wrong
[ https://issues.apache.org/jira/browse/SPARK-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3318. -- Resolution: Fixed Fix Version/s: 1.2.0 The documentation for addFiles is wrong --- Key: SPARK-3318 URL: https://issues.apache.org/jira/browse/SPARK-3318 Project: Spark Issue Type: Documentation Reporter: holdenk Assignee: Holden Karau Priority: Trivial Fix For: 1.2.0 It indicates we should use the path rather than the file name, but all of the tests and the functionality works with file name. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-2885: -- Description: Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf was: Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that
[jira] [Updated] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated SPARK-3324: --- Description: The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recognize sources in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and add specify different version of yarn api via maven profile setting. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. was: The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recogniz codes in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and choose the version of yarn api via maven profile setting in the pom.xml. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. YARN module has nonstandard structure which cause compile error In IntelliJ --- Key: SPARK-3324 URL: https://issues.apache.org/jira/browse/SPARK-3324 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Environment: Mac OS: 10.9.4 IntelliJ IDEA: 13.1.4 Scala Plugins: 0.41.2 Maven: 3.0.5 Reporter: Yi Tian Priority: Minor Labels: intellij, maven, yarn The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recognize sources in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and add specify different version of yarn api via maven profile setting. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2558) Mention --queue argument in YARN documentation
[ https://issues.apache.org/jira/browse/SPARK-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116614#comment-14116614 ] Apache Spark commented on SPARK-2558: - User 'kramimus' has created a pull request for this issue: https://github.com/apache/spark/pull/2218 Mention --queue argument in YARN documentation --- Key: SPARK-2558 URL: https://issues.apache.org/jira/browse/SPARK-2558 Project: Spark Issue Type: Documentation Components: YARN Reporter: Matei Zaharia Priority: Trivial Labels: Starter The docs about it went away when we updated the page to spark-submit. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116616#comment-14116616 ] Patrick Wendell commented on SPARK-3324: Hi, I don't totally understand the proposal. But in general, the issue is that these yarn versions have different API's, so you can not have a single folder of source code that compiles against both YARN API's. So if you mean the profile would simply influence the version of YARN we compile against, that's not sufficient. Maybe you have something more elegant in mind? If you could produce a small example of how you'd do this using profiles, I'd be happy to take a look at it and we can see if it's better or worse than what we have now :) Our current approach does have the issue of being IDE-unfriendly. YARN module has nonstandard structure which cause compile error In IntelliJ --- Key: SPARK-3324 URL: https://issues.apache.org/jira/browse/SPARK-3324 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Environment: Mac OS: 10.9.4 IntelliJ IDEA: 13.1.4 Scala Plugins: 0.41.2 Maven: 3.0.5 Reporter: Yi Tian Priority: Minor Labels: intellij, maven, yarn The YARN module has nonstandard path structure like: {code} ${SPARK_HOME} |--yarn |--alpha (contains yarn api support for 0.23 and 2.0.x) |--stable (contains yarn api support for 2.2 and later) | |--pom.xml (spark-yarn) |--common (Common codes not depending on specific version of Hadoop) |--pom.xml (yarn-parent) {code} When we use maven to compile yarn module, maven will import 'alpha' or 'stable' module according to profile setting. And the submodule like 'stable' use the build propertie defined in yarn/pom.xml to import common codes to sourcePath. It will cause IntelliJ can't directly recognize sources in common directory as sourcePath. I thought we should change the yarn module to a unified maven jar project, and add specify different version of yarn api via maven profile setting. It will resolve the compile error in IntelliJ and make the yarn module more simple and clear. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.
[ https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3287: --- Component/s: (was: Spark Core) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed. Key: SPARK-3287 URL: https://issues.apache.org/jira/browse/SPARK-3287 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: Benoy Antony Attachments: SPARK-3287.patch When ResourceManager High Availability is enabled, there will be multiple resource managers and each of them could act as a proxy. AmIpFilter is modified to accept multiple proxy hosts. But Spark ApplicationMaster fails to read the ResourceManager IPs properly from the configuration. So AmIpFilter is initialized with an empty set of proxy hosts. So any access to the ApplicationMaster WebUI will be redirected to port RM port on the local host. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1239) Don't fetch all map outputs at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116642#comment-14116642 ] bc Wong commented on SPARK-1239: Is anyone working on this? If so, could someone please update the jira? Thanks! Don't fetch all map outputs at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Andrew Or Fix For: 1.1.0 Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2579) Reading from S3 returns an inconsistent number of items with Spark 0.9.1
[ https://issues.apache.org/jira/browse/SPARK-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116643#comment-14116643 ] Chris Fregly commented on SPARK-2579: - interesting and possibly-related blog post from netflix earlier this year: http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html Reading from S3 returns an inconsistent number of items with Spark 0.9.1 Key: SPARK-2579 URL: https://issues.apache.org/jira/browse/SPARK-2579 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 0.9.1 Reporter: Eemil Lagerspetz Priority: Critical Labels: hdfs, read, s3, skipping I have created a random matrix of 1M rows with 10K items on each row, semicolon-separated. While reading it with Spark 0.9.1 and doing a count, I consistently get less than 1M rows, and a different number every time at that ( !! ). Example below: head -n 1 tool-generate-random-matrix*log == tool-generate-random-matrix-999158.log == Row item counts: 999158 == tool-generate-random-matrix.log == Row item counts: 997163 The data is split into 1000 partitions. When I download it using s3cmd sync, and run the following AWK on it, I get the correct number of rows in each partition (1000x1000 = 1M). What is up? {code:title=checkrows.sh|borderStyle=solid} for k in part-0* do echo $k awk -F ; ' NF != 1 { print Wrong number of items:,NF } END { if (NR != 1000) { print Wrong number of rows:,NR } }' $k done {code} The matrix generation and counting code is below: {code:title=Matrix.scala|borderStyle=solid} package fi.helsinki.cs.nodes.matrix import java.util.Random import org.apache.spark._ import org.apache.spark.SparkContext._ import scala.collection.mutable.ListBuffer import org.apache.spark.rdd.RDD import org.apache.spark.storage.StorageLevel._ object GenerateRandomMatrix { def NewGeMatrix(rSeed: Int, rdd: RDD[Int], features: Int) = { rdd.mapPartitions(part = part.map(xarr = { val rdm = new Random(rSeed + xarr) val arr = new Array[Double](features) for (i - 0 until features) arr(i) = rdm.nextDouble() new Row(xarr, arr) })) } case class Row(id: Int, elements: Array[Double]) {} def rowFromText(line: String) = { val idarr = line.split( ) val arr = idarr(1).split(;) // -1 to fix saved matrix indexing error new Row(idarr(0).toInt-1, arr.map(_.toDouble)) } def main(args: Array[String]) { val master = args(0) val tasks = args(1).toInt val savePath = args(2) val read = args.contains(read) val datapoints = 100 val features = 1 val sc = new SparkContext(master, RandomMatrix) if (read) { val randomMatrix: RDD[Row] = sc.textFile(savePath, tasks).map(rowFromText).persist(MEMORY_AND_DISK) println(Row item counts: + randomMatrix.count) } else { val rdd = sc.parallelize(0 until datapoints, tasks) val bcSeed = sc.broadcast(128) /* Generating a matrix of random Doubles */ val randomMatrix = NewGeMatrix(bcSeed.value, rdd, features).persist(MEMORY_AND_DISK) randomMatrix.map(row = row.id + + row.elements.mkString(;)).saveAsTextFile(savePath) } sc.stop } } {code} I run this with: appassembler/bin/tool-generate-random-matrix master 1000 s3n://keys@path/to/data 1matrix.log 2matrix.err Reading from HDFS gives the right count and right number of items on each row. However, I had to run with the full path with the server name, just /matrix does not work (it thinks I want file://): p=hdfs://ec2-54-188-6-77.us-west-2.compute.amazonaws.com:9000/matrix appassembler/bin/tool-generate-random-matrix $( cat /root/spark-ec2/cluster-url ) 1000 $p read 1readmatrix.log 2readmatrix.err -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org