[jira] [Commented] (SPARK-2288) Hide ShuffleBlockManager behind ShuffleManager

2014-08-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116249#comment-14116249
 ] 

Reynold Xin commented on SPARK-2288:


Thanks for the design doc, Raymond. Next time it would be better to also 
comment on the new block type you are adding. Cheers.


 Hide ShuffleBlockManager behind ShuffleManager
 --

 Key: SPARK-2288
 URL: https://issues.apache.org/jira/browse/SPARK-2288
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager, Shuffle
Reporter: Raymond Liu
Assignee: Raymond Liu
 Attachments: shuffleblockmanager.pdf


 This is a sub task for SPARK-2275. 
 At present, In shuffle write path, the shuffle block manager manage the 
 mapping from some blockID to a FileSegment for the benefit of consolidate 
 shuffle, this way it bypass the block store's blockId based access mode. Then 
 in the read path, when read a shuffle block data, disk store query 
 shuffleBlockManager to hack the normal blockId to file mapping in order to 
 correctly read data from file. This really rend to a lot of bi-directional 
 dependencies between modules and the code logic is some how messed up. None 
 of the shuffle block manager and blockManager/Disk Store fully control the 
 read path. They are tightly coupled in low level code modules. And it make it 
 hard to implement other shuffle manager logics. e.g. a sort based shuffle 
 which might merge all output from one map partition to a single file. This 
 will need to hack more into the diskStore/diskBlockManager etc to find out 
 the right data to be read.
 Possible approaching:
 So I think it might be better that we expose an FileSegment based read 
 interface for DiskStore in addition to the current blockID based interface.
 Then those mapping blockId to FileSegment code logic can all reside in the 
 specific shuffle manager, if they do need to merge data into one single 
 object. they take care of the mapping logic in both read/write path and take 
 the responsibility of read / write shuffle data
 The BlockStore itself should just take care of read/write as required, it 
 should not involve into the data mapping logic at all. This might make the 
 interface between modules more clear and decouple each other in a more clean 
 way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2288) Hide ShuffleBlockManager behind ShuffleManager

2014-08-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2288.


   Resolution: Fixed
Fix Version/s: 1.2.0

 Hide ShuffleBlockManager behind ShuffleManager
 --

 Key: SPARK-2288
 URL: https://issues.apache.org/jira/browse/SPARK-2288
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager, Shuffle
Reporter: Raymond Liu
Assignee: Raymond Liu
 Fix For: 1.2.0

 Attachments: shuffleblockmanager.pdf


 This is a sub task for SPARK-2275. 
 At present, In shuffle write path, the shuffle block manager manage the 
 mapping from some blockID to a FileSegment for the benefit of consolidate 
 shuffle, this way it bypass the block store's blockId based access mode. Then 
 in the read path, when read a shuffle block data, disk store query 
 shuffleBlockManager to hack the normal blockId to file mapping in order to 
 correctly read data from file. This really rend to a lot of bi-directional 
 dependencies between modules and the code logic is some how messed up. None 
 of the shuffle block manager and blockManager/Disk Store fully control the 
 read path. They are tightly coupled in low level code modules. And it make it 
 hard to implement other shuffle manager logics. e.g. a sort based shuffle 
 which might merge all output from one map partition to a single file. This 
 will need to hack more into the diskStore/diskBlockManager etc to find out 
 the right data to be read.
 Possible approaching:
 So I think it might be better that we expose an FileSegment based read 
 interface for DiskStore in addition to the current blockID based interface.
 Then those mapping blockId to FileSegment code logic can all reside in the 
 specific shuffle manager, if they do need to merge data into one single 
 object. they take care of the mapping logic in both read/write path and take 
 the responsibility of read / write shuffle data
 The BlockStore itself should just take care of read/write as required, it 
 should not involve into the data mapping logic at all. This might make the 
 interface between modules more clear and decouple each other in a more clean 
 way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3305) Remove unused import from UI classes.

2014-08-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3305.


   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Kousuke Saruta

 Remove unused import from UI classes.
 -

 Key: SPARK-3305
 URL: https://issues.apache.org/jira/browse/SPARK-3305
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Trivial
 Fix For: 1.2.0


 UI classes are continue to change and left unused import.
 Let's keep clean.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3323) yarn website's Tracking UI links to the Standby RM

2014-08-30 Thread wangfei (JIRA)
wangfei created SPARK-3323:
--

 Summary: yarn website's Tracking UI links to the Standby RM
 Key: SPARK-3323
 URL: https://issues.apache.org/jira/browse/SPARK-3323
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: wangfei


Running a big application, sometimes will occur this situation:
When clicking the Tracking UI of the running application, it links to the 
Standby RM.
With Info as fellow:
This is standby RM.Redirecting to the current active RM: some address
But actually the address of this website is the same with the some address




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ

2014-08-30 Thread Yi Tian (JIRA)
Yi Tian created SPARK-3324:
--

 Summary: YARN module has nonstandard structure which cause compile 
error In IntelliJ
 Key: SPARK-3324
 URL: https://issues.apache.org/jira/browse/SPARK-3324
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
 Environment: Mac OS: 10.9.4
IntelliJ IDEA: 13.1.4
Scala Plugins: 0.41.2
Maven: 3.0.5
Reporter: Yi Tian
Priority: Minor


The YARN module has nonstandard path structure like:
${SPARK_HOME}
  |--yarn
 |--alpha (contains yarn api support for 0.23 and 2.0.x)
 |--stable (contains yarn api support for 2.2 and later)
 | |--pom.xml (spark-yarn)
 |--common (Common codes not depending on specific version of Hadoop)
 |--pom.xml (yarn-parent)

When we use maven to compile yarn module, maven will import 'alpha' or 'stable' 
module according to profile setting.
And the submodule like 'stable' use the build propertie defined in yarn/pom.xml 
to import common codes to sourcePath.
It will cause IntelliJ can't directly recogniz codes in common directory as 
sourcePath. 

I thought we should change the yarn module to a unified maven jar project, 
and choose the version of yarn api via maven profile setting in the pom.xml.

It will resolve the compile error in IntelliJ and make the yarn module more 
simple and clear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ

2014-08-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3324:
--

Description: 
The YARN module has nonstandard path structure like:
{code}
${SPARK_HOME}
  |--yarn
 |--alpha (contains yarn api support for 0.23 and 2.0.x)
 |--stable (contains yarn api support for 2.2 and later)
 | |--pom.xml (spark-yarn)
 |--common (Common codes not depending on specific version of Hadoop)
 |--pom.xml (yarn-parent)
{code}
When we use maven to compile yarn module, maven will import 'alpha' or 'stable' 
module according to profile setting.
And the submodule like 'stable' use the build propertie defined in yarn/pom.xml 
to import common codes to sourcePath.
It will cause IntelliJ can't directly recogniz codes in common directory as 
sourcePath. 

I thought we should change the yarn module to a unified maven jar project, 
and choose the version of yarn api via maven profile setting in the pom.xml.

It will resolve the compile error in IntelliJ and make the yarn module more 
simple and clear.

  was:
The YARN module has nonstandard path structure like:
${SPARK_HOME}
  |--yarn
 |--alpha (contains yarn api support for 0.23 and 2.0.x)
 |--stable (contains yarn api support for 2.2 and later)
 | |--pom.xml (spark-yarn)
 |--common (Common codes not depending on specific version of Hadoop)
 |--pom.xml (yarn-parent)

When we use maven to compile yarn module, maven will import 'alpha' or 'stable' 
module according to profile setting.
And the submodule like 'stable' use the build propertie defined in yarn/pom.xml 
to import common codes to sourcePath.
It will cause IntelliJ can't directly recogniz codes in common directory as 
sourcePath. 

I thought we should change the yarn module to a unified maven jar project, 
and choose the version of yarn api via maven profile setting in the pom.xml.

It will resolve the compile error in IntelliJ and make the yarn module more 
simple and clear.


 YARN module has nonstandard structure which cause compile error In IntelliJ
 ---

 Key: SPARK-3324
 URL: https://issues.apache.org/jira/browse/SPARK-3324
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
 Environment: Mac OS: 10.9.4
 IntelliJ IDEA: 13.1.4
 Scala Plugins: 0.41.2
 Maven: 3.0.5
Reporter: Yi Tian
Priority: Minor
  Labels: intellij, maven, yarn

 The YARN module has nonstandard path structure like:
 {code}
 ${SPARK_HOME}
   |--yarn
  |--alpha (contains yarn api support for 0.23 and 2.0.x)
  |--stable (contains yarn api support for 2.2 and later)
  | |--pom.xml (spark-yarn)
  |--common (Common codes not depending on specific version of Hadoop)
  |--pom.xml (yarn-parent)
 {code}
 When we use maven to compile yarn module, maven will import 'alpha' or 
 'stable' module according to profile setting.
 And the submodule like 'stable' use the build propertie defined in 
 yarn/pom.xml to import common codes to sourcePath.
 It will cause IntelliJ can't directly recogniz codes in common directory as 
 sourcePath. 
 I thought we should change the yarn module to a unified maven jar project, 
 and choose the version of yarn api via maven profile setting in the pom.xml.
 It will resolve the compile error in IntelliJ and make the yarn module more 
 simple and clear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ

2014-08-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116278#comment-14116278
 ] 

Sean Owen commented on SPARK-3324:
--

I agree, I've had a similar problem and just resolved it manually.

I imagine the answer is, soon we're going to delete alpha anyway and then 
this is moot.

 YARN module has nonstandard structure which cause compile error In IntelliJ
 ---

 Key: SPARK-3324
 URL: https://issues.apache.org/jira/browse/SPARK-3324
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
 Environment: Mac OS: 10.9.4
 IntelliJ IDEA: 13.1.4
 Scala Plugins: 0.41.2
 Maven: 3.0.5
Reporter: Yi Tian
Priority: Minor
  Labels: intellij, maven, yarn

 The YARN module has nonstandard path structure like:
 {code}
 ${SPARK_HOME}
   |--yarn
  |--alpha (contains yarn api support for 0.23 and 2.0.x)
  |--stable (contains yarn api support for 2.2 and later)
  | |--pom.xml (spark-yarn)
  |--common (Common codes not depending on specific version of Hadoop)
  |--pom.xml (yarn-parent)
 {code}
 When we use maven to compile yarn module, maven will import 'alpha' or 
 'stable' module according to profile setting.
 And the submodule like 'stable' use the build propertie defined in 
 yarn/pom.xml to import common codes to sourcePath.
 It will cause IntelliJ can't directly recogniz codes in common directory as 
 sourcePath. 
 I thought we should change the yarn module to a unified maven jar project, 
 and choose the version of yarn api via maven profile setting in the pom.xml.
 It will resolve the compile error in IntelliJ and make the yarn module more 
 simple and clear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3325) Add a parameter to the method print in class DStream.

2014-08-30 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-3325:


 Summary: Add a parameter to the method print in class DStream.
 Key: SPARK-3325
 URL: https://issues.apache.org/jira/browse/SPARK-3325
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Yadong Qi


def print(num: Int = 10)
User can control the number of elements which to print.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3325) Add a parameter to the method print in class DStream.

2014-08-30 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-3325:
-

Component/s: (was: Spark Core)
 Streaming

 Add a parameter to the method print in class DStream.
 -

 Key: SPARK-3325
 URL: https://issues.apache.org/jira/browse/SPARK-3325
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Yadong Qi

 def print(num: Int = 10)
 User can control the number of elements which to print.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3325) Add a parameter to the method print in class DStream.

2014-08-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116322#comment-14116322
 ] 

Apache Spark commented on SPARK-3325:
-

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2216

 Add a parameter to the method print in class DStream.
 -

 Key: SPARK-3325
 URL: https://issues.apache.org/jira/browse/SPARK-3325
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Yadong Qi

 def print(num: Int = 10)
 User can control the number of elements which to print.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3326) can't access a static variable after init in mapper

2014-08-30 Thread Gavin Zhang (JIRA)
Gavin Zhang created SPARK-3326:
--

 Summary: can't access a static variable after init in mapper
 Key: SPARK-3326
 URL: https://issues.apache.org/jira/browse/SPARK-3326
 Project: Spark
  Issue Type: Bug
 Environment: CDH5.1.0
Spark1.0.0
Reporter: Gavin Zhang


I wrote a object like:
object Foo {
   private Bar bar = null
   def init(Bar bar){
   this.bar = bar
   }
   def getSome(){
   bar.someDef()
   }
}
In Spark main def, I read some text from HDFS and init this object. And after 
then calling getSome().
I was successful with this code:
sc.textFile(args(0)).take(10).map(println(Foo.getSome()))
However, when I changed it for write output to HDFS, I found the bar variable 
in Foo object is null:
sc.textFile(args(0)).map(line=Foo.getSome()).saveAsTextFile(args(1))
WHY?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3326) can't access a static variable after init in mapper

2014-08-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116327#comment-14116327
 ] 

Sean Owen commented on SPARK-3326:
--

The call to Foo.getSome() occurs remotely, on a different JVM with a different 
copy of your class. You may initialize your instance in the driver, but this 
leaves it uninitialized in the remote workers.

You can initialize this in a static block. Or you can simply reference the 
value of Foo.getSome() directly in your map function and then it is serialized 
in the closure. All that you send right now is a function that depends on what 
Foo.getSome() returns when it's called, not what it happens to return on the 
driver. Consider broadcast variables if it's large.

If that's what's going on then this is normal behavior.

 can't access a static variable after init in mapper
 ---

 Key: SPARK-3326
 URL: https://issues.apache.org/jira/browse/SPARK-3326
 Project: Spark
  Issue Type: Bug
 Environment: CDH5.1.0
 Spark1.0.0
Reporter: Gavin Zhang

 I wrote a object like:
 object Foo {
private Bar bar = null
def init(Bar bar){
this.bar = bar
}
def getSome(){
bar.someDef()
}
 }
 In Spark main def, I read some text from HDFS and init this object. And after 
 then calling getSome().
 I was successful with this code:
 sc.textFile(args(0)).take(10).map(println(Foo.getSome()))
 However, when I changed it for write output to HDFS, I found the bar variable 
 in Foo object is null:
 sc.textFile(args(0)).map(line=Foo.getSome()).saveAsTextFile(args(1))
 WHY?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3326) can't access a static variable after init in mapper

2014-08-30 Thread Gavin Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116387#comment-14116387
 ] 

Gavin Zhang commented on SPARK-3326:


It works. Thanks!

 can't access a static variable after init in mapper
 ---

 Key: SPARK-3326
 URL: https://issues.apache.org/jira/browse/SPARK-3326
 Project: Spark
  Issue Type: Bug
 Environment: CDH5.1.0
 Spark1.0.0
Reporter: Gavin Zhang

 I wrote a object like:
 object Foo {
private Bar bar = null
def init(Bar bar){
this.bar = bar
}
def getSome(){
bar.someDef()
}
 }
 In Spark main def, I read some text from HDFS and init this object. And after 
 then calling getSome().
 I was successful with this code:
 sc.textFile(args(0)).take(10).map(println(Foo.getSome()))
 However, when I changed it for write output to HDFS, I found the bar variable 
 in Foo object is null:
 sc.textFile(args(0)).map(line=Foo.getSome()).saveAsTextFile(args(1))
 WHY?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3321) Defining a class within python main script

2014-08-30 Thread Shawn Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Guo updated SPARK-3321:
-

Priority: Critical  (was: Blocker)

 Defining a class within python main script
 --

 Key: SPARK-3321
 URL: https://issues.apache.org/jira/browse/SPARK-3321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.1
 Environment: Python version 2.6.6
 Spark version version 1.0.1
 jdk1.6.0_43
Reporter: Shawn Guo
Priority: Critical

 *leftOuterJoin(self, other, numPartitions=None)*
 Perform a left outer join of self and other.
 For each element (k, v) in self, the resulting RDD will either contain all 
 pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements 
 in other have key k.
 *Background*: leftOuterJoin will produce None element in result dataset.
 I define a new class 'Null' in the main script to replace all python native 
 None to new 'Null' object. 'Null' object overload the [] operator.
 {code:title=Class Null|borderStyle=solid}
 class Null(object):
 def __getitem__(self,key): return None;
 def __getstate__(self): pass;
 def __setstate__(self, dict): pass;
 def convert_to_null(x):
 return Null() if x is None else x
 X = A.leftOuterJoin(B)
 X.mapValues(lambda line: (line[0],convert_to_null(line[1]))
 {code}
 The code seems running good in pyspark console, however spark-submit failed 
 with below error messages:
 /spark-1.0.1-bin-hadoop1/bin/spark-submit --master local[2] 
 /tmp/python_test.py
 {noformat}
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 134, in _write_with_length
 serialized = self.dumps(obj)
   File /data/work/spark-1.0.1-bin-hadoop1/python/pyspark/serializers.py, 
 line 279, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
 PicklingError: Can't pickle class '__main__.Null': attribute lookup 
 __main__.Null failed
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:145)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 

[jira] [Created] (SPARK-3327) Make broadcasted value mutable for caching useful information

2014-08-30 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-3327:
--

 Summary: Make broadcasted value mutable for caching useful 
information
 Key: SPARK-3327
 URL: https://issues.apache.org/jira/browse/SPARK-3327
 Project: Spark
  Issue Type: New Feature
Reporter: Liang-Chi Hsieh


When implementing some algorithms, it is helpful that we can cache some useful 
information for using later.

Specifically, we would like to performa operation A on each partition of 
data. Some variables are updated. Then we want to run operation B on the data 
too. B operation uses the variables updated by operation A.

One of the examples is the Liblinear on Spark from Dr. Lin. They discuss the 
problem in Section IV.D of the paper Large-scale Logistic Regression and 
Linear Support Vector Machines Using Spark.

Currently broadcasted variables can satisfy partial need for that. We can 
broadcast variables to reduce communication costs. However, because broadcasted 
variables can not be modified, it doesn't help solve the problem and we maybe 
need to collect updated variables back to master and broadcast them again 
before conducting next data operation.

I would like to add an interface to broadcasted variables to make them mutable 
so later data operations can use them again.


 




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3327) Make broadcasted value mutable for caching useful information

2014-08-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116532#comment-14116532
 ] 

Apache Spark commented on SPARK-3327:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/2217

 Make broadcasted value mutable for caching useful information
 -

 Key: SPARK-3327
 URL: https://issues.apache.org/jira/browse/SPARK-3327
 Project: Spark
  Issue Type: New Feature
Reporter: Liang-Chi Hsieh

 When implementing some algorithms, it is helpful that we can cache some 
 useful information for using later.
 Specifically, we would like to performa operation A on each partition of 
 data. Some variables are updated. Then we want to run operation B on the 
 data too. B operation uses the variables updated by operation A.
 One of the examples is the Liblinear on Spark from Dr. Lin. They discuss the 
 problem in Section IV.D of the paper Large-scale Logistic Regression and 
 Linear Support Vector Machines Using Spark.
 Currently broadcasted variables can satisfy partial need for that. We can 
 broadcast variables to reduce communication costs. However, because 
 broadcasted variables can not be modified, it doesn't help solve the problem 
 and we maybe need to collect updated variables back to master and broadcast 
 them again before conducting next data operation.
 I would like to add an interface to broadcasted variables to make them 
 mutable so later data operations can use them again.
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2014-08-30 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116536#comment-14116536
 ] 

Remus Rusanu commented on SPARK-2356:
-

HADOOP-11003 is requesting hadoop-common to reduce the severity of the error 
logged in this case. 

 Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
 ---

 Key: SPARK-2356
 URL: https://issues.apache.org/jira/browse/SPARK-2356
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kostiantyn Kudriavtsev
Priority: Critical

 I'm trying to run some transformation on Spark, it works fine on cluster 
 (YARN, linux machines). However, when I'm trying to run it on local machine 
 (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
 from local filesystem):
 {code}
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
   at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
   at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
   at org.apache.hadoop.security.Groups.init(Groups.java:77)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 {code}
 It's happened because Hadoop config is initialized each time when spark 
 context is created regardless is hadoop required or not.
 I propose to add some special flag to indicate if hadoop config is required 
 (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2014-08-30 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116536#comment-14116536
 ] 

Remus Rusanu edited comment on SPARK-2356 at 8/30/14 7:50 PM:
--

HADOOP-11003 is requesting hadoop-common to reduce the severity of the error 
logged in this case. The error is raised, but getWinUtilsPath() catches it and 
logs the stack with error severity. Your code should not see the exception.


was (Author: rusanu):
HADOOP-11003 is requesting hadoop-common to reduce the severity of the error 
logged in this case. 

 Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
 ---

 Key: SPARK-2356
 URL: https://issues.apache.org/jira/browse/SPARK-2356
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kostiantyn Kudriavtsev
Priority: Critical

 I'm trying to run some transformation on Spark, it works fine on cluster 
 (YARN, linux machines). However, when I'm trying to run it on local machine 
 (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
 from local filesystem):
 {code}
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
   at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
   at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
   at org.apache.hadoop.security.Groups.init(Groups.java:77)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 {code}
 It's happened because Hadoop config is initialized each time when spark 
 context is created regardless is hadoop required or not.
 I propose to add some special flag to indicate if hadoop config is required 
 (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2889) Spark creates Hadoop Configuration objects inconsistently

2014-08-30 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2889.
--

   Resolution: Fixed
Fix Version/s: 1.2.0

 Spark creates Hadoop Configuration objects inconsistently 
 --

 Key: SPARK-2889
 URL: https://issues.apache.org/jira/browse/SPARK-2889
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.2.0


 Looking through Spark code, there are 3 ways used to get Configuration 
 objects:
 - Use SparkContext.hadoopConfiguration
 - Use SparkHadoopUtil.newConfiguration
 - Call {{new Configuration()}} directly
 Only the first one supports setting hadoop configs via {{spark.hadoop.*}} 
 properties. We should probably make everybody agree about how to do things.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3318) The documentation for addFiles is wrong

2014-08-30 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3318:
-

Assignee: Holden Karau

 The documentation for addFiles is wrong
 ---

 Key: SPARK-3318
 URL: https://issues.apache.org/jira/browse/SPARK-3318
 Project: Spark
  Issue Type: Documentation
Reporter: holdenk
Assignee: Holden Karau
Priority: Trivial

 It indicates we should use the path rather than the file name, but all of the 
 tests and the functionality works with file name.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3318) The documentation for addFiles is wrong

2014-08-30 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3318.
--

   Resolution: Fixed
Fix Version/s: 1.2.0

 The documentation for addFiles is wrong
 ---

 Key: SPARK-3318
 URL: https://issues.apache.org/jira/browse/SPARK-3318
 Project: Spark
  Issue Type: Documentation
Reporter: holdenk
Assignee: Holden Karau
Priority: Trivial
 Fix For: 1.2.0


 It indicates we should use the path rather than the file name, but all of the 
 tests and the functionality works with file name.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

2014-08-30 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-2885:
--

Description: 
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
computation vs accuracy from low computation to high accuracy. For a very large 
gamma, all cosine similarities are computed exactly with no sampling.

Current PR:
https://github.com/apache/spark/pull/1778

Justification for adding to MLlib:
- All-pairs similarity is missing from MLlib and has been requested several 
times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
https://github.com/apache/spark/pull/1778#issuecomment-51300825)

- Algorithm is used in large-scale production at Twitter. e.g. see 
https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
 . Twitter also open-sourced their version in scalding: 
https://github.com/twitter/scalding/pull/833

- When used with the gamma parameter set high, this algorithm becomes the 
normalized gramian matrix, which is useful in RowMatrix alongside the 
computeGramianMatrix method already in RowMatrix

More details about usage at Twitter: 
https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum

For correctness proof, see Theorem 4.3 in 
http://stanford.edu/~rezab/papers/dimsum.pdf

  was:
Build all-pairs similarity algorithm via DIMSUM. 

Given a dataset of sparse vector data, the all-pairs similarity problem is to 
find all similar vector pairs according to a similarity function such as cosine 
similarity, and a given similarity score threshold. Sometimes, this problem is 
called a “similarity join”.

The brute force approach of considering all pairs quickly breaks, since it 
scales quadratically. For example, for a million vectors, it is not feasible to 
check all roughly trillion pairs to see if they are above the similarity 
threshold. Having said that, there exist clever sampling techniques to focus 
the computational effort on those pairs that are above the similarity 
threshold, which makes the problem feasible.

DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
computation vs accuracy from low computation to high accuracy. For a very large 
gamma, all cosine similarities are computed exactly with no sampling.

Current PR:
https://github.com/apache/spark/pull/1778

Justification for adding to MLlib:
- All-pairs similarity is missing from MLlib and has been requested several 
times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
https://github.com/apache/spark/pull/1778#issuecomment-51300825)

- Algorithm is used in large-scale production at Twitter. e.g. see 
https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
 . Twitter also open-sourced their version in scalding: 
https://github.com/twitter/scalding/pull/833

- When used with the gamma parameter set high, this algorithm becomes the 
normalized gramian matrix, which is useful in RowMatrix alongside the 
computeGramianMatrix method already in RowMatrix


 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh

 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that 

[jira] [Updated] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ

2014-08-30 Thread Yi Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated SPARK-3324:
---

Description: 
The YARN module has nonstandard path structure like:
{code}
${SPARK_HOME}
  |--yarn
 |--alpha (contains yarn api support for 0.23 and 2.0.x)
 |--stable (contains yarn api support for 2.2 and later)
 | |--pom.xml (spark-yarn)
 |--common (Common codes not depending on specific version of Hadoop)
 |--pom.xml (yarn-parent)
{code}
When we use maven to compile yarn module, maven will import 'alpha' or 'stable' 
module according to profile setting.
And the submodule like 'stable' use the build propertie defined in yarn/pom.xml 
to import common codes to sourcePath.
It will cause IntelliJ can't directly recognize sources in common directory as 
sourcePath. 

I thought we should change the yarn module to a unified maven jar project, 
and add specify different version of yarn api via maven profile setting.

It will resolve the compile error in IntelliJ and make the yarn module more 
simple and clear.

  was:
The YARN module has nonstandard path structure like:
{code}
${SPARK_HOME}
  |--yarn
 |--alpha (contains yarn api support for 0.23 and 2.0.x)
 |--stable (contains yarn api support for 2.2 and later)
 | |--pom.xml (spark-yarn)
 |--common (Common codes not depending on specific version of Hadoop)
 |--pom.xml (yarn-parent)
{code}
When we use maven to compile yarn module, maven will import 'alpha' or 'stable' 
module according to profile setting.
And the submodule like 'stable' use the build propertie defined in yarn/pom.xml 
to import common codes to sourcePath.
It will cause IntelliJ can't directly recogniz codes in common directory as 
sourcePath. 

I thought we should change the yarn module to a unified maven jar project, 
and choose the version of yarn api via maven profile setting in the pom.xml.

It will resolve the compile error in IntelliJ and make the yarn module more 
simple and clear.


 YARN module has nonstandard structure which cause compile error In IntelliJ
 ---

 Key: SPARK-3324
 URL: https://issues.apache.org/jira/browse/SPARK-3324
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
 Environment: Mac OS: 10.9.4
 IntelliJ IDEA: 13.1.4
 Scala Plugins: 0.41.2
 Maven: 3.0.5
Reporter: Yi Tian
Priority: Minor
  Labels: intellij, maven, yarn

 The YARN module has nonstandard path structure like:
 {code}
 ${SPARK_HOME}
   |--yarn
  |--alpha (contains yarn api support for 0.23 and 2.0.x)
  |--stable (contains yarn api support for 2.2 and later)
  | |--pom.xml (spark-yarn)
  |--common (Common codes not depending on specific version of Hadoop)
  |--pom.xml (yarn-parent)
 {code}
 When we use maven to compile yarn module, maven will import 'alpha' or 
 'stable' module according to profile setting.
 And the submodule like 'stable' use the build propertie defined in 
 yarn/pom.xml to import common codes to sourcePath.
 It will cause IntelliJ can't directly recognize sources in common directory 
 as sourcePath. 
 I thought we should change the yarn module to a unified maven jar project, 
 and add specify different version of yarn api via maven profile setting.
 It will resolve the compile error in IntelliJ and make the yarn module more 
 simple and clear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2558) Mention --queue argument in YARN documentation

2014-08-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116614#comment-14116614
 ] 

Apache Spark commented on SPARK-2558:
-

User 'kramimus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2218

 Mention --queue argument in YARN documentation 
 ---

 Key: SPARK-2558
 URL: https://issues.apache.org/jira/browse/SPARK-2558
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Reporter: Matei Zaharia
Priority: Trivial
  Labels: Starter

 The docs about it went away when we updated the page to spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3324) YARN module has nonstandard structure which cause compile error In IntelliJ

2014-08-30 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116616#comment-14116616
 ] 

Patrick Wendell commented on SPARK-3324:


Hi,

I don't totally understand the proposal. But in general, the issue is that 
these yarn versions have different API's, so you can not have a single folder 
of source code that compiles against both YARN API's. So if you mean the 
profile would simply influence the version of YARN we compile against, that's 
not sufficient. Maybe you have something more elegant in mind? If you could 
produce a small example of how you'd do this using profiles, I'd be happy to 
take a look at it and we can see if it's better or worse than what we have now 
:)

Our current approach does have the issue of being IDE-unfriendly.

 YARN module has nonstandard structure which cause compile error In IntelliJ
 ---

 Key: SPARK-3324
 URL: https://issues.apache.org/jira/browse/SPARK-3324
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
 Environment: Mac OS: 10.9.4
 IntelliJ IDEA: 13.1.4
 Scala Plugins: 0.41.2
 Maven: 3.0.5
Reporter: Yi Tian
Priority: Minor
  Labels: intellij, maven, yarn

 The YARN module has nonstandard path structure like:
 {code}
 ${SPARK_HOME}
   |--yarn
  |--alpha (contains yarn api support for 0.23 and 2.0.x)
  |--stable (contains yarn api support for 2.2 and later)
  | |--pom.xml (spark-yarn)
  |--common (Common codes not depending on specific version of Hadoop)
  |--pom.xml (yarn-parent)
 {code}
 When we use maven to compile yarn module, maven will import 'alpha' or 
 'stable' module according to profile setting.
 And the submodule like 'stable' use the build propertie defined in 
 yarn/pom.xml to import common codes to sourcePath.
 It will cause IntelliJ can't directly recognize sources in common directory 
 as sourcePath. 
 I thought we should change the yarn module to a unified maven jar project, 
 and add specify different version of yarn api via maven profile setting.
 It will resolve the compile error in IntelliJ and make the yarn module more 
 simple and clear.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.

2014-08-30 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3287:
---

Component/s: (was: Spark Core)

 When ResourceManager High Availability is enabled, ApplicationMaster webUI is 
 not displayed.
 

 Key: SPARK-3287
 URL: https://issues.apache.org/jira/browse/SPARK-3287
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3287.patch


 When ResourceManager High Availability is enabled, there will be multiple 
 resource managers and each of them could act as a proxy.
 AmIpFilter is modified to accept multiple proxy hosts. But Spark 
 ApplicationMaster fails to read the ResourceManager IPs properly from the 
 configuration.
 So AmIpFilter is initialized with an empty set of proxy hosts. So any access 
 to the ApplicationMaster WebUI will be redirected to port RM port on the 
 local host. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map outputs at each reducer during shuffles

2014-08-30 Thread bc Wong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116642#comment-14116642
 ] 

bc Wong commented on SPARK-1239:


Is anyone working on this? If so, could someone please update the jira? Thanks!

 Don't fetch all map outputs at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or
 Fix For: 1.1.0


 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2579) Reading from S3 returns an inconsistent number of items with Spark 0.9.1

2014-08-30 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116643#comment-14116643
 ] 

Chris Fregly commented on SPARK-2579:
-

interesting and possibly-related blog post from netflix earlier this year:  
http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html

 Reading from S3 returns an inconsistent number of items with Spark 0.9.1
 

 Key: SPARK-2579
 URL: https://issues.apache.org/jira/browse/SPARK-2579
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 0.9.1
Reporter: Eemil Lagerspetz
Priority: Critical
  Labels: hdfs, read, s3, skipping

 I have created a random matrix of 1M rows with 10K items on each row, 
 semicolon-separated. While reading it with Spark 0.9.1 and doing a count, I 
 consistently get less than 1M rows, and a different number every time at that 
 ( !! ). Example below:
 head -n 1 tool-generate-random-matrix*log
 == tool-generate-random-matrix-999158.log ==
 Row item counts: 999158
 == tool-generate-random-matrix.log ==
 Row item counts: 997163
 The data is split into 1000 partitions. When I download it using s3cmd sync, 
 and run the following AWK on it, I get the correct number of rows in each 
 partition (1000x1000 = 1M). What is up?
 {code:title=checkrows.sh|borderStyle=solid}
 for k in part-0*
 do
   echo $k
   awk -F ; '
 NF != 1 {
   print Wrong number of items:,NF
 }
 END {
   if (NR != 1000) {
 print Wrong number of rows:,NR
   }
 }' $k
 done
 {code}
 The matrix generation and counting code is below:
 {code:title=Matrix.scala|borderStyle=solid}
 package fi.helsinki.cs.nodes.matrix
 import java.util.Random
 import org.apache.spark._
 import org.apache.spark.SparkContext._
 import scala.collection.mutable.ListBuffer
 import org.apache.spark.rdd.RDD
 import org.apache.spark.storage.StorageLevel._
 object GenerateRandomMatrix {
   def NewGeMatrix(rSeed: Int, rdd: RDD[Int], features: Int) = {
 rdd.mapPartitions(part = part.map(xarr = {
 val rdm = new Random(rSeed + xarr)
 val arr = new Array[Double](features)
 for (i - 0 until features)
   arr(i) = rdm.nextDouble()
 new Row(xarr, arr)
   }))
   }
   case class Row(id: Int, elements: Array[Double]) {}
   def rowFromText(line: String) = {
 val idarr = line.split( )
 val arr = idarr(1).split(;)
 // -1 to fix saved matrix indexing error
 new Row(idarr(0).toInt-1, arr.map(_.toDouble))
   }
   def main(args: Array[String]) {
 val master = args(0)
 val tasks = args(1).toInt
 val savePath = args(2)
 val read = args.contains(read)
 
 val datapoints = 100
 val features = 1
 val sc = new SparkContext(master, RandomMatrix)
 if (read) {
   val randomMatrix: RDD[Row] = sc.textFile(savePath, 
 tasks).map(rowFromText).persist(MEMORY_AND_DISK)
   println(Row item counts: + randomMatrix.count)
 } else {
   val rdd = sc.parallelize(0 until datapoints, tasks)
   val bcSeed = sc.broadcast(128)
   /* Generating a matrix of random Doubles */
   val randomMatrix = NewGeMatrix(bcSeed.value, rdd, 
 features).persist(MEMORY_AND_DISK)
   randomMatrix.map(row = row.id +   + 
 row.elements.mkString(;)).saveAsTextFile(savePath)
 }
 
 sc.stop
   }
 }
 {code}
 I run this with:
 appassembler/bin/tool-generate-random-matrix master 1000 
 s3n://keys@path/to/data 1matrix.log 2matrix.err
 Reading from HDFS gives the right count and right number of items on each 
 row. However, I had to run with the full path with the server name, just 
 /matrix does not work (it thinks I want file://):
 p=hdfs://ec2-54-188-6-77.us-west-2.compute.amazonaws.com:9000/matrix
 appassembler/bin/tool-generate-random-matrix $( cat 
 /root/spark-ec2/cluster-url ) 1000 $p read 1readmatrix.log 2readmatrix.err



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org