Re: Announcing the official Spark Job Server repo

2014-03-19 Thread Evan Chan
Andy,

Yeah, we've thought of deploying this on Marathon ourselves, but we're
not sure how much Mesos we're going to use yet.   (Indeed if you look
at bin/server_start.sh, I think I set up the PORT environment var
specifically for Marathon.)This is also why we have deploy scripts
which package into .tar.gz, again for Mesos deployment.

If you do try this, please let us know.  :)

-Evan


On Tue, Mar 18, 2014 at 3:57 PM, andy petrella andy.petre...@gmail.com wrote:
 tad! That's awesome.

 A quick question, does someone has insights regarding having such
 JobServers deployed using Marathon on Mesos?

 I'm thinking about an arch where Marathon would deploy and keep the Job
 Servers running along with part of the whole set of apps deployed on it
 regarding the resources needed (à la Jenkins).

 Any idea is welcome.

 Back to the news, Evan + Ooyala team: Great Job again.

 andy

 On Tue, Mar 18, 2014 at 11:39 PM, Henry Saputra 
 henry.sapu...@gmail.comwrote:

 W00t!

 Thanks for releasing this, Evan.

 - Henry

 On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan e...@ooyala.com wrote:
  Dear Spark developers,
 
  Ooyala is happy to announce that we have pushed our official, Spark
  0.9.0 / Scala 2.10-compatible, job server as a github repo:
 
  https://github.com/ooyala/spark-jobserver
 
  Complete with unit tests, deploy scripts, and examples.
 
  The original PR (#222) on incubator-spark is now closed.
 
  Please have a look; pull requests are very welcome.
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Announcing the official Spark Job Server repo

2014-03-19 Thread Evan Chan
Matei,

Maybe it's time to explore the spark-contrib idea again?   Should I
start a JIRA ticket?

-Evan


On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Cool, glad to see this posted! I've added a link to it at 
 https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.

 Matei

 On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote:

 Dear Spark developers,

 Ooyala is happy to announce that we have pushed our official, Spark
 0.9.0 / Scala 2.10-compatible, job server as a github repo:

 https://github.com/ooyala/spark-jobserver

 Complete with unit tests, deploy scripts, and examples.

 The original PR (#222) on incubator-spark is now closed.

 Please have a look; pull requests are very welcome.
 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-19 Thread Nick Pentreath
Hi Matei


I'm afraid I haven't had enough time to focus on this as work has just been 
crazy. It's still something I want to get to a mergeable status. 




Actually it was working fine it was just a bit rough and needs to be updated to 
HEAD.




I'll absolutely try my utmost to get something ready to merge before the window 
for 1.0 closes. Perhaps we can put it in there (once I've updated and cleaned 
up) as a more experimental feature? What is the view on having such more 
untested (as in production) stuff in 1.0?
—
Sent from Mailbox for iPhone

On Wed, Mar 19, 2014 at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hey Nick, I’m curious, have you been doing any further development on this? 
 It would be good to get expanded InputFormat support in Spark 1.0. To start 
 with we don’t have to do SequenceFiles in particular, we can do stuff like 
 Avro (if it’s easy to read in Python) or some kind of WholeFileInputFormat.
 Matei
 On Dec 19, 2013, at 10:57 AM, Nick Pentreath nick.pentre...@gmail.com wrote:
 Hi
 
 
 I managed to find the time to put together a PR on this: 
 https://github.com/apache/incubator-spark/pull/263
 
 
 
 
 Josh has had a look over it - if anyone else with an interest could give 
 some feedback that would be great.
 
 
 
 
 As mentioned in the PR it's more of an RFC and certainly still needs a bit 
 of clean up work, and I need to add the concept of wrapper functions to 
 deserialize classes that MsgPack can't handle out the box.
 
 
 
 
 N
 —
 Sent from Mailbox for iPhone
 
 On Fri, Nov 8, 2013 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
 
 Wow Josh, that looks great. I've been a bit swamped this week but as soon
 as I get a chance I'll test out the PR in more detail and port over the
 InputFormat stuff to use the new framework (including the changes you
 suggested).
 I can then look deeper into the MsgPack functionality to see if it can be
 made to work in a generic enough manner without requiring huge amounts of
 custom Templates to be written by users.
 Will feed back asap.
 N
 On Thu, Nov 7, 2013 at 5:03 AM, Josh Rosen rosenvi...@gmail.com wrote:
 I opened a pull request to add custom serializer support to PySpark:
 https://github.com/apache/incubator-spark/pull/146
 
 My pull request adds the plumbing for transferring data from Java to Python
 using formats other than Pickle.  For example, look at how textFile() uses
 MUTF8Deserializer to read strings from Java.  Hopefully this provides all
 of the functionality needed to support MsgPack.
 
 - Josh
 
 
 On Thu, Oct 31, 2013 at 11:11 AM, Josh Rosen rosenvi...@gmail.com wrote:
 
 Hi Nick,
 
 This is a nice start.  I'd prefer to keep the Java sequenceFileAsText()
 and newHadoopFileAsText() methods inside PythonRDD instead of adding them
 to JavaSparkContext, since I think these methods are unlikely to be used
 directly by Java users (you can add these methods to the PythonRDD
 companion object, which is how readRDDFromPickleFile is implemented:
 
 https://github.com/apache/incubator-spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L255
 )
 
 For MsgPack, the UnpicklingError is because the Python worker expects to
 receive its input in a pickled format.  In my prototype of custom
 serializers, I modified the PySpark worker to receive its
 serialization/deserialization function as input (
 
 https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/worker.py#L41
 )
 and added logic to pass the appropriate serializers based on each stage's
 input and output formats (
 
 https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42
 ).
 
 At some point, I'd like to port my custom serializers code to PySpark; if
 anyone's interested in helping, I'd be glad to write up some additional
 notes on how this should work.
 
 - Josh
 
 On Wed, Oct 30, 2013 at 2:25 PM, Nick Pentreath 
 nick.pentre...@gmail.comwrote:
 
 Thanks Josh, Patrick for the feedback.
 
 Based on Josh's pointers I have something working for JavaPairRDD -
 PySpark RDD[(String, String)]. This just calls the toString method on
 each
 key and value as before, but without the need for a delimiter. For
 SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls
 toString to convert to Text for keys and values. We then call toString
 (again) ourselves to get Strings to feed to writeAsPickle.
 
 Details here: https://gist.github.com/MLnick/7230588
 
 This also illustrates where the wrapper function api would fit in. All
 that is required is to define a T = String for key and value.
 
 I started playing around with MsgPack and can sort of get things to work
 in
 Scala, but am struggling with getting the raw bytes to be written
 properly
 in PythonRDD (I think it is treating them as pickled byte arrays when
 they
 are not, but when I removed the 'stripPickle' calls and amended the
 length
 (-6) I got UnpicklingError: 

[Exception]:Could not obtain block

2014-03-19 Thread mohit.goyal
I am getting below error while running scala application with input file
present in hdfs.

Exception in thread main org.apache.spark.SparkException: Job aborted:
Task 1.0:41 failed 4 times (most recent failure: Exception failure:
java.io.IOException: Could not obtain block: blk_5289257631743300391_2097
file=/data/1.txtl)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


But when simple do hadoop dfs -cat /data/1.txt, it displays complete file
with no error.

Spark Version=0.9.0
Hadoop version=1.2.1

Any idea??



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Exception-Could-not-obtain-block-tp5976.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: ALS solve.solvePositive

2014-03-19 Thread Xiangrui Meng
Another question: do you have negative or out-of-range user or product
ids or? -Xiangrui

On Tue, Mar 11, 2014 at 8:00 PM, Debasish Das debasish.da...@gmail.com wrote:
 Nope..I did not test implicit feedback yet...will get into more detailed
 debug and generate the testcase hopefully next week...
 On Mar 11, 2014 7:02 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Deb, did you use ALS with implicit feedback? -Xiangrui

 On Mon, Mar 10, 2014 at 1:17 PM, Xiangrui Meng men...@gmail.com wrote:
  Choosing lambda = 0.1 shouldn't lead to the error you got. This is
  probably a bug. Do you mind sharing a small amount of data that can
  re-produce the error? -Xiangrui
 
  On Fri, Mar 7, 2014 at 8:24 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi Xiangrui,
 
  I used lambda = 0.1...It is possible that 2 users ranked in movies in a
  very similar way...
 
  I agree that increasing lambda will solve the problem but you agree
 this is
  not a solution...lambda should be tuned based on sparsity / other
 criteria
  and not to make a linearly dependent hessian matrix linearly
  independent...
 
  Thanks.
  Deb
 
 
 
 
 
  On Thu, Mar 6, 2014 at 7:20 PM, Xiangrui Meng men...@gmail.com wrote:
 
  If the matrix is very ill-conditioned, then A^T A becomes numerically
  rank deficient. However, if you use a reasonably large positive
  regularization constant (lambda), A^T A + lambda I should be still
  positive definite. What was the regularization constant (lambda) you
  set? Could you test whether the error still happens when you use a
  large lambda?
 
  Best,
  Xiangrui
 



Re: Announcing the official Spark Job Server repo

2014-03-19 Thread Patrick Wendell
Evan - yep definitely open a JIRA. It would be nice to have a contrib
repo set-up for the 1.0 release.

On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote:
 Matei,

 Maybe it's time to explore the spark-contrib idea again?   Should I
 start a JIRA ticket?

 -Evan


 On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Cool, glad to see this posted! I've added a link to it at 
 https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.

 Matei

 On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote:

 Dear Spark developers,

 Ooyala is happy to announce that we have pushed our official, Spark
 0.9.0 / Scala 2.10-compatible, job server as a github repo:

 https://github.com/ooyala/spark-jobserver

 Complete with unit tests, deploy scripts, and examples.

 The original PR (#222) on incubator-spark is now closed.

 Please have a look; pull requests are very welcome.
 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |




 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |


Re: Announcing the official Spark Job Server repo

2014-03-19 Thread Gerard Maas
this is cool +1


On Wed, Mar 19, 2014 at 6:54 PM, Patrick Wendell pwend...@gmail.com wrote:

 Evan - yep definitely open a JIRA. It would be nice to have a contrib
 repo set-up for the 1.0 release.

 On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote:
  Matei,
 
  Maybe it's time to explore the spark-contrib idea again?   Should I
  start a JIRA ticket?
 
  -Evan
 
 
  On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Cool, glad to see this posted! I've added a link to it at
 https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
 
  Matei
 
  On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote:
 
  Dear Spark developers,
 
  Ooyala is happy to announce that we have pushed our official, Spark
  0.9.0 / Scala 2.10-compatible, job server as a github repo:
 
  https://github.com/ooyala/spark-jobserver
 
  Complete with unit tests, deploy scripts, and examples.
 
  The original PR (#222) on incubator-spark is now closed.
 
  Please have a look; pull requests are very welcome.
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |
 
 
 
 
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |



Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-19 Thread Matei Zaharia
Hey Nick, no worries if this can’t be done in time. It’s probably better to 
test it thoroughly. If you do have something partially working though, the main 
concern will be the API, i.e. whether it’s an API we want to support 
indefinitely. It would be bad to add this and then make major changes to what 
it returns. But if we’re comfortable with the API, we can mark it as 
experimental and include it. It might be better to support fewer data types at 
first and then add some just to keep the API small.

Matei

On Mar 18, 2014, at 11:44 PM, Nick Pentreath nick.pentre...@gmail.com wrote:

 Hi Matei
 
 
 I'm afraid I haven't had enough time to focus on this as work has just been 
 crazy. It's still something I want to get to a mergeable status. 
 
 
 
 
 Actually it was working fine it was just a bit rough and needs to be updated 
 to HEAD.
 
 
 
 
 I'll absolutely try my utmost to get something ready to merge before the 
 window for 1.0 closes. Perhaps we can put it in there (once I've updated and 
 cleaned up) as a more experimental feature? What is the view on having such 
 more untested (as in production) stuff in 1.0?
 —
 Sent from Mailbox for iPhone
 
 On Wed, Mar 19, 2014 at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
 Hey Nick, I’m curious, have you been doing any further development on this? 
 It would be good to get expanded InputFormat support in Spark 1.0. To start 
 with we don’t have to do SequenceFiles in particular, we can do stuff like 
 Avro (if it’s easy to read in Python) or some kind of WholeFileInputFormat.
 Matei
 On Dec 19, 2013, at 10:57 AM, Nick Pentreath nick.pentre...@gmail.com 
 wrote:
 Hi
 
 
 I managed to find the time to put together a PR on this: 
 https://github.com/apache/incubator-spark/pull/263
 
 
 
 
 Josh has had a look over it - if anyone else with an interest could give 
 some feedback that would be great.
 
 
 
 
 As mentioned in the PR it's more of an RFC and certainly still needs a bit 
 of clean up work, and I need to add the concept of wrapper functions to 
 deserialize classes that MsgPack can't handle out the box.
 
 
 
 
 N
 —
 Sent from Mailbox for iPhone
 
 On Fri, Nov 8, 2013 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
 
 Wow Josh, that looks great. I've been a bit swamped this week but as soon
 as I get a chance I'll test out the PR in more detail and port over the
 InputFormat stuff to use the new framework (including the changes you
 suggested).
 I can then look deeper into the MsgPack functionality to see if it can be
 made to work in a generic enough manner without requiring huge amounts of
 custom Templates to be written by users.
 Will feed back asap.
 N
 On Thu, Nov 7, 2013 at 5:03 AM, Josh Rosen rosenvi...@gmail.com wrote:
 I opened a pull request to add custom serializer support to PySpark:
 https://github.com/apache/incubator-spark/pull/146
 
 My pull request adds the plumbing for transferring data from Java to 
 Python
 using formats other than Pickle.  For example, look at how textFile() uses
 MUTF8Deserializer to read strings from Java.  Hopefully this provides all
 of the functionality needed to support MsgPack.
 
 - Josh
 
 
 On Thu, Oct 31, 2013 at 11:11 AM, Josh Rosen rosenvi...@gmail.com wrote:
 
 Hi Nick,
 
 This is a nice start.  I'd prefer to keep the Java sequenceFileAsText()
 and newHadoopFileAsText() methods inside PythonRDD instead of adding them
 to JavaSparkContext, since I think these methods are unlikely to be used
 directly by Java users (you can add these methods to the PythonRDD
 companion object, which is how readRDDFromPickleFile is implemented:
 
 https://github.com/apache/incubator-spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L255
 )
 
 For MsgPack, the UnpicklingError is because the Python worker expects to
 receive its input in a pickled format.  In my prototype of custom
 serializers, I modified the PySpark worker to receive its
 serialization/deserialization function as input (
 
 https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/worker.py#L41
 )
 and added logic to pass the appropriate serializers based on each stage's
 input and output formats (
 
 https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42
 ).
 
 At some point, I'd like to port my custom serializers code to PySpark; if
 anyone's interested in helping, I'd be glad to write up some additional
 notes on how this should work.
 
 - Josh
 
 On Wed, Oct 30, 2013 at 2:25 PM, Nick Pentreath 
 nick.pentre...@gmail.comwrote:
 
 Thanks Josh, Patrick for the feedback.
 
 Based on Josh's pointers I have something working for JavaPairRDD -
 PySpark RDD[(String, String)]. This just calls the toString method on
 each
 key and value as before, but without the need for a delimiter. For
 SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls
 toString to convert to Text for keys and 

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-19 Thread Nick Pentreath
Ok - I'll work something up and reopen a PR against the new spark mirror.


The API itself mirrors the newHadoopFile etc methods, so that should be quite 
stable once finalised.




It's the wrapper stuff of how to serialize custom classes and read them in 
Python that is the potential tricky part.



—
Sent from Mailbox for iPhone

On Wed, Mar 19, 2014 at 8:55 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hey Nick, no worries if this can’t be done in time. It’s probably better to 
 test it thoroughly. If you do have something partially working though, the 
 main concern will be the API, i.e. whether it’s an API we want to support 
 indefinitely. It would be bad to add this and then make major changes to what 
 it returns. But if we’re comfortable with the API, we can mark it as 
 experimental and include it. It might be better to support fewer data types 
 at first and then add some just to keep the API small.
 Matei
 On Mar 18, 2014, at 11:44 PM, Nick Pentreath nick.pentre...@gmail.com wrote:
 Hi Matei
 
 
 I'm afraid I haven't had enough time to focus on this as work has just been 
 crazy. It's still something I want to get to a mergeable status. 
 
 
 
 
 Actually it was working fine it was just a bit rough and needs to be updated 
 to HEAD.
 
 
 
 
 I'll absolutely try my utmost to get something ready to merge before the 
 window for 1.0 closes. Perhaps we can put it in there (once I've updated and 
 cleaned up) as a more experimental feature? What is the view on having such 
 more untested (as in production) stuff in 1.0?
 —
 Sent from Mailbox for iPhone
 
 On Wed, Mar 19, 2014 at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
 Hey Nick, I’m curious, have you been doing any further development on this? 
 It would be good to get expanded InputFormat support in Spark 1.0. To start 
 with we don’t have to do SequenceFiles in particular, we can do stuff like 
 Avro (if it’s easy to read in Python) or some kind of WholeFileInputFormat.
 Matei
 On Dec 19, 2013, at 10:57 AM, Nick Pentreath nick.pentre...@gmail.com 
 wrote:
 Hi
 
 
 I managed to find the time to put together a PR on this: 
 https://github.com/apache/incubator-spark/pull/263
 
 
 
 
 Josh has had a look over it - if anyone else with an interest could give 
 some feedback that would be great.
 
 
 
 
 As mentioned in the PR it's more of an RFC and certainly still needs a bit 
 of clean up work, and I need to add the concept of wrapper functions to 
 deserialize classes that MsgPack can't handle out the box.
 
 
 
 
 N
 —
 Sent from Mailbox for iPhone
 
 On Fri, Nov 8, 2013 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
 
 Wow Josh, that looks great. I've been a bit swamped this week but as soon
 as I get a chance I'll test out the PR in more detail and port over the
 InputFormat stuff to use the new framework (including the changes you
 suggested).
 I can then look deeper into the MsgPack functionality to see if it can be
 made to work in a generic enough manner without requiring huge amounts of
 custom Templates to be written by users.
 Will feed back asap.
 N
 On Thu, Nov 7, 2013 at 5:03 AM, Josh Rosen rosenvi...@gmail.com wrote:
 I opened a pull request to add custom serializer support to PySpark:
 https://github.com/apache/incubator-spark/pull/146
 
 My pull request adds the plumbing for transferring data from Java to 
 Python
 using formats other than Pickle.  For example, look at how textFile() 
 uses
 MUTF8Deserializer to read strings from Java.  Hopefully this provides all
 of the functionality needed to support MsgPack.
 
 - Josh
 
 
 On Thu, Oct 31, 2013 at 11:11 AM, Josh Rosen rosenvi...@gmail.com 
 wrote:
 
 Hi Nick,
 
 This is a nice start.  I'd prefer to keep the Java sequenceFileAsText()
 and newHadoopFileAsText() methods inside PythonRDD instead of adding 
 them
 to JavaSparkContext, since I think these methods are unlikely to be used
 directly by Java users (you can add these methods to the PythonRDD
 companion object, which is how readRDDFromPickleFile is implemented:
 
 https://github.com/apache/incubator-spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L255
 )
 
 For MsgPack, the UnpicklingError is because the Python worker expects to
 receive its input in a pickled format.  In my prototype of custom
 serializers, I modified the PySpark worker to receive its
 serialization/deserialization function as input (
 
 https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/worker.py#L41
 )
 and added logic to pass the appropriate serializers based on each 
 stage's
 input and output formats (
 
 https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42
 ).
 
 At some point, I'd like to port my custom serializers code to PySpark; 
 if
 anyone's interested in helping, I'd be glad to write up some additional
 notes on how this should work.
 
 - Josh
 
 On Wed, Oct 30, 2013 at 

Wrong input split mapping? I am reading a set of files from s3 and writing output to the same account in a different folder. My input split mappings seem to be wrong somehow. It appends base-maps or p

2014-03-19 Thread Usman Ghani
14/03/19 19:11:37 INFO Executor: Serialized size of result for 678 is 1423
14/03/19 19:11:37 INFO Executor: Sending result for 678 directly to
driver14/03/19 19:11:37 INFO Executor: Finished task ID 678
14/03/19 19:11:37 INFO NativeS3FileSystem: Opening key
'test_data/jws/video_logs2/video_logs2_00128G/video_view/video_views_1376612461124_1_data_1.csv'
for reading at position '134217727'
14/03/19 19:11:37 INFO NativeS3FileSystem: Opening key
'test_data/jws/video_logs2/video_logs2_00128G/video_view/video_views_1376612461124_1_data_1.csv'
for reading at position '402653183'
14/03/19 19:11:38 INFO MemoryStore: ensureFreeSpace(82814164) called
with curMem=11412250398, maxMem=35160431001
14/03/19 19:11:38 INFO MemoryStore: Block rdd_5_681 stored as bytes to
memory (size 79.0 MB, free 22.0 GB)
14/03/19 19:11:38 INFO BlockManagerMaster: Updated info of block rdd_5_681
14/03/19 19:11:38 INFO MemoryStore: ensureFreeSpace(83081354) called
with curMem=11495064562, maxMem=35160431001
14/03/19 19:11:38 INFO MemoryStore: Block rdd_5_693 stored as bytes to
memory (size 79.2 MB, free 22.0 GB)
14/03/19 19:11:38 INFO BlockManagerMaster: Updated info of block rdd_5_693
14/03/19 19:11:38 INFO Executor: Serialized size of result for 681 is 1423
14/03/19 19:11:38 INFO Executor: Sending result for 681 directly to driver
14/03/19 19:11:38 INFO Executor: Finished task ID 681
14/03/19 19:11:39 INFO CoarseGrainedExecutorBackend: Got assigned task 707
14/03/19 19:11:39 INFO Executor: Running task ID 707
14/03/19 19:11:39 INFO BlockManager: Found block broadcast_1 locally
14/03/19 19:11:39 INFO CacheManager: Partition rdd_5_685 not found, computing it
14/03/19 19:11:39 INFO NewHadoopRDD: Input split:
s3n://AKIAJ346M2WM3VKBHFJA:ezu6d3li5gu6j3panqtxmihlypliwhqme+du8...@platfora.qa/test_data/jws/video_logs2/video_logs2_00128G/video_view/video_views_1376612461124_1_data_*.csv/proton:0+0
14/03/19 19:11:39 INFO Executor: Serialized size of result for 693 is 1423
14/03/19 19:11:39 INFO Executor: Sending result for 693 directly to driver
14/03/19 19:11:39 INFO Executor: Finished task ID 693
14/03/19 19:11:39 ERROR Executor: Exception in task ID 706
java.io.IOException:
's3n://AKIAJ346M2WM3VKBHFJA:ezu6d3li5gu6j3panqtxmihlypliwhqme+du8...@platfora.qa/test_data/jws/video_logs2/video_logs2_00128G/video_view/video_views_1376612461124_1_data_*.csv/base-maps'
is a directory
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.open(NativeS3FileSystem.java:559)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:711)
at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:75)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:96)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:84)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:48)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/03/19 19:11:39 INFO CoarseGrainedExecutorBackend: Got assigned task 708
14/03/19 19:11:39 INFO Executor: Running task ID 

Re: ALS solve.solvePositive

2014-03-19 Thread Debasish Das
Nope...with the cleaner dataset I am not noticing issues with the dposv and
this dataset is even bigger...20 M users and 1 M products...I don't think
other than cholesky anything else will get us the efficiency we need...

For my usecase we also need to see the effectiveness of positive factors
and I am doing variable projection as a start..

If possible could you please point me to the PRs related to ALS
improvements ? Are they all added to the master ? There are at least 3 PRs
that Sean and you contributed recently related to ALS efficiency.

A JIRA or gist will definitely help a lot.

Thanks.
Deb



On Wed, Mar 19, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote:

 Another question: do you have negative or out-of-range user or product
 ids or? -Xiangrui

 On Tue, Mar 11, 2014 at 8:00 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Nope..I did not test implicit feedback yet...will get into more detailed
  debug and generate the testcase hopefully next week...
  On Mar 11, 2014 7:02 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Hi Deb, did you use ALS with implicit feedback? -Xiangrui
 
  On Mon, Mar 10, 2014 at 1:17 PM, Xiangrui Meng men...@gmail.com
 wrote:
   Choosing lambda = 0.1 shouldn't lead to the error you got. This is
   probably a bug. Do you mind sharing a small amount of data that can
   re-produce the error? -Xiangrui
  
   On Fri, Mar 7, 2014 at 8:24 AM, Debasish Das 
 debasish.da...@gmail.com
  wrote:
   Hi Xiangrui,
  
   I used lambda = 0.1...It is possible that 2 users ranked in movies
 in a
   very similar way...
  
   I agree that increasing lambda will solve the problem but you agree
  this is
   not a solution...lambda should be tuned based on sparsity / other
  criteria
   and not to make a linearly dependent hessian matrix linearly
   independent...
  
   Thanks.
   Deb
  
  
  
  
  
   On Thu, Mar 6, 2014 at 7:20 PM, Xiangrui Meng men...@gmail.com
 wrote:
  
   If the matrix is very ill-conditioned, then A^T A becomes
 numerically
   rank deficient. However, if you use a reasonably large positive
   regularization constant (lambda), A^T A + lambda I should be still
   positive definite. What was the regularization constant (lambda) you
   set? Could you test whether the error still happens when you use a
   large lambda?
  
   Best,
   Xiangrui
  
 



Re: Announcing the official Spark Job Server repo

2014-03-19 Thread Evan Chan
https://spark-project.atlassian.net/browse/SPARK-1283

On Wed, Mar 19, 2014 at 10:59 AM, Gerard Maas gerard.m...@gmail.com wrote:
 this is cool +1


 On Wed, Mar 19, 2014 at 6:54 PM, Patrick Wendell pwend...@gmail.com wrote:

 Evan - yep definitely open a JIRA. It would be nice to have a contrib
 repo set-up for the 1.0 release.

 On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote:
  Matei,
 
  Maybe it's time to explore the spark-contrib idea again?   Should I
  start a JIRA ticket?
 
  -Evan
 
 
  On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Cool, glad to see this posted! I've added a link to it at
 https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
 
  Matei
 
  On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote:
 
  Dear Spark developers,
 
  Ooyala is happy to announce that we have pushed our official, Spark
  0.9.0 / Scala 2.10-compatible, job server as a github repo:
 
  https://github.com/ooyala/spark-jobserver
 
  Complete with unit tests, deploy scripts, and examples.
 
  The original PR (#222) on incubator-spark is now closed.
 
  Please have a look; pull requests are very welcome.
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |
 
 
 
 
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: repositories for spark jars

2014-03-19 Thread Evan Chan
The alternative is for Spark to not explicitly include hadoop_client,
perhaps only as provided, and provide a facility to insert the
hadoop client jars of your choice at packaging time.   Unfortunately,
hadoop_client pulls in a ton of other deps, so it's not as simple as
copying one extra jar into dist/jars.

On Mon, Mar 17, 2014 at 10:58 AM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Nathan,

 I don't think this would be possible because there are at least dozens
 of permutations of Hadoop versions (different vendor distros X
 different versions X YARN vs not YARN, etc) and maybe hundreds. So
 publishing new artifacts for each would be really difficult.

 What is the exact problem you ran into? Maybe we need to improve the
 documentation to make it more clear how to correctly link against
 spark/hadoop for user applications. Basically the model we have now is
 users link against Spark and then link against the hadoop-client
 relevant to their version of Hadoop.

 - Patrick

 On Mon, Mar 17, 2014 at 9:50 AM, Nathan Kronenfeld
 nkronenf...@oculusinfo.com wrote:
 After just spending a couple days fighting with a new spark installation,
 getting spark and hadoop version numbers matching everywhere, I have a
 suggestion I'd like to put out there.

 Can we put the hadoop version against which the spark jars were built into
 the version number?

 I noticed that the Cloudera maven repo has started to do this (
 https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.10/)
 - sadly, though, only with the cdh5.x versions, not with the 4.x versions
 for which they also have spark parcels.  But I see no signs of it in the
 central maven repo.

 Is this already done in some other repo about which I don't know, perhaps?

 I know it would save us a lot of time and grief simply to be able to point
 a project we build at the right version, and not have to rebuild and deploy
 spark manually.

 --
 Nathan Kronenfeld
 Senior Visualization Developer
 Oculus Info Inc
 2 Berkeley Street, Suite 600,
 Toronto, Ontario M5A 4J5
 Phone:  +1-416-203-3003 x 238
 Email:  nkronenf...@oculusinfo.com



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: ALS solve.solvePositive

2014-03-19 Thread Xiangrui Meng
They have been merged into the master branch. However, the
improvements are for implicit ALS computation. I don't think they can
speed up normal ALS computation. Could you share more details about
the variable projection?

JIRAs:

https://spark-project.atlassian.net/browse/SPARK-1266
https://spark-project.atlassian.net/browse/SPARK-1238
https://spark-project.atlassian.net/browse/MLLIB-25

Best,
Xiangrui

On Wed, Mar 19, 2014 at 3:17 PM, Debasish Das debasish.da...@gmail.com wrote:
 Nope...with the cleaner dataset I am not noticing issues with the dposv and
 this dataset is even bigger...20 M users and 1 M products...I don't think
 other than cholesky anything else will get us the efficiency we need...

 For my usecase we also need to see the effectiveness of positive factors
 and I am doing variable projection as a start..

 If possible could you please point me to the PRs related to ALS
 improvements ? Are they all added to the master ? There are at least 3 PRs
 that Sean and you contributed recently related to ALS efficiency.

 A JIRA or gist will definitely help a lot.

 Thanks.
 Deb



 On Wed, Mar 19, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote:

 Another question: do you have negative or out-of-range user or product
 ids or? -Xiangrui

 On Tue, Mar 11, 2014 at 8:00 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Nope..I did not test implicit feedback yet...will get into more detailed
  debug and generate the testcase hopefully next week...
  On Mar 11, 2014 7:02 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Hi Deb, did you use ALS with implicit feedback? -Xiangrui
 
  On Mon, Mar 10, 2014 at 1:17 PM, Xiangrui Meng men...@gmail.com
 wrote:
   Choosing lambda = 0.1 shouldn't lead to the error you got. This is
   probably a bug. Do you mind sharing a small amount of data that can
   re-produce the error? -Xiangrui
  
   On Fri, Mar 7, 2014 at 8:24 AM, Debasish Das 
 debasish.da...@gmail.com
  wrote:
   Hi Xiangrui,
  
   I used lambda = 0.1...It is possible that 2 users ranked in movies
 in a
   very similar way...
  
   I agree that increasing lambda will solve the problem but you agree
  this is
   not a solution...lambda should be tuned based on sparsity / other
  criteria
   and not to make a linearly dependent hessian matrix linearly
   independent...
  
   Thanks.
   Deb
  
  
  
  
  
   On Thu, Mar 6, 2014 at 7:20 PM, Xiangrui Meng men...@gmail.com
 wrote:
  
   If the matrix is very ill-conditioned, then A^T A becomes
 numerically
   rank deficient. However, if you use a reasonably large positive
   regularization constant (lambda), A^T A + lambda I should be still
   positive definite. What was the regularization constant (lambda) you
   set? Could you test whether the error still happens when you use a
   large lambda?
  
   Best,
   Xiangrui
  
 



Spark 0.9.1 release

2014-03-19 Thread Tathagata Das
 Hello everyone,

Since the release of Spark 0.9, we have received a number of important bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
going to cut a release candidate soon and we would love it if people test
it out. We have backported several bug fixes into the 0.9 and updated JIRA
accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed).
Please let me know if there are fixes that were not backported but you
would like to see them in 0.9.1.

Thanks!

TD


Re: Spark 0.9.1 release

2014-03-19 Thread Mridul Muralidharan
Would be great if the garbage collection PR is also committed - if not
the whole thing, atleast the part to unpersist broadcast variables
explicitly would be great.
Currently we are running with a custom impl which does something
similar, and I would like to move to standard distribution for that.


Thanks,
Mridul


On Wed, Mar 19, 2014 at 5:07 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
  Hello everyone,

 Since the release of Spark 0.9, we have received a number of important bug
 fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
 going to cut a release candidate soon and we would love it if people test
 it out. We have backported several bug fixes into the 0.9 and updated JIRA
 accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed).
 Please let me know if there are fixes that were not backported but you
 would like to see them in 0.9.1.

 Thanks!

 TD


Re: Spark 0.9.1 release

2014-03-19 Thread Mridul Muralidharan
If 1.0 is just round the corner, then it is fair enough to push to
that, thanks for clarifying !

Regards,
Mridul

On Wed, Mar 19, 2014 at 6:12 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 I agree that the garbage collection
 PRhttps://github.com/apache/spark/pull/126would make things very
 convenient in a lot of usecases. However, there are
 two broads reasons why it is hard for that PR to get into 0.9.1.
 1. The PR still needs some amount of work and quite a lot of testing. While
 we enable RDD and shuffle cleanup based on Java GC, its behavior in a real
 workloads still needs to be understood (especially since it is tied to
 Spark driver's garbage collection behavior).
 2. This actually changes some of the semantic behavior of Spark and should
 not be included in a bug-fix release. The PR will definitely be present for
 Spark 1.0, which is expected to be release around end of April (not too far
 ;) ).

 TD


 On Wed, Mar 19, 2014 at 5:57 PM, Mridul Muralidharan mri...@gmail.comwrote:

 Would be great if the garbage collection PR is also committed - if not
 the whole thing, atleast the part to unpersist broadcast variables
 explicitly would be great.
 Currently we are running with a custom impl which does something
 similar, and I would like to move to standard distribution for that.


 Thanks,
 Mridul


 On Wed, Mar 19, 2014 at 5:07 PM, Tathagata Das
 tathagata.das1...@gmail.com wrote:
   Hello everyone,
 
  Since the release of Spark 0.9, we have received a number of important
 bug
  fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
  going to cut a release candidate soon and we would love it if people test
  it out. We have backported several bug fixes into the 0.9 and updated
 JIRA
  accordingly
 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
 .
  Please let me know if there are fixes that were not backported but you
  would like to see them in 0.9.1.
 
  Thanks!
 
  TD



How the scala style checker works?

2014-03-19 Thread Nan Zhu
Hi, all  

I’m just curious about the working mechanism of scala style checker

When I work on a PR, I found that the following line contains 101 chars, 
violating the 100 limitation  

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L515

but the current scala style checker passes this line?

Best,  

--  
Nan Zhu




Re: How the scala style checker works?

2014-03-19 Thread Nirmal Allugari
Hey Nan

The line contains exactly 100 chars and the cursor will be at the 101 char
and thus it indicate so in the IDE.

*Thanks,*
*Nirmal Reddy.*


On Thu, Mar 20, 2014 at 10:49 AM, Nan Zhu zhunanmcg...@gmail.com wrote:

 Hi, all

 I'm just curious about the working mechanism of scala style checker

 When I work on a PR, I found that the following line contains 101 chars,
 violating the 100 limitation


 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L515

 but the current scala style checker passes this line?

 Best,

 --
 Nan Zhu