Re: Announcing the official Spark Job Server repo
Andy, Yeah, we've thought of deploying this on Marathon ourselves, but we're not sure how much Mesos we're going to use yet. (Indeed if you look at bin/server_start.sh, I think I set up the PORT environment var specifically for Marathon.)This is also why we have deploy scripts which package into .tar.gz, again for Mesos deployment. If you do try this, please let us know. :) -Evan On Tue, Mar 18, 2014 at 3:57 PM, andy petrella andy.petre...@gmail.com wrote: tad! That's awesome. A quick question, does someone has insights regarding having such JobServers deployed using Marathon on Mesos? I'm thinking about an arch where Marathon would deploy and keep the Job Servers running along with part of the whole set of apps deployed on it regarding the resources needed (à la Jenkins). Any idea is welcome. Back to the news, Evan + Ooyala team: Great Job again. andy On Tue, Mar 18, 2014 at 11:39 PM, Henry Saputra henry.sapu...@gmail.comwrote: W00t! Thanks for releasing this, Evan. - Henry On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan e...@ooyala.com wrote: Dear Spark developers, Ooyala is happy to announce that we have pushed our official, Spark 0.9.0 / Scala 2.10-compatible, job server as a github repo: https://github.com/ooyala/spark-jobserver Complete with unit tests, deploy scripts, and examples. The original PR (#222) on incubator-spark is now closed. Please have a look; pull requests are very welcome. -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Announcing the official Spark Job Server repo
Matei, Maybe it's time to explore the spark-contrib idea again? Should I start a JIRA ticket? -Evan On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Cool, glad to see this posted! I've added a link to it at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Matei On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote: Dear Spark developers, Ooyala is happy to announce that we have pushed our official, Spark 0.9.0 / Scala 2.10-compatible, job server as a github repo: https://github.com/ooyala/spark-jobserver Complete with unit tests, deploy scripts, and examples. The original PR (#222) on incubator-spark is now closed. Please have a look; pull requests are very welcome. -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: [PySpark]: reading arbitrary Hadoop InputFormats
Hi Matei I'm afraid I haven't had enough time to focus on this as work has just been crazy. It's still something I want to get to a mergeable status. Actually it was working fine it was just a bit rough and needs to be updated to HEAD. I'll absolutely try my utmost to get something ready to merge before the window for 1.0 closes. Perhaps we can put it in there (once I've updated and cleaned up) as a more experimental feature? What is the view on having such more untested (as in production) stuff in 1.0? — Sent from Mailbox for iPhone On Wed, Mar 19, 2014 at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Nick, I’m curious, have you been doing any further development on this? It would be good to get expanded InputFormat support in Spark 1.0. To start with we don’t have to do SequenceFiles in particular, we can do stuff like Avro (if it’s easy to read in Python) or some kind of WholeFileInputFormat. Matei On Dec 19, 2013, at 10:57 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Hi I managed to find the time to put together a PR on this: https://github.com/apache/incubator-spark/pull/263 Josh has had a look over it - if anyone else with an interest could give some feedback that would be great. As mentioned in the PR it's more of an RFC and certainly still needs a bit of clean up work, and I need to add the concept of wrapper functions to deserialize classes that MsgPack can't handle out the box. N — Sent from Mailbox for iPhone On Fri, Nov 8, 2013 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Wow Josh, that looks great. I've been a bit swamped this week but as soon as I get a chance I'll test out the PR in more detail and port over the InputFormat stuff to use the new framework (including the changes you suggested). I can then look deeper into the MsgPack functionality to see if it can be made to work in a generic enough manner without requiring huge amounts of custom Templates to be written by users. Will feed back asap. N On Thu, Nov 7, 2013 at 5:03 AM, Josh Rosen rosenvi...@gmail.com wrote: I opened a pull request to add custom serializer support to PySpark: https://github.com/apache/incubator-spark/pull/146 My pull request adds the plumbing for transferring data from Java to Python using formats other than Pickle. For example, look at how textFile() uses MUTF8Deserializer to read strings from Java. Hopefully this provides all of the functionality needed to support MsgPack. - Josh On Thu, Oct 31, 2013 at 11:11 AM, Josh Rosen rosenvi...@gmail.com wrote: Hi Nick, This is a nice start. I'd prefer to keep the Java sequenceFileAsText() and newHadoopFileAsText() methods inside PythonRDD instead of adding them to JavaSparkContext, since I think these methods are unlikely to be used directly by Java users (you can add these methods to the PythonRDD companion object, which is how readRDDFromPickleFile is implemented: https://github.com/apache/incubator-spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L255 ) For MsgPack, the UnpicklingError is because the Python worker expects to receive its input in a pickled format. In my prototype of custom serializers, I modified the PySpark worker to receive its serialization/deserialization function as input ( https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/worker.py#L41 ) and added logic to pass the appropriate serializers based on each stage's input and output formats ( https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42 ). At some point, I'd like to port my custom serializers code to PySpark; if anyone's interested in helping, I'd be glad to write up some additional notes on how this should work. - Josh On Wed, Oct 30, 2013 at 2:25 PM, Nick Pentreath nick.pentre...@gmail.comwrote: Thanks Josh, Patrick for the feedback. Based on Josh's pointers I have something working for JavaPairRDD - PySpark RDD[(String, String)]. This just calls the toString method on each key and value as before, but without the need for a delimiter. For SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls toString to convert to Text for keys and values. We then call toString (again) ourselves to get Strings to feed to writeAsPickle. Details here: https://gist.github.com/MLnick/7230588 This also illustrates where the wrapper function api would fit in. All that is required is to define a T = String for key and value. I started playing around with MsgPack and can sort of get things to work in Scala, but am struggling with getting the raw bytes to be written properly in PythonRDD (I think it is treating them as pickled byte arrays when they are not, but when I removed the 'stripPickle' calls and amended the length (-6) I got UnpicklingError:
[Exception]:Could not obtain block
I am getting below error while running scala application with input file present in hdfs. Exception in thread main org.apache.spark.SparkException: Job aborted: Task 1.0:41 failed 4 times (most recent failure: Exception failure: java.io.IOException: Could not obtain block: blk_5289257631743300391_2097 file=/data/1.txtl) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) But when simple do hadoop dfs -cat /data/1.txt, it displays complete file with no error. Spark Version=0.9.0 Hadoop version=1.2.1 Any idea?? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Exception-Could-not-obtain-block-tp5976.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: ALS solve.solvePositive
Another question: do you have negative or out-of-range user or product ids or? -Xiangrui On Tue, Mar 11, 2014 at 8:00 PM, Debasish Das debasish.da...@gmail.com wrote: Nope..I did not test implicit feedback yet...will get into more detailed debug and generate the testcase hopefully next week... On Mar 11, 2014 7:02 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, did you use ALS with implicit feedback? -Xiangrui On Mon, Mar 10, 2014 at 1:17 PM, Xiangrui Meng men...@gmail.com wrote: Choosing lambda = 0.1 shouldn't lead to the error you got. This is probably a bug. Do you mind sharing a small amount of data that can re-produce the error? -Xiangrui On Fri, Mar 7, 2014 at 8:24 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, I used lambda = 0.1...It is possible that 2 users ranked in movies in a very similar way... I agree that increasing lambda will solve the problem but you agree this is not a solution...lambda should be tuned based on sparsity / other criteria and not to make a linearly dependent hessian matrix linearly independent... Thanks. Deb On Thu, Mar 6, 2014 at 7:20 PM, Xiangrui Meng men...@gmail.com wrote: If the matrix is very ill-conditioned, then A^T A becomes numerically rank deficient. However, if you use a reasonably large positive regularization constant (lambda), A^T A + lambda I should be still positive definite. What was the regularization constant (lambda) you set? Could you test whether the error still happens when you use a large lambda? Best, Xiangrui
Re: Announcing the official Spark Job Server repo
Evan - yep definitely open a JIRA. It would be nice to have a contrib repo set-up for the 1.0 release. On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote: Matei, Maybe it's time to explore the spark-contrib idea again? Should I start a JIRA ticket? -Evan On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Cool, glad to see this posted! I've added a link to it at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Matei On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote: Dear Spark developers, Ooyala is happy to announce that we have pushed our official, Spark 0.9.0 / Scala 2.10-compatible, job server as a github repo: https://github.com/ooyala/spark-jobserver Complete with unit tests, deploy scripts, and examples. The original PR (#222) on incubator-spark is now closed. Please have a look; pull requests are very welcome. -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Announcing the official Spark Job Server repo
this is cool +1 On Wed, Mar 19, 2014 at 6:54 PM, Patrick Wendell pwend...@gmail.com wrote: Evan - yep definitely open a JIRA. It would be nice to have a contrib repo set-up for the 1.0 release. On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote: Matei, Maybe it's time to explore the spark-contrib idea again? Should I start a JIRA ticket? -Evan On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Cool, glad to see this posted! I've added a link to it at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Matei On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote: Dear Spark developers, Ooyala is happy to announce that we have pushed our official, Spark 0.9.0 / Scala 2.10-compatible, job server as a github repo: https://github.com/ooyala/spark-jobserver Complete with unit tests, deploy scripts, and examples. The original PR (#222) on incubator-spark is now closed. Please have a look; pull requests are very welcome. -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: [PySpark]: reading arbitrary Hadoop InputFormats
Hey Nick, no worries if this can’t be done in time. It’s probably better to test it thoroughly. If you do have something partially working though, the main concern will be the API, i.e. whether it’s an API we want to support indefinitely. It would be bad to add this and then make major changes to what it returns. But if we’re comfortable with the API, we can mark it as experimental and include it. It might be better to support fewer data types at first and then add some just to keep the API small. Matei On Mar 18, 2014, at 11:44 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Hi Matei I'm afraid I haven't had enough time to focus on this as work has just been crazy. It's still something I want to get to a mergeable status. Actually it was working fine it was just a bit rough and needs to be updated to HEAD. I'll absolutely try my utmost to get something ready to merge before the window for 1.0 closes. Perhaps we can put it in there (once I've updated and cleaned up) as a more experimental feature? What is the view on having such more untested (as in production) stuff in 1.0? — Sent from Mailbox for iPhone On Wed, Mar 19, 2014 at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Nick, I’m curious, have you been doing any further development on this? It would be good to get expanded InputFormat support in Spark 1.0. To start with we don’t have to do SequenceFiles in particular, we can do stuff like Avro (if it’s easy to read in Python) or some kind of WholeFileInputFormat. Matei On Dec 19, 2013, at 10:57 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Hi I managed to find the time to put together a PR on this: https://github.com/apache/incubator-spark/pull/263 Josh has had a look over it - if anyone else with an interest could give some feedback that would be great. As mentioned in the PR it's more of an RFC and certainly still needs a bit of clean up work, and I need to add the concept of wrapper functions to deserialize classes that MsgPack can't handle out the box. N — Sent from Mailbox for iPhone On Fri, Nov 8, 2013 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Wow Josh, that looks great. I've been a bit swamped this week but as soon as I get a chance I'll test out the PR in more detail and port over the InputFormat stuff to use the new framework (including the changes you suggested). I can then look deeper into the MsgPack functionality to see if it can be made to work in a generic enough manner without requiring huge amounts of custom Templates to be written by users. Will feed back asap. N On Thu, Nov 7, 2013 at 5:03 AM, Josh Rosen rosenvi...@gmail.com wrote: I opened a pull request to add custom serializer support to PySpark: https://github.com/apache/incubator-spark/pull/146 My pull request adds the plumbing for transferring data from Java to Python using formats other than Pickle. For example, look at how textFile() uses MUTF8Deserializer to read strings from Java. Hopefully this provides all of the functionality needed to support MsgPack. - Josh On Thu, Oct 31, 2013 at 11:11 AM, Josh Rosen rosenvi...@gmail.com wrote: Hi Nick, This is a nice start. I'd prefer to keep the Java sequenceFileAsText() and newHadoopFileAsText() methods inside PythonRDD instead of adding them to JavaSparkContext, since I think these methods are unlikely to be used directly by Java users (you can add these methods to the PythonRDD companion object, which is how readRDDFromPickleFile is implemented: https://github.com/apache/incubator-spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L255 ) For MsgPack, the UnpicklingError is because the Python worker expects to receive its input in a pickled format. In my prototype of custom serializers, I modified the PySpark worker to receive its serialization/deserialization function as input ( https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/worker.py#L41 ) and added logic to pass the appropriate serializers based on each stage's input and output formats ( https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42 ). At some point, I'd like to port my custom serializers code to PySpark; if anyone's interested in helping, I'd be glad to write up some additional notes on how this should work. - Josh On Wed, Oct 30, 2013 at 2:25 PM, Nick Pentreath nick.pentre...@gmail.comwrote: Thanks Josh, Patrick for the feedback. Based on Josh's pointers I have something working for JavaPairRDD - PySpark RDD[(String, String)]. This just calls the toString method on each key and value as before, but without the need for a delimiter. For SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls toString to convert to Text for keys and
Re: [PySpark]: reading arbitrary Hadoop InputFormats
Ok - I'll work something up and reopen a PR against the new spark mirror. The API itself mirrors the newHadoopFile etc methods, so that should be quite stable once finalised. It's the wrapper stuff of how to serialize custom classes and read them in Python that is the potential tricky part. — Sent from Mailbox for iPhone On Wed, Mar 19, 2014 at 8:55 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Nick, no worries if this can’t be done in time. It’s probably better to test it thoroughly. If you do have something partially working though, the main concern will be the API, i.e. whether it’s an API we want to support indefinitely. It would be bad to add this and then make major changes to what it returns. But if we’re comfortable with the API, we can mark it as experimental and include it. It might be better to support fewer data types at first and then add some just to keep the API small. Matei On Mar 18, 2014, at 11:44 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Hi Matei I'm afraid I haven't had enough time to focus on this as work has just been crazy. It's still something I want to get to a mergeable status. Actually it was working fine it was just a bit rough and needs to be updated to HEAD. I'll absolutely try my utmost to get something ready to merge before the window for 1.0 closes. Perhaps we can put it in there (once I've updated and cleaned up) as a more experimental feature? What is the view on having such more untested (as in production) stuff in 1.0? — Sent from Mailbox for iPhone On Wed, Mar 19, 2014 at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Nick, I’m curious, have you been doing any further development on this? It would be good to get expanded InputFormat support in Spark 1.0. To start with we don’t have to do SequenceFiles in particular, we can do stuff like Avro (if it’s easy to read in Python) or some kind of WholeFileInputFormat. Matei On Dec 19, 2013, at 10:57 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Hi I managed to find the time to put together a PR on this: https://github.com/apache/incubator-spark/pull/263 Josh has had a look over it - if anyone else with an interest could give some feedback that would be great. As mentioned in the PR it's more of an RFC and certainly still needs a bit of clean up work, and I need to add the concept of wrapper functions to deserialize classes that MsgPack can't handle out the box. N — Sent from Mailbox for iPhone On Fri, Nov 8, 2013 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Wow Josh, that looks great. I've been a bit swamped this week but as soon as I get a chance I'll test out the PR in more detail and port over the InputFormat stuff to use the new framework (including the changes you suggested). I can then look deeper into the MsgPack functionality to see if it can be made to work in a generic enough manner without requiring huge amounts of custom Templates to be written by users. Will feed back asap. N On Thu, Nov 7, 2013 at 5:03 AM, Josh Rosen rosenvi...@gmail.com wrote: I opened a pull request to add custom serializer support to PySpark: https://github.com/apache/incubator-spark/pull/146 My pull request adds the plumbing for transferring data from Java to Python using formats other than Pickle. For example, look at how textFile() uses MUTF8Deserializer to read strings from Java. Hopefully this provides all of the functionality needed to support MsgPack. - Josh On Thu, Oct 31, 2013 at 11:11 AM, Josh Rosen rosenvi...@gmail.com wrote: Hi Nick, This is a nice start. I'd prefer to keep the Java sequenceFileAsText() and newHadoopFileAsText() methods inside PythonRDD instead of adding them to JavaSparkContext, since I think these methods are unlikely to be used directly by Java users (you can add these methods to the PythonRDD companion object, which is how readRDDFromPickleFile is implemented: https://github.com/apache/incubator-spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L255 ) For MsgPack, the UnpicklingError is because the Python worker expects to receive its input in a pickled format. In my prototype of custom serializers, I modified the PySpark worker to receive its serialization/deserialization function as input ( https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/worker.py#L41 ) and added logic to pass the appropriate serializers based on each stage's input and output formats ( https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42 ). At some point, I'd like to port my custom serializers code to PySpark; if anyone's interested in helping, I'd be glad to write up some additional notes on how this should work. - Josh On Wed, Oct 30, 2013 at
Wrong input split mapping? I am reading a set of files from s3 and writing output to the same account in a different folder. My input split mappings seem to be wrong somehow. It appends base-maps or p
14/03/19 19:11:37 INFO Executor: Serialized size of result for 678 is 1423 14/03/19 19:11:37 INFO Executor: Sending result for 678 directly to driver14/03/19 19:11:37 INFO Executor: Finished task ID 678 14/03/19 19:11:37 INFO NativeS3FileSystem: Opening key 'test_data/jws/video_logs2/video_logs2_00128G/video_view/video_views_1376612461124_1_data_1.csv' for reading at position '134217727' 14/03/19 19:11:37 INFO NativeS3FileSystem: Opening key 'test_data/jws/video_logs2/video_logs2_00128G/video_view/video_views_1376612461124_1_data_1.csv' for reading at position '402653183' 14/03/19 19:11:38 INFO MemoryStore: ensureFreeSpace(82814164) called with curMem=11412250398, maxMem=35160431001 14/03/19 19:11:38 INFO MemoryStore: Block rdd_5_681 stored as bytes to memory (size 79.0 MB, free 22.0 GB) 14/03/19 19:11:38 INFO BlockManagerMaster: Updated info of block rdd_5_681 14/03/19 19:11:38 INFO MemoryStore: ensureFreeSpace(83081354) called with curMem=11495064562, maxMem=35160431001 14/03/19 19:11:38 INFO MemoryStore: Block rdd_5_693 stored as bytes to memory (size 79.2 MB, free 22.0 GB) 14/03/19 19:11:38 INFO BlockManagerMaster: Updated info of block rdd_5_693 14/03/19 19:11:38 INFO Executor: Serialized size of result for 681 is 1423 14/03/19 19:11:38 INFO Executor: Sending result for 681 directly to driver 14/03/19 19:11:38 INFO Executor: Finished task ID 681 14/03/19 19:11:39 INFO CoarseGrainedExecutorBackend: Got assigned task 707 14/03/19 19:11:39 INFO Executor: Running task ID 707 14/03/19 19:11:39 INFO BlockManager: Found block broadcast_1 locally 14/03/19 19:11:39 INFO CacheManager: Partition rdd_5_685 not found, computing it 14/03/19 19:11:39 INFO NewHadoopRDD: Input split: s3n://AKIAJ346M2WM3VKBHFJA:ezu6d3li5gu6j3panqtxmihlypliwhqme+du8...@platfora.qa/test_data/jws/video_logs2/video_logs2_00128G/video_view/video_views_1376612461124_1_data_*.csv/proton:0+0 14/03/19 19:11:39 INFO Executor: Serialized size of result for 693 is 1423 14/03/19 19:11:39 INFO Executor: Sending result for 693 directly to driver 14/03/19 19:11:39 INFO Executor: Finished task ID 693 14/03/19 19:11:39 ERROR Executor: Exception in task ID 706 java.io.IOException: 's3n://AKIAJ346M2WM3VKBHFJA:ezu6d3li5gu6j3panqtxmihlypliwhqme+du8...@platfora.qa/test_data/jws/video_logs2/video_logs2_00128G/video_view/video_views_1376612461124_1_data_*.csv/base-maps' is a directory at org.apache.hadoop.fs.s3native.NativeS3FileSystem.open(NativeS3FileSystem.java:559) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:711) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:75) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:96) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:84) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:48) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/03/19 19:11:39 INFO CoarseGrainedExecutorBackend: Got assigned task 708 14/03/19 19:11:39 INFO Executor: Running task ID
Re: ALS solve.solvePositive
Nope...with the cleaner dataset I am not noticing issues with the dposv and this dataset is even bigger...20 M users and 1 M products...I don't think other than cholesky anything else will get us the efficiency we need... For my usecase we also need to see the effectiveness of positive factors and I am doing variable projection as a start.. If possible could you please point me to the PRs related to ALS improvements ? Are they all added to the master ? There are at least 3 PRs that Sean and you contributed recently related to ALS efficiency. A JIRA or gist will definitely help a lot. Thanks. Deb On Wed, Mar 19, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote: Another question: do you have negative or out-of-range user or product ids or? -Xiangrui On Tue, Mar 11, 2014 at 8:00 PM, Debasish Das debasish.da...@gmail.com wrote: Nope..I did not test implicit feedback yet...will get into more detailed debug and generate the testcase hopefully next week... On Mar 11, 2014 7:02 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, did you use ALS with implicit feedback? -Xiangrui On Mon, Mar 10, 2014 at 1:17 PM, Xiangrui Meng men...@gmail.com wrote: Choosing lambda = 0.1 shouldn't lead to the error you got. This is probably a bug. Do you mind sharing a small amount of data that can re-produce the error? -Xiangrui On Fri, Mar 7, 2014 at 8:24 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, I used lambda = 0.1...It is possible that 2 users ranked in movies in a very similar way... I agree that increasing lambda will solve the problem but you agree this is not a solution...lambda should be tuned based on sparsity / other criteria and not to make a linearly dependent hessian matrix linearly independent... Thanks. Deb On Thu, Mar 6, 2014 at 7:20 PM, Xiangrui Meng men...@gmail.com wrote: If the matrix is very ill-conditioned, then A^T A becomes numerically rank deficient. However, if you use a reasonably large positive regularization constant (lambda), A^T A + lambda I should be still positive definite. What was the regularization constant (lambda) you set? Could you test whether the error still happens when you use a large lambda? Best, Xiangrui
Re: Announcing the official Spark Job Server repo
https://spark-project.atlassian.net/browse/SPARK-1283 On Wed, Mar 19, 2014 at 10:59 AM, Gerard Maas gerard.m...@gmail.com wrote: this is cool +1 On Wed, Mar 19, 2014 at 6:54 PM, Patrick Wendell pwend...@gmail.com wrote: Evan - yep definitely open a JIRA. It would be nice to have a contrib repo set-up for the 1.0 release. On Tue, Mar 18, 2014 at 11:28 PM, Evan Chan e...@ooyala.com wrote: Matei, Maybe it's time to explore the spark-contrib idea again? Should I start a JIRA ticket? -Evan On Tue, Mar 18, 2014 at 4:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Cool, glad to see this posted! I've added a link to it at https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Matei On Mar 18, 2014, at 1:51 PM, Evan Chan e...@ooyala.com wrote: Dear Spark developers, Ooyala is happy to announce that we have pushed our official, Spark 0.9.0 / Scala 2.10-compatible, job server as a github repo: https://github.com/ooyala/spark-jobserver Complete with unit tests, deploy scripts, and examples. The original PR (#222) on incubator-spark is now closed. Please have a look; pull requests are very welcome. -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: repositories for spark jars
The alternative is for Spark to not explicitly include hadoop_client, perhaps only as provided, and provide a facility to insert the hadoop client jars of your choice at packaging time. Unfortunately, hadoop_client pulls in a ton of other deps, so it's not as simple as copying one extra jar into dist/jars. On Mon, Mar 17, 2014 at 10:58 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Nathan, I don't think this would be possible because there are at least dozens of permutations of Hadoop versions (different vendor distros X different versions X YARN vs not YARN, etc) and maybe hundreds. So publishing new artifacts for each would be really difficult. What is the exact problem you ran into? Maybe we need to improve the documentation to make it more clear how to correctly link against spark/hadoop for user applications. Basically the model we have now is users link against Spark and then link against the hadoop-client relevant to their version of Hadoop. - Patrick On Mon, Mar 17, 2014 at 9:50 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: After just spending a couple days fighting with a new spark installation, getting spark and hadoop version numbers matching everywhere, I have a suggestion I'd like to put out there. Can we put the hadoop version against which the spark jars were built into the version number? I noticed that the Cloudera maven repo has started to do this ( https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.10/) - sadly, though, only with the cdh5.x versions, not with the 4.x versions for which they also have spark parcels. But I see no signs of it in the central maven repo. Is this already done in some other repo about which I don't know, perhaps? I know it would save us a lot of time and grief simply to be able to point a project we build at the right version, and not have to rebuild and deploy spark manually. -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: ALS solve.solvePositive
They have been merged into the master branch. However, the improvements are for implicit ALS computation. I don't think they can speed up normal ALS computation. Could you share more details about the variable projection? JIRAs: https://spark-project.atlassian.net/browse/SPARK-1266 https://spark-project.atlassian.net/browse/SPARK-1238 https://spark-project.atlassian.net/browse/MLLIB-25 Best, Xiangrui On Wed, Mar 19, 2014 at 3:17 PM, Debasish Das debasish.da...@gmail.com wrote: Nope...with the cleaner dataset I am not noticing issues with the dposv and this dataset is even bigger...20 M users and 1 M products...I don't think other than cholesky anything else will get us the efficiency we need... For my usecase we also need to see the effectiveness of positive factors and I am doing variable projection as a start.. If possible could you please point me to the PRs related to ALS improvements ? Are they all added to the master ? There are at least 3 PRs that Sean and you contributed recently related to ALS efficiency. A JIRA or gist will definitely help a lot. Thanks. Deb On Wed, Mar 19, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote: Another question: do you have negative or out-of-range user or product ids or? -Xiangrui On Tue, Mar 11, 2014 at 8:00 PM, Debasish Das debasish.da...@gmail.com wrote: Nope..I did not test implicit feedback yet...will get into more detailed debug and generate the testcase hopefully next week... On Mar 11, 2014 7:02 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, did you use ALS with implicit feedback? -Xiangrui On Mon, Mar 10, 2014 at 1:17 PM, Xiangrui Meng men...@gmail.com wrote: Choosing lambda = 0.1 shouldn't lead to the error you got. This is probably a bug. Do you mind sharing a small amount of data that can re-produce the error? -Xiangrui On Fri, Mar 7, 2014 at 8:24 AM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, I used lambda = 0.1...It is possible that 2 users ranked in movies in a very similar way... I agree that increasing lambda will solve the problem but you agree this is not a solution...lambda should be tuned based on sparsity / other criteria and not to make a linearly dependent hessian matrix linearly independent... Thanks. Deb On Thu, Mar 6, 2014 at 7:20 PM, Xiangrui Meng men...@gmail.com wrote: If the matrix is very ill-conditioned, then A^T A becomes numerically rank deficient. However, if you use a reasonably large positive regularization constant (lambda), A^T A + lambda I should be still positive definite. What was the regularization constant (lambda) you set? Could you test whether the error still happens when you use a large lambda? Best, Xiangrui
Spark 0.9.1 release
Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed). Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD
Re: Spark 0.9.1 release
Would be great if the garbage collection PR is also committed - if not the whole thing, atleast the part to unpersist broadcast variables explicitly would be great. Currently we are running with a custom impl which does something similar, and I would like to move to standard distribution for that. Thanks, Mridul On Wed, Mar 19, 2014 at 5:07 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed). Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD
Re: Spark 0.9.1 release
If 1.0 is just round the corner, then it is fair enough to push to that, thanks for clarifying ! Regards, Mridul On Wed, Mar 19, 2014 at 6:12 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I agree that the garbage collection PRhttps://github.com/apache/spark/pull/126would make things very convenient in a lot of usecases. However, there are two broads reasons why it is hard for that PR to get into 0.9.1. 1. The PR still needs some amount of work and quite a lot of testing. While we enable RDD and shuffle cleanup based on Java GC, its behavior in a real workloads still needs to be understood (especially since it is tied to Spark driver's garbage collection behavior). 2. This actually changes some of the semantic behavior of Spark and should not be included in a bug-fix release. The PR will definitely be present for Spark 1.0, which is expected to be release around end of April (not too far ;) ). TD On Wed, Mar 19, 2014 at 5:57 PM, Mridul Muralidharan mri...@gmail.comwrote: Would be great if the garbage collection PR is also committed - if not the whole thing, atleast the part to unpersist broadcast variables explicitly would be great. Currently we are running with a custom impl which does something similar, and I would like to move to standard distribution for that. Thanks, Mridul On Wed, Mar 19, 2014 at 5:07 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD
How the scala style checker works?
Hi, all I’m just curious about the working mechanism of scala style checker When I work on a PR, I found that the following line contains 101 chars, violating the 100 limitation https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L515 but the current scala style checker passes this line? Best, -- Nan Zhu
Re: How the scala style checker works?
Hey Nan The line contains exactly 100 chars and the cursor will be at the 101 char and thus it indicate so in the IDE. *Thanks,* *Nirmal Reddy.* On Thu, Mar 20, 2014 at 10:49 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all I'm just curious about the working mechanism of scala style checker When I work on a PR, I found that the following line contains 101 chars, violating the 100 limitation https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L515 but the current scala style checker passes this line? Best, -- Nan Zhu