Re: run reduceByKey on huge data in spark

2015-06-30 Thread lisendong
hello, I ‘m using spark 1.4.2-SNAPSHOT I ‘m running in yarn mode:-) I wonder if the spark.shuffle.memoryFraction or spark.shuffle.manager work? how to set these parameters... 在 2015年7月1日,上午1:32,Ted Yu yuzhih...@gmail.com 写道: Which Spark release are you using ? Are you running in standalone

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread lisendong
/LzoTextInputFormat.java the class. You can read more here https://github.com/twitter/hadoop-lzo#maven-repository https://github.com/twitter/hadoop-lzo#maven-repository Thanks Best Regards On Thu, May 14, 2015 at 1:22 PM, lisendong lisend...@163.com mailto:lisend...@163.com wrote

how to load some of the files in a dir and monitor new file in that dir in spark streaming without missing?

2015-05-11 Thread lisendong
I have one hdfs dir, which contains many files: /user/root/1.txt /user/root/2.txt /user/root/3.txt /user/root/4.txt and there is a daemon process which add one file per minute to this dir. (e.g., 5.txt, 6.txt, 7.txt...) I want to start a spark streaming job which load 3.txt, 4.txt and then

union eatch streaming window into a static rdd and use the static rdd periodicity

2015-05-06 Thread lisendong
the pseudo code : object myApp { var myStaticRDD: RDD[Int] def main() { ... //init streaming context, and get two DStream (streamA and streamB) from two hdfs path //complex transformation using the two DStream val new_stream = streamA.transformWith(StreamB, (a, b, t) = {

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
yes! thank you very much:-) 在 2015年4月2日,下午7:13,Sean Owen so...@cloudera.com 写道: Right, I asked because in your original message, you were looking at the initialization to a random vector. But that is the initial state, not final state. On Thu, Apr 2, 2015 at 11:51 AM, lisendong lisend

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
to the initialization, not the result, right? It's possible that the resulting weight vectors are sparse although this looks surprising to me. But it is not related to the initial state, right? On Thu, Apr 2, 2015 at 10:43 AM, lisendong lisend...@163.com mailto:lisend...@163.com wrote: I

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
to the initialization, not the result, right? It's possible that the resulting weight vectors are sparse although this looks surprising to me. But it is not related to the initial state, right? On Thu, Apr 2, 2015 at 10:43 AM, lisendong lisend...@163.com mailto:lisend...@163.com wrote: I found

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
2015年3月31日,上午12:11,Xiangrui Meng men...@gmail.com 写道: setCheckpointInterval was added in the current master and branch-1.3. Please help check whether it works. It will be included in the 1.3.1 and 1.4.0 release. -Xiangrui On Mon, Mar 30, 2015 at 7:27 AM, lisendong lisend...@163.com

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
-1.3. Please help check whether it works. It will be included in the 1.3.1 and 1.4.0 release. -Xiangrui On Mon, Mar 30, 2015 at 7:27 AM, lisendong lisend...@163.com mailto:lisend...@163.com wrote: hi, xiangrui: I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
. Is it correct? Best, Xiangrui On Tue, Mar 31, 2015 at 8:58 AM, lisendong lisend...@163.com mailto:lisend...@163.com wrote: guoqiang ’s method works very well … it only takes 1TB disk now. thank you very much! 在 2015年3月31日,下午4:47,GuoQiang Li wi...@qq.com mailto:wi...@qq.com 写道: You

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
. It will be included in the 1.3.1 and 1.4.0 release. -Xiangrui On Mon, Mar 30, 2015 at 7:27 AM, lisendong lisend...@163.com mailto:lisend...@163.com wrote: hi, xiangrui: I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS: the code is : https://github.com/apache/spark

what are the types of tasks when running ALS iterations

2015-03-09 Thread lisendong
you see, the core of ALS 1.0.0 is the following code: there should be flatMap and groupByKey when running ALS iterations , right? but when I run als iteration, there are ONLY flatMap tasks... do you know why? private def updateFeatures( products: RDD[(Int,

why my YoungGen GC takes so long time?

2015-03-05 Thread lisendong
I found my task takes so long time for YoungGen GC, I set the young gen size to about 1.5G, I wonder why it takes so long time? not all the tasks take such long time, only about 1% tasks so long... 180.426: [GC [PSYoungGen: 9916105K-1676785K(14256640K)] 26201020K-18690057K(53403648K), 17.3581500

Re: spark master shut down suddenly

2015-03-04 Thread lisendong
I ‘m sorry, but how to look at the mesos logs? where are them? 在 2015年3月4日,下午6:06,Akhil Das ak...@sigmoidanalytics.com 写道: You can check in the mesos logs and see whats really happening. Thanks Best Regards On Wed, Mar 4, 2015 at 3:10 PM, lisendong lisend...@163.com mailto:lisend

how to update als in mllib?

2015-03-04 Thread lisendong
I 'm using spark1.0.0 with cloudera. but I want to use new als code which supports more features, such as rdd cache level(MEMORY ONLY), checkpoint, and so on. What is the easiest way to use the new als code? I only need the mllib als code, so maybe I don't need to update all the spark mllib

spark master shut down suddenly

2015-03-04 Thread lisendong
15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not heard from server in 26679ms for sessionid 0x34bbf3313a8001b, closing socket connection and attempting reconnect 15/03/04 09:26:36 INFO ConnectionStateManager: State change: SUSPENDED 15/03/04 09:26:36 INFO

Re: how to clean shuffle write each iteration

2015-03-03 Thread lisendong
in ALS, I guess all the iteration’s rdds are referenced by its next iteration’s rdd, so all the shuffle data will not be deleted until the als job finished… I guess checkpoint could solve my problem, do you know checkpoint? 在 2015年3月3日,下午4:18,nitin [via Apache Spark User List]

spark.local.dir leads to Job cancelled because SparkContext was shut down

2015-03-03 Thread lisendong
As long as I set the spark.local.dir to multiple disks, the job will failed, the errors are as follow: (if I set the spark.local.dir to only 1 dir, the job will succed...) Exception in thread main org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at

gc time too long when using mllib als

2015-03-03 Thread lisendong
why does the gc time so long? i 'm using als in mllib, while the garbage collection time is too long (about 1/3 of total time) I have tried some measures in the tunning spark guide, and try to set the new generation memory, but it still does not work... Tasks Task Index Task ID

how to clean shuffle write each iteration

2015-03-02 Thread lisendong
I 'm using spark als. I set the iteration number to 30. And in each iteration, tasks will produce nearly 1TB shuffle write. To my surprise, this shuffle data will not be cleaned until the total job finished, which means, I need 30TB disk to store the shuffle data. I think after each

different result from implicit ALS with explicit ALS

2015-02-26 Thread lisendong
I’m using ALS with spark 1.0.0, the code should be: https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala I think the following two method should produce the same (or near) result: MatrixFactorizationModel model =

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread lisendong
a Rating for these data points. What then? Also would you care to bring this to the user@ list? it's kind of interesting. On Thu, Feb 26, 2015 at 2:02 PM, lisendong lisend...@163.com wrote: I set the score of ‘0’ interaction user-item pair to 0.0 the code is as following: if (ifclick 0