from:"Alexis Gillain"

Re: how to handle OOMError from groupByKey

2015-09-28 Thread Alexis Gillain

e.org/mod_mbox/spark-user/201501.mbox/%3ccaae1cqr8rd8ypebcmbjwfhm+lxh6nw4+r+uharx00psk_sh...@mail.gmail.com%3E >>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Partition-sorting-by-Spark-framework-td18213.html >>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Alternatives-to-groupByKey-td20293.html >>> >>> And this Jira seems relevant too: >>> https://issues.apache.org/jira/browse/SPARK-3655 >>> >>> The amount of memory that I'm using is 2g per executor, and I can't go >>> higher than that because each executor gets a YARN container from nodes >>> with 16 GB of RAM and 5 YARN containers allowed per node. >>> >>> So I'd like to know if there's an easy solution to executing my logic on >>> my full dataset in Spark. >>> >>> Thanks! >>> >>> -- Elango >>> >> >> > -- Alexis GILLAIN

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Alexis Gillain

ache.spark.SparkContext.clean(SparkContext.scala:1893) >> org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311) >> org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310) >> >> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) >> >> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) >> org.apache.spark.rdd.RDD.withScope(RDD.scala:286) >> org.apache.spark.rdd.RDD.filter(RDD.scala:310) >> cmd6$$user$$anonfun$3.apply(Main.scala:134) >> cmd6$$user$$anonfun$3.apply(Main.scala:133) >> >> Thanks, >> Balaji >> > -- Alexis GILLAIN

Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-16 Thread Alexis Gillain

e > (not memory space), GC does not run; therefore the finalize() methods for > the intermediate RDDs are not triggered. > > > 2. System.gc() is only executed on the driver, but not on the workers (Is > it how it works??!!) > > Any suggestions? > > Kind regards > Ali H

Re: Spark wastes a lot of space (tmp data) for iterative jobs

2015-09-15 Thread Alexis Gillain

ng the > intermediate data from the previous iteration. Anyways, why does it keep > the intermediate data for ALL previous iterations??? > How can we enforce Spark to clear these intermediate data *during* the > execution of job? > > Kind regards, > Ali hadian > > -- Alexis GILLAIN

Re: Spark aggregateByKey Issues

2015-09-15 Thread Alexis Gillain

e question: > How to set the decent number of partition, if it need not to be equal to > the number of keys ? > > 在 2015年9月15日，下午3:41，Alexis Gillain <alexis.gill...@googlemail.com> 写道： > > Sorry I made a typo error in my previous message, you can't > sortByKey(youkey,

Re: Spark aggregateByKey Issues

2015-09-14 Thread Alexis Gillain

on ? If it is the former, comOp Function > do nothing! > > I tried to take the second “numPartitions” parameter, pass the number of > key to it. But, the number of key is so large to all the tasks be killed. > > > What should I do with this case ? > > I'm asking for advises online... > > Thank you. > -- Alexis GILLAIN

Re: Multilabel classification support

2015-09-11 Thread Alexis Gillain

t I find > http://spark.apache.org/docs/latest/mllib-classification-regression.html, > it is not what I mean. Is there a way to use multilabel classification? > Thanks alot. > > Best, > yasemin > > -- > hiç ender hiç > -- Alexis GILLAIN

Re: Multilabel classification support

2015-09-11 Thread Alexis Gillain

anbo Liang <yblia...@gmail.com>: > LogisticRegression in MLlib(not ML) package supports both multiclass and > multilabel classification. > > > 2015-09-11 16:21 GMT+08:00 Alexis Gillain <alexis.gill...@googlemail.com>: > >> You can try these packages for

Re: Memory-efficient successive calls to repartition()

2015-09-02 Thread alexis GILLAIN

; line browser to look at the webui (I cannot access the server in graphical > display mode), this should help me understand what's going on. I will also > try the workarounds mentioned in the link. Keep you posted. > > Again, thanks a lot! > > Best, > > Aurelien > >

Re: Memory-efficient successive calls to repartition()

2015-09-02 Thread alexis GILLAIN

t; Cloudera Manager) *besides* the checkpoint files (which are regular HDFS > files), and the application eventually runs out of disk space. The same is > true even if I checkpoint at every iteration. > > What am I doing wrong? Maybe some garbage collector setting? > > Thanks a lot for the help,

Re: MLlib Prefixspan implementation

2015-08-26 Thread alexis GILLAIN

Feynman Liang fli...@databricks.com: CCing the mailing list again. It's currently not on the radar. Do you have a use case for it? I can bring it up during 1.6 roadmap planning tomorrow. On Mon, Aug 24, 2015 at 8:28 PM, alexis GILLAIN ila...@hotmail.com wrote: Hi, I just realized

Re: Memory-efficient successive calls to repartition()

2015-08-24 Thread Alexis Gillain

. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Alexis GILLAIN

Re: Memory-efficient successive calls to repartition()

2015-08-24 Thread alexis GILLAIN

Hi Aurelien, The first code should create a new RDD in memory at each iteration (check the webui). The second code will unpersist the RDD but that's not the main problem. I think you have trouble due to long lineage as .cache() keep track of lineage for recovery. You should have a look at

MLlib Prefixspan implementation

2015-08-20 Thread alexis GILLAIN

I want to use prefixspan so I had a look at the code and the cited paper : Distributed PrefixSpan Algorithm Based on MapReduce. There is a result in the paper I didn't really undertstand and I could'nt find where it is used in the code. Suppose a sequence database S = {1,2...n}, a sequence

Re: serialization stakeoverflow error during reduce on nested objects

2015-03-14 Thread alexis GILLAIN

I haven't register my class in kryo but I dont think it would have such an impact on the stack size. I'm thinking of using graphx and I'm wondering how it serializes the graph object as it can use kryo as serializer. 2015-03-14 6:22 GMT+01:00 Ted Yu yuzhih...@gmail.com: Have you registered

Re: how to handle OOMError from groupByKey

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

Re: Spark wastes a lot of space (tmp data) for iterative jobs

Re: Spark wastes a lot of space (tmp data) for iterative jobs

Re: Spark aggregateByKey Issues

Re: Spark aggregateByKey Issues

Re: Multilabel classification support

Re: Multilabel classification support

Re: Memory-efficient successive calls to repartition()

Re: Memory-efficient successive calls to repartition()

Re: MLlib Prefixspan implementation

Re: Memory-efficient successive calls to repartition()

Re: Memory-efficient successive calls to repartition()

MLlib Prefixspan implementation

Re: serialization stakeoverflow error during reduce on nested objects

15 matches

Site Navigation

Mail list logo

Footer information