Spark.Executor.Cores question

2015-10-23 Thread mkhaitman
Regarding the 'spark.executor.cores' config option in a Standalone spark environment, I'm curious about whether there's a way to enforce the following logic: *- Max cores per executor = 4* ** Max executors PER application PER worker = 1* In order to force better balance across all workers, I

RE: Dataframe nested schema inference from Json without type conflicts

2015-10-23 Thread Ewan Leith
Hi all, It’s taken us a while, but one of my colleagues has made the pull request on github for our proposed solution to this, https://issues.apache.org/jira/browse/SPARK-10947 https://github.com/apache/spark/pull/9249 It adds a parameter to the Json read otpions to force all primitives as a

Re: Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-23 Thread 周千昊
We have not tried that yet, however both implementations on MR and spark are tested on the same amount of partition and same cluster 250635...@qq.com <250635...@qq.com>于2015年10月23日周五 下午5:21写道: > Hi, > > Not an expert on this kind of implementation. But referring to the > performance result, > >

slightly more informative error message in MLUtils.loadLibSVMFile

2015-10-23 Thread Robert Dodier
Hi, MLUtils.loadLibSVMFile verifies that indices are 1-based and increasing, and otherwise triggers an error. I'd like to suggest that the error message be a little more informative. I ran into this when loading a malformed file. Exactly what gets printed isn't too crucial, maybe you would want

Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-23 Thread Li Yang
Any advise on how to tune the repartitionAndSortWithinPartitions stage? Any particular metrics or parameter to look into? Basically Spark and MR shuffles the same amount of data, cause we kinda copied MR implementation into Spark. Let us know if more info is needed. On Fri, Oct 23, 2015 at 10:24