RE: defaultMinPartitions in textFile
Yes, Great. I thought it was math.max instead of math.min on that line. Thank you! From: Ye Xianjin [mailto:advance...@gmail.com] Sent: Tuesday, July 22, 2014 11:37 AM To: user@spark.apache.org Subject: Re: defaultMinPartitions in textFile well, I think you miss this line of code in SparkContext.scala line 1242-1243(master): /** Default min number of partitions for Hadoop RDDs when not given by user */ def defaultMinPartitions: Int = math.min(defaultParallelism, 2) so the defaultMinPartitions will be 2 unless the defaultParallelism is less than 2... -- Ye Xianjin Sent with Sparrow<http://www.sparrowmailapp.com/?sig> On Tuesday, July 22, 2014 at 10:18 AM, Wang, Jensen wrote: Hi, I started to use spark on yarn recently and found a problem while tuning my program. When SparkContext is initialized as sc and ready to read text file from hdfs, the textFile(path, defaultMinPartitions) method is called. I traced down the second parameter in the spark source code and finally found this: conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) in CoarseGrainedSchedulerBackend.scala I do not specify the property “spark.default.parallelism” anywhere so the getInt will return value from the larger one between totalCoreCount and 2. When I submit the application using spark-submit and specify the parameter: --num-executors 2 --executor-cores 6, I suppose the totalCoreCount will be 2*6 = 12, so defaultMinPartitions will be 12. But when I print the value of defaultMinPartitions in my program, I still get 2 in return, How does this happen, or where do I make a mistake?
Re: defaultMinPartitions in textFile
well, I think you miss this line of code in SparkContext.scala line 1242-1243(master): /** Default min number of partitions for Hadoop RDDs when not given by user */ def defaultMinPartitions: Int = math.min(defaultParallelism, 2) so the defaultMinPartitions will be 2 unless the defaultParallelism is less than 2... -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, July 22, 2014 at 10:18 AM, Wang, Jensen wrote: > Hi, > I started to use spark on yarn recently and found a problem while > tuning my program. > > When SparkContext is initialized as sc and ready to read text file from hdfs, > the textFile(path, defaultMinPartitions) method is called. > I traced down the second parameter in the spark source code and finally found > this: >conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), > 2)) in CoarseGrainedSchedulerBackend.scala > > I do not specify the property “spark.default.parallelism” anywhere so the > getInt will return value from the larger one between totalCoreCount and 2. > > When I submit the application using spark-submit and specify the parameter: > --num-executors 2 --executor-cores 6, I suppose the totalCoreCount will be > > 2*6 = 12, so defaultMinPartitions will be 12. > > But when I print the value of defaultMinPartitions in my program, I still get > 2 in return, How does this happen, or where do I make a mistake? > > >
defaultMinPartitions in textFile
Hi, I started to use spark on yarn recently and found a problem while tuning my program. When SparkContext is initialized as sc and ready to read text file from hdfs, the textFile(path, defaultMinPartitions) method is called. I traced down the second parameter in the spark source code and finally found this: conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) in CoarseGrainedSchedulerBackend.scala I do not specify the property "spark.default.parallelism" anywhere so the getInt will return value from the larger one between totalCoreCount and 2. When I submit the application using spark-submit and specify the parameter: --num-executors 2 --executor-cores 6, I suppose the totalCoreCount will be 2*6 = 12, so defaultMinPartitions will be 12. But when I print the value of defaultMinPartitions in my program, I still get 2 in return, How does this happen, or where do I make a mistake?