Yes, It's surpressing to me as well....

I tried to execute it with different configurations,

sudo -u hdfs spark-submit  --master yarn-client --class
com.mycompany.app.App --num-executors 40 --executor-memory 4g
Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin
parameters

This is what I executed with different values in num-executors and
executor-memory.
What do you think there are too many executors for those HDDs? Could
it be the reason because of each executor takes more time?

2015-02-06 9:36 GMT+01:00 Sandy Ryza <sandy.r...@cloudera.com>:
> That's definitely surprising to me that you would be hitting a lot of GC for
> this scenario.  Are you setting --executor-cores and --executor-memory?
> What are you setting them to?
>
> -Sandy
>
> On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz <konstt2...@gmail.com>
> wrote:
>>
>> Any idea why if I use more containers I get a lot of stopped because GC?
>>
>> 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz <konstt2...@gmail.com>:
>> > I'm not caching the data. with "each iteration I mean,, each 128mb
>> > that a executor has to process.
>> >
>> > The code is pretty simple.
>> >
>> > final Conversor c = new Conversor(null, null, null,
>> > longFields,typeFields);
>> > SparkConf conf = new SparkConf().setAppName("Simple Application");
>> > JavaSparkContext sc = new JavaSparkContext(conf);
>> > JavaRDD<byte[]> rdd = sc.binaryRecords(path, c.calculaLongBlock());
>> >
>> >  JavaRDD<String> rddString = rdd.map(new Function<byte[], String>() {
>> >      @Override
>> >       public String call(byte[] arg0) throws Exception {
>> >          String result = c.parse(arg0).toString();
>> >           return result;
>> >     }
>> >  });
>> > rddString.saveAsTextFile(url + "/output/" + System.currentTimeMillis()+
>> > "/");
>> >
>> > The parse function just takes an array of bytes and applies some
>> > transformations like,,,
>> > [0..3] an integer, [4...20] an String, [21..27] another String and so
>> > on.
>> >
>> > It's just a test code, I'd like to understand what it's happeing.
>> >
>> > 2015-02-04 18:57 GMT+01:00 Sandy Ryza <sandy.r...@cloudera.com>:
>> >> Hi Guillermo,
>> >>
>> >> What exactly do you mean by "each iteration"?  Are you caching data in
>> >> memory?
>> >>
>> >> -Sandy
>> >>
>> >> On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz <konstt2...@gmail.com>
>> >> wrote:
>> >>>
>> >>> I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
>> >>> I have 5 slaves:
>> >>> (32cores /256Gb / 7physical disks) x 5
>> >>>
>> >>> I have been trying many different configurations with YARN.
>> >>> yarn.nodemanager.resource.memory-mb 196Gb
>> >>> yarn.nodemanager.resource.cpu-vcores 24
>> >>>
>> >>> I have tried to execute the job with different number of executors a
>> >>> memory (1-4g)
>> >>> With 20 executors takes 25s each iteration (128mb) and it never has a
>> >>> really long time waiting because GC.
>> >>>
>> >>> When I execute around 60 executors the process time it's about 45s and
>> >>> some tasks take until one minute because GC.
>> >>>
>> >>> I have no idea why it's calling GC when I execute more executors
>> >>> simultaneously.
>> >>> The another question it's why it takes more time to execute each
>> >>> block. My theory about the this it's because there're only 7 physical
>> >>> disks and it's not the same 5 processes writing than 20.
>> >>>
>> >>> The code is pretty simple, it's just a map function which parse a line
>> >>> and write the output in HDFS. There're a lot of substrings inside of
>> >>> the function what it could cause GC.
>> >>>
>> >>> Any theory about?
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to