Hi,

I have an application which uses 3 parquet files , 2 of which are large and
another one is small. These files are in hdfs and are partitioned by column
"col1". Now I create  3 data-frames one for each parquet file but I pass
col1 value so that it reads the relevant data. I always read from the hdfs
as files are too big and almost all queries uses the col1 and thus read
very small data around 10-20 MB.

Then I join these data-frames. Now the data-frames from large tables are
joined using sort-merged and then join with 3rd one using broadcast hash
join. I have 3 executors and each with 10 cores. Only application running
on cluster is mine. The overall performance is quite good. But there is
large scheduler delay especially when it reads the smaller data-frame from
hdfs and in final step when it is using the broadcast hash join.

Usually compute time is just 50% of the scheduler delay time. The size of
3rd smaller dataframe is less than 1 MB. I am not sure why there is so much
scheduler delay.

Is there a diagnostic tools which can be used to check why there is
scheduler delay or point to possible causes of scheduler delay?


Thanks

Reply via email to