Hi,

A few thoughts to add to Nicholas' apt reply.

We were loading multiple files from AWS S3 in our Spark application. When
the spark step of load files is called, the driver spends significant time
fetching the exact path of files from AWS s3.
Especially because we specified S3 paths like regex string (Eg:
s3a://bucket-name/folder1/data1/2019-05*/* , Defines that I want to
reference all sub-files/folders for the month of May 2019)

At that time how I was able to verify the same was by running "iftop" linux
command, and this showed a lot of network calls to *s3.amazon.com* servers

This phenomena occurs only when I define the load files transformation,
even when no save/collect action has been called in my spark pipeline.
Even on Spark UI it does not show that any stage is in running mode. And
only when all the network calls to AWS s3 are completed, the Spark UI shows
that the call to load files was completed in 2 seconds.
My spark job "seemed to be paused" for over half an hour depending upon the
number of files. I believe this happens due to the underlying library of
AWS SDK/Azure SDK that we use in Spark.
They need to fetch exact file paths in the object stores before they can be
referenced in spark.


As you mention you are using Azure blob files, this should explain the
behaviour where everything seems to stop. You can reduce this time by
ensuring you have small number of large files in your blob store to read
from instead of vice-a-versa.

Akshay Bhardwaj
+91-97111-33849


On Thu, May 23, 2019 at 11:13 PM Nicholas Hakobian <
nicholas.hakob...@rallyhealth.com> wrote:

> One potential case that can cause this is the optimizer being a little
> overzealous with determining if a table can be broadcasted or not. Have you
> checked the UI or query plan to see if any steps include a
> BroadcastHashJoin? Its possible that the optimizer thinks that it should be
> able to fit the table in memory from looking at its size on disk, but it
> actually cannot fit in memory. In this case you might want to look at
> tuning the autoBroadcastJoinThreshold.
>
> Another potential case is that at the step it looks like the driver is
> "hanging" its attempting to load in a data source that is backed by a very
> large number of files. Spark maintains a cache of file paths for a data
> source to determine task splits, and we've seen the driver appear to hang
> and/or crash if you try to load in thousands (or more) of tiny files per
> partition, and you have a large number of partitions.
>
> Hope this helps.
>
> Nicholas Szandor Hakobian, Ph.D.
> Principal Data Scientist
> Rally Health
> nicholas.hakob...@rallyhealth.com
>
>
> On Thu, May 23, 2019 at 7:36 AM Ashic Mahtab <as...@live.com> wrote:
>
>> Hi,
>> We have a quite long winded Spark application we inherited with many
>> stages. When we run on our spark cluster, things start off well enough.
>> Workers are busy, lots of progress made, etc. etc. However, 30 minutes into
>> processing, we see CPU usage of the workers drop drastically. At this time,
>> we also see that the driver is maxing out exactly one core (though we've
>> given it more than one), and its ram usage is creeping up. At this time,
>> there's no logs coming out on the driver. Everything seems to stop, and
>> then it suddenly starts working, and the workers start working again. The
>> driver ram doesn't go down, but flatlines. A few minutes later, the same
>> thing happens again - the world seems to stop. However, the driver soon
>> crashes with an out of memory exception.
>>
>> What could be causing this sort of behaviour on the driver? We don't have
>> any collect() or similar functions in the code. We're reading in from Azure
>> blobs, processing, and writing back to Azure blobs. Where should we start
>> in trying to get to the bottom of this? We're running Spark 2.4.1 in a
>> stand-alone cluster.
>>
>> Thanks,
>> Ashic.
>>
>

Reply via email to