HDFS is simply a better place to make performant reads and on top of that
the data is closer to your spark job. The databricks link from above will
show you that where they find a 6x read throughput difference between the
two.

If your HDFS is part of the same Spark cluster than it should be an
incredibly fast read vs reaching out to S3 for the data.

They are different types of storage solving different things.

Something I have seen in workflows is something other people have suggested
above, is a stage where you load data from S3 into HDFS, then move on to
you other work with it and maybe finally persist outside of HDFS.

On Fri, May 29, 2020 at 2:09 PM Bin Fan <fanbin...@gmail.com> wrote:

> Try to deploy Alluxio as a caching layer on top of S3, providing Spark a
> similar HDFS interface?
> Like in this article:
>
> https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/
>
>
> On Wed, May 27, 2020 at 6:52 PM Dark Crusader <
> relinquisheddra...@gmail.com> wrote:
>
>> Hi Randy,
>>
>> Yes, I'm using parquet on both S3 and hdfs.
>>
>> On Thu, 28 May, 2020, 2:38 am randy clinton, <randyclin...@gmail.com>
>> wrote:
>>
>>> Is the file Parquet on S3 or is it some other file format?
>>>
>>> In general I would assume that HDFS read/writes are more performant for
>>> spark jobs.
>>>
>>> For instance, consider how well partitioned your HDFS file is vs the S3
>>> file.
>>>
>>> On Wed, May 27, 2020 at 1:51 PM Dark Crusader <
>>> relinquisheddra...@gmail.com> wrote:
>>>
>>>> Hi Jörn,
>>>>
>>>> Thanks for the reply. I will try to create a easier example to
>>>> reproduce the issue.
>>>>
>>>> I will also try your suggestion to look into the UI. Can you guide on
>>>> what I should be looking for?
>>>>
>>>> I was already using the s3a protocol to compare the times.
>>>>
>>>> My hunch is that multiple reads from S3 are required because of
>>>> improper caching of intermediate data. And maybe hdfs is doing a better job
>>>> at this. Does this make sense?
>>>>
>>>> I would also like to add that we built an extra layer on S3 which might
>>>> be adding to even slower times.
>>>>
>>>> Thanks for your help.
>>>>
>>>> On Wed, 27 May, 2020, 11:03 pm Jörn Franke, <jornfra...@gmail.com>
>>>> wrote:
>>>>
>>>>> Have you looked in Spark UI why this is the case ?
>>>>> S3 Reading can take more time - it depends also what s3 url you are
>>>>> using : s3a vs s3n vs S3.
>>>>>
>>>>> It could help after some calculation to persist in-memory or on HDFS.
>>>>> You can also initially load from S3 and store on HDFS and work from there 
>>>>> .
>>>>>
>>>>> HDFS offers Data locality for the tasks, ie the tasks start on the
>>>>> nodes where the data is. Depending on what s3 „protocol“ you are using you
>>>>> might be also more punished with performance.
>>>>>
>>>>> Try s3a as a protocol (replace all s3n with s3a).
>>>>>
>>>>> You can also use s3 url but this requires a special bucket
>>>>> configuration, a dedicated empty bucket and it lacks some ineroperability
>>>>> with other AWS services.
>>>>>
>>>>> Nevertheless, it could be also something else with the code. Can you
>>>>> post an example reproducing the issue?
>>>>>
>>>>> > Am 27.05.2020 um 18:18 schrieb Dark Crusader <
>>>>> relinquisheddra...@gmail.com>:
>>>>> >
>>>>> > 
>>>>> > Hi all,
>>>>> >
>>>>> > I am reading data from hdfs in the form of parquet files (around 3
>>>>> GB) and running an algorithm from the spark ml library.
>>>>> >
>>>>> > If I create the same spark dataframe by reading data from S3, the
>>>>> same algorithm takes considerably more time.
>>>>> >
>>>>> > I don't understand why this is happening. Is this a chance occurence
>>>>> or are the spark dataframes created different?
>>>>> >
>>>>> > I don't understand how the data store would effect the algorithm
>>>>> performance.
>>>>> >
>>>>> > Any help would be appreciated. Thanks a lot.
>>>>>
>>>>
>>>
>>> --
>>> I appreciate your time,
>>>
>>> ~Randy
>>>
>>

-- 
I appreciate your time,

~Randy

Reply via email to