See if this helps

"That is to say, on a per node basis, HDFS can yield 6X higher read
throughput than S3. Thus, *given that the S3 is 10x cheaper than HDFS, we
find that S3 is almost 2x better compared to HDFS on performance per
dollar."*

*https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
<https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html>*


On Wed, May 27, 2020, 9:51 PM Dark Crusader <relinquisheddra...@gmail.com>
wrote:

> Hi Randy,
>
> Yes, I'm using parquet on both S3 and hdfs.
>
> On Thu, 28 May, 2020, 2:38 am randy clinton, <randyclin...@gmail.com>
> wrote:
>
>> Is the file Parquet on S3 or is it some other file format?
>>
>> In general I would assume that HDFS read/writes are more performant for
>> spark jobs.
>>
>> For instance, consider how well partitioned your HDFS file is vs the S3
>> file.
>>
>> On Wed, May 27, 2020 at 1:51 PM Dark Crusader <
>> relinquisheddra...@gmail.com> wrote:
>>
>>> Hi Jörn,
>>>
>>> Thanks for the reply. I will try to create a easier example to reproduce
>>> the issue.
>>>
>>> I will also try your suggestion to look into the UI. Can you guide on
>>> what I should be looking for?
>>>
>>> I was already using the s3a protocol to compare the times.
>>>
>>> My hunch is that multiple reads from S3 are required because of improper
>>> caching of intermediate data. And maybe hdfs is doing a better job at this.
>>> Does this make sense?
>>>
>>> I would also like to add that we built an extra layer on S3 which might
>>> be adding to even slower times.
>>>
>>> Thanks for your help.
>>>
>>> On Wed, 27 May, 2020, 11:03 pm Jörn Franke, <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Have you looked in Spark UI why this is the case ?
>>>> S3 Reading can take more time - it depends also what s3 url you are
>>>> using : s3a vs s3n vs S3.
>>>>
>>>> It could help after some calculation to persist in-memory or on HDFS.
>>>> You can also initially load from S3 and store on HDFS and work from there .
>>>>
>>>> HDFS offers Data locality for the tasks, ie the tasks start on the
>>>> nodes where the data is. Depending on what s3 „protocol“ you are using you
>>>> might be also more punished with performance.
>>>>
>>>> Try s3a as a protocol (replace all s3n with s3a).
>>>>
>>>> You can also use s3 url but this requires a special bucket
>>>> configuration, a dedicated empty bucket and it lacks some ineroperability
>>>> with other AWS services.
>>>>
>>>> Nevertheless, it could be also something else with the code. Can you
>>>> post an example reproducing the issue?
>>>>
>>>> > Am 27.05.2020 um 18:18 schrieb Dark Crusader <
>>>> relinquisheddra...@gmail.com>:
>>>> >
>>>> > 
>>>> > Hi all,
>>>> >
>>>> > I am reading data from hdfs in the form of parquet files (around 3
>>>> GB) and running an algorithm from the spark ml library.
>>>> >
>>>> > If I create the same spark dataframe by reading data from S3, the
>>>> same algorithm takes considerably more time.
>>>> >
>>>> > I don't understand why this is happening. Is this a chance occurence
>>>> or are the spark dataframes created different?
>>>> >
>>>> > I don't understand how the data store would effect the algorithm
>>>> performance.
>>>> >
>>>> > Any help would be appreciated. Thanks a lot.
>>>>
>>>
>>
>> --
>> I appreciate your time,
>>
>> ~Randy
>>
>

Reply via email to