See if this helps "That is to say, on a per node basis, HDFS can yield 6X higher read throughput than S3. Thus, *given that the S3 is 10x cheaper than HDFS, we find that S3 is almost 2x better compared to HDFS on performance per dollar."*
*https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html <https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html>* On Wed, May 27, 2020, 9:51 PM Dark Crusader <relinquisheddra...@gmail.com> wrote: > Hi Randy, > > Yes, I'm using parquet on both S3 and hdfs. > > On Thu, 28 May, 2020, 2:38 am randy clinton, <randyclin...@gmail.com> > wrote: > >> Is the file Parquet on S3 or is it some other file format? >> >> In general I would assume that HDFS read/writes are more performant for >> spark jobs. >> >> For instance, consider how well partitioned your HDFS file is vs the S3 >> file. >> >> On Wed, May 27, 2020 at 1:51 PM Dark Crusader < >> relinquisheddra...@gmail.com> wrote: >> >>> Hi Jörn, >>> >>> Thanks for the reply. I will try to create a easier example to reproduce >>> the issue. >>> >>> I will also try your suggestion to look into the UI. Can you guide on >>> what I should be looking for? >>> >>> I was already using the s3a protocol to compare the times. >>> >>> My hunch is that multiple reads from S3 are required because of improper >>> caching of intermediate data. And maybe hdfs is doing a better job at this. >>> Does this make sense? >>> >>> I would also like to add that we built an extra layer on S3 which might >>> be adding to even slower times. >>> >>> Thanks for your help. >>> >>> On Wed, 27 May, 2020, 11:03 pm Jörn Franke, <jornfra...@gmail.com> >>> wrote: >>> >>>> Have you looked in Spark UI why this is the case ? >>>> S3 Reading can take more time - it depends also what s3 url you are >>>> using : s3a vs s3n vs S3. >>>> >>>> It could help after some calculation to persist in-memory or on HDFS. >>>> You can also initially load from S3 and store on HDFS and work from there . >>>> >>>> HDFS offers Data locality for the tasks, ie the tasks start on the >>>> nodes where the data is. Depending on what s3 „protocol“ you are using you >>>> might be also more punished with performance. >>>> >>>> Try s3a as a protocol (replace all s3n with s3a). >>>> >>>> You can also use s3 url but this requires a special bucket >>>> configuration, a dedicated empty bucket and it lacks some ineroperability >>>> with other AWS services. >>>> >>>> Nevertheless, it could be also something else with the code. Can you >>>> post an example reproducing the issue? >>>> >>>> > Am 27.05.2020 um 18:18 schrieb Dark Crusader < >>>> relinquisheddra...@gmail.com>: >>>> > >>>> > >>>> > Hi all, >>>> > >>>> > I am reading data from hdfs in the form of parquet files (around 3 >>>> GB) and running an algorithm from the spark ml library. >>>> > >>>> > If I create the same spark dataframe by reading data from S3, the >>>> same algorithm takes considerably more time. >>>> > >>>> > I don't understand why this is happening. Is this a chance occurence >>>> or are the spark dataframes created different? >>>> > >>>> > I don't understand how the data store would effect the algorithm >>>> performance. >>>> > >>>> > Any help would be appreciated. Thanks a lot. >>>> >>> >> >> -- >> I appreciate your time, >> >> ~Randy >> >