Is the file Parquet on S3 or is it some other file format? In general I would assume that HDFS read/writes are more performant for spark jobs.
For instance, consider how well partitioned your HDFS file is vs the S3 file. On Wed, May 27, 2020 at 1:51 PM Dark Crusader <relinquisheddra...@gmail.com> wrote: > Hi Jörn, > > Thanks for the reply. I will try to create a easier example to reproduce > the issue. > > I will also try your suggestion to look into the UI. Can you guide on what > I should be looking for? > > I was already using the s3a protocol to compare the times. > > My hunch is that multiple reads from S3 are required because of improper > caching of intermediate data. And maybe hdfs is doing a better job at this. > Does this make sense? > > I would also like to add that we built an extra layer on S3 which might be > adding to even slower times. > > Thanks for your help. > > On Wed, 27 May, 2020, 11:03 pm Jörn Franke, <jornfra...@gmail.com> wrote: > >> Have you looked in Spark UI why this is the case ? >> S3 Reading can take more time - it depends also what s3 url you are using >> : s3a vs s3n vs S3. >> >> It could help after some calculation to persist in-memory or on HDFS. You >> can also initially load from S3 and store on HDFS and work from there . >> >> HDFS offers Data locality for the tasks, ie the tasks start on the nodes >> where the data is. Depending on what s3 „protocol“ you are using you might >> be also more punished with performance. >> >> Try s3a as a protocol (replace all s3n with s3a). >> >> You can also use s3 url but this requires a special bucket configuration, >> a dedicated empty bucket and it lacks some ineroperability with other AWS >> services. >> >> Nevertheless, it could be also something else with the code. Can you post >> an example reproducing the issue? >> >> > Am 27.05.2020 um 18:18 schrieb Dark Crusader < >> relinquisheddra...@gmail.com>: >> > >> > >> > Hi all, >> > >> > I am reading data from hdfs in the form of parquet files (around 3 GB) >> and running an algorithm from the spark ml library. >> > >> > If I create the same spark dataframe by reading data from S3, the same >> algorithm takes considerably more time. >> > >> > I don't understand why this is happening. Is this a chance occurence or >> are the spark dataframes created different? >> > >> > I don't understand how the data store would effect the algorithm >> performance. >> > >> > Any help would be appreciated. Thanks a lot. >> > -- I appreciate your time, ~Randy