Re: Spark dataframe hdfs vs s3

2020-05-30 Thread Anwar AliKhan
Optimisation of Spark applications Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article presents several Spark

Re: Spark dataframe hdfs vs s3

2020-05-30 Thread Dark Crusader
Thanks all for the replies. I am switching to hdfs since it seems like an easier solution. To answer some of your questions, my hdfs space is a part of my nodes I use for computation on spark. >From what I understand, this helps because of the data locality advantage. Which means that there is less

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Jörn Franke
Maybe some aws network optimized instances with higher bandwidth will improve the situation. > Am 27.05.2020 um 19:51 schrieb Dark Crusader : > >  > Hi Jörn, > > Thanks for the reply. I will try to create a easier example to reproduce the > issue. > > I will also try your suggestion to look

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread randy clinton
HDFS is simply a better place to make performant reads and on top of that the data is closer to your spark job. The databricks link from above will show you that where they find a 6x read throughput difference between the two. If your HDFS is part of the same Spark cluster than it should be an inc

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Bin Fan
Try to deploy Alluxio as a caching layer on top of S3, providing Spark a similar HDFS interface? Like in this article: https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/ On Wed, May 27, 2020 at 6:52 PM Dark Crusader wrote: > Hi Randy, > > Ye

Re: Spark dataframe hdfs vs s3

2020-05-28 Thread Kanwaljit Singh
You can’t play much if it is a streaming job. But in case of batch jobs, sometimes teams will copy their S3 data to HDFS in prep for the next run :D From: randy clinton Date: Thursday, May 28, 2020 at 5:50 AM To: Dark Crusader Cc: Jörn Franke , user Subject: Re: Spark dataframe hdfs vs s3

Re: Spark dataframe hdfs vs s3

2020-05-28 Thread randy clinton
See if this helps "That is to say, on a per node basis, HDFS can yield 6X higher read throughput than S3. Thus, *given that the S3 is 10x cheaper than HDFS, we find that S3 is almost 2x better compared to HDFS on performance per dollar."* *https://databricks.com/blog/2017/05/31/top-5-reasons-for-

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi Randy, Yes, I'm using parquet on both S3 and hdfs. On Thu, 28 May, 2020, 2:38 am randy clinton, wrote: > Is the file Parquet on S3 or is it some other file format? > > In general I would assume that HDFS read/writes are more performant for > spark jobs. > > For instance, consider how well pa

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread randy clinton
Is the file Parquet on S3 or is it some other file format? In general I would assume that HDFS read/writes are more performant for spark jobs. For instance, consider how well partitioned your HDFS file is vs the S3 file. On Wed, May 27, 2020 at 1:51 PM Dark Crusader wrote: > Hi Jörn, > > Thank

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi Jörn, Thanks for the reply. I will try to create a easier example to reproduce the issue. I will also try your suggestion to look into the UI. Can you guide on what I should be looking for? I was already using the s3a protocol to compare the times. My hunch is that multiple reads from S3 are

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Jörn Franke
Have you looked in Spark UI why this is the case ? S3 Reading can take more time - it depends also what s3 url you are using : s3a vs s3n vs S3. It could help after some calculation to persist in-memory or on HDFS. You can also initially load from S3 and store on HDFS and work from there . HD

Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi all, I am reading data from hdfs in the form of parquet files (around 3 GB) and running an algorithm from the spark ml library. If I create the same spark dataframe by reading data from S3, the same algorithm takes considerably more time. I don't understand why this is happening. Is this a ch