RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
Oh, Tzahi, I misread the metrics in the first reply. It’s about reads indeed, not writes. From: Tzahi File Sent: Wednesday, 7 April 2021 16:02 To: Hariharan Cc: user Subject: Re: Spark performance over S3 Hi Hariharan, Thanks for your reply. In both cases we are writing the data to S3

Re: Spark performance over S3

2021-04-07 Thread Tzahi File
Hi Hariharan, Thanks for your reply. In both cases we are writing the data to S3. The difference is that in the first case we read the data from S3 and in the second we read from HDFS. We are using ListObjectsV2 API in S3A . The S3 bucket and

Re: Spark performance over S3

2021-04-07 Thread Vladimir Prus
VPC endpoint can also make a major difference in costs. Without it, access to S3 incurs data transfer costs and NAT costs, and these can be large. On Wed, 7 Apr 2021 at 14:13, Hariharan wrote: > Hi Tzahi, > > Comparing the first two cases: > >- > reads the parquet files from S3 and also

Re: Spark performance over S3

2021-04-07 Thread Hariharan
Hi Tzahi, Comparing the first two cases: - > reads the parquet files from S3 and also writes to S3, it takes 22 min - > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min) It looks like most of the time is being spent in reading, and the time

RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
tion to compare this with EMRFS performance … I know it requires you to put in some work. Boris From: Gourav Sengupta Sent: Tuesday, 6 April 2021 22:24 To: Tzahi File Cc: user Subject: Re: Spark performance over S3 Hi Tzahi, that is a huge cost. So that I can understand the question before answe

Re: Spark performance over S3

2021-04-06 Thread Gourav Sengupta
Hi Tzahi, that is a huge cost. So that I can understand the question before answering it: 1. what is the SPARK version that you are using? 2. what is the SQL code that you are using to read and write? There are several other questions that are pertinent, but the above will be a great starting

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Mich Talebzadeh
LOL I think we should try the Chrystal ball to answer this question. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Jörn Franke
Do you mind sharing what your software does? What is the input data size? What is the spark version and apis used? How many nodes? What is the input data format? Is compression used? > On 21 Sep 2016, at 13:37, Trinadh Kaja wrote: > > Hi all, > > how to increase spark

Re: Spark performance testing

2016-07-09 Thread Mich Talebzadeh
Hi Andrew, I suggest that you narrow down your scope for performance testing using the same setup and doing incremental changes keeping other systematics the same. Spark itself can run on local, standalone, yarn client and yarn cluster modes So really you need to target a particular setup of run

Re: Spark performance testing

2016-07-08 Thread Andrew Ehrlich
Yea, I'm looking for any personal experiences people have had with tools like these. > On Jul 8, 2016, at 8:57 PM, charles li wrote: > > Hi, Andrew, I've got lots of materials when asking google for "spark > performance test" > >

Re: Spark performance testing

2016-07-08 Thread charles li
Hi, Andrew, I've got lots of materials when asking google for "*spark performance test*" - https://github.com/databricks/spark-perf - https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf -

Re: spark performance non-linear response

2015-10-07 Thread Sean Owen
OK, next question then is: if this is wall-clock time for the whole process, then, I wonder if you are just measuring the time taken by the longest single task. I'd expect the time taken by the longest straggler task to follow a distribution like this. That is, how balanced are the partitions?

Re: spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Additional missing relevant information: Im running a transformation, there are no Shuffles occurring and at the end im performing a lookup of 4 partitions on the driver. On 10/7/15 11:26 AM, Yadid Ayzenberg wrote: Hi All, Im using spark 1.4.1 to to analyze a largish data set (several

Re: spark performance non-linear response

2015-10-07 Thread Jonathan Coveney
I've noticed this as well and am curious if there is anything more people can say. My theory is that it is just communication overhead. If you only have a couple of gigabytes (a tiny dataset), then spotting that into 50 nodes means you'll have a ton of tiny partitions all finishing very quickly,

Re: spark performance - executor computing time

2015-09-17 Thread Adrian Tanase
lto:user@spark.apache.org>" Subject: Re: spark performance - executor computing time Is this repeatable? Do you always get one or two executors that are 6 times as slow? It could be that some of your tasks have more work to do (maybe you are filtering some records out? If it’s always one p

Re: spark performance - executor computing time

2015-09-16 Thread Robin East
Is this repeatable? Do you always get one or two executors that are 6 times as slow? It could be that some of your tasks have more work to do (maybe you are filtering some records out? If it’s always one particular worker node is there something about the machine configuration (e.g. CPU speed)

RE: Spark performance

2015-07-13 Thread Mohammed Guller
. Mohammed From: Michael Segel [mailto:msegel_had...@hotmail.com] Sent: Sunday, July 12, 2015 6:59 AM To: Mohammed Guller Cc: David Mitchell; Roman Sokolov; user; Ravisankar Mani Subject: Re: Spark performance Not necessarily. It depends on the use case and what you intend to do with the data. 4-6

Re: Spark performance

2015-07-12 Thread santoshv98
Ravi Spark (or in that case Big Data solutions like Hive) is suited for large analytical loads, where the “scaling up” starts to pale in comparison to “Scaling out” with regards to performance, versatility(types of data) and cost. Without going into the details of MsSQL architecture, there

Re: Spark performance

2015-07-11 Thread Jörn Franke
What is your business case for the move? Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani rrav...@gmail.com a écrit : Hi everyone, I have planned to move mssql server to spark?. I have using around 50,000 to 1l records. The spark performance is slow when compared to mssql server. What is

Re: Spark performance

2015-07-11 Thread David Mitchell
You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return responses in milliseconds. Analytic databases tend to operate on large data sets, and

RE: Spark performance

2015-07-11 Thread Roman Sokolov
Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com: Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired

RE: Spark performance

2015-07-11 Thread Mohammed Guller
To: Roman Sokolov Cc: Mohammed Guller; user; Ravisankar Mani Subject: Re: Spark performance You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return

Re: Spark performance

2015-07-11 Thread Jörn Franke
Honestly you are addressing this wrongly - you do not seem.to have a business case for changing - so why do you want to switch Le sam. 11 juil. 2015 à 3:28, Mohammed Guller moham...@glassbeam.com a écrit : Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines,

Re: Spark performance

2015-07-11 Thread Jörn Franke
Le sam. 11 juil. 2015 à 14:53, Roman Sokolov ole...@gmail.com a écrit : Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com: Hi Ravi, First, Neither Spark nor

RE: Spark performance

2015-07-10 Thread Mohammed Guller
Hi Ravi, First, Neither Spark nor Spark SQL is a database. Both are compute engines, which need to be paired with a storage system. Seconds, they are designed for processing large distributed datasets. If you have only 100,000 records or even a million records, you don’t need Spark. A RDBMS

Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
It’ll help to see the code or at least understand what transformations you’re using. Also, you have 15 nodes but not using all of them, so that means you may be losing data locality. You can see this in the job UI for Spark if any jobs do not have node or process local. From: diplomatic Guru

Re: Spark performance in cluster mode using yarn

2015-05-15 Thread Sachin Singh
Hi Ayan, I am asking general scenarios as per given info/configuration, from experts, not specific, java code is nothing get hive context and select query, there is no serialization or any other complex things I kept,straight forward, 10 lines of code, Group Please suggest if any Idea, Regards

Re: Spark performance in cluster mode using yarn

2015-05-14 Thread ayan guha
With this information it is hard to predict. What's the performance you are getting? What's your desired performance? Maybe you can post your code and experts can suggests improvement? On 14 May 2015 15:02, sachin Singh sachin.sha...@gmail.com wrote: Hi Friends, please someone can give the

Re: Spark Performance on Yarn

2015-04-22 Thread Ted Yu
In master branch, overhead is now 10%. That would be 500 MB FYI On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote: +1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should

Re: Spark Performance on Yarn

2015-04-22 Thread nsalian
+1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should suffice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html Sent

Re: Spark Performance on Yarn

2015-04-22 Thread Neelesh Salian
Does it still hit the memory limit for the container? An expensive transformation? On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu yuzhih...@gmail.com wrote: In master branch, overhead is now 10%. That would be 500 MB FYI On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote:

Re: Spark Performance on Yarn

2015-04-21 Thread hnahak
Try --executor-memory 5g , because you have 8 gb RAM in each machine -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22603.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Performance on Yarn

2015-04-20 Thread Peng Cheng
I got exactly the same problem, except that I'm running on a standalone master. Can you tell me the counterpart parameter on standalone master for increasing the same memroy overhead? -- View this message in context:

Re: Spark Performance on Yarn

2015-02-23 Thread Lee Bierman
Thanks for the suggestions. I removed the persist call from program. Doing so I started it with: spark-submit --class com.xxx.analytics.spark.AnalyticsJob --master yarn /tmp/analytics.jar --input_directory hdfs://ip:8020/flume/events/2015/02/ This takes all the default and only runs 2

Re: Spark performance tuning

2015-02-22 Thread Akhil Das
You can simply follow these http://spark.apache.org/docs/1.2.0/tuning.html Thanks Best Regards On Sun, Feb 22, 2015 at 1:14 AM, java8964 java8...@hotmail.com wrote: Can someone share some ideas about how to tune the GC time? Thanks -- From: java8...@hotmail.com

Re: Spark Performance on Yarn

2015-02-21 Thread Davies Liu
How many executors you have per machine? It will be helpful if you could list all the configs. Could you also try to run it without persist? Caching do hurt than help, if you don't have enough memory. On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote: Thanks for the

RE: Spark performance tuning

2015-02-21 Thread java8964
Can someone share some ideas about how to tune the GC time? Thanks From: java8...@hotmail.com To: user@spark.apache.org Subject: Spark performance tuning Date: Fri, 20 Feb 2015 16:04:23 -0500 Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a

Re: Spark Performance on Yarn

2015-02-20 Thread Sean Owen
None of this really points to the problem. These indicate that workers died but not why. I'd first go locate executor logs that reveal more about what's happening. It sounds like a hard-er type of failure, like JVM crash or running out of file handles, or GC thrashing. On Fri, Feb 20, 2015 at

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy, I appreciate your clear explanation. Let me try again. It's the best way to confirm I understand. spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory that YARN will create a JVM spark.executor.memory = the memory I can actually use in my jvm application = part of

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
That's all correct. -Sandy On Fri, Feb 20, 2015 at 1:23 PM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi Sandy, I appreciate your clear explanation. Let me try again. It's the best way to confirm I understand. spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory that

Re: Spark Performance on Yarn

2015-02-20 Thread Lee Bierman
Thanks for the suggestions. I'm experimenting with different values for spark memoryOverhead and explictly giving the executors more memory, but still have not found the golden medium to get it to finish in a proper time frame. Is my cluster massively undersized at 5 boxes, 8gb 2cpu ? Trying to

Re: Spark Performance on Yarn

2015-02-20 Thread lbierman
A bit more context on this issue. From the container logs on the executor Given my cluster specs above what would be appropriate parameters to pass into : --num-executors --num-cores --executor-memory I had tried it with --executor-memory 2500MB 015-02-20 06:50:09,056 WARN

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Are you specifying the executor memory, cores, or number of executors anywhere? If not, you won't be taking advantage of the full resources on the cluster. -Sandy On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote: None of this really points to the problem. These indicate

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
If that's the error you're hitting, the fix is to boost spark.yarn.executor.memoryOverhead, which will put some extra room in between the executor heap sizes and the amount of memory requested for them from YARN. -Sandy On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote: A

Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy, I am also doing memory tuning on YARN. Just want to confirm, is it correct to say: spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I can actually use in my jvm application If it is not, what is the correct relationship? Any other variables or config parameters

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Hi Kelvin, spark.executor.memory controls the size of the executor heaps. spark.yarn.executor.memoryOverhead is the amount of memory to request from YARN beyond the heap size. This accounts for the fact that JVMs use some non-heap memory. The Spark heap is divided into

Re: Spark performance for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where

Re: Spark performance optimization examples

2014-11-24 Thread Akhil Das
Here's the tuning guidelines if you haven't seen it already. http://spark.apache.org/docs/latest/tuning.html You could try the following to get it loaded: - Use kryo Serialization http://spark.apache.org/docs/latest/tuning.html#data-serialization - Enable RDD Compression - Set Storage level to