Re: Stress testing hdfs with Spark

2016-04-05 Thread Jan Holmberg
Yes, I realize that there's a standard way and then there's the way where client asks 'how fast can it write the data'. That is what I'm trying to figure out. At the moment I'm far from disks teorethical write speed when combining all the disks together. On 05 Apr 2016, at 23:21, Mich

Re: Stress testing hdfs with Spark

2016-04-05 Thread Jan Holmberg
Yep, I used dfsio and also Teragen but I would like to experiment with ad-hoc Spark prog. -jan On 05 Apr 2016, at 23:13, Sebastian Piu > wrote: You could they using TestDFSIO for raw hdfs performance, but we found it not very relevant

Re: Stress testing hdfs with Spark

2016-04-05 Thread Mich Talebzadeh
so that throughput per second. You can try Spark streaming saving it to HDFS and increase the throttle. The general accepted form is to measure service time which is the average service time for IO requests in ms Dr Mich Talebzadeh LinkedIn *

Re: Stress testing hdfs with Spark

2016-04-05 Thread Sebastian Piu
You could they using TestDFSIO for raw hdfs performance, but we found it not very relevant Another way could be to either generate a file and then read it and write it back. For some of our use cases we are populated a Kafka queue on the cluster (on different disks) and used spark streaming to do

Re: Stress testing hdfs with Spark

2016-04-05 Thread Jan Holmberg
I'm trying to get rough estimate how much data I can write within certain time period (GB/sec). -jan On 05 Apr 2016, at 22:49, Mich Talebzadeh > wrote: Hi Jan, What is the definition of stress test in here? What are the matrices?

Re: Stress testing hdfs with Spark

2016-04-05 Thread Mich Talebzadeh
Hi Jan, What is the definition of stress test in here? What are the matrices? Throughput of data, latency, velocity, volume? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Stress testing hdfs with Spark

2016-04-05 Thread Jan Holmberg
Hi, I'm trying to figure out how to write lots of data from each worker. I tried rdd.saveAsTextFile but got OOM when generating 1024MB string for a worker. Increasing worker memory would mean that I should drop the number of workers. Soo, any idea how to write ex. 1gb file from each worker?