Re: Resource under-utilization when using RocksDb state backend [SOLVED]

2017-02-16 Thread vinay patil
ad is spread 
>>>>> over
>>>>> all CPUs.
>>>>>
>>>>> Here is a sample top:
>>>>>
>>>>> Cpu0  : 20.5%us,  0.0%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>>>>>  0.0%st
>>>>> Cpu1  : 18.5%us,  0.0%sy,  0.0%ni, 81.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>>>>>  0.0%st
>>>>> Cpu2  : 11.6%us,  0.7%sy,  0.0%ni, 87.0%id,  0.7%wa,  0.0%hi,  0.0%si,
>>>>>  0.0%st
>>>>> Cpu3  : 12.5%us,  0.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.3%si,
>>>>>  0.0%st
>>>>>
>>>>> Our pipeline uses tumbling windows, each with a ValueState keyed to a
>>>>> 3-tuple of one string and two ints.. Each ValueState comprises a small set
>>>>> of tuples around 5-7 fields each. The WindowFunction simply diffs agains
>>>>> the set and updates state if there is a diff.
>>>>>
>>>>> Any ideas as to what the bottleneck is here? Any suggestions welcomed!
>>>>>
>>>>> -Cliff
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Re-Resource-under-utilization-when-using-
> RocksDb-state-backend-SOLVED-tp10537.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1...@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx>
> .
> NAML
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Resource-under-utilization-when-using-RocksDb-state-backend-SOLVED-tp10537p11678.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.

Re: Resource under-utilization when using RocksDb state backend [SOLVED]

2016-12-08 Thread Aljoscha Krettek
Great to hear!

On Fri, 9 Dec 2016 at 01:02 Cliff Resnick  wrote:

> It turns out that most of the time in RocksDBFoldingState was spent on
> serialization/deserializaton. RocksDb read/write was performing well. By
> moving from Kryo to custom serialization we were able to increase
> throughput dramatically. Load is now where it should be.
>
> On Mon, Dec 5, 2016 at 1:15 PM, Robert Metzger 
> wrote:
>
> Another Flink user using RocksDB with large state on SSDs recently posted
> this video for oprimizing the performance of Rocks on SSDs:
> https://www.youtube.com/watch?v=pvUqbIeoPzM
> That could be relevant for you.
>
> For how long did you look at iotop. It could be that the IO access happens
> in bursts, depending on how data is cached.
>
> I'll also add Stefan Richter to the conversation, he has maybe some more
> ideas what we can do here.
>
>
> On Mon, Dec 5, 2016 at 6:19 PM, Cliff Resnick  wrote:
>
> Hi Robert,
>
> We're following 1.2-SNAPSHOT,  using event time. I have tried "iotop" and
> I see usually less than 1 % IO. The most I've seen was a quick flash here
> or there of something substantial (e.g. 19%, 52%) then back to nothing. I
> also assumed we were disk-bound, but to use your metaphor I'm having
> trouble finding any smoke. However, I'm not very experienced in sussing out
> IO issues so perhaps there is something else I'm missing.
>
> I'll keep investigating. If I continue to come up empty then I guess my
> next steps may be to stage some independent tests directly against RocksDb.
>
> -Cliff
>
>
> On Mon, Dec 5, 2016 at 5:52 AM, Robert Metzger 
> wrote:
>
> Hi Cliff,
>
> which Flink version are you using?
> Are you using Eventtime or processing time windows?
>
> I suspect that your disks are "burning" (= your job is IO bound). Can you
> check with a tool like "iotop" how much disk IO Flink is producing?
> Then, I would set this number in relation with the theoretical maximum of
> your SSD's (a good rough estimate is to use dd for that).
>
> If you find that your disk bandwidth is saturated by Flink, you could look
> into tuning the RocksDB settings so that it uses more memory for caching.
>
> Regards,
> Robert
>
>
> On Fri, Dec 2, 2016 at 11:34 PM, Cliff Resnick  wrote:
>
> In tests comparing RocksDb to fs state backend we observe much lower
> throughput, around 10x slower. While the lowered throughput is expected,
> what's perplexing is that machine load is also very low with RocksDb,
> typically falling to  < 25% CPU and negligible IO wait (around 0.1%). Our
> test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.5G RAM,
> each running a single TaskManager in YARN, with 6.5G allocated memory per
> TaskManager. The instances also have 2x40G attached SSDs which we have
> mapped to `taskmanager.tmp.dir`.
>
> With FS state and 4 slots per TM, we will easily max out with an average
> load average around 5 or 6, so we actually need throttle down the slots to
> 3. With RocksDb using the Flink SSD configured options we see a load
> average at around 1. Also, load (and actual) throughput remain more or less
> constant no matter how many slots we use. The weak load is spread over all
> CPUs.
>
> Here is a sample top:
>
> Cpu0  : 20.5%us,  0.0%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu1  : 18.5%us,  0.0%sy,  0.0%ni, 81.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu2  : 11.6%us,  0.7%sy,  0.0%ni, 87.0%id,  0.7%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu3  : 12.5%us,  0.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.3%si,
>  0.0%st
>
> Our pipeline uses tumbling windows, each with a ValueState keyed to a
> 3-tuple of one string and two ints.. Each ValueState comprises a small set
> of tuples around 5-7 fields each. The WindowFunction simply diffs agains
> the set and updates state if there is a diff.
>
> Any ideas as to what the bottleneck is here? Any suggestions welcomed!
>
> -Cliff
>
>
>
>
>
>
>
>
>
>
>


Re: Resource under-utilization when using RocksDb state backend [SOLVED]

2016-12-08 Thread Cliff Resnick
It turns out that most of the time in RocksDBFoldingState was spent on
serialization/deserializaton. RocksDb read/write was performing well. By
moving from Kryo to custom serialization we were able to increase
throughput dramatically. Load is now where it should be.

On Mon, Dec 5, 2016 at 1:15 PM, Robert Metzger  wrote:

> Another Flink user using RocksDB with large state on SSDs recently posted
> this video for oprimizing the performance of Rocks on SSDs:
> https://www.youtube.com/watch?v=pvUqbIeoPzM
> That could be relevant for you.
>
> For how long did you look at iotop. It could be that the IO access happens
> in bursts, depending on how data is cached.
>
> I'll also add Stefan Richter to the conversation, he has maybe some more
> ideas what we can do here.
>
>
> On Mon, Dec 5, 2016 at 6:19 PM, Cliff Resnick  wrote:
>
>> Hi Robert,
>>
>> We're following 1.2-SNAPSHOT,  using event time. I have tried "iotop" and
>> I see usually less than 1 % IO. The most I've seen was a quick flash here
>> or there of something substantial (e.g. 19%, 52%) then back to nothing. I
>> also assumed we were disk-bound, but to use your metaphor I'm having
>> trouble finding any smoke. However, I'm not very experienced in sussing out
>> IO issues so perhaps there is something else I'm missing.
>>
>> I'll keep investigating. If I continue to come up empty then I guess my
>> next steps may be to stage some independent tests directly against RocksDb.
>>
>> -Cliff
>>
>>
>> On Mon, Dec 5, 2016 at 5:52 AM, Robert Metzger 
>> wrote:
>>
>>> Hi Cliff,
>>>
>>> which Flink version are you using?
>>> Are you using Eventtime or processing time windows?
>>>
>>> I suspect that your disks are "burning" (= your job is IO bound). Can
>>> you check with a tool like "iotop" how much disk IO Flink is producing?
>>> Then, I would set this number in relation with the theoretical maximum
>>> of your SSD's (a good rough estimate is to use dd for that).
>>>
>>> If you find that your disk bandwidth is saturated by Flink, you could
>>> look into tuning the RocksDB settings so that it uses more memory for
>>> caching.
>>>
>>> Regards,
>>> Robert
>>>
>>>
>>> On Fri, Dec 2, 2016 at 11:34 PM, Cliff Resnick  wrote:
>>>
 In tests comparing RocksDb to fs state backend we observe much lower
 throughput, around 10x slower. While the lowered throughput is expected,
 what's perplexing is that machine load is also very low with RocksDb,
 typically falling to  < 25% CPU and negligible IO wait (around 0.1%). Our
 test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.5G RAM,
 each running a single TaskManager in YARN, with 6.5G allocated memory per
 TaskManager. The instances also have 2x40G attached SSDs which we have
 mapped to `taskmanager.tmp.dir`.

 With FS state and 4 slots per TM, we will easily max out with an
 average load average around 5 or 6, so we actually need throttle down the
 slots to 3. With RocksDb using the Flink SSD configured options we see a
 load average at around 1. Also, load (and actual) throughput remain more or
 less constant no matter how many slots we use. The weak load is spread over
 all CPUs.

 Here is a sample top:

 Cpu0  : 20.5%us,  0.0%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,
  0.0%st
 Cpu1  : 18.5%us,  0.0%sy,  0.0%ni, 81.5%id,  0.0%wa,  0.0%hi,  0.0%si,
  0.0%st
 Cpu2  : 11.6%us,  0.7%sy,  0.0%ni, 87.0%id,  0.7%wa,  0.0%hi,  0.0%si,
  0.0%st
 Cpu3  : 12.5%us,  0.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.3%si,
  0.0%st

 Our pipeline uses tumbling windows, each with a ValueState keyed to a
 3-tuple of one string and two ints.. Each ValueState comprises a small set
 of tuples around 5-7 fields each. The WindowFunction simply diffs agains
 the set and updates state if there is a diff.

 Any ideas as to what the bottleneck is here? Any suggestions welcomed!

 -Cliff







>>>
>>
>


Re: Resource under-utilization when using RocksDb state backend

2016-12-05 Thread Robert Metzger
Another Flink user using RocksDB with large state on SSDs recently posted
this video for oprimizing the performance of Rocks on SSDs:
https://www.youtube.com/watch?v=pvUqbIeoPzM
That could be relevant for you.

For how long did you look at iotop. It could be that the IO access happens
in bursts, depending on how data is cached.

I'll also add Stefan Richter to the conversation, he has maybe some more
ideas what we can do here.


On Mon, Dec 5, 2016 at 6:19 PM, Cliff Resnick  wrote:

> Hi Robert,
>
> We're following 1.2-SNAPSHOT,  using event time. I have tried "iotop" and
> I see usually less than 1 % IO. The most I've seen was a quick flash here
> or there of something substantial (e.g. 19%, 52%) then back to nothing. I
> also assumed we were disk-bound, but to use your metaphor I'm having
> trouble finding any smoke. However, I'm not very experienced in sussing out
> IO issues so perhaps there is something else I'm missing.
>
> I'll keep investigating. If I continue to come up empty then I guess my
> next steps may be to stage some independent tests directly against RocksDb.
>
> -Cliff
>
>
> On Mon, Dec 5, 2016 at 5:52 AM, Robert Metzger 
> wrote:
>
>> Hi Cliff,
>>
>> which Flink version are you using?
>> Are you using Eventtime or processing time windows?
>>
>> I suspect that your disks are "burning" (= your job is IO bound). Can you
>> check with a tool like "iotop" how much disk IO Flink is producing?
>> Then, I would set this number in relation with the theoretical maximum of
>> your SSD's (a good rough estimate is to use dd for that).
>>
>> If you find that your disk bandwidth is saturated by Flink, you could
>> look into tuning the RocksDB settings so that it uses more memory for
>> caching.
>>
>> Regards,
>> Robert
>>
>>
>> On Fri, Dec 2, 2016 at 11:34 PM, Cliff Resnick  wrote:
>>
>>> In tests comparing RocksDb to fs state backend we observe much lower
>>> throughput, around 10x slower. While the lowered throughput is expected,
>>> what's perplexing is that machine load is also very low with RocksDb,
>>> typically falling to  < 25% CPU and negligible IO wait (around 0.1%). Our
>>> test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.5G RAM,
>>> each running a single TaskManager in YARN, with 6.5G allocated memory per
>>> TaskManager. The instances also have 2x40G attached SSDs which we have
>>> mapped to `taskmanager.tmp.dir`.
>>>
>>> With FS state and 4 slots per TM, we will easily max out with an average
>>> load average around 5 or 6, so we actually need throttle down the slots to
>>> 3. With RocksDb using the Flink SSD configured options we see a load
>>> average at around 1. Also, load (and actual) throughput remain more or less
>>> constant no matter how many slots we use. The weak load is spread over all
>>> CPUs.
>>>
>>> Here is a sample top:
>>>
>>> Cpu0  : 20.5%us,  0.0%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>>>  0.0%st
>>> Cpu1  : 18.5%us,  0.0%sy,  0.0%ni, 81.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>>>  0.0%st
>>> Cpu2  : 11.6%us,  0.7%sy,  0.0%ni, 87.0%id,  0.7%wa,  0.0%hi,  0.0%si,
>>>  0.0%st
>>> Cpu3  : 12.5%us,  0.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.3%si,
>>>  0.0%st
>>>
>>> Our pipeline uses tumbling windows, each with a ValueState keyed to a
>>> 3-tuple of one string and two ints.. Each ValueState comprises a small set
>>> of tuples around 5-7 fields each. The WindowFunction simply diffs agains
>>> the set and updates state if there is a diff.
>>>
>>> Any ideas as to what the bottleneck is here? Any suggestions welcomed!
>>>
>>> -Cliff
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Resource under-utilization when using RocksDb state backend

2016-12-05 Thread Cliff Resnick
Hi Robert,

We're following 1.2-SNAPSHOT,  using event time. I have tried "iotop" and I
see usually less than 1 % IO. The most I've seen was a quick flash here or
there of something substantial (e.g. 19%, 52%) then back to nothing. I also
assumed we were disk-bound, but to use your metaphor I'm having trouble
finding any smoke. However, I'm not very experienced in sussing out IO
issues so perhaps there is something else I'm missing.

I'll keep investigating. If I continue to come up empty then I guess my
next steps may be to stage some independent tests directly against RocksDb.

-Cliff


On Mon, Dec 5, 2016 at 5:52 AM, Robert Metzger  wrote:

> Hi Cliff,
>
> which Flink version are you using?
> Are you using Eventtime or processing time windows?
>
> I suspect that your disks are "burning" (= your job is IO bound). Can you
> check with a tool like "iotop" how much disk IO Flink is producing?
> Then, I would set this number in relation with the theoretical maximum of
> your SSD's (a good rough estimate is to use dd for that).
>
> If you find that your disk bandwidth is saturated by Flink, you could look
> into tuning the RocksDB settings so that it uses more memory for caching.
>
> Regards,
> Robert
>
>
> On Fri, Dec 2, 2016 at 11:34 PM, Cliff Resnick  wrote:
>
>> In tests comparing RocksDb to fs state backend we observe much lower
>> throughput, around 10x slower. While the lowered throughput is expected,
>> what's perplexing is that machine load is also very low with RocksDb,
>> typically falling to  < 25% CPU and negligible IO wait (around 0.1%). Our
>> test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.5G RAM,
>> each running a single TaskManager in YARN, with 6.5G allocated memory per
>> TaskManager. The instances also have 2x40G attached SSDs which we have
>> mapped to `taskmanager.tmp.dir`.
>>
>> With FS state and 4 slots per TM, we will easily max out with an average
>> load average around 5 or 6, so we actually need throttle down the slots to
>> 3. With RocksDb using the Flink SSD configured options we see a load
>> average at around 1. Also, load (and actual) throughput remain more or less
>> constant no matter how many slots we use. The weak load is spread over all
>> CPUs.
>>
>> Here is a sample top:
>>
>> Cpu0  : 20.5%us,  0.0%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>>  0.0%st
>> Cpu1  : 18.5%us,  0.0%sy,  0.0%ni, 81.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>>  0.0%st
>> Cpu2  : 11.6%us,  0.7%sy,  0.0%ni, 87.0%id,  0.7%wa,  0.0%hi,  0.0%si,
>>  0.0%st
>> Cpu3  : 12.5%us,  0.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.3%si,
>>  0.0%st
>>
>> Our pipeline uses tumbling windows, each with a ValueState keyed to a
>> 3-tuple of one string and two ints.. Each ValueState comprises a small set
>> of tuples around 5-7 fields each. The WindowFunction simply diffs agains
>> the set and updates state if there is a diff.
>>
>> Any ideas as to what the bottleneck is here? Any suggestions welcomed!
>>
>> -Cliff
>>
>>
>>
>>
>>
>>
>>
>


Re: Resource under-utilization when using RocksDb state backend

2016-12-05 Thread Robert Metzger
Hi Cliff,

which Flink version are you using?
Are you using Eventtime or processing time windows?

I suspect that your disks are "burning" (= your job is IO bound). Can you
check with a tool like "iotop" how much disk IO Flink is producing?
Then, I would set this number in relation with the theoretical maximum of
your SSD's (a good rough estimate is to use dd for that).

If you find that your disk bandwidth is saturated by Flink, you could look
into tuning the RocksDB settings so that it uses more memory for caching.

Regards,
Robert


On Fri, Dec 2, 2016 at 11:34 PM, Cliff Resnick  wrote:

> In tests comparing RocksDb to fs state backend we observe much lower
> throughput, around 10x slower. While the lowered throughput is expected,
> what's perplexing is that machine load is also very low with RocksDb,
> typically falling to  < 25% CPU and negligible IO wait (around 0.1%). Our
> test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.5G RAM,
> each running a single TaskManager in YARN, with 6.5G allocated memory per
> TaskManager. The instances also have 2x40G attached SSDs which we have
> mapped to `taskmanager.tmp.dir`.
>
> With FS state and 4 slots per TM, we will easily max out with an average
> load average around 5 or 6, so we actually need throttle down the slots to
> 3. With RocksDb using the Flink SSD configured options we see a load
> average at around 1. Also, load (and actual) throughput remain more or less
> constant no matter how many slots we use. The weak load is spread over all
> CPUs.
>
> Here is a sample top:
>
> Cpu0  : 20.5%us,  0.0%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu1  : 18.5%us,  0.0%sy,  0.0%ni, 81.5%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu2  : 11.6%us,  0.7%sy,  0.0%ni, 87.0%id,  0.7%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu3  : 12.5%us,  0.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.3%si,
>  0.0%st
>
> Our pipeline uses tumbling windows, each with a ValueState keyed to a
> 3-tuple of one string and two ints.. Each ValueState comprises a small set
> of tuples around 5-7 fields each. The WindowFunction simply diffs agains
> the set and updates state if there is a diff.
>
> Any ideas as to what the bottleneck is here? Any suggestions welcomed!
>
> -Cliff
>
>
>
>
>
>
>


Resource under-utilization when using RocksDb state backend

2016-12-02 Thread Cliff Resnick
In tests comparing RocksDb to fs state backend we observe much lower
throughput, around 10x slower. While the lowered throughput is expected,
what's perplexing is that machine load is also very low with RocksDb,
typically falling to  < 25% CPU and negligible IO wait (around 0.1%). Our
test instances are EC2 c3.xlarge which are 4 virtual CPUs and 7.5G RAM,
each running a single TaskManager in YARN, with 6.5G allocated memory per
TaskManager. The instances also have 2x40G attached SSDs which we have
mapped to `taskmanager.tmp.dir`.

With FS state and 4 slots per TM, we will easily max out with an average
load average around 5 or 6, so we actually need throttle down the slots to
3. With RocksDb using the Flink SSD configured options we see a load
average at around 1. Also, load (and actual) throughput remain more or less
constant no matter how many slots we use. The weak load is spread over all
CPUs.

Here is a sample top:

Cpu0  : 20.5%us,  0.0%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
Cpu1  : 18.5%us,  0.0%sy,  0.0%ni, 81.5%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
Cpu2  : 11.6%us,  0.7%sy,  0.0%ni, 87.0%id,  0.7%wa,  0.0%hi,  0.0%si,
 0.0%st
Cpu3  : 12.5%us,  0.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.3%si,
 0.0%st

Our pipeline uses tumbling windows, each with a ValueState keyed to a
3-tuple of one string and two ints.. Each ValueState comprises a small set
of tuples around 5-7 fields each. The WindowFunction simply diffs agains
the set and updates state if there is a diff.

Any ideas as to what the bottleneck is here? Any suggestions welcomed!

-Cliff