Re: OOM errors
I see. For windowed aggregations the disk space (i.e. "/tmp/kafka-streams/appname") as well as memory consumption on RocksDB should not keep increasing forever. One thing to note is that you are using a tumbling window where a new window will be created every minute, so within 20 minutes of "event time", note this is not your processing time, you will get 20 windows, and each update could be applied to each one of the window in the worst case. Hence, your space consumption would roughly be #. input traffic (mb / sec) * avg #. windows for each input (worst case 20) * RocksDB space amplification. We have been using 0.10.0.1 in production with 10+ aggregations and we do see memory usage climbing up to 50GB for a single node, again depending on the input traffic, but it did not climbing forever. Guozhang On Wed, Nov 30, 2016 at 4:09 AM, Jon Yeargers wrote: > My apologies. In fact the 'aggregate' step includes this: > 'TimeWindows.of(20 > * 60 * 1000L).advanceBy(60 * 1000L)' > > On Tue, Nov 29, 2016 at 9:12 PM, Guozhang Wang wrote: > > > Where does the "20 minutes" come from? I thought the "aggregate" operator > > in your > > > > stream->aggregate->filter->foreach > > > > topology is not a windowed aggregation, so the aggregate results will > keep > > accumulating. > > > > > > Guozhang > > > > > > On Tue, Nov 29, 2016 at 8:40 PM, Jon Yeargers > > wrote: > > > > > "keep increasing" - why? It seems (to me) that the aggregates should be > > 20 > > > minutes long. After that the memory should be released. > > > > > > Not true? > > > > > > On Tue, Nov 29, 2016 at 3:53 PM, Guozhang Wang > > wrote: > > > > > > > Jon, > > > > > > > > Note that in your "aggregate" function, if it is now windowed > aggregate > > > > then the aggregation results will keep increasing in your local state > > > > stores unless you're pretty sure that the aggregate key space is > > bounded. > > > > This is not only related to disk space but also memory since the > > current > > > > default persistent state store, RocksDB, takes its own block cache > both > > > on > > > > heap and out of heap. > > > > > > > > In addition, RocksDB has a write / space amplification. That is, if > > your > > > > estimate size of the aggregated state store takes 1GB, it will > actually > > > > take 1 * max-amplification-factor in worst case, and similarly for > > block > > > > cache and write buffer. > > > > > > > > > > > > Guozhang > > > > > > > > On Tue, Nov 29, 2016 at 12:25 PM, Jon Yeargers < > > jon.yearg...@cedexis.com > > > > > > > > wrote: > > > > > > > > > App eventually got OOM-killed. Consumed 53G of swap space. > > > > > > > > > > Does it require a different GC? Some extra settings for the java > cmd > > > > line? > > > > > > > > > > On Tue, Nov 29, 2016 at 12:05 PM, Jon Yeargers < > > > jon.yearg...@cedexis.com > > > > > > > > > > wrote: > > > > > > > > > > > I cloned/built 10.2.0-SNAPSHOT > > > > > > > > > > > > App hasn't been OOM-killed yet but it's up to 66% mem. > > > > > > > > > > > > App takes > 10 min to start now. Needless to say this is > > problematic. > > > > > > > > > > > > The 'kafka-streams' scratch space has consumed 37G and still > > > climbing. > > > > > > > > > > > > > > > > > > On Tue, Nov 29, 2016 at 10:48 AM, Jon Yeargers < > > > > jon.yearg...@cedexis.com > > > > > > > > > > > > wrote: > > > > > > > > > > > >> Does every broker need to be updated or just my client app(s)? > > > > > >> > > > > > >> On Tue, Nov 29, 2016 at 10:46 AM, Matthias J. Sax < > > > > > matth...@confluent.io> > > > > > >> wrote: > > > > > >> > > > > > >>> What version do you use? > > > > > >>> > > > > > >>> There is a memory leak in the latest version 0.10.1.0. The bug > > got > > > > > >>> already fixed in trunk and 0.10.1 branch. > > > > > >>> > > > > > >>> There is already a discussion about a 0.10.1.1 bug fix release. > > For > > > > > now, > > > > > >>> you could build the Kafka Streams from the sources by yourself. > > > > > >>> > > > > > >>> -Matthias > > > > > >>> > > > > > >>> > > > > > >>> On 11/29/16 10:30 AM, Jon Yeargers wrote: > > > > > >>> > My KStreams app seems to be having some memory issues. > > > > > >>> > > > > > > >>> > 1. I start it `java -Xmx8G -jar .jar` > > > > > >>> > > > > > > >>> > 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper. > > > > ClientCnxn > > > > > - > > > > > >>> Got > > > > > >>> > ping response for sessionid: 0xc58abee3e13 after 0ms' > > > messages > > > > > >>> > > > > > > >>> > 3. When it _finally_ starts reading values it typically goes > > for > > > a > > > > > >>> minute > > > > > >>> > or so, reads a few thousand values and then the OS kills it > > with > > > > 'Out > > > > > >>> of > > > > > >>> > memory' error. > > > > > >>> > > > > > > >>> > The topology is (essentially): > > > > > >>> > > > > > > >>> > stream->aggregate->filter->foreach > > > > > >>> > > > > > > >>> > It's reading values and creating a rolling average. > > > > > >>> > > > > > > >>> > During phase 2 (above) I see lots
Re: OOM errors
My apologies. In fact the 'aggregate' step includes this: 'TimeWindows.of(20 * 60 * 1000L).advanceBy(60 * 1000L)' On Tue, Nov 29, 2016 at 9:12 PM, Guozhang Wang wrote: > Where does the "20 minutes" come from? I thought the "aggregate" operator > in your > > stream->aggregate->filter->foreach > > topology is not a windowed aggregation, so the aggregate results will keep > accumulating. > > > Guozhang > > > On Tue, Nov 29, 2016 at 8:40 PM, Jon Yeargers > wrote: > > > "keep increasing" - why? It seems (to me) that the aggregates should be > 20 > > minutes long. After that the memory should be released. > > > > Not true? > > > > On Tue, Nov 29, 2016 at 3:53 PM, Guozhang Wang > wrote: > > > > > Jon, > > > > > > Note that in your "aggregate" function, if it is now windowed aggregate > > > then the aggregation results will keep increasing in your local state > > > stores unless you're pretty sure that the aggregate key space is > bounded. > > > This is not only related to disk space but also memory since the > current > > > default persistent state store, RocksDB, takes its own block cache both > > on > > > heap and out of heap. > > > > > > In addition, RocksDB has a write / space amplification. That is, if > your > > > estimate size of the aggregated state store takes 1GB, it will actually > > > take 1 * max-amplification-factor in worst case, and similarly for > block > > > cache and write buffer. > > > > > > > > > Guozhang > > > > > > On Tue, Nov 29, 2016 at 12:25 PM, Jon Yeargers < > jon.yearg...@cedexis.com > > > > > > wrote: > > > > > > > App eventually got OOM-killed. Consumed 53G of swap space. > > > > > > > > Does it require a different GC? Some extra settings for the java cmd > > > line? > > > > > > > > On Tue, Nov 29, 2016 at 12:05 PM, Jon Yeargers < > > jon.yearg...@cedexis.com > > > > > > > > wrote: > > > > > > > > > I cloned/built 10.2.0-SNAPSHOT > > > > > > > > > > App hasn't been OOM-killed yet but it's up to 66% mem. > > > > > > > > > > App takes > 10 min to start now. Needless to say this is > problematic. > > > > > > > > > > The 'kafka-streams' scratch space has consumed 37G and still > > climbing. > > > > > > > > > > > > > > > On Tue, Nov 29, 2016 at 10:48 AM, Jon Yeargers < > > > jon.yearg...@cedexis.com > > > > > > > > > > wrote: > > > > > > > > > >> Does every broker need to be updated or just my client app(s)? > > > > >> > > > > >> On Tue, Nov 29, 2016 at 10:46 AM, Matthias J. Sax < > > > > matth...@confluent.io> > > > > >> wrote: > > > > >> > > > > >>> What version do you use? > > > > >>> > > > > >>> There is a memory leak in the latest version 0.10.1.0. The bug > got > > > > >>> already fixed in trunk and 0.10.1 branch. > > > > >>> > > > > >>> There is already a discussion about a 0.10.1.1 bug fix release. > For > > > > now, > > > > >>> you could build the Kafka Streams from the sources by yourself. > > > > >>> > > > > >>> -Matthias > > > > >>> > > > > >>> > > > > >>> On 11/29/16 10:30 AM, Jon Yeargers wrote: > > > > >>> > My KStreams app seems to be having some memory issues. > > > > >>> > > > > > >>> > 1. I start it `java -Xmx8G -jar .jar` > > > > >>> > > > > > >>> > 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper. > > > ClientCnxn > > > > - > > > > >>> Got > > > > >>> > ping response for sessionid: 0xc58abee3e13 after 0ms' > > messages > > > > >>> > > > > > >>> > 3. When it _finally_ starts reading values it typically goes > for > > a > > > > >>> minute > > > > >>> > or so, reads a few thousand values and then the OS kills it > with > > > 'Out > > > > >>> of > > > > >>> > memory' error. > > > > >>> > > > > > >>> > The topology is (essentially): > > > > >>> > > > > > >>> > stream->aggregate->filter->foreach > > > > >>> > > > > > >>> > It's reading values and creating a rolling average. > > > > >>> > > > > > >>> > During phase 2 (above) I see lots of IO wait and the 'scratch' > > > buffer > > > > >>> > (usually "/tmp/kafka-streams/appname") fills with 10s of Gb of > > .. ? > > > > (I > > > > >>> had > > > > >>> > to create a special scratch partition with 100Gb of space as > > kafka > > > > >>> would > > > > >>> > fill the / partition and make the system v v unhappy) > > > > >>> > > > > > >>> > Have I misconfigured something? > > > > >>> > > > > > >>> > > > > >>> > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > > -- > -- Guozhang >
Re: OOM errors
Where does the "20 minutes" come from? I thought the "aggregate" operator in your stream->aggregate->filter->foreach topology is not a windowed aggregation, so the aggregate results will keep accumulating. Guozhang On Tue, Nov 29, 2016 at 8:40 PM, Jon Yeargers wrote: > "keep increasing" - why? It seems (to me) that the aggregates should be 20 > minutes long. After that the memory should be released. > > Not true? > > On Tue, Nov 29, 2016 at 3:53 PM, Guozhang Wang wrote: > > > Jon, > > > > Note that in your "aggregate" function, if it is now windowed aggregate > > then the aggregation results will keep increasing in your local state > > stores unless you're pretty sure that the aggregate key space is bounded. > > This is not only related to disk space but also memory since the current > > default persistent state store, RocksDB, takes its own block cache both > on > > heap and out of heap. > > > > In addition, RocksDB has a write / space amplification. That is, if your > > estimate size of the aggregated state store takes 1GB, it will actually > > take 1 * max-amplification-factor in worst case, and similarly for block > > cache and write buffer. > > > > > > Guozhang > > > > On Tue, Nov 29, 2016 at 12:25 PM, Jon Yeargers > > > wrote: > > > > > App eventually got OOM-killed. Consumed 53G of swap space. > > > > > > Does it require a different GC? Some extra settings for the java cmd > > line? > > > > > > On Tue, Nov 29, 2016 at 12:05 PM, Jon Yeargers < > jon.yearg...@cedexis.com > > > > > > wrote: > > > > > > > I cloned/built 10.2.0-SNAPSHOT > > > > > > > > App hasn't been OOM-killed yet but it's up to 66% mem. > > > > > > > > App takes > 10 min to start now. Needless to say this is problematic. > > > > > > > > The 'kafka-streams' scratch space has consumed 37G and still > climbing. > > > > > > > > > > > > On Tue, Nov 29, 2016 at 10:48 AM, Jon Yeargers < > > jon.yearg...@cedexis.com > > > > > > > > wrote: > > > > > > > >> Does every broker need to be updated or just my client app(s)? > > > >> > > > >> On Tue, Nov 29, 2016 at 10:46 AM, Matthias J. Sax < > > > matth...@confluent.io> > > > >> wrote: > > > >> > > > >>> What version do you use? > > > >>> > > > >>> There is a memory leak in the latest version 0.10.1.0. The bug got > > > >>> already fixed in trunk and 0.10.1 branch. > > > >>> > > > >>> There is already a discussion about a 0.10.1.1 bug fix release. For > > > now, > > > >>> you could build the Kafka Streams from the sources by yourself. > > > >>> > > > >>> -Matthias > > > >>> > > > >>> > > > >>> On 11/29/16 10:30 AM, Jon Yeargers wrote: > > > >>> > My KStreams app seems to be having some memory issues. > > > >>> > > > > >>> > 1. I start it `java -Xmx8G -jar .jar` > > > >>> > > > > >>> > 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper. > > ClientCnxn > > > - > > > >>> Got > > > >>> > ping response for sessionid: 0xc58abee3e13 after 0ms' > messages > > > >>> > > > > >>> > 3. When it _finally_ starts reading values it typically goes for > a > > > >>> minute > > > >>> > or so, reads a few thousand values and then the OS kills it with > > 'Out > > > >>> of > > > >>> > memory' error. > > > >>> > > > > >>> > The topology is (essentially): > > > >>> > > > > >>> > stream->aggregate->filter->foreach > > > >>> > > > > >>> > It's reading values and creating a rolling average. > > > >>> > > > > >>> > During phase 2 (above) I see lots of IO wait and the 'scratch' > > buffer > > > >>> > (usually "/tmp/kafka-streams/appname") fills with 10s of Gb of > .. ? > > > (I > > > >>> had > > > >>> > to create a special scratch partition with 100Gb of space as > kafka > > > >>> would > > > >>> > fill the / partition and make the system v v unhappy) > > > >>> > > > > >>> > Have I misconfigured something? > > > >>> > > > > >>> > > > >>> > > > >> > > > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang
Re: OOM errors
"keep increasing" - why? It seems (to me) that the aggregates should be 20 minutes long. After that the memory should be released. Not true? On Tue, Nov 29, 2016 at 3:53 PM, Guozhang Wang wrote: > Jon, > > Note that in your "aggregate" function, if it is now windowed aggregate > then the aggregation results will keep increasing in your local state > stores unless you're pretty sure that the aggregate key space is bounded. > This is not only related to disk space but also memory since the current > default persistent state store, RocksDB, takes its own block cache both on > heap and out of heap. > > In addition, RocksDB has a write / space amplification. That is, if your > estimate size of the aggregated state store takes 1GB, it will actually > take 1 * max-amplification-factor in worst case, and similarly for block > cache and write buffer. > > > Guozhang > > On Tue, Nov 29, 2016 at 12:25 PM, Jon Yeargers > wrote: > > > App eventually got OOM-killed. Consumed 53G of swap space. > > > > Does it require a different GC? Some extra settings for the java cmd > line? > > > > On Tue, Nov 29, 2016 at 12:05 PM, Jon Yeargers > > > wrote: > > > > > I cloned/built 10.2.0-SNAPSHOT > > > > > > App hasn't been OOM-killed yet but it's up to 66% mem. > > > > > > App takes > 10 min to start now. Needless to say this is problematic. > > > > > > The 'kafka-streams' scratch space has consumed 37G and still climbing. > > > > > > > > > On Tue, Nov 29, 2016 at 10:48 AM, Jon Yeargers < > jon.yearg...@cedexis.com > > > > > > wrote: > > > > > >> Does every broker need to be updated or just my client app(s)? > > >> > > >> On Tue, Nov 29, 2016 at 10:46 AM, Matthias J. Sax < > > matth...@confluent.io> > > >> wrote: > > >> > > >>> What version do you use? > > >>> > > >>> There is a memory leak in the latest version 0.10.1.0. The bug got > > >>> already fixed in trunk and 0.10.1 branch. > > >>> > > >>> There is already a discussion about a 0.10.1.1 bug fix release. For > > now, > > >>> you could build the Kafka Streams from the sources by yourself. > > >>> > > >>> -Matthias > > >>> > > >>> > > >>> On 11/29/16 10:30 AM, Jon Yeargers wrote: > > >>> > My KStreams app seems to be having some memory issues. > > >>> > > > >>> > 1. I start it `java -Xmx8G -jar .jar` > > >>> > > > >>> > 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper. > ClientCnxn > > - > > >>> Got > > >>> > ping response for sessionid: 0xc58abee3e13 after 0ms' messages > > >>> > > > >>> > 3. When it _finally_ starts reading values it typically goes for a > > >>> minute > > >>> > or so, reads a few thousand values and then the OS kills it with > 'Out > > >>> of > > >>> > memory' error. > > >>> > > > >>> > The topology is (essentially): > > >>> > > > >>> > stream->aggregate->filter->foreach > > >>> > > > >>> > It's reading values and creating a rolling average. > > >>> > > > >>> > During phase 2 (above) I see lots of IO wait and the 'scratch' > buffer > > >>> > (usually "/tmp/kafka-streams/appname") fills with 10s of Gb of .. ? > > (I > > >>> had > > >>> > to create a special scratch partition with 100Gb of space as kafka > > >>> would > > >>> > fill the / partition and make the system v v unhappy) > > >>> > > > >>> > Have I misconfigured something? > > >>> > > > >>> > > >>> > > >> > > > > > > > > > -- > -- Guozhang >
Re: OOM errors
Jon, Note that in your "aggregate" function, if it is now windowed aggregate then the aggregation results will keep increasing in your local state stores unless you're pretty sure that the aggregate key space is bounded. This is not only related to disk space but also memory since the current default persistent state store, RocksDB, takes its own block cache both on heap and out of heap. In addition, RocksDB has a write / space amplification. That is, if your estimate size of the aggregated state store takes 1GB, it will actually take 1 * max-amplification-factor in worst case, and similarly for block cache and write buffer. Guozhang On Tue, Nov 29, 2016 at 12:25 PM, Jon Yeargers wrote: > App eventually got OOM-killed. Consumed 53G of swap space. > > Does it require a different GC? Some extra settings for the java cmd line? > > On Tue, Nov 29, 2016 at 12:05 PM, Jon Yeargers > wrote: > > > I cloned/built 10.2.0-SNAPSHOT > > > > App hasn't been OOM-killed yet but it's up to 66% mem. > > > > App takes > 10 min to start now. Needless to say this is problematic. > > > > The 'kafka-streams' scratch space has consumed 37G and still climbing. > > > > > > On Tue, Nov 29, 2016 at 10:48 AM, Jon Yeargers > > > wrote: > > > >> Does every broker need to be updated or just my client app(s)? > >> > >> On Tue, Nov 29, 2016 at 10:46 AM, Matthias J. Sax < > matth...@confluent.io> > >> wrote: > >> > >>> What version do you use? > >>> > >>> There is a memory leak in the latest version 0.10.1.0. The bug got > >>> already fixed in trunk and 0.10.1 branch. > >>> > >>> There is already a discussion about a 0.10.1.1 bug fix release. For > now, > >>> you could build the Kafka Streams from the sources by yourself. > >>> > >>> -Matthias > >>> > >>> > >>> On 11/29/16 10:30 AM, Jon Yeargers wrote: > >>> > My KStreams app seems to be having some memory issues. > >>> > > >>> > 1. I start it `java -Xmx8G -jar .jar` > >>> > > >>> > 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper.ClientCnxn > - > >>> Got > >>> > ping response for sessionid: 0xc58abee3e13 after 0ms' messages > >>> > > >>> > 3. When it _finally_ starts reading values it typically goes for a > >>> minute > >>> > or so, reads a few thousand values and then the OS kills it with 'Out > >>> of > >>> > memory' error. > >>> > > >>> > The topology is (essentially): > >>> > > >>> > stream->aggregate->filter->foreach > >>> > > >>> > It's reading values and creating a rolling average. > >>> > > >>> > During phase 2 (above) I see lots of IO wait and the 'scratch' buffer > >>> > (usually "/tmp/kafka-streams/appname") fills with 10s of Gb of .. ? > (I > >>> had > >>> > to create a special scratch partition with 100Gb of space as kafka > >>> would > >>> > fill the / partition and make the system v v unhappy) > >>> > > >>> > Have I misconfigured something? > >>> > > >>> > >>> > >> > > > -- -- Guozhang
Re: OOM errors
App eventually got OOM-killed. Consumed 53G of swap space. Does it require a different GC? Some extra settings for the java cmd line? On Tue, Nov 29, 2016 at 12:05 PM, Jon Yeargers wrote: > I cloned/built 10.2.0-SNAPSHOT > > App hasn't been OOM-killed yet but it's up to 66% mem. > > App takes > 10 min to start now. Needless to say this is problematic. > > The 'kafka-streams' scratch space has consumed 37G and still climbing. > > > On Tue, Nov 29, 2016 at 10:48 AM, Jon Yeargers > wrote: > >> Does every broker need to be updated or just my client app(s)? >> >> On Tue, Nov 29, 2016 at 10:46 AM, Matthias J. Sax >> wrote: >> >>> What version do you use? >>> >>> There is a memory leak in the latest version 0.10.1.0. The bug got >>> already fixed in trunk and 0.10.1 branch. >>> >>> There is already a discussion about a 0.10.1.1 bug fix release. For now, >>> you could build the Kafka Streams from the sources by yourself. >>> >>> -Matthias >>> >>> >>> On 11/29/16 10:30 AM, Jon Yeargers wrote: >>> > My KStreams app seems to be having some memory issues. >>> > >>> > 1. I start it `java -Xmx8G -jar .jar` >>> > >>> > 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper.ClientCnxn - >>> Got >>> > ping response for sessionid: 0xc58abee3e13 after 0ms' messages >>> > >>> > 3. When it _finally_ starts reading values it typically goes for a >>> minute >>> > or so, reads a few thousand values and then the OS kills it with 'Out >>> of >>> > memory' error. >>> > >>> > The topology is (essentially): >>> > >>> > stream->aggregate->filter->foreach >>> > >>> > It's reading values and creating a rolling average. >>> > >>> > During phase 2 (above) I see lots of IO wait and the 'scratch' buffer >>> > (usually "/tmp/kafka-streams/appname") fills with 10s of Gb of .. ? (I >>> had >>> > to create a special scratch partition with 100Gb of space as kafka >>> would >>> > fill the / partition and make the system v v unhappy) >>> > >>> > Have I misconfigured something? >>> > >>> >>> >> >
Re: OOM errors
I cloned/built 10.2.0-SNAPSHOT App hasn't been OOM-killed yet but it's up to 66% mem. App takes > 10 min to start now. Needless to say this is problematic. The 'kafka-streams' scratch space has consumed 37G and still climbing. On Tue, Nov 29, 2016 at 10:48 AM, Jon Yeargers wrote: > Does every broker need to be updated or just my client app(s)? > > On Tue, Nov 29, 2016 at 10:46 AM, Matthias J. Sax > wrote: > >> What version do you use? >> >> There is a memory leak in the latest version 0.10.1.0. The bug got >> already fixed in trunk and 0.10.1 branch. >> >> There is already a discussion about a 0.10.1.1 bug fix release. For now, >> you could build the Kafka Streams from the sources by yourself. >> >> -Matthias >> >> >> On 11/29/16 10:30 AM, Jon Yeargers wrote: >> > My KStreams app seems to be having some memory issues. >> > >> > 1. I start it `java -Xmx8G -jar .jar` >> > >> > 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper.ClientCnxn - >> Got >> > ping response for sessionid: 0xc58abee3e13 after 0ms' messages >> > >> > 3. When it _finally_ starts reading values it typically goes for a >> minute >> > or so, reads a few thousand values and then the OS kills it with 'Out of >> > memory' error. >> > >> > The topology is (essentially): >> > >> > stream->aggregate->filter->foreach >> > >> > It's reading values and creating a rolling average. >> > >> > During phase 2 (above) I see lots of IO wait and the 'scratch' buffer >> > (usually "/tmp/kafka-streams/appname") fills with 10s of Gb of .. ? (I >> had >> > to create a special scratch partition with 100Gb of space as kafka would >> > fill the / partition and make the system v v unhappy) >> > >> > Have I misconfigured something? >> > >> >> >
Re: OOM errors
Does every broker need to be updated or just my client app(s)? On Tue, Nov 29, 2016 at 10:46 AM, Matthias J. Sax wrote: > What version do you use? > > There is a memory leak in the latest version 0.10.1.0. The bug got > already fixed in trunk and 0.10.1 branch. > > There is already a discussion about a 0.10.1.1 bug fix release. For now, > you could build the Kafka Streams from the sources by yourself. > > -Matthias > > > On 11/29/16 10:30 AM, Jon Yeargers wrote: > > My KStreams app seems to be having some memory issues. > > > > 1. I start it `java -Xmx8G -jar .jar` > > > > 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper.ClientCnxn - > Got > > ping response for sessionid: 0xc58abee3e13 after 0ms' messages > > > > 3. When it _finally_ starts reading values it typically goes for a minute > > or so, reads a few thousand values and then the OS kills it with 'Out of > > memory' error. > > > > The topology is (essentially): > > > > stream->aggregate->filter->foreach > > > > It's reading values and creating a rolling average. > > > > During phase 2 (above) I see lots of IO wait and the 'scratch' buffer > > (usually "/tmp/kafka-streams/appname") fills with 10s of Gb of .. ? (I > had > > to create a special scratch partition with 100Gb of space as kafka would > > fill the / partition and make the system v v unhappy) > > > > Have I misconfigured something? > > > >
Re: OOM errors
What version do you use? There is a memory leak in the latest version 0.10.1.0. The bug got already fixed in trunk and 0.10.1 branch. There is already a discussion about a 0.10.1.1 bug fix release. For now, you could build the Kafka Streams from the sources by yourself. -Matthias On 11/29/16 10:30 AM, Jon Yeargers wrote: > My KStreams app seems to be having some memory issues. > > 1. I start it `java -Xmx8G -jar .jar` > > 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper.ClientCnxn - Got > ping response for sessionid: 0xc58abee3e13 after 0ms' messages > > 3. When it _finally_ starts reading values it typically goes for a minute > or so, reads a few thousand values and then the OS kills it with 'Out of > memory' error. > > The topology is (essentially): > > stream->aggregate->filter->foreach > > It's reading values and creating a rolling average. > > During phase 2 (above) I see lots of IO wait and the 'scratch' buffer > (usually "/tmp/kafka-streams/appname") fills with 10s of Gb of .. ? (I had > to create a special scratch partition with 100Gb of space as kafka would > fill the / partition and make the system v v unhappy) > > Have I misconfigured something? > signature.asc Description: OpenPGP digital signature
OOM errors
My KStreams app seems to be having some memory issues. 1. I start it `java -Xmx8G -jar .jar` 2. Wait 5-10 minutes - see lots of 'org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0xc58abee3e13 after 0ms' messages 3. When it _finally_ starts reading values it typically goes for a minute or so, reads a few thousand values and then the OS kills it with 'Out of memory' error. The topology is (essentially): stream->aggregate->filter->foreach It's reading values and creating a rolling average. During phase 2 (above) I see lots of IO wait and the 'scratch' buffer (usually "/tmp/kafka-streams/appname") fills with 10s of Gb of .. ? (I had to create a special scratch partition with 100Gb of space as kafka would fill the / partition and make the system v v unhappy) Have I misconfigured something?
Re: Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors
Oh, I am talking about another memory leak. the offheap memory leak we had experienced. Which is about Direct Buffer memory. the callstack as below. ReplicaFetcherThread.warn - [ReplicaFetcherThread-4-1463989770], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest@7f4c1657. Possible cause: java.lang.OutOfMemoryError: Direct buffer memory ERROR kafka-network-thread-75737866-PLAINTEXT-26 Processor.error - Processor got uncaught exception. local3.err java.lang.OutOfMemoryError: - Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:693) On Mon, Jul 11, 2016 at 10:54 AM, Tom Crayford wrote: > Hi (I'm the author of that ticket): > > From my understanding limiting MaxDirectMemory won't workaround this memory > leak. The leak is inside the JVM's implementation, not in normal direct > buffers. On one of our brokers with this issue, we're seeing the JVM report > 1.2GB of heap, and 128MB of offheap memory, yet the actual process memory > is more like 10GB. > > Thanks > > Tom Crayford > Heroku Kafka > -- -hao
Re: Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors
Hi (I'm the author of that ticket): >From my understanding limiting MaxDirectMemory won't workaround this memory leak. The leak is inside the JVM's implementation, not in normal direct buffers. On one of our brokers with this issue, we're seeing the JVM report 1.2GB of heap, and 128MB of offheap memory, yet the actual process memory is more like 10GB. Thanks Tom Crayford Heroku Kafka
Re: Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors
please refer (KAFKA-3933) a workaround is -XX:MaxDirectMemorySize=1024m if your callstack has direct buffer issues.(effectively off heap memory) On Wed, May 11, 2016 at 9:50 AM, Russ Lavoie wrote: > Good Afternoon, > > I am currently trying to do a rolling upgrade from Kafka 0.8.2.1 to 0.9.0.1 > and am running into a problem when starting 0.9.0.1 with the protocol > version 0.8.2.1 set in the server.properties. > > Here is my current Kafka topic setup, data retention and hardware used: > > 3 Zookeeper nodes > 5 Broker nodes > Topics have at least 2 replicas > Topics have no more than 200 partitions > 4,564 partitions across 61 topics > 14 day retention > Each Kafka node has between 2.1T - 2.9T of data > Hardware is C4.2xlarge AWS instances > - 8 Core (Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz) > - 14G Ram > - 4TB EBS volume (10k IOPS [never gets maxed unless I up the > num.io.threads]) > > Here is my running broker configuration for 0.9.0.1: > > [2016-05-11 11:43:58,172] INFO KafkaConfig values: > advertised.host.name = server.domain > metric.reporters = [] > quota.producer.default = 9223372036854775807 > offsets.topic.num.partitions = 150 > log.flush.interval.messages = 9223372036854775807 > auto.create.topics.enable = false > controller.socket.timeout.ms = 3 > log.flush.interval.ms = 1000 > principal.builder.class = class > org.apache.kafka.common.security.auth.DefaultPrincipalBuilder > replica.socket.receive.buffer.bytes = 65536 > min.insync.replicas = 1 > replica.fetch.wait.max.ms = 500 > num.recovery.threads.per.data.dir = 1 > ssl.keystore.type = JKS > default.replication.factor = 3 > ssl.truststore.password = null > log.preallocate = false > sasl.kerberos.principal.to.local.rules = [DEFAULT] > fetch.purgatory.purge.interval.requests = 1000 > ssl.endpoint.identification.algorithm = null > replica.socket.timeout.ms = 3 > message.max.bytes = 10485760 > num.io.threads =8 > offsets.commit.required.acks = -1 > log.flush.offset.checkpoint.interval.ms = 6 > delete.topic.enable = true > quota.window.size.seconds = 1 > ssl.truststore.type = JKS > offsets.commit.timeout.ms = 5000 > quota.window.num = 11 > zookeeper.connect = zkserver:2181/kafka > authorizer.class.name = > num.replica.fetchers = 8 > log.retention.ms = null > log.roll.jitter.hours = 0 > log.cleaner.enable = false > offsets.load.buffer.size = 5242880 > log.cleaner.delete.retention.ms = 8640 > ssl.client.auth = none > controlled.shutdown.max.retries = 3 > queued.max.requests = 500 > offsets.topic.replication.factor = 3 > log.cleaner.threads = 1 > sasl.kerberos.service.name = null > sasl.kerberos.ticket.renew.jitter = 0.05 > socket.request.max.bytes = 104857600 > ssl.trustmanager.algorithm = PKIX > zookeeper.session.timeout.ms = 6000 > log.retention.bytes = -1 > sasl.kerberos.min.time.before.relogin = 6 > zookeeper.set.acl = false > connections.max.idle.ms = 60 > offsets.retention.minutes = 1440 > replica.fetch.backoff.ms = 1000 > inter.broker.protocol.version = 0.8.2.1 > log.retention.hours = 168 > num.partitions = 16 > broker.id.generation.enable = false > listeners = null > ssl.provider = null > ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] > log.roll.ms = null > log.flush.scheduler.interval.ms = 9223372036854775807 > ssl.cipher.suites = null > log.index.size.max.bytes = 10485760 > ssl.keymanager.algorithm = SunX509 > security.inter.broker.protocol = PLAINTEXT > replica.fetch.max.bytes = 104857600 > advertised.port = null > log.cleaner.dedupe.buffer.size = 134217728 > replica.high.watermark.checkpoint.interval.ms = 5000 > log.cleaner.io.buffer.size = 524288 > sasl.kerberos.ticket.renew.window.factor = 0.8 > zookeeper.connection.timeout.ms = 6000 > controlled.shutdown.retry.backoff.ms = 5000 > log.roll.hours = 168 > log.cleanup.policy = delete > host.name = > log.roll.jitter.ms = null > max.connections.per.ip = 2147483647 > offsets.topic.segment.bytes = 104857600 > background.threads = 10 > quota.consumer.default = 9223372036854775807 > request.timeout.ms = 3 > log.index.interval.bytes = 4096 > log.dir = /tmp/kafka-logs > log.segment.bytes = 268435456 > log.cleaner.backoff.ms = 15000 > offset.metadata.max.bytes = 4096 > ssl.truststore.location = null > group.max.session.timeout.ms = 3 > ssl.keystore.password = null > zookeeper.sync.time.ms = 2000 > port = 9092 > log.retention.minutes = null > log.segment.delete.delay.ms = 6 > log.dirs = /mnt/kafka/data > controlled.shutdown.enable = true > compression.type = producer > max.connections.per.ip.overrides = > sasl.kerberos.kinit.cmd = /usr/bin/kinit > log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308 > auto.leader.rebalance.enable = true > leader.imbalance.check.interval.seconds = 300 > log.cleaner.min.cleanable.ratio = 0.5 > replica.lag.time.max.ms = 1 > num.network.threads =8 > ssl.key.password = null > reserved.broker.max.id = 1000 > metrics.num.samples = 2 > socket.send.buffer.bytes = 2097152 > ssl.protocol = TLS > socket.receive.buffer
Rolling upgrade from 0.8.2.1 to 0.9.0.1 failing with replicafetchthread OOM errors
Good Afternoon, I am currently trying to do a rolling upgrade from Kafka 0.8.2.1 to 0.9.0.1 and am running into a problem when starting 0.9.0.1 with the protocol version 0.8.2.1 set in the server.properties. Here is my current Kafka topic setup, data retention and hardware used: 3 Zookeeper nodes 5 Broker nodes Topics have at least 2 replicas Topics have no more than 200 partitions 4,564 partitions across 61 topics 14 day retention Each Kafka node has between 2.1T - 2.9T of data Hardware is C4.2xlarge AWS instances - 8 Core (Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz) - 14G Ram - 4TB EBS volume (10k IOPS [never gets maxed unless I up the num.io.threads]) Here is my running broker configuration for 0.9.0.1: [2016-05-11 11:43:58,172] INFO KafkaConfig values: advertised.host.name = server.domain metric.reporters = [] quota.producer.default = 9223372036854775807 offsets.topic.num.partitions = 150 log.flush.interval.messages = 9223372036854775807 auto.create.topics.enable = false controller.socket.timeout.ms = 3 log.flush.interval.ms = 1000 principal.builder.class = class org.apache.kafka.common.security.auth.DefaultPrincipalBuilder replica.socket.receive.buffer.bytes = 65536 min.insync.replicas = 1 replica.fetch.wait.max.ms = 500 num.recovery.threads.per.data.dir = 1 ssl.keystore.type = JKS default.replication.factor = 3 ssl.truststore.password = null log.preallocate = false sasl.kerberos.principal.to.local.rules = [DEFAULT] fetch.purgatory.purge.interval.requests = 1000 ssl.endpoint.identification.algorithm = null replica.socket.timeout.ms = 3 message.max.bytes = 10485760 num.io.threads =8 offsets.commit.required.acks = -1 log.flush.offset.checkpoint.interval.ms = 6 delete.topic.enable = true quota.window.size.seconds = 1 ssl.truststore.type = JKS offsets.commit.timeout.ms = 5000 quota.window.num = 11 zookeeper.connect = zkserver:2181/kafka authorizer.class.name = num.replica.fetchers = 8 log.retention.ms = null log.roll.jitter.hours = 0 log.cleaner.enable = false offsets.load.buffer.size = 5242880 log.cleaner.delete.retention.ms = 8640 ssl.client.auth = none controlled.shutdown.max.retries = 3 queued.max.requests = 500 offsets.topic.replication.factor = 3 log.cleaner.threads = 1 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 socket.request.max.bytes = 104857600 ssl.trustmanager.algorithm = PKIX zookeeper.session.timeout.ms = 6000 log.retention.bytes = -1 sasl.kerberos.min.time.before.relogin = 6 zookeeper.set.acl = false connections.max.idle.ms = 60 offsets.retention.minutes = 1440 replica.fetch.backoff.ms = 1000 inter.broker.protocol.version = 0.8.2.1 log.retention.hours = 168 num.partitions = 16 broker.id.generation.enable = false listeners = null ssl.provider = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] log.roll.ms = null log.flush.scheduler.interval.ms = 9223372036854775807 ssl.cipher.suites = null log.index.size.max.bytes = 10485760 ssl.keymanager.algorithm = SunX509 security.inter.broker.protocol = PLAINTEXT replica.fetch.max.bytes = 104857600 advertised.port = null log.cleaner.dedupe.buffer.size = 134217728 replica.high.watermark.checkpoint.interval.ms = 5000 log.cleaner.io.buffer.size = 524288 sasl.kerberos.ticket.renew.window.factor = 0.8 zookeeper.connection.timeout.ms = 6000 controlled.shutdown.retry.backoff.ms = 5000 log.roll.hours = 168 log.cleanup.policy = delete host.name = log.roll.jitter.ms = null max.connections.per.ip = 2147483647 offsets.topic.segment.bytes = 104857600 background.threads = 10 quota.consumer.default = 9223372036854775807 request.timeout.ms = 3 log.index.interval.bytes = 4096 log.dir = /tmp/kafka-logs log.segment.bytes = 268435456 log.cleaner.backoff.ms = 15000 offset.metadata.max.bytes = 4096 ssl.truststore.location = null group.max.session.timeout.ms = 3 ssl.keystore.password = null zookeeper.sync.time.ms = 2000 port = 9092 log.retention.minutes = null log.segment.delete.delay.ms = 6 log.dirs = /mnt/kafka/data controlled.shutdown.enable = true compression.type = producer max.connections.per.ip.overrides = sasl.kerberos.kinit.cmd = /usr/bin/kinit log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308 auto.leader.rebalance.enable = true leader.imbalance.check.interval.seconds = 300 log.cleaner.min.cleanable.ratio = 0.5 replica.lag.time.max.ms = 1 num.network.threads =8 ssl.key.password = null reserved.broker.max.id = 1000 metrics.num.samples = 2 socket.send.buffer.bytes = 2097152 ssl.protocol = TLS socket.receive.buffer.bytes = 2097152 ssl.keystore.location = null replica.fetch.min.bytes = 1 unclean.leader.election.enable = false group.min.session.timeout.ms = 6000 log.cleaner.io.buffer.load.factor = 0.9 offsets.retention.check.interval.ms = 60 producer.purgatory.purge.interval.requests = 1000 metrics.sample.window.ms = 3 broker.id = 2 offsets.topic.compression.codec = 0 log.retention.check.interval.ms = 30 advertised.listeners = null leader.imbalance.per.broker.percenta