Great, thanks for the link Mike. >From what I can tell, the only time opening of a segment file would be slow is in the event of unclean shutdown, where a segment file may not have been fsync'd and Kafka needs to CRC it and rebuild its index. This should really only be a problem for the "newest" log segment and only with a limited subset of topics which would be set to a high segment size.
However, having to necessitate a larger pointer (and accompanied increased index size) just to accommodate large topics may be a good enough reason to neglect large segments and embrace opening 20,000 files per restart, as it seems the common use case for kafka involves a many-small-messages workload. Thanks for the responses On Wed, May 13, 2015 at 11:52 PM, Mike Axiak <m...@axiak.net> wrote: > Jay Kreps has commented on this before: > > https://issues.apache.org/jira/browse/KAFKA-1670?focusedCommentId=14161185&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14161185 > > Basically, you can always have more segment files. Having too large of > segment files will significantly slow down the opening of files which is > done whenever a broker comes online or has to recover. > > On Thu, May 14, 2015 at 3:10 AM, Mayuresh Gharat < > gharatmayures...@gmail.com > > wrote: > > > I suppose it is way log management works in kafka. > > I am not sure the exact reason for this. Also the index files that are > > constructed have a mapping of relative offset to the base offset of log > > file to the real offset. The key value in index file is of the form > > <Int,Int>. > > > > > > Thanks, > > > > Mayuresh > > > > > > On Wed, May 13, 2015 at 5:57 PM, Lance Laursen < > > llaur...@rubiconproject.com> > > wrote: > > > > > Hey folks, > > > > > > Any update on this? > > > > > > On Thu, Apr 30, 2015 at 5:34 PM, Lance Laursen < > > > llaur...@rubiconproject.com> > > > wrote: > > > > > > > Hey all, > > > > > > > > I am attempting to create a topic which uses 8GB log segment sizes, > > like > > > > so: > > > > ./kafka-topics.sh --zookeeper localhost:2181 --create --topic > > > perftest6p2r > > > > --partitions 6 --replication-factor 2 --config > max.message.bytes=655360 > > > > --config segment.bytes=8589934592 > > > > > > > > And am getting the following error: > > > > Error while executing topic command For input string: "8589934592" > > > > java.lang.NumberFormatException: For input string: "8589934592" > > > > at > > > > > > > > > > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > > > > at java.lang.Integer.parseInt(Integer.java:583) > > > > ... > > > > ... > > > > > > > > Upon further testing with --alter topic, it would appear that > > > > segment.bytes will not accept a value higher than 2,147,483,647, > which > > is > > > > the upper limit for a signed 32bit int. This then restricts log > segment > > > > size to an upper limit of ~2GB. > > > > > > > > We run Kafka on hard drive dense machines, each with 10gbit uplinks. > We > > > > can set ulimits higher in order to deal with all the open file > handles > > > > (since Kafka keeps all log segment file handles open), but it would > be > > > > preferable to minimize this number, as well as minimize the amount of > > log > > > > segment rollover experienced at high traffic (ie: a rollover every > 1-2 > > > > seconds or so when saturating 10gbe). > > > > > > > > Is there a reason (performance or otherwise) that a 32 bit integer is > > > used > > > > rather than something larger? > > > > > > > > Thanks, > > > > -Lance > > > > > > > > > > > > > > > > > -- > > > > > > [image: elogo.png] > > > > > > Leading the Automation of Advertising > > > > > > LANCE LAURSEN | Systems Architect > > > > > > ••• (M) 310.903.0546 > > > > > > 12181 BLUFF CREEK DRIVE, 4TH FLOOR, PLAYA VISTA, CA 90094 > > > > > > RUBICONPROJECT.COM <http://www.rubiconproject.com/> | @RUBICONPROJECT > > > <http://twitter.com/rubiconproject> > > > > > > > > > > > -- > > -Regards, > > Mayuresh R. Gharat > > (862) 250-7125 > > > --