Hi Guys, Just a quick update on some findings here
With 1M URLs and gora.buffer.write.limit settings of 1000 and 100 respectively (on a reasonably powerful machine) I get the following results 1000 limit -time elapsed: 9m42s or 582s -writes p/s 1718 100 limit -time elapsed: 9m33s or 573s -writes p/s 1745 So reducing the write factor (in Cassandra) to the low limit of 100 knocks 1.5ish% off execute time and increases write throughout to Cassandra by around 25 p/s... which is really what we expect from Cassandra anyway. I am as happy with these results to I'll stick to low maximum limits for buffered writes (with Cassandra) from now on. Have a great weekend. Lewis On Tue, Mar 5, 2013 at 10:09 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Thanks for the input Roland. I share a similar use case. > @Renato, the gora.write.buffer.limit property can be overridden within the > Hadoop Configuration. AFAIK we can override in nutch-site.xml if using > Nutch or core-site.xml if using Gora over hadoop. > This is the way I have been tinkering. > I was curious as to obtaining performance gains. > > > On Tuesday, March 5, 2013, Renato Marroquín Mogrovejo < > renatoj.marroq...@gmail.com> wrote: > > This is a very interesting topic to discuss about thank you for starting > it Lewis (: > > I think we have to think about two different application types, the ones > doing real time processing, and the ones doing batch processing. For the > former, a smaller flush-threshold is probably a better choice, and for > the latter one a value depending on the application should be used i.e. > different applications might consider "batch operations differently". > > Just one quick question here Lewis, is this possible to set this > parameter through the configuration file? or is it always hard-coded? I > think it should be settable from outside Gora without having to recompile > Gora every time we want to change it. What do you guys think? > > > > > > Renato M. > > > > On Mar 5, 2013 7:23 AM, "Roland" <rol...@rvh-gmbh.de> wrote:Hi Lewis, > >> > >> for me (nutch use case) a lower value is better, because of 3 main > reasons: > >> a) load is better distributed for the db backend > >> b) when running the nutch fetcherJob, towards the end of the job you > don't have to wait for gora flushing all data to backend, because it was > mostly done during the fetching > >> c) during debugging you'll get gora/cassandra flushing errors much > earlier > >> > >> I'm running with 1k write buffer for cassandra. > >> > >> --Roland > >> > >> Am 01.03.2013 02:01, schrieb Lewis John Mcgibbney: > >> > >> Hi, > >> We use the above class for write operations in the Nutch InjectorJob. > >> I am writing large URL lists to Cassandra using Gora and wonder if I > can get it working better. > >> Currently I am getting around 10000 writes per 90 seconds. Don't get me > wrong, I am working from a very primitive laptop and right now I am merely > attempting to push the software. > >> What I want to know, is what is the consequence of altering the > BUFFER_LIMIT_WRITE_VALUE? > >> Currently we set a default value of 10K for the limit on this value, > meaning that Gora batches flushes to reflect this value. > >> Is a higher or lower value better? Is there any evidence of better > performance by changing this value. > >> I see it a pretty critical so I am wanting to understand more about > this. > >> Thanks > >> Lewis > >> > >> -- > >> Lewis > >> > >> > > > > -- > *Lewis* > > -- *Lewis*