Hi Guys,

Just a quick update on some findings here

With 1M URLs and gora.buffer.write.limit settings of 1000 and 100
respectively (on a reasonably powerful machine) I get the following results

1000 limit
-time elapsed: 9m42s or 582s
-writes p/s 1718

100 limit
-time elapsed: 9m33s or 573s
-writes p/s 1745

So reducing the write factor (in Cassandra) to the low limit of 100 knocks
1.5ish% off execute time and increases write throughout to Cassandra by
around 25 p/s... which is really what we expect from Cassandra anyway.

I am as happy with these results to I'll stick to low maximum limits for
buffered writes (with Cassandra) from now on.

Have a great weekend.
Lewis


On Tue, Mar 5, 2013 at 10:09 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Thanks for the input Roland. I share a similar use case.
> @Renato, the gora.write.buffer.limit property can be overridden within the
> Hadoop Configuration. AFAIK we can override in nutch-site.xml if using
> Nutch or core-site.xml if using Gora over hadoop.
> This is the way I have been tinkering.
> I was curious as to obtaining performance gains.
>
>
> On Tuesday, March 5, 2013, Renato Marroquín Mogrovejo <
> renatoj.marroq...@gmail.com> wrote:
> > This is a very interesting topic to discuss about thank you for starting
> it Lewis (:
> > I think we have to think about two different application types, the ones
> doing real time processing, and the ones doing batch processing. For the
> former, a smaller flush-threshold is probably a better choice, and for
> the latter one a value depending on the application should be used i.e.
> different applications might consider "batch operations differently".
> > Just one quick question here Lewis, is this possible to set this
> parameter through the configuration file? or is it always hard-coded? I
> think it should be settable from outside Gora without having to recompile
> Gora every time we want to change it. What do you guys think?
> >
> >
> > Renato M.
> >
> > On Mar 5, 2013 7:23 AM, "Roland" <rol...@rvh-gmbh.de> wrote:Hi Lewis,
> >>
> >> for me (nutch use case) a lower value is better, because of 3 main
> reasons:
> >> a) load is better distributed for the db backend
> >> b) when running the nutch fetcherJob, towards the end of the job you
> don't have to wait for gora flushing all data to backend, because it was
> mostly done during the fetching
> >> c) during debugging you'll get gora/cassandra flushing errors much
> earlier
> >>
> >> I'm running with 1k write buffer for cassandra.
> >>
> >> --Roland
> >>
> >> Am 01.03.2013 02:01, schrieb Lewis John Mcgibbney:
> >>
> >> Hi,
> >> We use the above class for write operations in the Nutch InjectorJob.
> >> I am writing large URL lists to Cassandra using Gora and wonder if I
> can get it working better.
> >> Currently I am getting around 10000 writes per 90 seconds. Don't get me
> wrong, I am working from a very primitive laptop and right now I am merely
> attempting to push the software.
> >> What I want to know, is what is the consequence of altering the
> BUFFER_LIMIT_WRITE_VALUE?
> >> Currently we set a default value of 10K for the limit on this value,
> meaning that Gora batches flushes to reflect this value.
> >> Is a higher or lower value better? Is there any evidence of better
> performance by changing this value.
> >> I see it a pretty critical so I am wanting to understand more about
> this.
> >> Thanks
> >> Lewis
> >>
> >> --
> >> Lewis
> >>
> >>
> >
>
> --
> *Lewis*
>
>


-- 
*Lewis*

Reply via email to