Good to know we are not too bad after all ;)
What I find interesting about this is the possibility of adding new fields
to our defined records? Is this what they are doing over there Lewis?


Renato M.


2013/6/21 Lewis John Mcgibbney <[email protected]>

> On second thoughts... a shining angel seems to have just landed on my
> shoulder.
> https://issues.apache.org/jira/browse/NUTCH-1591
>
>
> On Fri, Jun 21, 2013 at 11:51 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
> > Hi,
> > I am coming to the realisation that there are some lingering bugs within
> > the gora-cassandra module which only come to light when we run large MR
> > jobs.
> > I have continuous crawls which use gora-cassandra 0.3 to push/query data
> > to Cassandra 1.1.2... which is what we currently support in Gora.
> > Injecting millions of URLs works fine. Don't get me wrong, I see high CPU
> > but it all works well. Same with GeneratorJob.
> > In InjectorJob we use the following static fields within the persisted
> > WebPage. I've added the data type in brackets beside the field.
> >
> > static {
> >   FIELDS.add(WebPage.Field.MARKERS); map
> >   FIELDS.add(WebPage.Field.STATUS); int
> > }
> >
> > In GeneratorJob we add the following
> >
> > static {
> >  FIELDS.add(WebPage.Field.FETCH_TIME);  long
> >  FIELDS.add(WebPage.Field.SCORE);  float
> >  FIELDS.add(WebPage.Field.STATUS); int
> >  FIELDS.add(WebPage.Field.MARKERS);  map
> > }
> >
> > However in ParserJob we add the following and I see my memory just sucked
> > up >7GB and also my CPU rocketing in 4 cores >95%.
> >
> > static {
> >  FIELDS.add(WebPage.Field.STATUS); int
> >  FIELDS.add(WebPage.Field.CONTENT); bytes
> >  FIELDS.add(WebPage.Field.CONTENT_TYPE); string
> >  FIELDS.add(WebPage.Field.SIGNATURE); bytes
> >  FIELDS.add(WebPage.Field.MARKERS); map
> >  FIELDS.add(WebPage.Field.PARSE_STATUS); nested record
> >  FIELDS.add(WebPage.Field.OUTLINKS); map
> >  FIELDS.add(WebPage.Field.METADATA); map
> >  FIELDS.add(WebPage.Field.HEADERS); map
> > }
> >
> > Yes ParserJob is much more challenging than the previous two however
> there
> > is no justification for the memory and CPU footprint I am getting. It has
> > been noted that running this stuff on HBase is fine, Cassandra is not.
> >
> > I wonder if anyone can comment on the above as I am very very keen to
> > address this.
> > Thanks
> > Lewis
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *Lewis*
>

Reply via email to