Good to know we are not too bad after all ;) What I find interesting about this is the possibility of adding new fields to our defined records? Is this what they are doing over there Lewis?
Renato M. 2013/6/21 Lewis John Mcgibbney <[email protected]> > On second thoughts... a shining angel seems to have just landed on my > shoulder. > https://issues.apache.org/jira/browse/NUTCH-1591 > > > On Fri, Jun 21, 2013 at 11:51 AM, Lewis John Mcgibbney < > [email protected]> wrote: > > > Hi, > > I am coming to the realisation that there are some lingering bugs within > > the gora-cassandra module which only come to light when we run large MR > > jobs. > > I have continuous crawls which use gora-cassandra 0.3 to push/query data > > to Cassandra 1.1.2... which is what we currently support in Gora. > > Injecting millions of URLs works fine. Don't get me wrong, I see high CPU > > but it all works well. Same with GeneratorJob. > > In InjectorJob we use the following static fields within the persisted > > WebPage. I've added the data type in brackets beside the field. > > > > static { > > FIELDS.add(WebPage.Field.MARKERS); map > > FIELDS.add(WebPage.Field.STATUS); int > > } > > > > In GeneratorJob we add the following > > > > static { > > FIELDS.add(WebPage.Field.FETCH_TIME); long > > FIELDS.add(WebPage.Field.SCORE); float > > FIELDS.add(WebPage.Field.STATUS); int > > FIELDS.add(WebPage.Field.MARKERS); map > > } > > > > However in ParserJob we add the following and I see my memory just sucked > > up >7GB and also my CPU rocketing in 4 cores >95%. > > > > static { > > FIELDS.add(WebPage.Field.STATUS); int > > FIELDS.add(WebPage.Field.CONTENT); bytes > > FIELDS.add(WebPage.Field.CONTENT_TYPE); string > > FIELDS.add(WebPage.Field.SIGNATURE); bytes > > FIELDS.add(WebPage.Field.MARKERS); map > > FIELDS.add(WebPage.Field.PARSE_STATUS); nested record > > FIELDS.add(WebPage.Field.OUTLINKS); map > > FIELDS.add(WebPage.Field.METADATA); map > > FIELDS.add(WebPage.Field.HEADERS); map > > } > > > > Yes ParserJob is much more challenging than the previous two however > there > > is no justification for the memory and CPU footprint I am getting. It has > > been noted that running this stuff on HBase is fine, Cassandra is not. > > > > I wonder if anyone can comment on the above as I am very very keen to > > address this. > > Thanks > > Lewis > > > > -- > > *Lewis* > > > > > > -- > *Lewis* >

