RE: Crawled data not inserting in the tables

Krishnanand, Kartik Tue, 30 Sep 2014 16:53:22 -0700

Hi, Lewis

Thank you for replying.  I apologize in advance for asking what might well be a 
stupid question.  We are using the 
Crawler/InjectorJob/GeneratorJob/FetcherJob/ParserJob source code from the 
Nutch codebase without any modifications and calling the binary directly.


@Lewis: I used the datastax library directly to query the keyspace for that 
host and port combination. I was able to execute CQL queries programmatically 
and return the result sets. Pinging the hosts returns valid packets.  My 
gora.properties

gora.datastore.autocreateschema=true
gora.CassandraStore.autocreateschema=true
gora.cassandrastore.servers=192.161.23.161:9160<http://192.161.23.161:9160>
gora.cassandrastore.username=<username>
gora.cassandrastore.password=<password>

They match with gora-cassandra-mapping.xml data.

We are using Nutch 2.2.x for our purpose.



From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Tuesday, September 30, 2014 8:19 AM
To: user@gora.apache.org
Cc: Nutch Users; Kothuvatiparambil, Viju; Krishnanand, Kartik
Subject: Re: Crawled data not inserting in the tables

Can you also make sure that the cluster name and fully qualified address and 
port agree between mapping and Gora.properties
Thanks

On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo 
<renatoj.marroq...@gmail.com<mailto:renatoj.marroq...@gmail.com>> wrote:
Hi Kartik,

If TTL hasn't been set or if it has been set to 0, then Gora is not using any 
TTL[1] and all your data should be persisted without any problems.
Maybe this has to do something with the url generating/fetching process? Could 
you determine during which process the data is changing? (generate/fetch/parse)
Thanks!


Renato M.

[1] 
https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72

2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik 
<kartik.krishnan...@bankofamerica.com<javascript:_e(%7B%7D,'cvml','kartik.krishnan...@bankofamerica.com');>>:
Hi, Talat

I am afraid that I do not understand.  We have set the “ttl” value to 0, which 
is the default value. We don’t have any need portions of data that needs to be 
deleted.  For now, I am using a single node cluster, for us the 
gc_grace_seconds=”0” default value would be a valid value.

Have I missed out anything? My settings are as follows. Any suggestions would 
be greatly appreciated.

<gora-orm>

    <keyspace name="projectKeyspace" cluster="MultiTest" 
host="192.161.23.161:9160<http://192.161.23.161:9160>" 
placement_strategy="org.apache.cassandra.locator.NetworkTopologyStrategy">
        <family name="p" />
        <family name="f"/>
        <family name="sc" type="super"/>

        <family name="mtdt" type="super"/>
        <family name="il" type="super"/>
        <family name="ol" type="super"/>
    </keyspace>

    <class keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage" 
keyspace="projectKeyspace ">

        <!-- fetch fields -->
        <field name="baseUrl" family="f" qualifier="bas"/>
        <field name="status" family="f" qualifier="st"/>
        <field name="prevFetchTime" family="f" qualifier="pts"/>
        <field name="fetchTime" family="f" qualifier="ts"/>
        <field name="fetchInterval" family="f" qualifier="fi"/>
        <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
        <field name="reprUrl" family="f" qualifier="rpr"/>
        <field name="content" family="f" qualifier="cnt"/>
        <field name="contentType" family="f" qualifier="typ"/>
        <field name="modifiedTime" family="f" qualifier="mod"/>
        <field name="prevModifiedTime" family="f" qualifier="pmod"/>
        <field name="batchId" family="f" qualifier="bid"/>

        <!-- parse fields -->
        <field name="title" family="p" qualifier="t"/>
        <field name="text" family="p" qualifier="c"/>
        <field name="signature" family="p" qualifier="sig"/>
        <field name="prevSignature" family="p" qualifier="psig"/>

        <!-- score fields -->
        <field name="score" family="f" qualifier="s"/>

        <!-- super columns -->
        <field name="headers" family="sc" qualifier="h"/>
        <field name="inlinks" family="sc" qualifier="il"/>
        <field name="outlinks" family="sc" qualifier="ol"/>
        <field name="metadata" family="sc" qualifier="mtdt"/>
        <field name="markers" family="sc" qualifier="mk"/>
        <field name="parseStatus" family="sc" qualifier="pas"/>
        <field name="protocolStatus" family="sc" qualifier="prs"/>
    </class>


    <class keyClass="java.lang.String" name="org.apache.nutch.storage.Host" 
keyspace="projectKeyspace ">
        <field name="metadata" family="mtdt" qualifier="mtdt"/>
        <field name="inlinks" family="il" qualifier="il"/>
        <field name="outlinks" family="ol" qualifier="ol"/>
    </class>

</gora-orm>

Thanks,

Kartik

From: Talat Uyarer 
[mailto:ta...@uyarer.com<javascript:_e(%7B%7D,'cvml','ta...@uyarer.com');>]
Sent: Thursday, September 25, 2014 5:04 PM
To: user@gora.apache.org<javascript:_e(%7B%7D,'cvml','user@gora.apache.org');>
Cc: u...@nutch.apache.org<javascript:_e(%7B%7D,'cvml','u...@nutch.apache.org');>
Subject: Re: Crawled data not inserting in the tables


Hi Kartik,

The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. 
Please see the documentation [0], specifically relating to the values for 
'gc_grace_seconds' and also 'ttl'. This will fix the problem

Talat

[0] http://gora.apache.org/current/gora-cassandra.html
Hi, Gora gurus,

I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA 
Cassandra mapping to store the crawled data.

I can confirm that all 12 URLs are not being filtered and are injected, but 
after running the generate, fetch and parse jobs . There are only 3 entries in 
“column family” f.

I am not sure what I am doing wrong. The logs have not yielded anything 
relevant. What should I be looking at?

Any advice would be gratefully appreciated.

Thanks,

Kartik
________________________________
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer. If you are not the intended 
recipient, please delete this message.
________________________________
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer. If you are not the intended 
recipient, please delete this message.



--
Lewis

----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.

RE: Crawled data not inserting in the tables

Reply via email to