Hi, Lewis Thank you for replying. I apologize in advance for asking what might well be a stupid question. We are using the Crawler/InjectorJob/GeneratorJob/FetcherJob/ParserJob source code from the Nutch codebase without any modifications and calling the binary directly.
@Lewis: I used the datastax library directly to query the keyspace for that host and port combination. I was able to execute CQL queries programmatically and return the result sets. Pinging the hosts returns valid packets. My gora.properties gora.datastore.autocreateschema=true gora.CassandraStore.autocreateschema=true gora.cassandrastore.servers=192.161.23.161:9160<http://192.161.23.161:9160> gora.cassandrastore.username=<username> gora.cassandrastore.password=<password> They match with gora-cassandra-mapping.xml data. We are using Nutch 2.2.x for our purpose. From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Tuesday, September 30, 2014 8:19 AM To: user@gora.apache.org Cc: Nutch Users; Kothuvatiparambil, Viju; Krishnanand, Kartik Subject: Re: Crawled data not inserting in the tables Can you also make sure that the cluster name and fully qualified address and port agree between mapping and Gora.properties Thanks On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo <renatoj.marroq...@gmail.com<mailto:renatoj.marroq...@gmail.com>> wrote: Hi Kartik, If TTL hasn't been set or if it has been set to 0, then Gora is not using any TTL[1] and all your data should be persisted without any problems. Maybe this has to do something with the url generating/fetching process? Could you determine during which process the data is changing? (generate/fetch/parse) Thanks! Renato M. [1] https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72 2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik <kartik.krishnan...@bankofamerica.com<javascript:_e(%7B%7D,'cvml','kartik.krishnan...@bankofamerica.com');>>: Hi, Talat I am afraid that I do not understand. We have set the “ttl” value to 0, which is the default value. We don’t have any need portions of data that needs to be deleted. For now, I am using a single node cluster, for us the gc_grace_seconds=”0” default value would be a valid value. Have I missed out anything? My settings are as follows. Any suggestions would be greatly appreciated. <gora-orm> <keyspace name="projectKeyspace" cluster="MultiTest" host="192.161.23.161:9160<http://192.161.23.161:9160>" placement_strategy="org.apache.cassandra.locator.NetworkTopologyStrategy"> <family name="p" /> <family name="f"/> <family name="sc" type="super"/> <family name="mtdt" type="super"/> <family name="il" type="super"/> <family name="ol" type="super"/> </keyspace> <class keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage" keyspace="projectKeyspace "> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> <field name="reprUrl" family="f" qualifier="rpr"/> <field name="content" family="f" qualifier="cnt"/> <field name="contentType" family="f" qualifier="typ"/> <field name="modifiedTime" family="f" qualifier="mod"/> <field name="prevModifiedTime" family="f" qualifier="pmod"/> <field name="batchId" family="f" qualifier="bid"/> <!-- parse fields --> <field name="title" family="p" qualifier="t"/> <field name="text" family="p" qualifier="c"/> <field name="signature" family="p" qualifier="sig"/> <field name="prevSignature" family="p" qualifier="psig"/> <!-- score fields --> <field name="score" family="f" qualifier="s"/> <!-- super columns --> <field name="headers" family="sc" qualifier="h"/> <field name="inlinks" family="sc" qualifier="il"/> <field name="outlinks" family="sc" qualifier="ol"/> <field name="metadata" family="sc" qualifier="mtdt"/> <field name="markers" family="sc" qualifier="mk"/> <field name="parseStatus" family="sc" qualifier="pas"/> <field name="protocolStatus" family="sc" qualifier="prs"/> </class> <class keyClass="java.lang.String" name="org.apache.nutch.storage.Host" keyspace="projectKeyspace "> <field name="metadata" family="mtdt" qualifier="mtdt"/> <field name="inlinks" family="il" qualifier="il"/> <field name="outlinks" family="ol" qualifier="ol"/> </class> </gora-orm> Thanks, Kartik From: Talat Uyarer [mailto:ta...@uyarer.com<javascript:_e(%7B%7D,'cvml','ta...@uyarer.com');>] Sent: Thursday, September 25, 2014 5:04 PM To: user@gora.apache.org<javascript:_e(%7B%7D,'cvml','user@gora.apache.org');> Cc: u...@nutch.apache.org<javascript:_e(%7B%7D,'cvml','u...@nutch.apache.org');> Subject: Re: Crawled data not inserting in the tables Hi Kartik, The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. Please see the documentation [0], specifically relating to the values for 'gc_grace_seconds' and also 'ttl'. This will fix the problem Talat [0] http://gora.apache.org/current/gora-cassandra.html Hi, Gora gurus, I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA Cassandra mapping to store the crawled data. I can confirm that all 12 URLs are not being filtered and are injected, but after running the generate, fetch and parse jobs . There are only 3 entries in “column family” f. I am not sure what I am doing wrong. The logs have not yielded anything relevant. What should I be looking at? Any advice would be gratefully appreciated. Thanks, Kartik ________________________________ This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message. ________________________________ This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message. -- Lewis ---------------------------------------------------------------------- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.