RE: URL in crawldb not appearing in Solr after indexing.

Os Tyler Tue, 30 Jul 2013 17:56:13 -0700

A little progress. I edited 'crawl' and added "-adddays 2" prompting the crawl 
to include the URL in question (which had a fetch time of:
> Fetch time: Wed Jul 31 01:22:44 EDT 2013

Now the URL I am after is in a segment (I ran segread against the segment and 
can see the URL there)
Recno:: 3252
URL:: http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf

CrawlDatum::
Version: 7
Status: 33 (fetch_success)

But when after I run solrindex against the specific segment, the URL is still 
not visible in the Solr search results.

Any further thoughts, suggestions?

________________________________________
From: Os Tyler [[email protected]]
Sent: Tuesday, July 30, 2013 2:54 PM
To: [email protected]
Subject: RE: URL in crawldb not appearing in Solr after indexing.

Thanks, I appreciate your help.

None of the existing segments contains the relevant URL. Is there a way to 
ensure a specific URL makes it to a segment?

________________________________________
From: Sebastian Nagel [[email protected]]
Sent: Tuesday, July 30, 2013 2:13 PM
To: [email protected]
Subject: Re: URL in crawldb not appearing in Solr after indexing.

Hi,

the signature of the document is null in CrawlDb.
The signature is calculated when parsing the document, so:
- has parsing taken place?
- truncated content?
- parse failure?
etc.

> How do I specifically request that an entry in crawldb gets pushed to Solr?
You have to run solrindex on the segment which contains the fetched and parsed 
data.

To check whether the segment contains all required data, you can use
% bin/nutch readseg ...

Sebastian

On 07/30/2013 06:48 PM, Os Tyler wrote:
> Hello,
>
> I have successfully deployed Solr on our development environment and our 
> stage environment. But am running into an anomaly the third time around.
>
> I have a specific URL that appears in the crawldb, but is not showing up in 
> when I search from the Solr interface. How do I specifically request that an 
> entry in crawldb gets pushed to Solr?
>
> I have run solrindex multiple times and it does not produce any errors. 
> readdb, parsechecker and indexchecker all return positive results for this 
> URL. Configuration is identical on the to-be-production machine as it is on 
> dev and stage where it's correctly appearing in Solr.
>
> /usr/local/apache-nutch/bin/nutch readdb 
> /usr/local/apache-nutch/intranet/crawldb/ -url 
> http://redacted.com/ppb/ppb_3j_002_vacation_policy.pdf
>
> URL: http://redacted.com/ppb/ppb_3j_002_vacation_policy.pdf
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Wed Jul 31 01:22:44 EDT 2013
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 100000 seconds (1 days)
> Score: 6.5826543E-4
> Signature: null
> Metadata: Content-Type: application/pdf_pst_: success(1), lastModified=0
>
>
>

RE: URL in crawldb not appearing in Solr after indexing.

Reply via email to