A little progress. I edited 'crawl' and added "-adddays 2" prompting the crawl to include the URL in question (which had a fetch time of: > Fetch time: Wed Jul 31 01:22:44 EDT 2013
Now the URL I am after is in a segment (I ran segread against the segment and can see the URL there) Recno:: 3252 URL:: http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf CrawlDatum:: Version: 7 Status: 33 (fetch_success) But when after I run solrindex against the specific segment, the URL is still not visible in the Solr search results. Any further thoughts, suggestions? ________________________________________ From: Os Tyler [[email protected]] Sent: Tuesday, July 30, 2013 2:54 PM To: [email protected] Subject: RE: URL in crawldb not appearing in Solr after indexing. Thanks, I appreciate your help. None of the existing segments contains the relevant URL. Is there a way to ensure a specific URL makes it to a segment? ________________________________________ From: Sebastian Nagel [[email protected]] Sent: Tuesday, July 30, 2013 2:13 PM To: [email protected] Subject: Re: URL in crawldb not appearing in Solr after indexing. Hi, the signature of the document is null in CrawlDb. The signature is calculated when parsing the document, so: - has parsing taken place? - truncated content? - parse failure? etc. > How do I specifically request that an entry in crawldb gets pushed to Solr? You have to run solrindex on the segment which contains the fetched and parsed data. To check whether the segment contains all required data, you can use % bin/nutch readseg ... Sebastian On 07/30/2013 06:48 PM, Os Tyler wrote: > Hello, > > I have successfully deployed Solr on our development environment and our > stage environment. But am running into an anomaly the third time around. > > I have a specific URL that appears in the crawldb, but is not showing up in > when I search from the Solr interface. How do I specifically request that an > entry in crawldb gets pushed to Solr? > > I have run solrindex multiple times and it does not produce any errors. > readdb, parsechecker and indexchecker all return positive results for this > URL. Configuration is identical on the to-be-production machine as it is on > dev and stage where it's correctly appearing in Solr. > > /usr/local/apache-nutch/bin/nutch readdb > /usr/local/apache-nutch/intranet/crawldb/ -url > http://redacted.com/ppb/ppb_3j_002_vacation_policy.pdf > > URL: http://redacted.com/ppb/ppb_3j_002_vacation_policy.pdf > Version: 7 > Status: 2 (db_fetched) > Fetch time: Wed Jul 31 01:22:44 EDT 2013 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 100000 seconds (1 days) > Score: 6.5826543E-4 > Signature: null > Metadata: Content-Type: application/pdf_pst_: success(1), lastModified=0 > > >

