Re: URL in crawldb not appearing in Solr after indexing.

Sebastian Nagel Thu, 01 Aug 2013 15:50:00 -0700

A PDF with Content-Length=282669 is not really large.

Are you sure content is not truncated? See http.content.limit,
default is 64kB! Truncated PDFs are not parsed per default
(see parser.skip.truncated) because parsing will probably fail.


There should be also some parse-related data in the segment
for the given URL:
- CrawlDatum with signature status
- parse data
- parse text
If the document is not parsed it isn't indexed.

On 08/02/2013 12:10 AM, Os Tyler wrote:
> Thanks again for your time, Sebastian.
> 
> The output relevant to this URL is quite lengthy and, sense it is a PDF, 
> contains a lot of binary content. I'm pasting the output up to the binary 
> content at the bottom of this email. It looks to me like everything required 
> is there, is there a way to debug solrindex that you can point out to me that 
> will help show why this entry is not making it from the segment to Solr. 
> (Please see the segread output below)
> 
> BTW, I came up with a workaround, which was to pull the record from the stage 
> Solr instance and then build and xml statement and push it to our production 
> Solr environment using curl, but I would still like to understand how to 
> diagnose why this file, that shows up in both the crawldb and the segment, 
> was not making it to Solr after running solrindex.
> 
> For reference, here is the command for manually adding to Solr, followed by 
> the output from segread.
> 
> Adding a record to Solr index manually:
>  curl http://<solrhost>:<port>/solr/update -H "Content-Type: text/xml" 
> --data-binary '<add> <doc> <field name="content">Full page content goes 
> here</field> <field name="digest">fc57093cf1d347d1bf94bf4950deb738</field> 
> <field 
> name="id">http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf</field>
>  <field name="title">Vacation Policy</field> <field 
> name="tstamp">2013-07-23T16:13:25.431Z</field> <field name="url"> 
> http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf </field> </doc> 
> </add>'
> 
> Here is the segread output:
> Recno:: 3252
> URL:: http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf
> 
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Jul 30 18:27:26 EDT 2013
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 100000 seconds (1 days)
> Score: 6.5826543E-4
> Signature: null
> Metadata: _ngt_: 1375222322923Content-Type: application/pdf_pst_: success(1), 
> lastModified=0
> 
> CrawlDatum::
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Wed Jul 31 01:22:44 EDT 2013
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 100000 seconds (1 days)
> Score: 6.5826543E-4
> Signature: null
> Metadata: _ngt_: 1375222322923Content-Type: application/pdf_pst_: success(1), 
> lastModified=0
> 
> CrawlDatum::the PDf is not 

> Version: 7
> Status: 67 (linked)
> Fetch time: Tue Jul 30 18:48:38 EDT 2013
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 80000 seconds (0 days)
> Score: 5.418439E-4
> Signature: null
> Metadata: 
> 
> CrawlDatum::
> Version: 7
> Status: 67 (linked)
> Fetch time: Tue Jul 30 18:48:14 EDT 2013
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 80000 seconds (0 days)
> Score: 8.344418E-6
> Signature: null
> Metadata: 
> 
> CrawlDatum::
> Version: 7
> Status: 67 (linked)
> Fetch time: Tue Jul 30 18:48:35 EDT 2013
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 80000 seconds (0 days)
> Score: 4.298202E-5
> Signature: null
> Metadata: 
> 
> CrawlDatum::
> Version: 7
> Status: 67 (linked)
> Fetch time: Tue Jul 30 18:48:14 EDT 2013
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 80000 seconds (0 days)
> Score: 1.7928971E-7
> Signature: null
> Metadata: 
> 
> Content::
> Version: -1
> url: http://intranet.ur.com/files/ppb/ppb_3j_002_vacation_policy.pdf
> base: http://intranet.ur.com/files/ppb/ppb_3j_002_vacation_policy.pdf
> contentType: application/pdf
> metadata: Date=Tue, 30 Jul 2013 22:27:26 GMT Content-Length=282669 
> Expires=Thu, 29 Aug 2013 22:27:26 GMT Last-Modified=Wed, 24 Jul 2013 17:28
> :43 GMT nutch.crawl.score=6.5826543E-4 _fst_=33 
> nutch.segment.name=20130730181550 Accept-Ranges=bytes Connection=close 
> Content-Type=applicati
> on/pdf Server=Apache/2.2.12 (Linux/SUSE) Cache-Control=public, 
> max-age=2419200, no-transform 
> Content:
> %PDF-1.5
> (data content from here on)
> 
> 
> ________________________________________
> From: Sebastian Nagel [[email protected]]
> Sent: Thursday, August 01, 2013 1:52 PM
> To: [email protected]
> Subject: Re: URL in crawldb not appearing in Solr after indexing.
> 
>> But when after I run solrindex against the specific segment,
>> the URL is still not visible in the Solr search results.
> There should be other data related to this URL in the same segment.
> What about parse data (including meta data), parsed text, and signature?
> 
> On 07/31/2013 02:55 AM, Os Tyler wrote:
>> A little progress. I edited 'crawl' and added "-adddays 2" prompting the 
>> crawl to include the URL in question (which had a fetch time of:
>>> Fetch time: Wed Jul 31 01:22:44 EDT 2013
>>
>> Now the URL I am after is in a segment (I ran segread against the segment 
>> and can see the URL there)
>> Recno:: 3252
>> URL:: http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf
>>
>> CrawlDatum::
>> Version: 7
>> Status: 33 (fetch_success)
>>
>> But when after I run solrindex against the specific segment, the URL is 
>> still not visible in the Solr search results.
>>
>> Any further thoughts, suggestions?
>>
>> ________________________________________
>> From: Os Tyler [[email protected]]
>> Sent: Tuesday, July 30, 2013 2:54 PM
>> To: [email protected]
>> Subject: RE: URL in crawldb not appearing in Solr after indexing.
>>
>> Thanks, I appreciate your help.
>>
>> None of the existing segments contains the relevant URL. Is there a way to 
>> ensure a specific URL makes it to a segment?
>>
>> ________________________________________
>> From: Sebastian Nagel [[email protected]]
>> Sent: Tuesday, July 30, 2013 2:13 PM
>> To: [email protected]
>> Subject: Re: URL in crawldb not appearing in Solr after indexing.
>>
>> Hi,
>>
>> the signature of the document is null in CrawlDb.
>> The signature is calculated when parsing the document, so:
>> - has parsing taken place?
>> - truncated content?
>> - parse failure?
>> etc.
>>
>>> How do I specifically request that an entry in crawldb gets pushed to Solr?
>> You have to run solrindex on the segment which contains the fetched and 
>> parsed data.
>>
>> To check whether the segment contains all required data, you can use
>> % bin/nutch readseg ...
>>
>> Sebastian
>>
>> On 07/30/2013 06:48 PM, Os Tyler wrote:
>>> Hello,
>>>
>>> I have successfully deployed Solr on our development environment and our 
>>> stage environment. But am running into an anomaly the third time around.
>>>
>>> I have a specific URL that appears in the crawldb, but is not showing up in 
>>> when I search from the Solr interface. How do I specifically request that 
>>> an entry in crawldb gets pushed to Solr?
>>>
>>> I have run solrindex multiple times and it does not produce any errors. 
>>> readdb, parsechecker and indexchecker all return positive results for this 
>>> URL. Configuration is identical on the to-be-production machine as it is on 
>>> dev and stage where it's correctly appearing in Solr.
>>>
>>> /usr/local/apache-nutch/bin/nutch readdb 
>>> /usr/local/apache-nutch/intranet/crawldb/ -url 
>>> http://redacted.com/ppb/ppb_3j_002_vacation_policy.pdf
>>>
>>> URL: http://redacted.com/ppb/ppb_3j_002_vacation_policy.pdf
>>> Version: 7
>>> Status: 2 (db_fetched)
>>> Fetch time: Wed Jul 31 01:22:44 EDT 2013
>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>> Retries since fetch: 0
>>> Retry interval: 100000 seconds (1 days)
>>> Score: 6.5826543E-4
>>> Signature: null
>>> Metadata: Content-Type: application/pdf_pst_: success(1), lastModified=0
>>>
>>>
>>>
>>
>

Re: URL in crawldb not appearing in Solr after indexing.

Reply via email to