Hi Hoss,
Thanks for the quick response.

RE point 1) I'd mistyped (sorry) the incremental URL I'm using for updates.
Essentially every 5 minutes the system is making a HTTP call for...

http://localhost/solr/myfeed?clean=false&command=full-import&rows=5000

..which when accessed returns the following showing 0 deleted.

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">/opt/solr/myfeed/data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">33</str>
<str name="Total Rows Fetched">594</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2011-11-08 14:11:30</str>
<str name="">Indexing completed. Added/Updated: 594 documents. Deleted 0
documents.</str>
<str name="Committed">2011-11-08 14:11:31</str>
<str name="Optimized">2011-11-08 14:11:31</str>
<str name="Total Documents Processed">594</str>
<str name="Time taken ">0:0:6.492</str>
</lst>
<str name="WARNING">This response format is experimental.  It is likely to
change in the future.</str>
</response>

....but a search always returns between 550 and 600 rows. There should be
1,000s (as this is parsing 30+ active feeds).

My request handler is intended to be basic:

 <requestHandler name="/myfeed"
class="org.apache.solr.handler.dataimport.DataImportHandler">
      <lst name="defaults">
            <str name="config">/opt/solr/myfeed/data-config.xml</str>
       </lst>
</requestHandler>
I have not customised the solrconfig.xml beyond the above.

My data config is using:

<dataConfig>
        <dataSource type="HttpDataSource" />
        <document>
...

Should I be using the HttpDataSource?

Any other thoughts?
Regards,
Shaun


Chris Hostetter-3 wrote:
> 
> : We've successfully setup Solr 3.4.0 to parse and import multiple news 
> : RSS feeds (based on the slashdot example on 
> : http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
> 
> : The objective is for Solr to index ALL news items published on this feed 
> : (ever) - not just the current contents of the feed. I've read that the 
> : delta import is not supported for XML imports. I've therefore tried to 
> : use "command=full-impor&clean=false". 
> 
> 1) note your typo, should be "full-import"
> 
> : But still the number of Documents Processed seems to be stuck at a fixed 
> : number of items looking at the Stats and the 'numFound' result for a 
> : generic '*:*' search. New items are being added to the feeds all the 
> : time (and old ones dropping off).
> 
> "Documents Processed" after each full import should be whatever the number 
> of items in the current feed is -- it's the number processed in that 
> import, no total number processed in all time.
> 
> if you specify clean=false no documents should be deleted.  I just tested 
> this using the slashdot example with Solr 3.4 and could not reproduce the 
> problem you described.  I loaded the following URL...
> 
> http://localhost:8983/solr/rss/dataimport?clean=false&command=full-import
> 
> ...then waited a while for the feed to cahnge, and then loaded that URL 
> again.  The number of documents (returned by a *:* query) increased after 
> the second run.
> 
> 
> -Hoss
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3490501.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to