Hi Hoss, Thanks for the quick response. RE point 1) I'd mistyped (sorry) the incremental URL I'm using for updates. Essentially every 5 minutes the system is making a HTTP call for...
http://localhost/solr/myfeed?clean=false&command=full-import&rows=5000 ..which when accessed returns the following showing 0 deleted. <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> </lst> <lst name="initArgs"> <lst name="defaults"> <str name="config">/opt/solr/myfeed/data-config.xml</str> </lst> </lst> <str name="command">full-import</str> <str name="status">idle</str> <str name="importResponse"/> <lst name="statusMessages"> <str name="Total Requests made to DataSource">33</str> <str name="Total Rows Fetched">594</str> <str name="Total Documents Skipped">0</str> <str name="Full Dump Started">2011-11-08 14:11:30</str> <str name="">Indexing completed. Added/Updated: 594 documents. Deleted 0 documents.</str> <str name="Committed">2011-11-08 14:11:31</str> <str name="Optimized">2011-11-08 14:11:31</str> <str name="Total Documents Processed">594</str> <str name="Time taken ">0:0:6.492</str> </lst> <str name="WARNING">This response format is experimental. It is likely to change in the future.</str> </response> ....but a search always returns between 550 and 600 rows. There should be 1,000s (as this is parsing 30+ active feeds). My request handler is intended to be basic: <requestHandler name="/myfeed" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">/opt/solr/myfeed/data-config.xml</str> </lst> </requestHandler> I have not customised the solrconfig.xml beyond the above. My data config is using: <dataConfig> <dataSource type="HttpDataSource" /> <document> ... Should I be using the HttpDataSource? Any other thoughts? Regards, Shaun Chris Hostetter-3 wrote: > > : We've successfully setup Solr 3.4.0 to parse and import multiple news > : RSS feeds (based on the slashdot example on > : http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource. > > : The objective is for Solr to index ALL news items published on this feed > : (ever) - not just the current contents of the feed. I've read that the > : delta import is not supported for XML imports. I've therefore tried to > : use "command=full-impor&clean=false". > > 1) note your typo, should be "full-import" > > : But still the number of Documents Processed seems to be stuck at a fixed > : number of items looking at the Stats and the 'numFound' result for a > : generic '*:*' search. New items are being added to the feeds all the > : time (and old ones dropping off). > > "Documents Processed" after each full import should be whatever the number > of items in the current feed is -- it's the number processed in that > import, no total number processed in all time. > > if you specify clean=false no documents should be deleted. I just tested > this using the slashdot example with Solr 3.4 and could not reproduce the > problem you described. I loaded the following URL... > > http://localhost:8983/solr/rss/dataimport?clean=false&command=full-import > > ...then waited a while for the feed to cahnge, and then loaded that URL > again. The number of documents (returned by a *:* query) increased after > the second run. > > > -Hoss > -- View this message in context: http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3490501.html Sent from the Solr - User mailing list archive at Nabble.com.