Re: Aggregated indexing of updating RSS feeds

2011-11-17 Thread sbarriba
Thanks Chris.

(Bell rings)

The 'params' logging pointer was what I needed. So for reference its not a
good idea to use a 'wget' command directly in a crontab.
I was using:

wget http://localhost/solr/myfeed?command=full-importrows=5000clean=false

...but moving this into a separate shell script, wrapping the URL in quotes
and calling that resolved the issue.

Thanks very much.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3515388.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Aggregated indexing of updating RSS feeds

2011-11-17 Thread Michael Kuhlmann

Am 17.11.2011 11:53, schrieb sbarriba:

The 'params' logging pointer was what I needed. So for reference its not a
good idea to use a 'wget' command directly in a crontab.
I was using:

wget http://localhost/solr/myfeed?command=full-importrows=5000clean=false


:))

I think the shell handled the and sign as a flag to put the wget command 
into background.


You could put the full url into quotes, or escape the and sign with a 
backslash. Then it should work as well.


-Kuli


Re: Aggregated indexing of updating RSS feeds

2011-11-16 Thread sbarriba
All,
Can anyone advise how to stop the deleteAll event during a full import? 

As discussed above using clean=false with Solr 3.4 still seems to trigger a
delete of all previous imported data. I want to aggregate the results of
multiple imports.

Thanks in advance.
S

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3512260.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Aggregated indexing of updating RSS feeds

2011-11-16 Thread Chris Hostetter

: ..but the request I'm making is..
: /solr/myfeed?command=full-importrows=5000clean=false
: 
: ..note the clean=false.

I see it, but i also see this in the logs you provided...

: INFO: [] webapp=/solr path=/myfeed params={command=full-import} status=0
: QTime=8

...which means someone somewhere is executing full-import w/o using 
clean=false.  

are you absolutely certain that you are executing the request you think 
you are?  can you find a request in your logs that includes clean=false?

if it's not you and your code -- it is comming from somewhere, and that's 
what's causing DIH to trigger a deleteAll...

: 10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.DataImporter
: doFullImport
: INFO: Starting Full Import
: 10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.SolrWriter
: readIndexerProperties
: INFO: Read myfeed.properties
: 10-Nov-2011 05:40:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
: INFO: [] REMOVING ALL DOCUMENTS FROM INDEX



-Hoss


Re: Aggregated indexing of updating RSS feeds

2011-11-09 Thread sbarriba
All,
Can anyone advise how to stop the deleteAll event during a full import?

I'm still unable to determine why repeat full imports seem to delete old
indexes. After investigation the logs confirm this - see REMOVING ALL
DOCUMENTS FROM INDEX below.

..but the request I'm making is..
/solr/myfeed?command=full-importrows=5000clean=false

..note the clean=false.

All help appreciated.
Shaun


INFO: [] webapp=/solr path=/myfeed params={command=full-import} status=0
QTime=8
10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.DataImporter
doFullImport
INFO: Starting Full Import
10-Nov-2011 05:40:01 org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
INFO: Read myfeed.properties
10-Nov-2011 05:40:01 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
10-Nov-2011 05:40:05 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select/
params={indent=onstart=0q=description:one+directionrows=10version=2.2}
hits=0 status=0 QTime=1
10-Nov-2011 05:40:07 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select/
params={indent=onstart=0q=id:*23327977*rows=10version=2.2} hits=0
status=0 QTime=1
10-Nov-2011 05:40:08 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2000
   
commit{dir=/mnt/ebs1/data/index,segFN=segments_1x3,version=1319402557686,generation=2487,filenames=[_3u3.tii,
segments_1x3, _3u3.frq, _3u3.prx, _3u3.nrm, _3u3.fnm, _3u3.fdx, _3u3.tis,
_3u3.fdt]
   
commit{dir=/mnt/ebs1/data/index,segFN=segments_1x4,version=1319402557691,generation=2488,filenames=[_3u5.nrm,
_3u5.fnm, _3u5.fdx, segments_1x4, _3u5.tis, _3u5.prx, _3u5.frq, _3u5.tii,
_3u5.fdt]

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3495882.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Aggregated indexing of updating RSS feeds

2011-11-08 Thread sbarriba
Hi Hoss,
Thanks for the quick response.

RE point 1) I'd mistyped (sorry) the incremental URL I'm using for updates.
Essentially every 5 minutes the system is making a HTTP call for...

http://localhost/solr/myfeed?clean=falsecommand=full-importrows=5000

..which when accessed returns the following showing 0 deleted.

response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
lst name=initArgs
lst name=defaults
str name=config/opt/solr/myfeed/data-config.xml/str
/lst
/lst
str name=commandfull-import/str
str name=statusidle/str
str name=importResponse/
lst name=statusMessages
str name=Total Requests made to DataSource33/str
str name=Total Rows Fetched594/str
str name=Total Documents Skipped0/str
str name=Full Dump Started2011-11-08 14:11:30/str
str name=Indexing completed. Added/Updated: 594 documents. Deleted 0
documents./str
str name=Committed2011-11-08 14:11:31/str
str name=Optimized2011-11-08 14:11:31/str
str name=Total Documents Processed594/str
str name=Time taken 0:0:6.492/str
/lst
str name=WARNINGThis response format is experimental.  It is likely to
change in the future./str
/response

but a search always returns between 550 and 600 rows. There should be
1,000s (as this is parsing 30+ active feeds).

My request handler is intended to be basic:

 requestHandler name=/myfeed
class=org.apache.solr.handler.dataimport.DataImportHandler
  lst name=defaults
str name=config/opt/solr/myfeed/data-config.xml/str
   /lst
/requestHandler
I have not customised the solrconfig.xml beyond the above.

My data config is using:

dataConfig
dataSource type=HttpDataSource /
document
...

Should I be using the HttpDataSource?

Any other thoughts?
Regards,
Shaun


Chris Hostetter-3 wrote:
 
 : We've successfully setup Solr 3.4.0 to parse and import multiple news 
 : RSS feeds (based on the slashdot example on 
 : http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
 
 : The objective is for Solr to index ALL news items published on this feed 
 : (ever) - not just the current contents of the feed. I've read that the 
 : delta import is not supported for XML imports. I've therefore tried to 
 : use command=full-imporclean=false. 
 
 1) note your typo, should be full-import
 
 : But still the number of Documents Processed seems to be stuck at a fixed 
 : number of items looking at the Stats and the 'numFound' result for a 
 : generic '*:*' search. New items are being added to the feeds all the 
 : time (and old ones dropping off).
 
 Documents Processed after each full import should be whatever the number 
 of items in the current feed is -- it's the number processed in that 
 import, no total number processed in all time.
 
 if you specify clean=false no documents should be deleted.  I just tested 
 this using the slashdot example with Solr 3.4 and could not reproduce the 
 problem you described.  I loaded the following URL...
 
 http://localhost:8983/solr/rss/dataimport?clean=falsecommand=full-import
 
 ...then waited a while for the feed to cahnge, and then loaded that URL 
 again.  The number of documents (returned by a *:* query) increased after 
 the second run.
 
 
 -Hoss
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3490501.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread Nagendra Nagarajayya

Shaun:

You should try NRT available with Solr with RankingAlgorithm here. You 
should be able to add docs in real time and also query them in real 
time.  If DIH does not retain the old index, you may be able to convert 
the rss fields to a XML format as needed by Solr and update the docs 
(make sure there is a unique id)


http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.org

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 11/6/2011 1:22 PM, Shaun Barriball wrote:

Hi all,

We've successfully setup Solr 3.4.0 to parse and import multiple news RSS feeds 
(based on the slashdot example on 
http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
The objective is for Solr to index ALL news items published on this feed (ever) - not just the current contents of the feed. I've read that the delta import is not supported for XML imports. I've therefore tried to use command=full-imporclean=false. 


But still the number of Documents Processed seems to be stuck at a fixed number 
of items looking at the Stats and the 'numFound' result for a generic '*:*' 
search. New items are being added to the feeds all the time (and old ones 
dropping off).

Is it possible for Solr to incrementally build an index of a live RSS feed 
which is changing but retain the index of its archive?

All help appreciated.
Shaun




Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread Fred Zimmerman
Any options that do not require adding new software?

On Mon, Nov 7, 2011 at 11:11 AM, Nagendra Nagarajayya 
nnagaraja...@transaxtions.com wrote:

 Shaun:

 You should try NRT available with Solr with RankingAlgorithm here. You
 should be able to add docs in real time and also query them in real time.
  If DIH does not retain the old index, you may be able to convert the rss
 fields to a XML format as needed by Solr and update the docs (make sure
 there is a unique id)

 http://solr-ra.tgels.org/wiki/**en/Near_Real_Time_Search_ver_**3.xhttp://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

 You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
 http://solr-ra.tgels.org

 Regards,

 - Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.**org http://rankingalgorithm.tgels.org


 On 11/6/2011 1:22 PM, Shaun Barriball wrote:

 Hi all,

 We've successfully setup Solr 3.4.0 to parse and import multiple news RSS
 feeds (based on the slashdot example on http://wiki.apache.org/solr/**
 DataImportHandler http://wiki.apache.org/solr/DataImportHandler) using
 the HttpDataSource.
 The objective is for Solr to index ALL news items published on this feed
 (ever) - not just the current contents of the feed. I've read that the
 delta import is not supported for XML imports. I've therefore tried to use
 command=full-imporclean=**false.
 But still the number of Documents Processed seems to be stuck at a fixed
 number of items looking at the Stats and the 'numFound' result for a
 generic '*:*' search. New items are being added to the feeds all the time
 (and old ones dropping off).

 Is it possible for Solr to incrementally build an index of a live RSS
 feed which is changing but retain the index of its archive?

 All help appreciated.
 Shaun





Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread sbarriba
Thanks Nagendra, I'll take a look.

So question for you et al, so Solr in its default installation will ALWAYS
delete content for an entity prior to doing a full import? 
You cannot simply build up an index incrementally from multiple imports
(from XML)? I read elsewhere that the 'clean' parameter was intended to
control this.

Regards,
Shaun

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3487969.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread Chris Hostetter

: We've successfully setup Solr 3.4.0 to parse and import multiple news 
: RSS feeds (based on the slashdot example on 
: http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.

: The objective is for Solr to index ALL news items published on this feed 
: (ever) - not just the current contents of the feed. I've read that the 
: delta import is not supported for XML imports. I've therefore tried to 
: use command=full-imporclean=false. 

1) note your typo, should be full-import

: But still the number of Documents Processed seems to be stuck at a fixed 
: number of items looking at the Stats and the 'numFound' result for a 
: generic '*:*' search. New items are being added to the feeds all the 
: time (and old ones dropping off).

Documents Processed after each full import should be whatever the number 
of items in the current feed is -- it's the number processed in that 
import, no total number processed in all time.

if you specify clean=false no documents should be deleted.  I just tested 
this using the slashdot example with Solr 3.4 and could not reproduce the 
problem you described.  I loaded the following URL...

http://localhost:8983/solr/rss/dataimport?clean=falsecommand=full-import

...then waited a while for the feed to cahnge, and then loaded that URL 
again.  The number of documents (returned by a *:* query) increased after 
the second run.


-Hoss