Hello everybody,

we are using Solr to index some RSS feeds for a news agregator application.

We've got some difficulties with the publication date of each item because each site use an homemade date format. The fact is that we want to have the exact amount of time between the date of publication and the time it is now.

So we decided to uses a timestamp that stores the index time for each item.

The problem is :

   * when i do a full-import&clean=false the index is always cleaned.
   * when i do a simple import, nothing seems to be done.

Here is the configuration :

   * Apache Solr 1.4 Nightly 2009-09-25
   * java version : build 1.6.0_15-b03
   * Java HotSpot Client VM : build 14.1-b02, mixed mode, sharing

=> data-config.xml

<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
   <dataSource type="HttpDataSource" />
   <document>
       <entity name="flux_367"
               pk="link"
               url="http://www.capital.fr/rss2/feed/fil-bourse.xml";
               processor="XPathEntityProcessor"
               forEach="/rss/channel | /rss/channel/item"
               transformer="DateFormatTransformer, TemplateTransformer"
               onError="continue">
           <field column="source" template="368" commonField="true" />
           <field column="type" template="0" commonField="true" />
<field column="title" xpath="/rss/channel/item/title" />
           <field column="link" xpath="/rss/channel/item/link" />
<field column="description" xpath="/rss/channel/item/description" /> <field column="date" xpath="/rss/channel/item/pubDate" dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss z" />
       </entity>
   </document>
</dataConfig>

=> schema.xml

[...]
<fields>
  <field name="source" type="text" indexed="true" stored="true" />
  <field name="title" type="text" indexed="true" stored="true" />
  <field name="link" type="string" indexed="true" stored="true" />
  <field name="description" type="html" indexed="true" stored="true" />
<field name="date" type="date" indexed="true" stored="true" default="NOW" />
  <field name="type" type="sint" indexed="true" stored="true" />
<field name="all_text" type="text" indexed="true" stored="false" multiValued="true" />
  <copyField source="source" dest="all_text" />
  <copyField source="title" dest="all_text" />
  <copyField source="description" dest="all_text" />
  <copyField source="date" dest="all_text" />
  <copyField source="type" dest="all_text" />
<!-- Here, default is used to create a "timestamp" field indicating
       When each document was indexed.
  -->
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

</fields>

<uniqueKey>link</uniqueKey>

<defaultSearchField>all_text</defaultSearchField>

<solrQueryParser defaultOperator="OR"/>
[...]

- Tests :

=> command=full-import&clean=false

25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties
25-Sep-2009 14:58:21 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=6 25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.DataImporter doFullImport
INFO: Starting Full Import
25-Sep-2009 14:58:21 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties
25-Sep-2009 14:58:21 org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
commit{dir=D:\srv\solr\index,segFN=segments_2s,version=1251453476028,generation=100,filenames=[segments_2s, _3u.
cfs, _3u.cfx]
25-Sep-2009 14:58:21 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1251453476028
25-Sep-2009 14:58:22 org.apache.solr.handler.dataimport.DocBuilder finish
INFO: Import completed successfully

=> command=import

25-Sep-2009 14:59:20 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/dataimport params={command=import} status=0 QTime=0 25-Sep-2009 14:59:20 org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties
INFO: Read dataimport.properties

Any idea or suggestion ?
Thank you in advance!
--

Brahim Abdesslam
Directeur des opérations

* Maecia - /DĂ©veloppement web/ *
Mob : +33 (0)6 82 87 31 27
Tel  : +33 (0)9 54 99 29 59
Fax : +33 (0)9 59 99 29 59

http://www.maecia.com <http://www.maecia.com>

Reply via email to