Hi all,

Any ideas on how to run a reindex update process for all the imported
documents from a /dataimport query?
Appreciate your help.


Thanks,
Dileepa


On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody <
dileepajayak...@gmail.com> wrote:

> Hi All,
>
> I did some research on this and found some alternatives useful to my
> usecase. Please give your ideas.
>
> Can I update all documents indexed after a /dataimport query using the
> last_indexed_time in dataimport.properties?
> If so can anyone please give me some pointers?
> What I currently have in mind is something like below;
>
> 1. Store the indexing timestamp of the document as a field
> eg: <field name="timestamp" type="date" indexed="true" stored="true" 
> default="NOW"
> multiValued="false"/>
>
> 2. Read the last_index_time from the dataimport.properties
>
> 3. Query all document id's indexed after the last_index_time and send them
> through the Stanbol update processor.
>
> But I have a question here;
> Does the last_index_time refer to when the dataimport is
> started(onImportStart) or when the dataimport is finished (onImportEnd)?
> If it's onImportEnd timestamp, them this solution won't work because the
> timestamp indexed in the document field will be : onImportStart<
> doc-index-timestamp < onImportEnd.
>
>
> Another alternative I can think of is trigger an update chain via a
> EventListener configured to run after a dataimport is processed
> (onImportEnd).
> In this case can the context in DIH give the list of document ids
> processed in the /dataimport request? If so I can send those doc ids with
> an /update query to run the Stanbol update process.
>
> Please give me your ideas and suggestions.
>
> Thanks,
> Dileepa
>
>
>
>
> On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody <
> dileepajayak...@gmail.com> wrote:
>
>> Hi All,
>>
>> I have a Solr requirement to send all the documents imported from a
>> /dataimport query to go through another update chain as a separate
>> background process.
>>
>> Currently I have configured my custom update chain in the /dataimport
>> handler itself. But since my custom update process need to connect to an
>> external enhancement engine (Apache Stanbol) to enhance the documents with
>> some NLP fields, it has a negative impact on /dataimport process.
>> The solution will be to have a separate update process running to enhance
>> the content of the documents imported from /dataimport.
>>
>> Currently I have configured my custom Stanbol Processor as below in my
>> /dataimport handler.
>>
>> <requestHandler name="/dataimport" class="solr.DataImportHandler">
>> <lst name="defaults">
>>  <str name="config">data-config.xml</str>
>> <str name="update.chain">stanbolInterceptor</str>
>>  </lst>
>>    </requestHandler>
>>
>> <updateRequestProcessorChain name="stanbolInterceptor">
>>  <processor
>> class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>
>> <processor class="solr.RunUpdateProcessorFactory" />
>>   </updateRequestProcessorChain>
>>
>>
>> What I need now is to separate the 2 processes of dataimport and
>> stanbol-enhancement.
>> So this is like runing a separate re-indexing process periodically over
>> the documents imported from /dataimport for Stanbol fields.
>>
>> The question is how to trigger my Stanbol update process to the documents
>> imported from /dataimport?
>> In Solr to trigger /update query we need to know the id and the fields of
>> the document to be updated. In my case I need to run all the documents
>> imported from the previous /dataimport process through a stanbol
>> update.chain.
>>
>> Is there a way to keep track of the documents ids imported from
>> /dataimport?
>> Any advice or pointers will be really helpful.
>>
>> Thanks,
>> Dileepa
>>
>
>

Reply via email to