Re: How to run a subsequent update query to documents indexed from a dataimport query

Dileepa Jayakody Mon, 27 Jan 2014 03:05:51 -0800

Hi Varun and all,

Thanks for your input.


On Mon, Jan 27, 2014 at 11:29 AM, Varun Thacker
<varunthacker1...@gmail.com>wrote:

> Hi Dileepa,
>
> If I understand correctly this is what happens in your system correctly :
>
> 1. DIH Sends data to Solr
> 2. You have written a custom update processor (
> http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your
> Stanbol server for meta data, adds it to the document and then indexes it.
>
> Its the part where you query the Stanbol server and wait for the response
> which takes time and you want to reduce this.
>

Yes, this is what I'm trying to achieve. For each document I'm sending the
value of the content field to Stanbol and I process the Stanbol response to
add certain metadata fields to the document in my UpdateRequestProcessor.

>
> According to me instead of waiting for your response from the Stanbol
> server and then indexing it, You could send the required field data from
> the doc to your Stanbol server and continue. Once Stanbol as enriched the
> document, you re-index the document and update it with the meta-data.
>
> To update a document I need to invoke a /update request with the doc id
and the field to update/add. So in the method you have suggested, for each
Stanbol request I will need to process the response and create a Solr
/update query to update the document with the Stanbol enhancements.
To Stanbol I just send the value of the content to be enhanced and no
document ID is sent. How would you recommend to execute the Stanbol
request-response handling process separately?

Currently what I have done in my custom update processor is as below; I
process the Stanbol response and add NLP fields to the document in the
processAdd() method of my UpdateRequestProcessor.

public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
 String request = "";
for (String field : STANBOL_REQUEST_FIELDS) {
if (null != doc.getFieldValue(field)) {
request += (String) doc.getFieldValue(field) + ". ";
}
}
 try {
EnhancementResult result = stanbolPost(request, getBaseURI());
Collection<TextAnnotation> textAnnotations = result
.getTextAnnotations();
 // extracting text annotations
Set<String> personSet = new HashSet<String>();
Set<String> orgSet = new HashSet<String>();

for (TextAnnotation text : textAnnotations) {
String type = text.getType();
String language = text.getLanguage();
langSet.add(language);
String selectedText = text.getSelectedText();
if (null != type && null != selectedText) {
if (type.equalsIgnoreCase(StanbolConstants.PERSON)) {
personSet.add(selectedText);
} else if (type
.equalsIgnoreCase(StanbolConstants.ORGANIZATION)) {
orgSet.add(selectedText);
}
}
}
Collection<EntityAnnotation> entityAnnotations =
result.getEntityAnnotations();
 for (String person : personSet) {
doc.addField(NLP_PERSON, person);
}
for (String org : orgSet) {
doc.addField(NLP_ORGANIZATION, org);
}
cmd.solrDoc = doc;
super.processAdd(cmd);
} catch (Exception ex) {
ex.printStackTrace();
}
}

}

private EnhancementResult stanbolPost(String request, URI uri) {
Client client = Client.create();
WebResource webResource = client.resource(uri);
ClientResponse response = webResource.type(MediaType.TEXT_PLAIN)
.accept(new MediaType("application", "rdf+xml"))
.entity(request, MediaType.TEXT_PLAIN)
.post(ClientResponse.class);

int status = response.getStatus();
if (status != 200 && status != 201 && status != 202) {
throw new RuntimeException("Failed : HTTP error code : "
+ response.getStatus());
}
String output = response.getEntity(String.class);
// Parse the RDF model
Model model = ModelFactory.createDefaultModel();
StringReader reader = new StringReader(output);
model.read(reader, null);
return new EnhancementResult(model);

}

Thanks,
Dileepa

This method makes you re-index the document but the changes from your
> client would be visible faster.
>
> Alternately you could do the same thing at the DIH level by writing a
> customer Transformer (
> http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers)
>
>
> On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody <
> dileepajayak...@gmail.com> wrote:
>
> > Hi Ahmet,
> >
> >
> >
> > On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan <iori...@yahoo.com> wrote:
> >
> > > Hi,
> > >
> > > Here is what I understand from your Question.
> > >
> > > You have a custom update processor that runs with DIH. But it is slow.
> > You
> > > want to run that text enhancement component after DIH. How would this
> > help
> > > to speed up things?
> >
> >
> > > In this approach you will read/query/search already indexed and
> committed
> > > solr documents and run text enhancement thing on them. Probably this
> > > process will add new additional fields. And then you will update these
> > solr
> > > documents?
> > >
> > > Did I understand your use case correctly?
> > >
> >
> > Yes, that is exactly what I want to achieve.
> > I want to separate out the enhancement process from the dataimport
> process.
> > The dataimport process will be invoked by a client when new data is
> > added/updated to the mysql database. Therefore the dataimport process
> with
> > mandatory fields of the documents should be indexed as soon as possible.
> > Mandatory fields are mapped to the data table columns in the
> > data-config.xml and the normal /dataimport process doesn't take much
> time.
> > The enhancements are done in my custom processor by sending the content
> > field of the document to an external Stanbol[1] server to detect NLP
> > enhancements. Then new NLP fields are added to the document (detected
> > persons, organizations, places in the content) in the custom update
> > processor and if this is executed during the dataimport process, it
> takes a
> > lot of time.
> >
> > The NLP fields are not mandatory for the primary usage of the application
> > which is to query documents with mandatory fields. The NLP fields are
> > required for custom queries for Person, Organization entities. Therefore
> > the NLP update process should be run as a background process detached
> from
> > the primary /dataimport process. It should not slow down the existing
> > /dataimport process.
> >
> > That's why I am looking for the best way to achieve my objective. I want
> to
> > implement a way to separately update the imported documents from
> > /dataimport  to detect NLP enhancements. Currently I'm having the idea of
> > adopting a timestamp based approach to trigger a /update query to all
> > documents imported after the last_index_time in dataimport.prop and
> update
> > them with NLP fields.
> >
> > Hope my requirement is clear :). Appreciate your suggestions.
> >
> > [1] http://stanbol.apache.org/
> >
> > >
> > >
> > >
> > >
> > > On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody <
> > > dileepajayak...@gmail.com> wrote:
> > > Hi all,
> > >
> > > Any ideas on how to run a reindex update process for all the imported
> > > documents from a /dataimport query?
> > > Appreciate your help.
> > >
> > >
> > > Thanks,
> > > Dileepa
> > >
> > >
> > >
> > > On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody <
> > > dileepajayak...@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I did some research on this and found some alternatives useful to my
> > > > usecase. Please give your ideas.
> > > >
> > > > Can I update all documents indexed after a /dataimport query using
> the
> > > > last_indexed_time in dataimport.properties?
> > > > If so can anyone please give me some pointers?
> > > > What I currently have in mind is something like below;
> > > >
> > > > 1. Store the indexing timestamp of the document as a field
> > > > eg: <field name="timestamp" type="date" indexed="true" stored="true"
> > > default="NOW"
> > > > multiValued="false"/>
> > > >
> > > > 2. Read the last_index_time from the dataimport.properties
> > > >
> > > > 3. Query all document id's indexed after the last_index_time and send
> > > them
> > > > through the Stanbol update processor.
> > > >
> > > > But I have a question here;
> > > > Does the last_index_time refer to when the dataimport is
> > > > started(onImportStart) or when the dataimport is finished
> > (onImportEnd)?
> > > > If it's onImportEnd timestamp, them this solution won't work because
> > the
> > > > timestamp indexed in the document field will be : onImportStart<
> > > > doc-index-timestamp < onImportEnd.
> > > >
> > > >
> > > > Another alternative I can think of is trigger an update chain via a
> > > > EventListener configured to run after a dataimport is processed
> > > > (onImportEnd).
> > > > In this case can the context in DIH give the list of document ids
> > > > processed in the /dataimport request? If so I can send those doc ids
> > with
> > > > an /update query to run the Stanbol update process.
> > > >
> > > > Please give me your ideas and suggestions.
> > > >
> > > > Thanks,
> > > > Dileepa
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody <
> > > > dileepajayak...@gmail.com> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> I have a Solr requirement to send all the documents imported from a
> > > >> /dataimport query to go through another update chain as a separate
> > > >> background process.
> > > >>
> > > >> Currently I have configured my custom update chain in the
> /dataimport
> > > >> handler itself. But since my custom update process need to connect
> to
> > an
> > > >> external enhancement engine (Apache Stanbol) to enhance the
> documents
> > > with
> > > >> some NLP fields, it has a negative impact on /dataimport process.
> > > >> The solution will be to have a separate update process running to
> > > enhance
> > > >> the content of the documents imported from /dataimport.
> > > >>
> > > >> Currently I have configured my custom Stanbol Processor as below in
> my
> > > >> /dataimport handler.
> > > >>
> > > >> <requestHandler name="/dataimport" class="solr.DataImportHandler">
> > > >> <lst name="defaults">
> > > >>  <str name="config">data-config.xml</str>
> > > >> <str name="update.chain">stanbolInterceptor</str>
> > > >>  </lst>
> > > >>    </requestHandler>
> > > >>
> > > >> <updateRequestProcessorChain name="stanbolInterceptor">
> > > >>  <processor
> > > >> class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/>
> > > >> <processor class="solr.RunUpdateProcessorFactory" />
> > > >>   </updateRequestProcessorChain>
> > > >>
> > > >>
> > > >> What I need now is to separate the 2 processes of dataimport and
> > > >> stanbol-enhancement.
> > > >> So this is like runing a separate re-indexing process periodically
> > over
> > > >> the documents imported from /dataimport for Stanbol fields.
> > > >>
> > > >> The question is how to trigger my Stanbol update process to the
> > > documents
> > > >> imported from /dataimport?
> > > >> In Solr to trigger /update query we need to know the id and the
> fields
> > > of
> > > >> the document to be updated. In my case I need to run all the
> documents
> > > >> imported from the previous /dataimport process through a stanbol
> > > >> update.chain.
> > > >>
> > > >> Is there a way to keep track of the documents ids imported from
> > > >> /dataimport?
> > > >> Any advice or pointers will be really helpful.
> > > >>
> > > >> Thanks,
> > > >> Dileepa
> > > >>
> > > >
> > > >
> > >
> > >
> >
>
>
>
> --
>
>
> Regards,
> Varun Thacker
> http://www.vthacker.in/
>

Re: How to run a subsequent update query to documents indexed from a dataimport query

Reply via email to