Hi Varun and all, Thanks for your input.
On Mon, Jan 27, 2014 at 11:29 AM, Varun Thacker <varunthacker1...@gmail.com>wrote: > Hi Dileepa, > > If I understand correctly this is what happens in your system correctly : > > 1. DIH Sends data to Solr > 2. You have written a custom update processor ( > http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your > Stanbol server for meta data, adds it to the document and then indexes it. > > Its the part where you query the Stanbol server and wait for the response > which takes time and you want to reduce this. > Yes, this is what I'm trying to achieve. For each document I'm sending the value of the content field to Stanbol and I process the Stanbol response to add certain metadata fields to the document in my UpdateRequestProcessor. > > According to me instead of waiting for your response from the Stanbol > server and then indexing it, You could send the required field data from > the doc to your Stanbol server and continue. Once Stanbol as enriched the > document, you re-index the document and update it with the meta-data. > > To update a document I need to invoke a /update request with the doc id and the field to update/add. So in the method you have suggested, for each Stanbol request I will need to process the response and create a Solr /update query to update the document with the Stanbol enhancements. To Stanbol I just send the value of the content to be enhanced and no document ID is sent. How would you recommend to execute the Stanbol request-response handling process separately? Currently what I have done in my custom update processor is as below; I process the Stanbol response and add NLP fields to the document in the processAdd() method of my UpdateRequestProcessor. public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); String request = ""; for (String field : STANBOL_REQUEST_FIELDS) { if (null != doc.getFieldValue(field)) { request += (String) doc.getFieldValue(field) + ". "; } } try { EnhancementResult result = stanbolPost(request, getBaseURI()); Collection<TextAnnotation> textAnnotations = result .getTextAnnotations(); // extracting text annotations Set<String> personSet = new HashSet<String>(); Set<String> orgSet = new HashSet<String>(); for (TextAnnotation text : textAnnotations) { String type = text.getType(); String language = text.getLanguage(); langSet.add(language); String selectedText = text.getSelectedText(); if (null != type && null != selectedText) { if (type.equalsIgnoreCase(StanbolConstants.PERSON)) { personSet.add(selectedText); } else if (type .equalsIgnoreCase(StanbolConstants.ORGANIZATION)) { orgSet.add(selectedText); } } } Collection<EntityAnnotation> entityAnnotations = result.getEntityAnnotations(); for (String person : personSet) { doc.addField(NLP_PERSON, person); } for (String org : orgSet) { doc.addField(NLP_ORGANIZATION, org); } cmd.solrDoc = doc; super.processAdd(cmd); } catch (Exception ex) { ex.printStackTrace(); } } } private EnhancementResult stanbolPost(String request, URI uri) { Client client = Client.create(); WebResource webResource = client.resource(uri); ClientResponse response = webResource.type(MediaType.TEXT_PLAIN) .accept(new MediaType("application", "rdf+xml")) .entity(request, MediaType.TEXT_PLAIN) .post(ClientResponse.class); int status = response.getStatus(); if (status != 200 && status != 201 && status != 202) { throw new RuntimeException("Failed : HTTP error code : " + response.getStatus()); } String output = response.getEntity(String.class); // Parse the RDF model Model model = ModelFactory.createDefaultModel(); StringReader reader = new StringReader(output); model.read(reader, null); return new EnhancementResult(model); } Thanks, Dileepa This method makes you re-index the document but the changes from your > client would be visible faster. > > Alternately you could do the same thing at the DIH level by writing a > customer Transformer ( > http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers) > > > On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody < > dileepajayak...@gmail.com> wrote: > > > Hi Ahmet, > > > > > > > > On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan <iori...@yahoo.com> wrote: > > > > > Hi, > > > > > > Here is what I understand from your Question. > > > > > > You have a custom update processor that runs with DIH. But it is slow. > > You > > > want to run that text enhancement component after DIH. How would this > > help > > > to speed up things? > > > > > > > In this approach you will read/query/search already indexed and > committed > > > solr documents and run text enhancement thing on them. Probably this > > > process will add new additional fields. And then you will update these > > solr > > > documents? > > > > > > Did I understand your use case correctly? > > > > > > > Yes, that is exactly what I want to achieve. > > I want to separate out the enhancement process from the dataimport > process. > > The dataimport process will be invoked by a client when new data is > > added/updated to the mysql database. Therefore the dataimport process > with > > mandatory fields of the documents should be indexed as soon as possible. > > Mandatory fields are mapped to the data table columns in the > > data-config.xml and the normal /dataimport process doesn't take much > time. > > The enhancements are done in my custom processor by sending the content > > field of the document to an external Stanbol[1] server to detect NLP > > enhancements. Then new NLP fields are added to the document (detected > > persons, organizations, places in the content) in the custom update > > processor and if this is executed during the dataimport process, it > takes a > > lot of time. > > > > The NLP fields are not mandatory for the primary usage of the application > > which is to query documents with mandatory fields. The NLP fields are > > required for custom queries for Person, Organization entities. Therefore > > the NLP update process should be run as a background process detached > from > > the primary /dataimport process. It should not slow down the existing > > /dataimport process. > > > > That's why I am looking for the best way to achieve my objective. I want > to > > implement a way to separately update the imported documents from > > /dataimport to detect NLP enhancements. Currently I'm having the idea of > > adopting a timestamp based approach to trigger a /update query to all > > documents imported after the last_index_time in dataimport.prop and > update > > them with NLP fields. > > > > Hope my requirement is clear :). Appreciate your suggestions. > > > > [1] http://stanbol.apache.org/ > > > > > > > > > > > > > > > > > On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody < > > > dileepajayak...@gmail.com> wrote: > > > Hi all, > > > > > > Any ideas on how to run a reindex update process for all the imported > > > documents from a /dataimport query? > > > Appreciate your help. > > > > > > > > > Thanks, > > > Dileepa > > > > > > > > > > > > On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody < > > > dileepajayak...@gmail.com> wrote: > > > > > > > Hi All, > > > > > > > > I did some research on this and found some alternatives useful to my > > > > usecase. Please give your ideas. > > > > > > > > Can I update all documents indexed after a /dataimport query using > the > > > > last_indexed_time in dataimport.properties? > > > > If so can anyone please give me some pointers? > > > > What I currently have in mind is something like below; > > > > > > > > 1. Store the indexing timestamp of the document as a field > > > > eg: <field name="timestamp" type="date" indexed="true" stored="true" > > > default="NOW" > > > > multiValued="false"/> > > > > > > > > 2. Read the last_index_time from the dataimport.properties > > > > > > > > 3. Query all document id's indexed after the last_index_time and send > > > them > > > > through the Stanbol update processor. > > > > > > > > But I have a question here; > > > > Does the last_index_time refer to when the dataimport is > > > > started(onImportStart) or when the dataimport is finished > > (onImportEnd)? > > > > If it's onImportEnd timestamp, them this solution won't work because > > the > > > > timestamp indexed in the document field will be : onImportStart< > > > > doc-index-timestamp < onImportEnd. > > > > > > > > > > > > Another alternative I can think of is trigger an update chain via a > > > > EventListener configured to run after a dataimport is processed > > > > (onImportEnd). > > > > In this case can the context in DIH give the list of document ids > > > > processed in the /dataimport request? If so I can send those doc ids > > with > > > > an /update query to run the Stanbol update process. > > > > > > > > Please give me your ideas and suggestions. > > > > > > > > Thanks, > > > > Dileepa > > > > > > > > > > > > > > > > > > > > On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody < > > > > dileepajayak...@gmail.com> wrote: > > > > > > > >> Hi All, > > > >> > > > >> I have a Solr requirement to send all the documents imported from a > > > >> /dataimport query to go through another update chain as a separate > > > >> background process. > > > >> > > > >> Currently I have configured my custom update chain in the > /dataimport > > > >> handler itself. But since my custom update process need to connect > to > > an > > > >> external enhancement engine (Apache Stanbol) to enhance the > documents > > > with > > > >> some NLP fields, it has a negative impact on /dataimport process. > > > >> The solution will be to have a separate update process running to > > > enhance > > > >> the content of the documents imported from /dataimport. > > > >> > > > >> Currently I have configured my custom Stanbol Processor as below in > my > > > >> /dataimport handler. > > > >> > > > >> <requestHandler name="/dataimport" class="solr.DataImportHandler"> > > > >> <lst name="defaults"> > > > >> <str name="config">data-config.xml</str> > > > >> <str name="update.chain">stanbolInterceptor</str> > > > >> </lst> > > > >> </requestHandler> > > > >> > > > >> <updateRequestProcessorChain name="stanbolInterceptor"> > > > >> <processor > > > >> class="com.solr.stanbol.processor.StanbolContentProcessorFactory"/> > > > >> <processor class="solr.RunUpdateProcessorFactory" /> > > > >> </updateRequestProcessorChain> > > > >> > > > >> > > > >> What I need now is to separate the 2 processes of dataimport and > > > >> stanbol-enhancement. > > > >> So this is like runing a separate re-indexing process periodically > > over > > > >> the documents imported from /dataimport for Stanbol fields. > > > >> > > > >> The question is how to trigger my Stanbol update process to the > > > documents > > > >> imported from /dataimport? > > > >> In Solr to trigger /update query we need to know the id and the > fields > > > of > > > >> the document to be updated. In my case I need to run all the > documents > > > >> imported from the previous /dataimport process through a stanbol > > > >> update.chain. > > > >> > > > >> Is there a way to keep track of the documents ids imported from > > > >> /dataimport? > > > >> Any advice or pointers will be really helpful. > > > >> > > > >> Thanks, > > > >> Dileepa > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > Regards, > Varun Thacker > http://www.vthacker.in/ >