Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-27 Thread Dileepa Jayakody
Hi Varun and all,

Thanks for your input.

On Mon, Jan 27, 2014 at 11:29 AM, Varun Thacker
varunthacker1...@gmail.comwrote:

 Hi Dileepa,

 If I understand correctly this is what happens in your system correctly :

 1. DIH Sends data to Solr
 2. You have written a custom update processor (
 http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your
 Stanbol server for meta data, adds it to the document and then indexes it.

 Its the part where you query the Stanbol server and wait for the response
 which takes time and you want to reduce this.


Yes, this is what I'm trying to achieve. For each document I'm sending the
value of the content field to Stanbol and I process the Stanbol response to
add certain metadata fields to the document in my UpdateRequestProcessor.


 According to me instead of waiting for your response from the Stanbol
 server and then indexing it, You could send the required field data from
 the doc to your Stanbol server and continue. Once Stanbol as enriched the
 document, you re-index the document and update it with the meta-data.

 To update a document I need to invoke a /update request with the doc id
and the field to update/add. So in the method you have suggested, for each
Stanbol request I will need to process the response and create a Solr
/update query to update the document with the Stanbol enhancements.
To Stanbol I just send the value of the content to be enhanced and no
document ID is sent. How would you recommend to execute the Stanbol
request-response handling process separately?

Currently what I have done in my custom update processor is as below; I
process the Stanbol response and add NLP fields to the document in the
processAdd() method of my UpdateRequestProcessor.

public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
 String request = ;
for (String field : STANBOL_REQUEST_FIELDS) {
if (null != doc.getFieldValue(field)) {
request += (String) doc.getFieldValue(field) + . ;
}
}
 try {
EnhancementResult result = stanbolPost(request, getBaseURI());
CollectionTextAnnotation textAnnotations = result
.getTextAnnotations();
 // extracting text annotations
SetString personSet = new HashSetString();
SetString orgSet = new HashSetString();

for (TextAnnotation text : textAnnotations) {
String type = text.getType();
String language = text.getLanguage();
langSet.add(language);
String selectedText = text.getSelectedText();
if (null != type  null != selectedText) {
if (type.equalsIgnoreCase(StanbolConstants.PERSON)) {
personSet.add(selectedText);
} else if (type
.equalsIgnoreCase(StanbolConstants.ORGANIZATION)) {
orgSet.add(selectedText);
}
}
}
CollectionEntityAnnotation entityAnnotations =
result.getEntityAnnotations();
 for (String person : personSet) {
doc.addField(NLP_PERSON, person);
}
for (String org : orgSet) {
doc.addField(NLP_ORGANIZATION, org);
}
cmd.solrDoc = doc;
super.processAdd(cmd);
} catch (Exception ex) {
ex.printStackTrace();
}
}

}

private EnhancementResult stanbolPost(String request, URI uri) {
Client client = Client.create();
WebResource webResource = client.resource(uri);
ClientResponse response = webResource.type(MediaType.TEXT_PLAIN)
.accept(new MediaType(application, rdf+xml))
.entity(request, MediaType.TEXT_PLAIN)
.post(ClientResponse.class);

int status = response.getStatus();
if (status != 200  status != 201  status != 202) {
throw new RuntimeException(Failed : HTTP error code : 
+ response.getStatus());
}
String output = response.getEntity(String.class);
// Parse the RDF model
Model model = ModelFactory.createDefaultModel();
StringReader reader = new StringReader(output);
model.read(reader, null);
return new EnhancementResult(model);

}

Thanks,
Dileepa

This method makes you re-index the document but the changes from your
 client would be visible faster.

 Alternately you could do the same thing at the DIH level by writing a
 customer Transformer (
 http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers)


 On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

  Hi Ahmet,
 
 
 
  On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote:
 
   Hi,
  
   Here is what I understand from your Question.
  
   You have a custom update processor that runs with DIH. But it is slow.
  You
   want to run that text enhancement component after DIH. How would this
  help
   to speed up things?
 
 
   In this approach you will read/query/search already indexed and
 committed
   solr documents and run text enhancement thing on them. Probably this
   process will add new additional fields. And then you will update these
  solr
   documents?
  
   Did I understand your use case correctly?
  
 
  Yes, that is exactly what I want to achieve.
  I want to separate out the enhancement process from the dataimport
 process.
  The dataimport process will be invoked by a client when new data is
  added/updated to the mysql database. 

Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-27 Thread Dileepa Jayakody
Hi All,

I have implemented my requirement as a EventListener which runs on
importEnd of the dataimporthandler.

I'm running Solrj based client to send Stanbol enhancement updates to the
documents within my EventListener.

Thanks,
Dileepa


On Mon, Jan 27, 2014 at 4:34 PM, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

 Hi Varun and all,

 Thanks for your input.

 On Mon, Jan 27, 2014 at 11:29 AM, Varun Thacker 
 varunthacker1...@gmail.com wrote:

 Hi Dileepa,

 If I understand correctly this is what happens in your system correctly :

 1. DIH Sends data to Solr
 2. You have written a custom update processor (
 http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your
 Stanbol server for meta data, adds it to the document and then indexes it.

 Its the part where you query the Stanbol server and wait for the response
 which takes time and you want to reduce this.


 Yes, this is what I'm trying to achieve. For each document I'm sending the
 value of the content field to Stanbol and I process the Stanbol response to
 add certain metadata fields to the document in my UpdateRequestProcessor.


 According to me instead of waiting for your response from the Stanbol
 server and then indexing it, You could send the required field data from
 the doc to your Stanbol server and continue. Once Stanbol as enriched the
 document, you re-index the document and update it with the meta-data.

 To update a document I need to invoke a /update request with the doc id
 and the field to update/add. So in the method you have suggested, for each
 Stanbol request I will need to process the response and create a Solr
 /update query to update the document with the Stanbol enhancements.
 To Stanbol I just send the value of the content to be enhanced and no
 document ID is sent. How would you recommend to execute the Stanbol
 request-response handling process separately?

 Currently what I have done in my custom update processor is as below; I
 process the Stanbol response and add NLP fields to the document in the
 processAdd() method of my UpdateRequestProcessor.

 public void processAdd(AddUpdateCommand cmd) throws IOException {
 SolrInputDocument doc = cmd.getSolrInputDocument();
  String request = ;
  for (String field : STANBOL_REQUEST_FIELDS) {
 if (null != doc.getFieldValue(field)) {
  request += (String) doc.getFieldValue(field) + . ;
 }
  }
  try {
 EnhancementResult result = stanbolPost(request, getBaseURI());
  CollectionTextAnnotation textAnnotations = result
 .getTextAnnotations();
  // extracting text annotations
  SetString personSet = new HashSetString();
 SetString orgSet = new HashSetString();

 for (TextAnnotation text : textAnnotations) {
 String type = text.getType();
  String language = text.getLanguage();
 langSet.add(language);
  String selectedText = text.getSelectedText();
 if (null != type  null != selectedText) {
  if (type.equalsIgnoreCase(StanbolConstants.PERSON)) {
  personSet.add(selectedText);
 } else if (type
  .equalsIgnoreCase(StanbolConstants.ORGANIZATION)) {
  orgSet.add(selectedText);
 }
  }
 }
  CollectionEntityAnnotation entityAnnotations =
 result.getEntityAnnotations();
  for (String person : personSet) {
 doc.addField(NLP_PERSON, person);
  }
 for (String org : orgSet) {
  doc.addField(NLP_ORGANIZATION, org);
 }
  cmd.solrDoc = doc;
 super.processAdd(cmd);
  } catch (Exception ex) {
 ex.printStackTrace();
  }
 }

 }

 private EnhancementResult stanbolPost(String request, URI uri) {
 Client client = Client.create();
  WebResource webResource = client.resource(uri);
 ClientResponse response = webResource.type(MediaType.TEXT_PLAIN)
  .accept(new MediaType(application, rdf+xml))
 .entity(request, MediaType.TEXT_PLAIN)
  .post(ClientResponse.class);

  int status = response.getStatus();
 if (status != 200  status != 201  status != 202) {
  throw new RuntimeException(Failed : HTTP error code : 
 + response.getStatus());
  }
 String output = response.getEntity(String.class);
  // Parse the RDF model
 Model model = ModelFactory.createDefaultModel();
  StringReader reader = new StringReader(output);
 model.read(reader, null);
  return new EnhancementResult(model);

  }

 Thanks,
 Dileepa

  This method makes you re-index the document but the changes from your
 client would be visible faster.

 Alternately you could do the same thing at the DIH level by writing a
 customer Transformer (
 http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers
 )


 On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

  Hi Ahmet,
 
 
 
  On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com
 wrote:
 
   Hi,
  
   Here is what I understand from your Question.
  
   You have a custom update processor that runs with DIH. But it is slow.
  You
   want to run that text enhancement component after DIH. How would this
  help
   to speed up things?
 
 
   In this approach you will read/query/search already indexed and
 committed
   solr 

Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-26 Thread Dileepa Jayakody
Hi all,

Any ideas on how to run a reindex update process for all the imported
documents from a /dataimport query?
Appreciate your help.


Thanks,
Dileepa


On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody 
dileepajayak...@gmail.com wrote:

 Hi All,

 I did some research on this and found some alternatives useful to my
 usecase. Please give your ideas.

 Can I update all documents indexed after a /dataimport query using the
 last_indexed_time in dataimport.properties?
 If so can anyone please give me some pointers?
 What I currently have in mind is something like below;

 1. Store the indexing timestamp of the document as a field
 eg: field name=timestamp type=date indexed=true stored=true 
 default=NOW
 multiValued=false/

 2. Read the last_index_time from the dataimport.properties

 3. Query all document id's indexed after the last_index_time and send them
 through the Stanbol update processor.

 But I have a question here;
 Does the last_index_time refer to when the dataimport is
 started(onImportStart) or when the dataimport is finished (onImportEnd)?
 If it's onImportEnd timestamp, them this solution won't work because the
 timestamp indexed in the document field will be : onImportStart
 doc-index-timestamp  onImportEnd.


 Another alternative I can think of is trigger an update chain via a
 EventListener configured to run after a dataimport is processed
 (onImportEnd).
 In this case can the context in DIH give the list of document ids
 processed in the /dataimport request? If so I can send those doc ids with
 an /update query to run the Stanbol update process.

 Please give me your ideas and suggestions.

 Thanks,
 Dileepa




 On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

 Hi All,

 I have a Solr requirement to send all the documents imported from a
 /dataimport query to go through another update chain as a separate
 background process.

 Currently I have configured my custom update chain in the /dataimport
 handler itself. But since my custom update process need to connect to an
 external enhancement engine (Apache Stanbol) to enhance the documents with
 some NLP fields, it has a negative impact on /dataimport process.
 The solution will be to have a separate update process running to enhance
 the content of the documents imported from /dataimport.

 Currently I have configured my custom Stanbol Processor as below in my
 /dataimport handler.

 requestHandler name=/dataimport class=solr.DataImportHandler
 lst name=defaults
  str name=configdata-config.xml/str
 str name=update.chainstanbolInterceptor/str
  /lst
/requestHandler

 updateRequestProcessorChain name=stanbolInterceptor
  processor
 class=com.solr.stanbol.processor.StanbolContentProcessorFactory/
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain


 What I need now is to separate the 2 processes of dataimport and
 stanbol-enhancement.
 So this is like runing a separate re-indexing process periodically over
 the documents imported from /dataimport for Stanbol fields.

 The question is how to trigger my Stanbol update process to the documents
 imported from /dataimport?
 In Solr to trigger /update query we need to know the id and the fields of
 the document to be updated. In my case I need to run all the documents
 imported from the previous /dataimport process through a stanbol
 update.chain.

 Is there a way to keep track of the documents ids imported from
 /dataimport?
 Any advice or pointers will be really helpful.

 Thanks,
 Dileepa





Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-26 Thread Ahmet Arslan
Hi,

Here is what I understand from your Question.

You have a custom update processor that runs with DIH. But it is slow. You want 
to run that text enhancement component after DIH. How would this help to speed 
up things?

In this approach you will read/query/search already indexed and committed solr 
documents and run text enhancement thing on them. Probably this process will 
add new additional fields. And then you will update these solr documents?

Did I understand your use case correctly?





On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody 
dileepajayak...@gmail.com wrote:
Hi all,

Any ideas on how to run a reindex update process for all the imported
documents from a /dataimport query?
Appreciate your help.


Thanks,
Dileepa



On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody 
dileepajayak...@gmail.com wrote:

 Hi All,

 I did some research on this and found some alternatives useful to my
 usecase. Please give your ideas.

 Can I update all documents indexed after a /dataimport query using the
 last_indexed_time in dataimport.properties?
 If so can anyone please give me some pointers?
 What I currently have in mind is something like below;

 1. Store the indexing timestamp of the document as a field
 eg: field name=timestamp type=date indexed=true stored=true 
 default=NOW
 multiValued=false/

 2. Read the last_index_time from the dataimport.properties

 3. Query all document id's indexed after the last_index_time and send them
 through the Stanbol update processor.

 But I have a question here;
 Does the last_index_time refer to when the dataimport is
 started(onImportStart) or when the dataimport is finished (onImportEnd)?
 If it's onImportEnd timestamp, them this solution won't work because the
 timestamp indexed in the document field will be : onImportStart
 doc-index-timestamp  onImportEnd.


 Another alternative I can think of is trigger an update chain via a
 EventListener configured to run after a dataimport is processed
 (onImportEnd).
 In this case can the context in DIH give the list of document ids
 processed in the /dataimport request? If so I can send those doc ids with
 an /update query to run the Stanbol update process.

 Please give me your ideas and suggestions.

 Thanks,
 Dileepa




 On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

 Hi All,

 I have a Solr requirement to send all the documents imported from a
 /dataimport query to go through another update chain as a separate
 background process.

 Currently I have configured my custom update chain in the /dataimport
 handler itself. But since my custom update process need to connect to an
 external enhancement engine (Apache Stanbol) to enhance the documents with
 some NLP fields, it has a negative impact on /dataimport process.
 The solution will be to have a separate update process running to enhance
 the content of the documents imported from /dataimport.

 Currently I have configured my custom Stanbol Processor as below in my
 /dataimport handler.

 requestHandler name=/dataimport class=solr.DataImportHandler
 lst name=defaults
  str name=configdata-config.xml/str
 str name=update.chainstanbolInterceptor/str
  /lst
    /requestHandler

 updateRequestProcessorChain name=stanbolInterceptor
  processor
 class=com.solr.stanbol.processor.StanbolContentProcessorFactory/
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain


 What I need now is to separate the 2 processes of dataimport and
 stanbol-enhancement.
 So this is like runing a separate re-indexing process periodically over
 the documents imported from /dataimport for Stanbol fields.

 The question is how to trigger my Stanbol update process to the documents
 imported from /dataimport?
 In Solr to trigger /update query we need to know the id and the fields of
 the document to be updated. In my case I need to run all the documents
 imported from the previous /dataimport process through a stanbol
 update.chain.

 Is there a way to keep track of the documents ids imported from
 /dataimport?
 Any advice or pointers will be really helpful.

 Thanks,
 Dileepa






Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-26 Thread Dileepa Jayakody
Hi Ahmet,



On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi,

 Here is what I understand from your Question.

 You have a custom update processor that runs with DIH. But it is slow. You
 want to run that text enhancement component after DIH. How would this help
 to speed up things?


 In this approach you will read/query/search already indexed and committed
 solr documents and run text enhancement thing on them. Probably this
 process will add new additional fields. And then you will update these solr
 documents?

 Did I understand your use case correctly?


Yes, that is exactly what I want to achieve.
I want to separate out the enhancement process from the dataimport process.
The dataimport process will be invoked by a client when new data is
added/updated to the mysql database. Therefore the dataimport process with
mandatory fields of the documents should be indexed as soon as possible.
Mandatory fields are mapped to the data table columns in the
data-config.xml and the normal /dataimport process doesn't take much time.
The enhancements are done in my custom processor by sending the content
field of the document to an external Stanbol[1] server to detect NLP
enhancements. Then new NLP fields are added to the document (detected
persons, organizations, places in the content) in the custom update
processor and if this is executed during the dataimport process, it takes a
lot of time.

The NLP fields are not mandatory for the primary usage of the application
which is to query documents with mandatory fields. The NLP fields are
required for custom queries for Person, Organization entities. Therefore
the NLP update process should be run as a background process detached from
the primary /dataimport process. It should not slow down the existing
/dataimport process.

That's why I am looking for the best way to achieve my objective. I want to
implement a way to separately update the imported documents from
/dataimport  to detect NLP enhancements. Currently I'm having the idea of
adopting a timestamp based approach to trigger a /update query to all
documents imported after the last_index_time in dataimport.prop and update
them with NLP fields.

Hope my requirement is clear :). Appreciate your suggestions.

[1] http://stanbol.apache.org/





 On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:
 Hi all,

 Any ideas on how to run a reindex update process for all the imported
 documents from a /dataimport query?
 Appreciate your help.


 Thanks,
 Dileepa



 On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody 
 dileepajayak...@gmail.com wrote:

  Hi All,
 
  I did some research on this and found some alternatives useful to my
  usecase. Please give your ideas.
 
  Can I update all documents indexed after a /dataimport query using the
  last_indexed_time in dataimport.properties?
  If so can anyone please give me some pointers?
  What I currently have in mind is something like below;
 
  1. Store the indexing timestamp of the document as a field
  eg: field name=timestamp type=date indexed=true stored=true
 default=NOW
  multiValued=false/
 
  2. Read the last_index_time from the dataimport.properties
 
  3. Query all document id's indexed after the last_index_time and send
 them
  through the Stanbol update processor.
 
  But I have a question here;
  Does the last_index_time refer to when the dataimport is
  started(onImportStart) or when the dataimport is finished (onImportEnd)?
  If it's onImportEnd timestamp, them this solution won't work because the
  timestamp indexed in the document field will be : onImportStart
  doc-index-timestamp  onImportEnd.
 
 
  Another alternative I can think of is trigger an update chain via a
  EventListener configured to run after a dataimport is processed
  (onImportEnd).
  In this case can the context in DIH give the list of document ids
  processed in the /dataimport request? If so I can send those doc ids with
  an /update query to run the Stanbol update process.
 
  Please give me your ideas and suggestions.
 
  Thanks,
  Dileepa
 
 
 
 
  On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody 
  dileepajayak...@gmail.com wrote:
 
  Hi All,
 
  I have a Solr requirement to send all the documents imported from a
  /dataimport query to go through another update chain as a separate
  background process.
 
  Currently I have configured my custom update chain in the /dataimport
  handler itself. But since my custom update process need to connect to an
  external enhancement engine (Apache Stanbol) to enhance the documents
 with
  some NLP fields, it has a negative impact on /dataimport process.
  The solution will be to have a separate update process running to
 enhance
  the content of the documents imported from /dataimport.
 
  Currently I have configured my custom Stanbol Processor as below in my
  /dataimport handler.
 
  requestHandler name=/dataimport class=solr.DataImportHandler
  lst name=defaults
   str 

Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-26 Thread Varun Thacker
Hi Dileepa,

If I understand correctly this is what happens in your system correctly :

1. DIH Sends data to Solr
2. You have written a custom update processor (
http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your
Stanbol server for meta data, adds it to the document and then indexes it.

Its the part where you query the Stanbol server and wait for the response
which takes time and you want to reduce this.

According to me instead of waiting for your response from the Stanbol
server and then indexing it, You could send the required field data from
the doc to your Stanbol server and continue. Once Stanbol as enriched the
document, you re-index the document and update it with the meta-data.

This method makes you re-index the document but the changes from your
client would be visible faster.

Alternately you could do the same thing at the DIH level by writing a
customer Transformer (
http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers)


On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody 
dileepajayak...@gmail.com wrote:

 Hi Ahmet,



 On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote:

  Hi,
 
  Here is what I understand from your Question.
 
  You have a custom update processor that runs with DIH. But it is slow.
 You
  want to run that text enhancement component after DIH. How would this
 help
  to speed up things?


  In this approach you will read/query/search already indexed and committed
  solr documents and run text enhancement thing on them. Probably this
  process will add new additional fields. And then you will update these
 solr
  documents?
 
  Did I understand your use case correctly?
 

 Yes, that is exactly what I want to achieve.
 I want to separate out the enhancement process from the dataimport process.
 The dataimport process will be invoked by a client when new data is
 added/updated to the mysql database. Therefore the dataimport process with
 mandatory fields of the documents should be indexed as soon as possible.
 Mandatory fields are mapped to the data table columns in the
 data-config.xml and the normal /dataimport process doesn't take much time.
 The enhancements are done in my custom processor by sending the content
 field of the document to an external Stanbol[1] server to detect NLP
 enhancements. Then new NLP fields are added to the document (detected
 persons, organizations, places in the content) in the custom update
 processor and if this is executed during the dataimport process, it takes a
 lot of time.

 The NLP fields are not mandatory for the primary usage of the application
 which is to query documents with mandatory fields. The NLP fields are
 required for custom queries for Person, Organization entities. Therefore
 the NLP update process should be run as a background process detached from
 the primary /dataimport process. It should not slow down the existing
 /dataimport process.

 That's why I am looking for the best way to achieve my objective. I want to
 implement a way to separately update the imported documents from
 /dataimport  to detect NLP enhancements. Currently I'm having the idea of
 adopting a timestamp based approach to trigger a /update query to all
 documents imported after the last_index_time in dataimport.prop and update
 them with NLP fields.

 Hope my requirement is clear :). Appreciate your suggestions.

 [1] http://stanbol.apache.org/

 
 
 
 
  On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody 
  dileepajayak...@gmail.com wrote:
  Hi all,
 
  Any ideas on how to run a reindex update process for all the imported
  documents from a /dataimport query?
  Appreciate your help.
 
 
  Thanks,
  Dileepa
 
 
 
  On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody 
  dileepajayak...@gmail.com wrote:
 
   Hi All,
  
   I did some research on this and found some alternatives useful to my
   usecase. Please give your ideas.
  
   Can I update all documents indexed after a /dataimport query using the
   last_indexed_time in dataimport.properties?
   If so can anyone please give me some pointers?
   What I currently have in mind is something like below;
  
   1. Store the indexing timestamp of the document as a field
   eg: field name=timestamp type=date indexed=true stored=true
  default=NOW
   multiValued=false/
  
   2. Read the last_index_time from the dataimport.properties
  
   3. Query all document id's indexed after the last_index_time and send
  them
   through the Stanbol update processor.
  
   But I have a question here;
   Does the last_index_time refer to when the dataimport is
   started(onImportStart) or when the dataimport is finished
 (onImportEnd)?
   If it's onImportEnd timestamp, them this solution won't work because
 the
   timestamp indexed in the document field will be : onImportStart
   doc-index-timestamp  onImportEnd.
  
  
   Another alternative I can think of is trigger an update chain via a
   EventListener configured to run after a dataimport is processed
   

How to run a subsequent update query to documents indexed from a dataimport query

2014-01-22 Thread Dileepa Jayakody
Hi All,

I have a Solr requirement to send all the documents imported from a
/dataimport query to go through another update chain as a separate
background process.

Currently I have configured my custom update chain in the /dataimport
handler itself. But since my custom update process need to connect to an
external enhancement engine (Apache Stanbol) to enhance the documents with
some NLP fields, it has a negative impact on /dataimport process.
The solution will be to have a separate update process running to enhance
the content of the documents imported from /dataimport.

Currently I have configured my custom Stanbol Processor as below in my
/dataimport handler.

requestHandler name=/dataimport class=solr.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
str name=update.chainstanbolInterceptor/str
/lst
   /requestHandler

updateRequestProcessorChain name=stanbolInterceptor
processor
class=com.solr.stanbol.processor.StanbolContentProcessorFactory/
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain


What I need now is to separate the 2 processes of dataimport and
stanbol-enhancement.
So this is like runing a separate re-indexing process periodically over the
documents imported from /dataimport for Stanbol fields.

The question is how to trigger my Stanbol update process to the documents
imported from /dataimport?
In Solr to trigger /update query we need to know the id and the fields of
the document to be updated. In my case I need to run all the documents
imported from the previous /dataimport process through a stanbol
update.chain.

Is there a way to keep track of the documents ids imported from
/dataimport?
Any advice or pointers will be really helpful.

Thanks,
Dileepa


Re: How to run a subsequent update query to documents indexed from a dataimport query

2014-01-22 Thread Dileepa Jayakody
Hi All,

I did some research on this and found some alternatives useful to my
usecase. Please give your ideas.

Can I update all documents indexed after a /dataimport query using the
last_indexed_time in dataimport.properties?
If so can anyone please give me some pointers?
What I currently have in mind is something like below;

1. Store the indexing timestamp of the document as a field
eg: field name=timestamp type=date indexed=true stored=true
default=NOW
multiValued=false/

2. Read the last_index_time from the dataimport.properties

3. Query all document id's indexed after the last_index_time and send them
through the Stanbol update processor.

But I have a question here;
Does the last_index_time refer to when the dataimport is
started(onImportStart) or when the dataimport is finished (onImportEnd)?
If it's onImportEnd timestamp, them this solution won't work because the
timestamp indexed in the document field will be : onImportStart
doc-index-timestamp  onImportEnd.


Another alternative I can think of is trigger an update chain via a
EventListener configured to run after a dataimport is processed
(onImportEnd).
In this case can the context in DIH give the list of document ids processed
in the /dataimport request? If so I can send those doc ids with an /update
query to run the Stanbol update process.

Please give me your ideas and suggestions.

Thanks,
Dileepa




On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody dileepajayak...@gmail.com
 wrote:

 Hi All,

 I have a Solr requirement to send all the documents imported from a
 /dataimport query to go through another update chain as a separate
 background process.

 Currently I have configured my custom update chain in the /dataimport
 handler itself. But since my custom update process need to connect to an
 external enhancement engine (Apache Stanbol) to enhance the documents with
 some NLP fields, it has a negative impact on /dataimport process.
 The solution will be to have a separate update process running to enhance
 the content of the documents imported from /dataimport.

 Currently I have configured my custom Stanbol Processor as below in my
 /dataimport handler.

 requestHandler name=/dataimport class=solr.DataImportHandler
 lst name=defaults
  str name=configdata-config.xml/str
 str name=update.chainstanbolInterceptor/str
  /lst
/requestHandler

 updateRequestProcessorChain name=stanbolInterceptor
  processor
 class=com.solr.stanbol.processor.StanbolContentProcessorFactory/
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain


 What I need now is to separate the 2 processes of dataimport and
 stanbol-enhancement.
 So this is like runing a separate re-indexing process periodically over
 the documents imported from /dataimport for Stanbol fields.

 The question is how to trigger my Stanbol update process to the documents
 imported from /dataimport?
 In Solr to trigger /update query we need to know the id and the fields of
 the document to be updated. In my case I need to run all the documents
 imported from the previous /dataimport process through a stanbol
 update.chain.

 Is there a way to keep track of the documents ids imported from
 /dataimport?
 Any advice or pointers will be really helpful.

 Thanks,
 Dileepa