RE: DIH import and postImportDeleteQuery

2011-05-25 Thread Ephraim Ofir
Search the list for my post DIH - deleting documents, high performance
(delta) imports, and passing parameters which shows my solution a
similar problem.

Ephraim Ofir

-Original Message-
From: Alexandre Rocco [mailto:alel...@gmail.com] 
Sent: Tuesday, May 24, 2011 11:24 PM
To: solr-user@lucene.apache.org
Subject: DIH import and postImportDeleteQuery

Guys,

I am facing a situation in one of our projects that I need to perform a
cleanup to remove some documents after we perform an update via DIH.
The big issue right now comes from the fact that when we call the DIH
with
clean=false, the postImportDeleteQuery is not executed.

My setup is currently arranged like this:
- A SQL Server stored procedure that receives a parameter (specified in
the
URL) and returns the records to be indexed
- The procedure is able to return all the records (for a full-import) or
only the updated records (for a delta-import)
- This procedure returns valid and deleted records, from this point
comes
the need to run a postImportDeleteQuery to remove the deleted ones.

Everything works fine when I run a full-import, I am running always with
clean=true, and then the whole index is rebuilt.
When I need to do an incremental update, the records are updated
correctly,
but the command to delete the other records is not executed.

I've tried several combinations, with different results:
- Running full-import with clean=false: the records are updated but the
ones
that needs to be deleted stays on the index
- Running delta-import with clean=false: the records are updated but the
ones that needs to be deleted stays on the index
- Running delta-import with clean=true: all records are deleted from the
index and then only the records returned by the procedure are on the
index,
except the deleted ones.

I don't see any way to achieve my goal, without changing the process
that I
do to obtain the data.
Since this is a very complex stored procedure, with tons of joins and
custom
processing, I am trying everything to avoid messing with it.

See below a copy of my data-config.xml file. I made it simpler omitting
all
the fields, since it's out of scope of the issue:
?xml version=1.0 encoding=UTF-8 ?
dataConfig
dataSource type=JdbcDataSource
driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
url=jdbc:sqlserver://myserver;databaseName=mydb;user=username;password=
password;responseBuffering=adaptive;

/
document
entity name=entity_one
pk=entityid
transformer=RegexTransformer
query=EXEC some_stored_procedure ${dataimporter.request.someid}
preImportDeleteQuery=status:1 postImportDeleteQuery=status:1

field column=field1 name=field1 splitBy=; /
field column=field2 name=field2 splitBy=; /
field column=field3 name=field3 splitBy=; /
/entity

entity name=entity_two
pk=entityid
transformer=RegexTransformer
query=EXEC someother_stored_procedure
${dataimporter.request.someotherid}
preImportDeleteQuery=status:1 postImportDeleteQuery=status:1

field column=field1 name=field1 /
field column=field2 name=field2 /
field column=field3 name=field2 /
/entity
/document
/dataConfig

Any ideas or pointers that might help on this one?

Many thanks,
Alexandre


Re: DIH import and postImportDeleteQuery

2011-05-25 Thread Alexandre Rocco
Hi Ephraim,

Thank you so much for the input.
I was able to find your thread on the archives and got your solution to
work.

In fact, when using $deleteDocById and $skipDoc it worked like a charm. This
feature is very useful, it's a shame it's not properly documented.
The only downside is the one you mentioned that the stats are not updated,
so if I update 13 documents and delete 2, DIH would tell me that only 13
documents were processed. This is bad in my case because I check the end
result to generate an error e-mail if needed.

You also mentioned that if the query contains only deletion records, a
commit would not be automatically executed and it would be necessary to
commit manually.

How can I commit manually via DIH? I was not able to find any references on
the documentation.

Thanks!
Alexandre

On Wed, May 25, 2011 at 5:14 AM, Ephraim Ofir ephra...@icq.com wrote:

 Search the list for my post DIH - deleting documents, high performance
 (delta) imports, and passing parameters which shows my solution a
 similar problem.

 Ephraim Ofir

 -Original Message-
 From: Alexandre Rocco [mailto:alel...@gmail.com]
 Sent: Tuesday, May 24, 2011 11:24 PM
 To: solr-user@lucene.apache.org
 Subject: DIH import and postImportDeleteQuery

 Guys,

 I am facing a situation in one of our projects that I need to perform a
 cleanup to remove some documents after we perform an update via DIH.
 The big issue right now comes from the fact that when we call the DIH
 with
 clean=false, the postImportDeleteQuery is not executed.

 My setup is currently arranged like this:
 - A SQL Server stored procedure that receives a parameter (specified in
 the
 URL) and returns the records to be indexed
 - The procedure is able to return all the records (for a full-import) or
 only the updated records (for a delta-import)
 - This procedure returns valid and deleted records, from this point
 comes
 the need to run a postImportDeleteQuery to remove the deleted ones.

 Everything works fine when I run a full-import, I am running always with
 clean=true, and then the whole index is rebuilt.
 When I need to do an incremental update, the records are updated
 correctly,
 but the command to delete the other records is not executed.

 I've tried several combinations, with different results:
 - Running full-import with clean=false: the records are updated but the
 ones
 that needs to be deleted stays on the index
 - Running delta-import with clean=false: the records are updated but the
 ones that needs to be deleted stays on the index
 - Running delta-import with clean=true: all records are deleted from the
 index and then only the records returned by the procedure are on the
 index,
 except the deleted ones.

 I don't see any way to achieve my goal, without changing the process
 that I
 do to obtain the data.
 Since this is a very complex stored procedure, with tons of joins and
 custom
 processing, I am trying everything to avoid messing with it.

 See below a copy of my data-config.xml file. I made it simpler omitting
 all
 the fields, since it's out of scope of the issue:
 ?xml version=1.0 encoding=UTF-8 ?
 dataConfig
 dataSource type=JdbcDataSource
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=jdbc:sqlserver://myserver;databaseName=mydb;user=username;password=
 password;responseBuffering=adaptive;

 /
 document
 entity name=entity_one
 pk=entityid
 transformer=RegexTransformer
 query=EXEC some_stored_procedure ${dataimporter.request.someid}
 preImportDeleteQuery=status:1 postImportDeleteQuery=status:1
 
 field column=field1 name=field1 splitBy=; /
 field column=field2 name=field2 splitBy=; /
 field column=field3 name=field3 splitBy=; /
 /entity

 entity name=entity_two
 pk=entityid
 transformer=RegexTransformer
 query=EXEC someother_stored_procedure
 ${dataimporter.request.someotherid}
 preImportDeleteQuery=status:1 postImportDeleteQuery=status:1
 
 field column=field1 name=field1 /
 field column=field2 name=field2 /
 field column=field3 name=field2 /
 /entity
 /document
 /dataConfig

 Any ideas or pointers that might help on this one?

 Many thanks,
 Alexandre



RE: DIH import and postImportDeleteQuery

2011-05-25 Thread Dyer, James
The failure to commit bug with $deleteDocById can be fixed by applying patch 
SOLR-2492.  This patch also partially fixes the no updated stats bug in that 
it increments 1 for every call to $deleteDocById and $deleteDocByQuery.  Note 
that this might result in inaccurate counts if the id given with $deleteDocById 
doesn't exist or is duplicated.  Obviously this is not a complete fix for stats 
using $deleteDocByQuery as this command would normally be used to delete 1 doc 
at a time.

The patch is for Trunk but it might work with 3.1 also.  If not, it likely only 
needs minor tweaking.  

The jira ticket is here:  https://issues.apache.org/jira/browse/SOLR-2492

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Alexandre Rocco [mailto:alel...@gmail.com] 
Sent: Wednesday, May 25, 2011 12:54 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH import and postImportDeleteQuery

Hi Ephraim,

Thank you so much for the input.
I was able to find your thread on the archives and got your solution to
work.

In fact, when using $deleteDocById and $skipDoc it worked like a charm. This
feature is very useful, it's a shame it's not properly documented.
The only downside is the one you mentioned that the stats are not updated,
so if I update 13 documents and delete 2, DIH would tell me that only 13
documents were processed. This is bad in my case because I check the end
result to generate an error e-mail if needed.

You also mentioned that if the query contains only deletion records, a
commit would not be automatically executed and it would be necessary to
commit manually.

How can I commit manually via DIH? I was not able to find any references on
the documentation.

Thanks!
Alexandre

On Wed, May 25, 2011 at 5:14 AM, Ephraim Ofir ephra...@icq.com wrote:

 Search the list for my post DIH - deleting documents, high performance
 (delta) imports, and passing parameters which shows my solution a
 similar problem.

 Ephraim Ofir

 -Original Message-
 From: Alexandre Rocco [mailto:alel...@gmail.com]
 Sent: Tuesday, May 24, 2011 11:24 PM
 To: solr-user@lucene.apache.org
 Subject: DIH import and postImportDeleteQuery

 Guys,

 I am facing a situation in one of our projects that I need to perform a
 cleanup to remove some documents after we perform an update via DIH.
 The big issue right now comes from the fact that when we call the DIH
 with
 clean=false, the postImportDeleteQuery is not executed.

 My setup is currently arranged like this:
 - A SQL Server stored procedure that receives a parameter (specified in
 the
 URL) and returns the records to be indexed
 - The procedure is able to return all the records (for a full-import) or
 only the updated records (for a delta-import)
 - This procedure returns valid and deleted records, from this point
 comes
 the need to run a postImportDeleteQuery to remove the deleted ones.

 Everything works fine when I run a full-import, I am running always with
 clean=true, and then the whole index is rebuilt.
 When I need to do an incremental update, the records are updated
 correctly,
 but the command to delete the other records is not executed.

 I've tried several combinations, with different results:
 - Running full-import with clean=false: the records are updated but the
 ones
 that needs to be deleted stays on the index
 - Running delta-import with clean=false: the records are updated but the
 ones that needs to be deleted stays on the index
 - Running delta-import with clean=true: all records are deleted from the
 index and then only the records returned by the procedure are on the
 index,
 except the deleted ones.

 I don't see any way to achieve my goal, without changing the process
 that I
 do to obtain the data.
 Since this is a very complex stored procedure, with tons of joins and
 custom
 processing, I am trying everything to avoid messing with it.

 See below a copy of my data-config.xml file. I made it simpler omitting
 all
 the fields, since it's out of scope of the issue:
 ?xml version=1.0 encoding=UTF-8 ?
 dataConfig
 dataSource type=JdbcDataSource
 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=jdbc:sqlserver://myserver;databaseName=mydb;user=username;password=
 password;responseBuffering=adaptive;

 /
 document
 entity name=entity_one
 pk=entityid
 transformer=RegexTransformer
 query=EXEC some_stored_procedure ${dataimporter.request.someid}
 preImportDeleteQuery=status:1 postImportDeleteQuery=status:1
 
 field column=field1 name=field1 splitBy=; /
 field column=field2 name=field2 splitBy=; /
 field column=field3 name=field3 splitBy=; /
 /entity

 entity name=entity_two
 pk=entityid
 transformer=RegexTransformer
 query=EXEC someother_stored_procedure
 ${dataimporter.request.someotherid}
 preImportDeleteQuery=status:1 postImportDeleteQuery=status:1
 
 field column=field1 name=field1 /
 field column=field2 name=field2 /
 field column=field3 name=field2 /
 /entity
 /document
 /dataConfig

 Any ideas or pointers

Re: DIH import and postImportDeleteQuery

2011-05-25 Thread Alexandre Rocco
Hi James,

Thanks for the heads up!
I am currently on version 1.4.1, so I can apply this patch and see if it
works.
Just need to assess if it's best to apply the patch or to check on the
backend system to see if only delete requests were generated and then do not
call DIH.

Previously, I found another open issue, created from Ephraim:
https://issues.apache.org/jira/browse/SOLR-2104

It's the same issue, but it hasn't had any updates yet.

Regards,
Alexandre

On Wed, May 25, 2011 at 3:17 PM, Dyer, James james.d...@ingrambook.comwrote:

 The failure to commit bug with $deleteDocById can be fixed by applying
 patch SOLR-2492.  This patch also partially fixes the no updated stats bug
 in that it increments 1 for every call to $deleteDocById and
 $deleteDocByQuery.  Note that this might result in inaccurate counts if the
 id given with $deleteDocById doesn't exist or is duplicated.  Obviously this
 is not a complete fix for stats using $deleteDocByQuery as this command
 would normally be used to delete 1 doc at a time.

 The patch is for Trunk but it might work with 3.1 also.  If not, it likely
 only needs minor tweaking.

 The jira ticket is here:  https://issues.apache.org/jira/browse/SOLR-2492

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Alexandre Rocco [mailto:alel...@gmail.com]
 Sent: Wednesday, May 25, 2011 12:54 PM
 To: solr-user@lucene.apache.org
 Subject: Re: DIH import and postImportDeleteQuery

 Hi Ephraim,

 Thank you so much for the input.
 I was able to find your thread on the archives and got your solution to
 work.

 In fact, when using $deleteDocById and $skipDoc it worked like a charm.
 This
 feature is very useful, it's a shame it's not properly documented.
 The only downside is the one you mentioned that the stats are not updated,
 so if I update 13 documents and delete 2, DIH would tell me that only 13
 documents were processed. This is bad in my case because I check the end
 result to generate an error e-mail if needed.

 You also mentioned that if the query contains only deletion records, a
 commit would not be automatically executed and it would be necessary to
 commit manually.

 How can I commit manually via DIH? I was not able to find any references on
 the documentation.

 Thanks!
 Alexandre

 On Wed, May 25, 2011 at 5:14 AM, Ephraim Ofir ephra...@icq.com wrote:

  Search the list for my post DIH - deleting documents, high performance
  (delta) imports, and passing parameters which shows my solution a
  similar problem.
 
  Ephraim Ofir
 
  -Original Message-
  From: Alexandre Rocco [mailto:alel...@gmail.com]
  Sent: Tuesday, May 24, 2011 11:24 PM
  To: solr-user@lucene.apache.org
  Subject: DIH import and postImportDeleteQuery
 
  Guys,
 
  I am facing a situation in one of our projects that I need to perform a
  cleanup to remove some documents after we perform an update via DIH.
  The big issue right now comes from the fact that when we call the DIH
  with
  clean=false, the postImportDeleteQuery is not executed.
 
  My setup is currently arranged like this:
  - A SQL Server stored procedure that receives a parameter (specified in
  the
  URL) and returns the records to be indexed
  - The procedure is able to return all the records (for a full-import) or
  only the updated records (for a delta-import)
  - This procedure returns valid and deleted records, from this point
  comes
  the need to run a postImportDeleteQuery to remove the deleted ones.
 
  Everything works fine when I run a full-import, I am running always with
  clean=true, and then the whole index is rebuilt.
  When I need to do an incremental update, the records are updated
  correctly,
  but the command to delete the other records is not executed.
 
  I've tried several combinations, with different results:
  - Running full-import with clean=false: the records are updated but the
  ones
  that needs to be deleted stays on the index
  - Running delta-import with clean=false: the records are updated but the
  ones that needs to be deleted stays on the index
  - Running delta-import with clean=true: all records are deleted from the
  index and then only the records returned by the procedure are on the
  index,
  except the deleted ones.
 
  I don't see any way to achieve my goal, without changing the process
  that I
  do to obtain the data.
  Since this is a very complex stored procedure, with tons of joins and
  custom
  processing, I am trying everything to avoid messing with it.
 
  See below a copy of my data-config.xml file. I made it simpler omitting
  all
  the fields, since it's out of scope of the issue:
  ?xml version=1.0 encoding=UTF-8 ?
  dataConfig
  dataSource type=JdbcDataSource
  driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
  url=jdbc:sqlserver://myserver;databaseName=mydb;user=username;password=
  password;responseBuffering=adaptive;
 
  /
  document
  entity name=entity_one
  pk=entityid
  transformer=RegexTransformer

RE: DIH import and postImportDeleteQuery

2011-05-25 Thread Dyer, James
Great.  I wasn't aware of the other issue.  I put a link on the 2 issues in 
JIRA so people can know in the future.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Alexandre Rocco [mailto:alel...@gmail.com] 
Sent: Wednesday, May 25, 2011 2:34 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH import and postImportDeleteQuery

Hi James,

Thanks for the heads up!
I am currently on version 1.4.1, so I can apply this patch and see if it
works.
Just need to assess if it's best to apply the patch or to check on the
backend system to see if only delete requests were generated and then do not
call DIH.

Previously, I found another open issue, created from Ephraim:
https://issues.apache.org/jira/browse/SOLR-2104

It's the same issue, but it hasn't had any updates yet.

Regards,
Alexandre

On Wed, May 25, 2011 at 3:17 PM, Dyer, James james.d...@ingrambook.comwrote:

 The failure to commit bug with $deleteDocById can be fixed by applying
 patch SOLR-2492.  This patch also partially fixes the no updated stats bug
 in that it increments 1 for every call to $deleteDocById and
 $deleteDocByQuery.  Note that this might result in inaccurate counts if the
 id given with $deleteDocById doesn't exist or is duplicated.  Obviously this
 is not a complete fix for stats using $deleteDocByQuery as this command
 would normally be used to delete 1 doc at a time.

 The patch is for Trunk but it might work with 3.1 also.  If not, it likely
 only needs minor tweaking.

 The jira ticket is here:  https://issues.apache.org/jira/browse/SOLR-2492

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Alexandre Rocco [mailto:alel...@gmail.com]
 Sent: Wednesday, May 25, 2011 12:54 PM
 To: solr-user@lucene.apache.org
 Subject: Re: DIH import and postImportDeleteQuery

 Hi Ephraim,

 Thank you so much for the input.
 I was able to find your thread on the archives and got your solution to
 work.

 In fact, when using $deleteDocById and $skipDoc it worked like a charm.
 This
 feature is very useful, it's a shame it's not properly documented.
 The only downside is the one you mentioned that the stats are not updated,
 so if I update 13 documents and delete 2, DIH would tell me that only 13
 documents were processed. This is bad in my case because I check the end
 result to generate an error e-mail if needed.

 You also mentioned that if the query contains only deletion records, a
 commit would not be automatically executed and it would be necessary to
 commit manually.

 How can I commit manually via DIH? I was not able to find any references on
 the documentation.

 Thanks!
 Alexandre

 On Wed, May 25, 2011 at 5:14 AM, Ephraim Ofir ephra...@icq.com wrote:

  Search the list for my post DIH - deleting documents, high performance
  (delta) imports, and passing parameters which shows my solution a
  similar problem.
 
  Ephraim Ofir
 
  -Original Message-
  From: Alexandre Rocco [mailto:alel...@gmail.com]
  Sent: Tuesday, May 24, 2011 11:24 PM
  To: solr-user@lucene.apache.org
  Subject: DIH import and postImportDeleteQuery
 
  Guys,
 
  I am facing a situation in one of our projects that I need to perform a
  cleanup to remove some documents after we perform an update via DIH.
  The big issue right now comes from the fact that when we call the DIH
  with
  clean=false, the postImportDeleteQuery is not executed.
 
  My setup is currently arranged like this:
  - A SQL Server stored procedure that receives a parameter (specified in
  the
  URL) and returns the records to be indexed
  - The procedure is able to return all the records (for a full-import) or
  only the updated records (for a delta-import)
  - This procedure returns valid and deleted records, from this point
  comes
  the need to run a postImportDeleteQuery to remove the deleted ones.
 
  Everything works fine when I run a full-import, I am running always with
  clean=true, and then the whole index is rebuilt.
  When I need to do an incremental update, the records are updated
  correctly,
  but the command to delete the other records is not executed.
 
  I've tried several combinations, with different results:
  - Running full-import with clean=false: the records are updated but the
  ones
  that needs to be deleted stays on the index
  - Running delta-import with clean=false: the records are updated but the
  ones that needs to be deleted stays on the index
  - Running delta-import with clean=true: all records are deleted from the
  index and then only the records returned by the procedure are on the
  index,
  except the deleted ones.
 
  I don't see any way to achieve my goal, without changing the process
  that I
  do to obtain the data.
  Since this is a very complex stored procedure, with tons of joins and
  custom
  processing, I am trying everything to avoid messing with it.
 
  See below a copy of my data-config.xml file. I made it simpler omitting
  all