Re: Regarding Transaction logging

2011-10-26 Thread Simon Willnauer
I uploaded a patch to LUCENE-3424 which implements sequence ids for
IW. Add, update and delete returns a long seqID for every operation
and commit returns the largest committed seq id.

When writing transaction logs or a journal (however you wanna call it)
- the biggest problem here is that in a multithreaded environment
operations on the IW don't return in order so you basically have two
options, 1. build up a barrier and synchronize the operations as they
arrive or 2. somehow sort the logs once they need to be applied. The
first option seems like a total wast and kills concurrency entirely.
Some apps might be able to tell if two events are independent and
guarantee the order of dependent events on top of IW (like ES does).
Yet for lucene in general this is not always true since we don't have
a fixed primary key. The second options provides nice concurrency and
optimizes for the non-failure case. Unless you need to replay the logs
you can minimize the overhead concurrency wise. If logs are replayed
it somehow needs to be done in two or more steps (1. resort the seq
ids  offsets 2. read the entries in order based on 1.)
The biggest issue I see here is that you can not read the logs
sequentially from disk is almost certainly a perf hit. In a real world
systems there could even be a background process that reorders /
compacts the logs really.

When we replicate documents to another machine with a leader per shard
which seems the way solr goes (and ES is doing too?) sequence ids can
be used to disambiguate documents with the same ID if you keep track
of the ids you indexed in your current session. For instance if you
update doc X with seq id N but you already saw doc X with seq id N+1
you can simply drop it.

I would be interested in feedback especially for the transaction log ordering

simon

On Mon, Sep 12, 2011 at 1:04 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 I agree: we should figure out just how an app would effectively make
 use of this seq ID, in order to understand if this really is gonna
 work end to end.  Else we shouldn't change Lucene's core APIs.

 EG: could ES remove its lock array if Lucene returned a seq ID?  How
 bad is it that ES/Solr/this-new-module would have to order their
 transaction log according to Lucene's seq ID?  Or maybe it would not
 re-order, but rather write the seqID+document in each entry; then on
 playback (but also on RT get) it'd have to re-order?

 Mike McCandless

 http://blog.mikemccandless.com

 On Sat, Sep 10, 2011 at 1:45 PM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 On Thu, Sep 8, 2011 at 5:35 PM, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Sep 8, 2011 at 11:26 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 Returning a long seqID seems the least invasive change to make this
 total ordering possible?  Especially since the DWDQ already computes
 this order...

 +1
 This seems like the most powerful option.

 I still wonder how we make efficient use of this. If you are ordering
 the logs based on the returned sequence Ids you have to effectively
 delay writing to the log since documents ie. their threads come back
 async and out of order. Even worse if some thread picks up a flush it
 might block for a reasonable amount of time. I am not saying its
 impossible but before we jump on it and get into the DWPT hassle we
 should at least sketch out how to make use of this feature (lemme tell
 you this is not trivial to implement and requires a fair bit of
 refactoring). If somebody has thought about this I'd be happy if you
 could share you ideas here!

 simon

 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-11 Thread Michael McCandless
I agree: we should figure out just how an app would effectively make
use of this seq ID, in order to understand if this really is gonna
work end to end.  Else we shouldn't change Lucene's core APIs.

EG: could ES remove its lock array if Lucene returned a seq ID?  How
bad is it that ES/Solr/this-new-module would have to order their
transaction log according to Lucene's seq ID?  Or maybe it would not
re-order, but rather write the seqID+document in each entry; then on
playback (but also on RT get) it'd have to re-order?

Mike McCandless

http://blog.mikemccandless.com

On Sat, Sep 10, 2011 at 1:45 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Thu, Sep 8, 2011 at 5:35 PM, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Sep 8, 2011 at 11:26 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 Returning a long seqID seems the least invasive change to make this
 total ordering possible?  Especially since the DWDQ already computes
 this order...

 +1
 This seems like the most powerful option.

 I still wonder how we make efficient use of this. If you are ordering
 the logs based on the returned sequence Ids you have to effectively
 delay writing to the log since documents ie. their threads come back
 async and out of order. Even worse if some thread picks up a flush it
 might block for a reasonable amount of time. I am not saying its
 impossible but before we jump on it and get into the DWPT hassle we
 should at least sketch out how to make use of this feature (lemme tell
 you this is not trivial to implement and requires a fair bit of
 refactoring). If somebody has thought about this I'd be happy if you
 could share you ideas here!

 simon

 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-10 Thread Simon Willnauer
On Thu, Sep 8, 2011 at 5:35 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Sep 8, 2011 at 11:26 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 Returning a long seqID seems the least invasive change to make this
 total ordering possible?  Especially since the DWDQ already computes
 this order...

 +1
 This seems like the most powerful option.

I still wonder how we make efficient use of this. If you are ordering
the logs based on the returned sequence Ids you have to effectively
delay writing to the log since documents ie. their threads come back
async and out of order. Even worse if some thread picks up a flush it
might block for a reasonable amount of time. I am not saying its
impossible but before we jump on it and get into the DWPT hassle we
should at least sketch out how to make use of this feature (lemme tell
you this is not trivial to implement and requires a fair bit of
refactoring). If somebody has thought about this I'd be happy if you
could share you ideas here!

simon

 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-09 Thread Simon Willnauer
I created LUCENE-3424 for this. But I still would like to keep the
discussion open here rather than moving this entirely to an issue.
There is more about this than only the seq. ids.

simon

On Thu, Sep 8, 2011 at 5:35 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Sep 8, 2011 at 11:26 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 Returning a long seqID seems the least invasive change to make this
 total ordering possible?  Especially since the DWDQ already computes
 this order...

 +1
 This seems like the most powerful option.

 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-09 Thread Andrzej Bialecki

On 09/09/2011 11:00, Simon Willnauer wrote:

I created LUCENE-3424 for this. But I still would like to keep the
discussion open here rather than moving this entirely to an issue.
There is more about this than only the seq. ids.


I'm concerned also about the content of the transaction log. In Solr it 
uses javabin-encoded UpdateCommand-s (either SolrInputDocuments or 
Delete/Commit commands). Documents in the log are raw documents, i.e. 
before analysis.


This may have some merits for Solr (e.g. you could imagine having 
different analysis chains on the Solr slaves), but IMHO it's more of a 
hassle for Lucene, because it means that the analysis has to be repeated 
over and over again on all clients. If the analysis chain is costly 
(e.g. NLP) then it would make sense to have an option to log documents 
post-analysis, i.e. as correctly typed stored values (e.g. string - 
numeric) AND the resulting TokenStream-s. This has also the advantage of 
moving us towards the dumb IndexWriter concept, i.e. separating 
analysis from the core inverted index functionality.


So I'd argue for recording post-analysis docs in the tlog, either 
exclusively or as a default option.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-09 Thread eks dev
+1
indeed! All possibilities are are needed.

One might do wild things if it is somehow  typed. For example,
dictionary compression for fields that are tokenized (not only
stored), as we already have Term dictionary supporting ord-s. Keeping
just a map Token - ord with transaction log...




On Fri, Sep 9, 2011 at 11:19 AM, Andrzej Bialecki a...@getopt.org wrote:
 On 09/09/2011 11:00, Simon Willnauer wrote:

 I created LUCENE-3424 for this. But I still would like to keep the
 discussion open here rather than moving this entirely to an issue.
 There is more about this than only the seq. ids.

 I'm concerned also about the content of the transaction log. In Solr it uses
 javabin-encoded UpdateCommand-s (either SolrInputDocuments or Delete/Commit
 commands). Documents in the log are raw documents, i.e. before analysis.

 This may have some merits for Solr (e.g. you could imagine having different
 analysis chains on the Solr slaves), but IMHO it's more of a hassle for
 Lucene, because it means that the analysis has to be repeated over and over
 again on all clients. If the analysis chain is costly (e.g. NLP) then it
 would make sense to have an option to log documents post-analysis, i.e. as
 correctly typed stored values (e.g. string - numeric) AND the resulting
 TokenStream-s. This has also the advantage of moving us towards the dumb
 IndexWriter concept, i.e. separating analysis from the core inverted index
 functionality.

 So I'd argue for recording post-analysis docs in the tlog, either
 exclusively or as a default option.

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-09 Thread Andrzej Bialecki

On 09/09/2011 12:07, eks dev wrote:

+1
indeed! All possibilities are are needed.

One might do wild things if it is somehow  typed. For example,
dictionary compression for fields that are tokenized (not only
stored), as we already have Term dictionary supporting ord-s. Keeping
just a map Token-  ord with transaction log...


Hmm, you mean a per-doc map? because a global map would have to be 
updated as we add new docs, which would make the writing process 
non-atomic, which is the last thing you want from a transaction log :)


As a per-doc compression, sure. In fact, what you describe is 
essentially a single doc mini-index, because the map is a term dict, the 
token streams with ords are postings, etc.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-09 Thread eks dev
I didn't think, it was just a spontaneous reaction :)

At the moment I am using static dictionaries to at least get a grip on
size of stored fields (escaping encoded terms)

Re: Global
Maybe the trick would be to somehow use term dictionary as it  must be
*eventually* updated? An idea is to write raw token stream for
atomicity and reduce it later in compaction phase (e.g on lucene
commit())... no matter what we plan do, TL compaction  is going to be
needed?

It is slightly moving target problem (TL chases term dictionary),
but I am sure, benefits can be huge. compacted TL entry would need to
have a pointer to Term[] used to encode it, but this is by all means
doable, just simple Term[].

It surely makes not much sense for high cardinality fields, but if you
have something with low cardinality (indexed and stored) on a big
(100Mio) collection, this reduces space by exorbitant amounts.


I do not know, just trying to build upon the fact that we have term
dictionary updated in any case---


This  works not only for transaction logging, but also for
(Analyzed)-{Stored , indexed} fields. By the way, I never look how
our term vectors work, keeping reference to token or verbatim term
copy?





On Fri, Sep 9, 2011 at 12:31 PM, Andrzej Bialecki a...@getopt.org wrote:
 On 09/09/2011 12:07, eks dev wrote:

 +1
 indeed! All possibilities are are needed.

 One might do wild things if it is somehow  typed. For example,
 dictionary compression for fields that are tokenized (not only
 stored), as we already have Term dictionary supporting ord-s. Keeping
 just a map Token-  ord with transaction log...

 Hmm, you mean a per-doc map? because a global map would have to be updated
 as we add new docs, which would make the writing process non-atomic, which
 is the last thing you want from a transaction log :)

 As a per-doc compression, sure. In fact, what you describe is essentially a
 single doc mini-index, because the map is a term dict, the token streams
 with ords are postings, etc.

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-09 Thread Andrzej Bialecki

On 09/09/2011 13:20, eks dev wrote:

I didn't think, it was just a spontaneous reaction :)

At the moment I am using static dictionaries to at least get a grip on
size of stored fields (escaping encoded terms)

Re: Global
Maybe the trick would be to somehow use term dictionary as it  must be
*eventually* updated? An idea is to write raw token stream for
atomicity and reduce it later in compaction phase (e.g on lucene
commit())... no matter what we plan do, TL compaction  is going to be
needed?


Compaction - not sure, it would have to preserve the ordering of ops. 
But some form of primitive compression - certainly, delta coding, vints, 
etc, anything that can be done per doc, without the need to use data 
that spans more than 1 record.




It is slightly moving target problem (TL chases term dictionary),
but I am sure, benefits can be huge. compacted TL entry would need to
have a pointer to Term[] used to encode it, but this is by all means
doable, just simple Term[].

It surely makes not much sense for high cardinality fields, but if you
have something with low cardinality (indexed and stored) on a big
(100Mio) collection, this reduces space by exorbitant amounts.


I do not know, just trying to build upon the fact that we have term
dictionary updated in any case---


If the tlog has a Commit op, then you could theoretically compact all 
preceding entries ... at least their term dicts. If you compacted the 
postings, too, then you would essentially have a multi-doc index (naked 
segment), but it would not be a transaction log anymore, because the 
update ordering wouldn't be preserved (e.g. intermediate Delete ops 
would have a different effect).





This  works not only for transaction logging, but also for
(Analyzed)-{Stored , indexed} fields. By the way, I never look how
our term vectors work, keeping reference to token or verbatim term
copy?


It's like term dict + postings, terms are delta front coded like the 
main term dictionary. It does not reuse terms from the main dict, I 
think this representation was chosen to avoid ord renumbering when the 
main term dict is updated - you would have to renumber all term vectors 
on each commit...


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-09 Thread Simon Willnauer
On Fri, Sep 9, 2011 at 11:19 AM, Andrzej Bialecki a...@getopt.org wrote:
 On 09/09/2011 11:00, Simon Willnauer wrote:

 I created LUCENE-3424 for this. But I still would like to keep the
 discussion open here rather than moving this entirely to an issue.
 There is more about this than only the seq. ids.

 I'm concerned also about the content of the transaction log. In Solr it uses
 javabin-encoded UpdateCommand-s (either SolrInputDocuments or Delete/Commit
 commands). Documents in the log are raw documents, i.e. before analysis.

 This may have some merits for Solr (e.g. you could imagine having different
 analysis chains on the Solr slaves), but IMHO it's more of a hassle for
 Lucene, because it means that the analysis has to be repeated over and over
 again on all clients. If the analysis chain is costly (e.g. NLP) then it
 would make sense to have an option to log documents post-analysis, i.e. as
 correctly typed stored values (e.g. string - numeric) AND the resulting
 TokenStream-s. This has also the advantage of moving us towards the dumb
 IndexWriter concept, i.e. separating analysis from the core inverted index
 functionality.

 So I'd argue for recording post-analysis docs in the tlog, either
 exclusively or as a default option.

I am not sure if this should be the default option but I would need to
see how this is implemented. if we can efficiently support such a
preanalyzed document I am all for it. But I think it should be
possible to write opaque documents too. Other implementations / users
of lucene should be able to write their app specific format too.

simon

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Regarding Transaction logging

2011-09-08 Thread Simon Willnauer
hey folks,

we already have transaction logging on Solr side so I should have
started this discussion earlier. However, I want to bring this up to
the list since I think this is a very valuable feature also for plain
Lucene users and eventually this should also be available to them. I
don't think this needs to be a core feature at all but I think we need
to provide the necessary hooks in Lucene core to make this reliable
and consistent. I have a couple of concerns that which the current
extension mechanism we provide on the IndexWriter side this feature
can only be implemented in a sub-optimal way on the Solr (or basically
on top of lucene) but lemme elaborate this a little.

IndexWriter doesn't provide any transaction guarantees neither does it
give any guarantees on the order. So if you index two versions of a
document with the same delete key you can't tell which one wins unless
you prevent IW from seeing those two documents at the same time ie.
locking before you hit IW. This is basically what other implementation
do like ElasticSearch which uses locks assigned to buckets in an array
selected based on the del terms hash. However this gets a little more
complex once you get to DeleteQueries where you can't tell which
document is affected so they might be misplaced in the transaction log
if the order doesn't match the order the IW sees. Under the hood IW
does maintain such an order inside the DocumentsWriterDeleteQueue
which could be utilized to provide a total ordering that IMO should be
reflected in the transaction log.

Before I am going to propose ways of how this could be implemented I
want to check if other think we should provide more reliable ways for
users with the need for durability and consistent recovery.

simon

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-08 Thread Andrzej Bialecki

On 08/09/2011 11:35, Simon Willnauer wrote:

hey folks,

we already have transaction logging on Solr side so I should have
started this discussion earlier. However, I want to bring this up to
the list since I think this is a very valuable feature also for plain
Lucene users and eventually this should also be available to them. I
don't think this needs to be a core feature at all but I think we need
to provide the necessary hooks in Lucene core to make this reliable
and consistent. I have a couple of concerns that which the current
extension mechanism we provide on the IndexWriter side this feature
can only be implemented in a sub-optimal way on the Solr (or basically
on top of lucene) but lemme elaborate this a little.

IndexWriter doesn't provide any transaction guarantees neither does it
give any guarantees on the order. So if you index two versions of a
document with the same delete key you can't tell which one wins unless
you prevent IW from seeing those two documents at the same time ie.
locking before you hit IW. This is basically what other implementation
do like ElasticSearch which uses locks assigned to buckets in an array
selected based on the del terms hash. However this gets a little more
complex once you get to DeleteQueries where you can't tell which
document is affected so they might be misplaced in the transaction log
if the order doesn't match the order the IW sees. Under the hood IW
does maintain such an order inside the DocumentsWriterDeleteQueue
which could be utilized to provide a total ordering that IMO should be
reflected in the transaction log.

Before I am going to propose ways of how this could be implemented I
want to check if other think we should provide more reliable ways for
users with the need for durability and consistent recovery.


I think that a guarantee about total ordering is a must if you want to 
replay the log on the slaves, otherwise results on the slaves will be 
unpredictable. Maybe a deleteByQuery should enforce commit first?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-08 Thread Yonik Seeley
On Thu, Sep 8, 2011 at 5:35 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 I don't think this needs to be a core feature at all but I think we need
 to provide the necessary hooks in Lucene core to make this reliable
 and consistent.

I've thought about it a little - it would be really helpful if a
sequence id were returned from indexwriter operations.

public void addDocument() - public long addDocument()
public void deleteDocuments() - public long deleteDocuments()
public void commit() - public long commit()

And of course the id returned by commit would mean that everything
less than that id would be in the commit.

-Yonik
http://www.lucene-eurocon.com - The Lucene/Solr User Conference

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-08 Thread Jason Rutherglen
The delete by query is solved by recording the primary / UID of the
document(s) deleted.  It's only expensive if the transaction log
implementation is not designed properly.  :)

On Thu, Sep 8, 2011 at 5:35 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 hey folks,

 we already have transaction logging on Solr side so I should have
 started this discussion earlier. However, I want to bring this up to
 the list since I think this is a very valuable feature also for plain
 Lucene users and eventually this should also be available to them. I
 don't think this needs to be a core feature at all but I think we need
 to provide the necessary hooks in Lucene core to make this reliable
 and consistent. I have a couple of concerns that which the current
 extension mechanism we provide on the IndexWriter side this feature
 can only be implemented in a sub-optimal way on the Solr (or basically
 on top of lucene) but lemme elaborate this a little.

 IndexWriter doesn't provide any transaction guarantees neither does it
 give any guarantees on the order. So if you index two versions of a
 document with the same delete key you can't tell which one wins unless
 you prevent IW from seeing those two documents at the same time ie.
 locking before you hit IW. This is basically what other implementation
 do like ElasticSearch which uses locks assigned to buckets in an array
 selected based on the del terms hash. However this gets a little more
 complex once you get to DeleteQueries where you can't tell which
 document is affected so they might be misplaced in the transaction log
 if the order doesn't match the order the IW sees. Under the hood IW
 does maintain such an order inside the DocumentsWriterDeleteQueue
 which could be utilized to provide a total ordering that IMO should be
 reflected in the transaction log.

 Before I am going to propose ways of how this could be implemented I
 want to check if other think we should provide more reliable ways for
 users with the need for durability and consistent recovery.

 simon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-08 Thread Simon Willnauer
On Thu, Sep 8, 2011 at 4:21 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 The delete by query is solved by recording the primary / UID of the
 document(s) deleted.  It's only expensive if the transaction log
 implementation is not designed properly.  :)

phew I don't think this is realistic. I mean this could be a lot of
documents and looking up a lot of primary keys, plus you need to know
what the primary key is and you somehow need to do this async. I don't
consider this as an option.

simon

 On Thu, Sep 8, 2011 at 5:35 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 hey folks,

 we already have transaction logging on Solr side so I should have
 started this discussion earlier. However, I want to bring this up to
 the list since I think this is a very valuable feature also for plain
 Lucene users and eventually this should also be available to them. I
 don't think this needs to be a core feature at all but I think we need
 to provide the necessary hooks in Lucene core to make this reliable
 and consistent. I have a couple of concerns that which the current
 extension mechanism we provide on the IndexWriter side this feature
 can only be implemented in a sub-optimal way on the Solr (or basically
 on top of lucene) but lemme elaborate this a little.

 IndexWriter doesn't provide any transaction guarantees neither does it
 give any guarantees on the order. So if you index two versions of a
 document with the same delete key you can't tell which one wins unless
 you prevent IW from seeing those two documents at the same time ie.
 locking before you hit IW. This is basically what other implementation
 do like ElasticSearch which uses locks assigned to buckets in an array
 selected based on the del terms hash. However this gets a little more
 complex once you get to DeleteQueries where you can't tell which
 document is affected so they might be misplaced in the transaction log
 if the order doesn't match the order the IW sees. Under the hood IW
 does maintain such an order inside the DocumentsWriterDeleteQueue
 which could be utilized to provide a total ordering that IMO should be
 reflected in the transaction log.

 Before I am going to propose ways of how this could be implemented I
 want to check if other think we should provide more reliable ways for
 users with the need for durability and consistent recovery.

 simon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-08 Thread Jason Rutherglen
This isn't a new problem.  Databases have been around for what, 30+ years?

On Thu, Sep 8, 2011 at 11:01 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Thu, Sep 8, 2011 at 4:21 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 The delete by query is solved by recording the primary / UID of the
 document(s) deleted.  It's only expensive if the transaction log
 implementation is not designed properly.  :)

 phew I don't think this is realistic. I mean this could be a lot of
 documents and looking up a lot of primary keys, plus you need to know
 what the primary key is and you somehow need to do this async. I don't
 consider this as an option.

 simon

 On Thu, Sep 8, 2011 at 5:35 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 hey folks,

 we already have transaction logging on Solr side so I should have
 started this discussion earlier. However, I want to bring this up to
 the list since I think this is a very valuable feature also for plain
 Lucene users and eventually this should also be available to them. I
 don't think this needs to be a core feature at all but I think we need
 to provide the necessary hooks in Lucene core to make this reliable
 and consistent. I have a couple of concerns that which the current
 extension mechanism we provide on the IndexWriter side this feature
 can only be implemented in a sub-optimal way on the Solr (or basically
 on top of lucene) but lemme elaborate this a little.

 IndexWriter doesn't provide any transaction guarantees neither does it
 give any guarantees on the order. So if you index two versions of a
 document with the same delete key you can't tell which one wins unless
 you prevent IW from seeing those two documents at the same time ie.
 locking before you hit IW. This is basically what other implementation
 do like ElasticSearch which uses locks assigned to buckets in an array
 selected based on the del terms hash. However this gets a little more
 complex once you get to DeleteQueries where you can't tell which
 document is affected so they might be misplaced in the transaction log
 if the order doesn't match the order the IW sees. Under the hood IW
 does maintain such an order inside the DocumentsWriterDeleteQueue
 which could be utilized to provide a total ordering that IMO should be
 reflected in the transaction log.

 Before I am going to propose ways of how this could be implemented I
 want to check if other think we should provide more reliable ways for
 users with the need for durability and consistent recovery.

 simon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-08 Thread Simon Willnauer
On Thu, Sep 8, 2011 at 2:54 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Sep 8, 2011 at 5:35 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 I don't think this needs to be a core feature at all but I think we need
 to provide the necessary hooks in Lucene core to make this reliable
 and consistent.

 I've thought about it a little - it would be really helpful if a
 sequence id were returned from indexwriter operations.

 public void addDocument() - public long addDocument()
 public void deleteDocuments() - public long deleteDocuments()
 public void commit() - public long commit()

 And of course the id returned by commit would mean that everything
 less than that id would be in the commit.

I actually implemented a prototype that does something similar to
this. The problem here is some impl detail we have in DWPT with
cutting over to a new DelQueue on full flush but I think we can fix
that without too much trouble.

Another option would be a onDoc(Document, long queueGen, long seqID) /
onDelete(...) callback in DW that is passed a doc / delete. This would
actually be trivial to implement with what we have currently, yet not
ideal. I still wonder how we would resolve the order in the log since
you might get seq. ids out of order. Another idea would be writing a
log per DWPT which would be continuous in terms of seq. ids. Those
logs can be merged sequentially on recovery and we would get
concurrent writes for free. Just an idea though.

simon


 -Yonik
 http://www.lucene-eurocon.com - The Lucene/Solr User Conference

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-08 Thread Michael McCandless
+1 for having a contrib/transactionlog that apps could use, outside of
Solr/ElasticSearch.

And it sounds like one cannot build such a thing unless one forces an
order above Lucene (like ElasticSearch), or, we make it possible to
see/control the order of ops inside IW?

Even ES's approach is limited, since it only works because it only
deletes-by-ID?  And not by random Term or Query, etc.  This way ES
only must ensure the order when the same ID is being updated; the
order across different IDs is unimportant.

Returning a long seqID seems the least invasive change to make this
total ordering possible?  Especially since the DWDQ already computes
this order...

This would presumably mean, as long as ES cutover ordering the entries
in the transaction log according to the returned seqID, that it could
then remove the array of locks, and freely allow even docs w/ the same
ID to be updated at once and IW picks which one wins?

I hope this will not somehow mean that apps (nor IW) will need/want to
suddenly save arrays mapping docID (or appID) to seqID

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 8, 2011 at 5:35 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 hey folks,

 we already have transaction logging on Solr side so I should have
 started this discussion earlier. However, I want to bring this up to
 the list since I think this is a very valuable feature also for plain
 Lucene users and eventually this should also be available to them. I
 don't think this needs to be a core feature at all but I think we need
 to provide the necessary hooks in Lucene core to make this reliable
 and consistent. I have a couple of concerns that which the current
 extension mechanism we provide on the IndexWriter side this feature
 can only be implemented in a sub-optimal way on the Solr (or basically
 on top of lucene) but lemme elaborate this a little.

 IndexWriter doesn't provide any transaction guarantees neither does it
 give any guarantees on the order. So if you index two versions of a
 document with the same delete key you can't tell which one wins unless
 you prevent IW from seeing those two documents at the same time ie.
 locking before you hit IW. This is basically what other implementation
 do like ElasticSearch which uses locks assigned to buckets in an array
 selected based on the del terms hash. However this gets a little more
 complex once you get to DeleteQueries where you can't tell which
 document is affected so they might be misplaced in the transaction log
 if the order doesn't match the order the IW sees. Under the hood IW
 does maintain such an order inside the DocumentsWriterDeleteQueue
 which could be utilized to provide a total ordering that IMO should be
 reflected in the transaction log.

 Before I am going to propose ways of how this could be implemented I
 want to check if other think we should provide more reliable ways for
 users with the need for durability and consistent recovery.

 simon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Regarding Transaction logging

2011-09-08 Thread Yonik Seeley
On Thu, Sep 8, 2011 at 11:26 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 Returning a long seqID seems the least invasive change to make this
 total ordering possible?  Especially since the DWDQ already computes
 this order...

+1
This seems like the most powerful option.

-Yonik
http://www.lucene-eurocon.com - The Lucene/Solr User Conference

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org