Re: MergePolicy Thresholds

2011-05-21 Thread Michael McCandless
Thanks Tom!

Sounds like great fun working with such massive data sets :)

Mike

http://blog.mikemccandless.com

On Fri, May 20, 2011 at 7:03 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Hi Mike and Shai,



 I was able to index  a few documents with the tieredMergePolicy but I was
 hoping to build a large test index of about 700,000 documents to compare the
 performance against our previous runs.  I was hoping I would be able to
 report on my results in time for the Lucene Revolution conference.
 Unfortunately there was a power outage at our data center last week which
 resulted in a node failure in one of our storage nodes and node rebalancing
 for a cluster of 500 terabytes takes quite a while and totally messes up
 performance measurements.  (Our 6-8 terabytes of large scale search indexes
 shares storage with the repository that holds the 480+ terabytes of page
 images and metadata for the 8 million+ books).   Hopefully I will be able to
 run the tests when I get back.



 Tom



 From: Burton-West, Tom [mailto:tburt...@umich.edu]
 Sent: Monday, May 09, 2011 4:10 PM

 To: dev@lucene.apache.org
 Subject: RE: MergePolicy Thresholds



 Thanks again Shai and Mike.



 Am in the process of downloading and building   r108.  Should be able to
 build a test index sometime this week.  I’ll make some guesses on what
 parameters to use based on our previous tests.



 Tom

 From: Shai Erera [mailto:ser...@gmail.com]
 Sent: Saturday, May 07, 2011 11:33 PM
 To: dev@lucene.apache.org
 Subject: Re: MergePolicy Thresholds



 Hey Tom,

 Mike back-ported the changes to 3x, so you can try it out.

 FYI,
 Shai

 On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom tburt...@umich.edu wrote:

 Thanks Shai and Mike!

 I'll keep an eye on LUCENE-1076.

 Tom

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]

 Sent: Tuesday, May 03, 2011 11:15 AM
 To: dev@lucene.apache.org
 Subject: Re: MergePolicy Thresholds

 Thanks Shai!

 I'm way behind on my 3.x backports -- I'll try to do this soon.

 Mike

 http://blog.mikemccandless.com

 On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote:
 I uploaded a patch to LUCENE-1076.

 Tom, apparently the patch I've attached before cannot be used, because
 there
 are dependencies (in earlier commits on LUCENE-1076) that need to be
 back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to
 use
 this new MP.

 Shai

 On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 That'd be great, thanks :)

 Yes, let's iterate on the issue!  But: it should still be open, I hope
 (I didn't mean to close it yet, since it's not back ported)...

 Mike

 http://blog.mikemccandless.com

 On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote:
  Mike, if you want, I can back-port it, as I've already started this
  when
  preparing the patch.
 
  I noticed that you added a throws IOE to IW.setInfoStream -- is it ok
  on
  3x too? It'll be a backwards change.
 
  Maybe we should iterate on the issue? I can reopen.
 
  Shai
 
  On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  Looks good Shai!
 
  Comments below too:
 
  On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote:
   Hi
  
   I looked into porting it to 3x, and prepared the attached patch. It
   only
   contains the new TieredMP and Test, as well as the necessary changes
   to
   LuceneTestCase and IndexWriter. I guess you can start with it (even
   just
   the
   MP and IW changes) to test it on your indexes.
  
   Mike, I saw that there were many more changes, as part of
   LUCENE-1076,
   done
   to the code. In particular, this MP is now the default (on trunk),
   so
   I
   guess many changes (to tests) were needed because of that. Do you
   remember,
   if apart from the changes I've included in the patch, other
   important
   changes w.r.t. this code?
 
  The only other changes I can think of were some verbosity improvements
  to IndexWriter, to support the python script that can make a merge
  movie from an infoStream output; but that can wait for when I
  back-port to 3.x...
 
   As we won't change the default MP on 3x, I'm guessing I don't need
   to
   port
   all the changes to 3x.
 
  Right, I think.
 
  Mike
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

RE: MergePolicy Thresholds

2011-05-20 Thread Burton-West, Tom
Hi Mike and Shai,

I was able to index  a few documents with the tieredMergePolicy but I was 
hoping to build a large test index of about 700,000 documents to compare the 
performance against our previous runs.  I was hoping I would be able to report 
on my results in time for the Lucene Revolution conference.  Unfortunately 
there was a power outage at our data center last week which resulted in a node 
failure in one of our storage nodes and node rebalancing for a cluster of 500 
terabytes takes quite a while and totally messes up performance measurements.  
(Our 6-8 terabytes of large scale search indexes shares storage with the 
repository that holds the 480+ terabytes of page images and metadata for the 8 
million+ books).   Hopefully I will be able to run the tests when I get back.

Tom

From: Burton-West, Tom [mailto:tburt...@umich.edu]
Sent: Monday, May 09, 2011 4:10 PM
To: dev@lucene.apache.org
Subject: RE: MergePolicy Thresholds

Thanks again Shai and Mike.

Am in the process of downloading and building   r108.  Should be able to 
build a test index sometime this week.  I'll make some guesses on what 
parameters to use based on our previous tests.

Tom
From: Shai Erera [mailto:ser...@gmail.com]
Sent: Saturday, May 07, 2011 11:33 PM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

Hey Tom,

Mike back-ported the changes to 3x, so you can try it out.

FYI,
Shai
On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom 
tburt...@umich.edumailto:tburt...@umich.edu wrote:
Thanks Shai and Mike!

I'll keep an eye on LUCENE-1076.

Tom

-Original Message-
From: Michael McCandless 
[mailto:luc...@mikemccandless.commailto:luc...@mikemccandless.com]
Sent: Tuesday, May 03, 2011 11:15 AM
To: dev@lucene.apache.orgmailto:dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds
Thanks Shai!

I'm way behind on my 3.x backports -- I'll try to do this soon.

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 8:10 AM, Shai Erera 
ser...@gmail.commailto:ser...@gmail.com wrote:
 I uploaded a patch to LUCENE-1076.

 Tom, apparently the patch I've attached before cannot be used, because there
 are dependencies (in earlier commits on LUCENE-1076) that need to be
 back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
 this new MP.

 Shai

 On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
 luc...@mikemccandless.commailto:luc...@mikemccandless.com wrote:

 That'd be great, thanks :)

 Yes, let's iterate on the issue!  But: it should still be open, I hope
 (I didn't mean to close it yet, since it's not back ported)...

 Mike

 http://blog.mikemccandless.com

 On Tue, May 3, 2011 at 5:51 AM, Shai Erera 
 ser...@gmail.commailto:ser...@gmail.com wrote:
  Mike, if you want, I can back-port it, as I've already started this when
  preparing the patch.
 
  I noticed that you added a throws IOE to IW.setInfoStream -- is it ok
  on
  3x too? It'll be a backwards change.
 
  Maybe we should iterate on the issue? I can reopen.
 
  Shai
 
  On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
  luc...@mikemccandless.commailto:luc...@mikemccandless.com wrote:
 
  Looks good Shai!
 
  Comments below too:
 
  On Tue, May 3, 2011 at 5:29 AM, Shai Erera 
  ser...@gmail.commailto:ser...@gmail.com wrote:
   Hi
  
   I looked into porting it to 3x, and prepared the attached patch. It
   only
   contains the new TieredMP and Test, as well as the necessary changes
   to
   LuceneTestCase and IndexWriter. I guess you can start with it (even
   just
   the
   MP and IW changes) to test it on your indexes.
  
   Mike, I saw that there were many more changes, as part of
   LUCENE-1076,
   done
   to the code. In particular, this MP is now the default (on trunk), so
   I
   guess many changes (to tests) were needed because of that. Do you
   remember,
   if apart from the changes I've included in the patch, other important
   changes w.r.t. this code?
 
  The only other changes I can think of were some verbosity improvements
  to IndexWriter, to support the python script that can make a merge
  movie from an infoStream output; but that can wait for when I
  back-port to 3.x...
 
   As we won't change the default MP on 3x, I'm guessing I don't need to
   port
   all the changes to 3x.
 
  Right, I think.
 
  Mike
 
  -
  To unsubscribe, e-mail: 
  dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: 
  dev-h...@lucene.apache.orgmailto:dev-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: 
 dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: 
 dev-h...@lucene.apache.orgmailto:dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: 
dev-unsubscr...@lucene.apache.orgmailto:dev-unsubscr

Re: MergePolicy Thresholds

2011-05-07 Thread Shai Erera
Hey Tom,

Mike back-ported the changes to 3x, so you can try it out.

FYI,
Shai

On Tue, May 3, 2011 at 9:33 PM, Burton-West, Tom tburt...@umich.edu wrote:

 Thanks Shai and Mike!

 I'll keep an eye on LUCENE-1076.

 Tom

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, May 03, 2011 11:15 AM
 To: dev@lucene.apache.org
 Subject: Re: MergePolicy Thresholds

 Thanks Shai!

 I'm way behind on my 3.x backports -- I'll try to do this soon.

 Mike

 http://blog.mikemccandless.com

 On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote:
  I uploaded a patch to LUCENE-1076.
 
  Tom, apparently the patch I've attached before cannot be used, because
 there
  are dependencies (in earlier commits on LUCENE-1076) that need to be
  back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to
 use
  this new MP.
 
  Shai
 
  On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  That'd be great, thanks :)
 
  Yes, let's iterate on the issue!  But: it should still be open, I hope
  (I didn't mean to close it yet, since it's not back ported)...
 
  Mike
 
  http://blog.mikemccandless.com
 
  On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote:
   Mike, if you want, I can back-port it, as I've already started this
 when
   preparing the patch.
  
   I noticed that you added a throws IOE to IW.setInfoStream -- is it
 ok
   on
   3x too? It'll be a backwards change.
  
   Maybe we should iterate on the issue? I can reopen.
  
   Shai
  
   On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
   luc...@mikemccandless.com wrote:
  
   Looks good Shai!
  
   Comments below too:
  
   On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote:
Hi
   
I looked into porting it to 3x, and prepared the attached patch. It
only
contains the new TieredMP and Test, as well as the necessary
 changes
to
LuceneTestCase and IndexWriter. I guess you can start with it (even
just
the
MP and IW changes) to test it on your indexes.
   
Mike, I saw that there were many more changes, as part of
LUCENE-1076,
done
to the code. In particular, this MP is now the default (on trunk),
 so
I
guess many changes (to tests) were needed because of that. Do you
remember,
if apart from the changes I've included in the patch, other
 important
changes w.r.t. this code?
  
   The only other changes I can think of were some verbosity
 improvements
   to IndexWriter, to support the python script that can make a merge
   movie from an infoStream output; but that can wait for when I
   back-port to 3.x...
  
As we won't change the default MP on 3x, I'm guessing I don't need
 to
port
all the changes to 3x.
  
   Right, I think.
  
   Mike
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: dev-h...@lucene.apache.org
  
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: MergePolicy Thresholds

2011-05-03 Thread Shai Erera
Hi

I looked into porting it to 3x, and prepared the attached patch. It only
contains the new TieredMP and Test, as well as the necessary changes to
LuceneTestCase and IndexWriter. I guess you can start with it (even just the
MP and IW changes) to test it on your indexes.

Mike, I saw that there were many more changes, as part of LUCENE-1076, done
to the code. In particular, this MP is now the default (on trunk), so I
guess many changes (to tests) were needed because of that. Do you remember,
if apart from the changes I've included in the patch, other important
changes w.r.t. this code?

As we won't change the default MP on 3x, I'm guessing I don't need to port
all the changes to 3x.

Shai

On Mon, May 2, 2011 at 9:41 PM, Burton-West, Tom tburt...@umich.edu wrote:

 Hi Shai and Mike,

 Testing the TieredMP on our large indexes has been on my todo list since I
 read Mikes blog post

 http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
 .

 If you port it to the 3.x branch Shai, I'll be more than happy to test it
 with our very large (300GB+) indexes.  Besides being able to set the max
 merged segment size, I'm especially interested in using the
  maxSegmentsPerTier parameter.

 From Mike's blog post:
  ...maxSegmentsPerTier that lets you set the allowed width (number of
 segments) of each stair in the staircase. This is nice because it decouples
 how many segments to merge at a time from how wide the staircase can be.

 Tom Burton-West
 http://www.hathitrust.org/blogs/large-scale-search

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Monday, May 02, 2011 2:19 PM
 To: dev@lucene.apache.org
 Subject: Re: MergePolicy Thresholds

 I think it should be an easy port...

 Mike

 http://blog.mikemccandless.com

 On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote:
  Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
  way, or do you think it can easily be ported to 3x?
  Shai
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




tieredmp.patch
Description: Binary data

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: MergePolicy Thresholds

2011-05-03 Thread Michael McCandless
Looks good Shai!

Comments below too:

On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 I looked into porting it to 3x, and prepared the attached patch. It only
 contains the new TieredMP and Test, as well as the necessary changes to
 LuceneTestCase and IndexWriter. I guess you can start with it (even just the
 MP and IW changes) to test it on your indexes.

 Mike, I saw that there were many more changes, as part of LUCENE-1076, done
 to the code. In particular, this MP is now the default (on trunk), so I
 guess many changes (to tests) were needed because of that. Do you remember,
 if apart from the changes I've included in the patch, other important
 changes w.r.t. this code?

The only other changes I can think of were some verbosity improvements
to IndexWriter, to support the python script that can make a merge
movie from an infoStream output; but that can wait for when I
back-port to 3.x...

 As we won't change the default MP on 3x, I'm guessing I don't need to port
 all the changes to 3x.

Right, I think.

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-03 Thread Shai Erera
Mike, if you want, I can back-port it, as I've already started this when
preparing the patch.

I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on
3x too? It'll be a backwards change.

Maybe we should iterate on the issue? I can reopen.

Shai

On Tue, May 3, 2011 at 12:36 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Looks good Shai!

 Comments below too:

 On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  I looked into porting it to 3x, and prepared the attached patch. It only
  contains the new TieredMP and Test, as well as the necessary changes to
  LuceneTestCase and IndexWriter. I guess you can start with it (even just
 the
  MP and IW changes) to test it on your indexes.
 
  Mike, I saw that there were many more changes, as part of LUCENE-1076,
 done
  to the code. In particular, this MP is now the default (on trunk), so I
  guess many changes (to tests) were needed because of that. Do you
 remember,
  if apart from the changes I've included in the patch, other important
  changes w.r.t. this code?

 The only other changes I can think of were some verbosity improvements
 to IndexWriter, to support the python script that can make a merge
 movie from an infoStream output; but that can wait for when I
 back-port to 3.x...

  As we won't change the default MP on 3x, I'm guessing I don't need to
 port
  all the changes to 3x.

 Right, I think.

 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: MergePolicy Thresholds

2011-05-03 Thread Michael McCandless
That'd be great, thanks :)

Yes, let's iterate on the issue!  But: it should still be open, I hope
(I didn't mean to close it yet, since it's not back ported)...

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote:
 Mike, if you want, I can back-port it, as I've already started this when
 preparing the patch.

 I noticed that you added a throws IOE to IW.setInfoStream -- is it ok on
 3x too? It'll be a backwards change.

 Maybe we should iterate on the issue? I can reopen.

 Shai

 On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Looks good Shai!

 Comments below too:

 On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  I looked into porting it to 3x, and prepared the attached patch. It only
  contains the new TieredMP and Test, as well as the necessary changes to
  LuceneTestCase and IndexWriter. I guess you can start with it (even just
  the
  MP and IW changes) to test it on your indexes.
 
  Mike, I saw that there were many more changes, as part of LUCENE-1076,
  done
  to the code. In particular, this MP is now the default (on trunk), so I
  guess many changes (to tests) were needed because of that. Do you
  remember,
  if apart from the changes I've included in the patch, other important
  changes w.r.t. this code?

 The only other changes I can think of were some verbosity improvements
 to IndexWriter, to support the python script that can make a merge
 movie from an infoStream output; but that can wait for when I
 back-port to 3.x...

  As we won't change the default MP on 3x, I'm guessing I don't need to
  port
  all the changes to 3x.

 Right, I think.

 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-03 Thread Michael McCandless
Thanks Shai!

I'm way behind on my 3.x backports -- I'll try to do this soon.

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote:
 I uploaded a patch to LUCENE-1076.

 Tom, apparently the patch I've attached before cannot be used, because there
 are dependencies (in earlier commits on LUCENE-1076) that need to be
 back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
 this new MP.

 Shai

 On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 That'd be great, thanks :)

 Yes, let's iterate on the issue!  But: it should still be open, I hope
 (I didn't mean to close it yet, since it's not back ported)...

 Mike

 http://blog.mikemccandless.com

 On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote:
  Mike, if you want, I can back-port it, as I've already started this when
  preparing the patch.
 
  I noticed that you added a throws IOE to IW.setInfoStream -- is it ok
  on
  3x too? It'll be a backwards change.
 
  Maybe we should iterate on the issue? I can reopen.
 
  Shai
 
  On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  Looks good Shai!
 
  Comments below too:
 
  On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote:
   Hi
  
   I looked into porting it to 3x, and prepared the attached patch. It
   only
   contains the new TieredMP and Test, as well as the necessary changes
   to
   LuceneTestCase and IndexWriter. I guess you can start with it (even
   just
   the
   MP and IW changes) to test it on your indexes.
  
   Mike, I saw that there were many more changes, as part of
   LUCENE-1076,
   done
   to the code. In particular, this MP is now the default (on trunk), so
   I
   guess many changes (to tests) were needed because of that. Do you
   remember,
   if apart from the changes I've included in the patch, other important
   changes w.r.t. this code?
 
  The only other changes I can think of were some verbosity improvements
  to IndexWriter, to support the python script that can make a merge
  movie from an infoStream output; but that can wait for when I
  back-port to 3.x...
 
   As we won't change the default MP on 3x, I'm guessing I don't need to
   port
   all the changes to 3x.
 
  Right, I think.
 
  Mike
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: MergePolicy Thresholds

2011-05-03 Thread Burton-West, Tom
Thanks Shai and Mike!

I'll keep an eye on LUCENE-1076.

Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Tuesday, May 03, 2011 11:15 AM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

Thanks Shai!

I'm way behind on my 3.x backports -- I'll try to do this soon.

Mike

http://blog.mikemccandless.com

On Tue, May 3, 2011 at 8:10 AM, Shai Erera ser...@gmail.com wrote:
 I uploaded a patch to LUCENE-1076.

 Tom, apparently the patch I've attached before cannot be used, because there
 are dependencies (in earlier commits on LUCENE-1076) that need to be
 back-ported as well. So stay tuned on LUCENE-1076 for when it is safe to use
 this new MP.

 Shai

 On Tue, May 3, 2011 at 1:00 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 That'd be great, thanks :)

 Yes, let's iterate on the issue!  But: it should still be open, I hope
 (I didn't mean to close it yet, since it's not back ported)...

 Mike

 http://blog.mikemccandless.com

 On Tue, May 3, 2011 at 5:51 AM, Shai Erera ser...@gmail.com wrote:
  Mike, if you want, I can back-port it, as I've already started this when
  preparing the patch.
 
  I noticed that you added a throws IOE to IW.setInfoStream -- is it ok
  on
  3x too? It'll be a backwards change.
 
  Maybe we should iterate on the issue? I can reopen.
 
  Shai
 
  On Tue, May 3, 2011 at 12:36 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  Looks good Shai!
 
  Comments below too:
 
  On Tue, May 3, 2011 at 5:29 AM, Shai Erera ser...@gmail.com wrote:
   Hi
  
   I looked into porting it to 3x, and prepared the attached patch. It
   only
   contains the new TieredMP and Test, as well as the necessary changes
   to
   LuceneTestCase and IndexWriter. I guess you can start with it (even
   just
   the
   MP and IW changes) to test it on your indexes.
  
   Mike, I saw that there were many more changes, as part of
   LUCENE-1076,
   done
   to the code. In particular, this MP is now the default (on trunk), so
   I
   guess many changes (to tests) were needed because of that. Do you
   remember,
   if apart from the changes I've included in the patch, other important
   changes w.r.t. this code?
 
  The only other changes I can think of were some verbosity improvements
  to IndexWriter, to support the python script that can make a merge
  movie from an infoStream output; but that can wait for when I
  back-port to 3.x...
 
   As we won't change the default MP on 3x, I'm guessing I don't need to
   port
   all the changes to 3x.
 
  Right, I think.
 
  Mike
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-02 Thread Earwin Burrfoot
Have you checked BalancedSegmentMergePolicy? It has some more knobs :)

On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
 Hi

 Today, LogMP allows you to set different thresholds for segments sizes,
 thereby allowing you to control the largest segment that will be
 considered for merge + the largest segment your index will hold (=~
 threshold * mergeFactor).

 So, if you want to end up w/ say 20GB segments, you can set
 maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.

 However, this often does not achieve your desired goal -- if the index
 contains 5 and 7 GB segments, they will never be merged b/c they are
 bigger than the threshold. I am willing to spend the CPU and IO resources
 to end up w/ 20 GB segments, whether I'm merging 10 segments together or
 only 2. After I reach a 20GB segment, it can rest peacefully, at least
 until I increase the threshold.

 So I wonder, first, if this threshold (i.e., largest segment size you
 would like to end up with) is more natural to set than thee current
 thresholds,
 from the application level? I.e., wouldn't it be a simpler threshold to set
 instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
 and mergeFactor?

 Second, should this be an addition to LogMP, or a different
 type of MP. One that adheres to only those two factors (perhaps the
 segSize threshold should be allowed to set differently for optimize and
 regular merges). It can pick segments for merge such that it maximizes
 the result segment size (i.e., don't necessarily merge in sequential
 order), but not more than mergeFactor.

 I guess, if we think that maxResultSegmentSizeMB is more intuitive than
 the current thresholds, application-wise, then this change should go
 into LogMP. Otherwise, it feels like a different MP is needed, because
 LogMP is already complicated and another threshold would confuse things.

 What do you think of this? Am I trying to optimize too much? :)

 Shai





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-02 Thread Shai Erera
I did look at it, but I didn't find that it answers this particular need
(ending with a segment no bigger than X). Perhaps by tweaking several
parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve
something, but it's not very clear what is the right combination.

Which is related to one of the points -- is it not more intuitive for an app
to set this threshold (if it needs any thresholds), than tweaking all of
those parameters? If so, then we only need two thresholds (size +
mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
(perhaps w/ some adaptations) to derive a merge plan.

Shai

On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Have you checked BalancedSegmentMergePolicy? It has some more knobs :)

 On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Today, LogMP allows you to set different thresholds for segments sizes,
  thereby allowing you to control the largest segment that will be
  considered for merge + the largest segment your index will hold (=~
  threshold * mergeFactor).
 
  So, if you want to end up w/ say 20GB segments, you can set
  maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
 
  However, this often does not achieve your desired goal -- if the index
  contains 5 and 7 GB segments, they will never be merged b/c they are
  bigger than the threshold. I am willing to spend the CPU and IO resources
  to end up w/ 20 GB segments, whether I'm merging 10 segments together or
  only 2. After I reach a 20GB segment, it can rest peacefully, at least
  until I increase the threshold.
 
  So I wonder, first, if this threshold (i.e., largest segment size you
  would like to end up with) is more natural to set than thee current
  thresholds,
  from the application level? I.e., wouldn't it be a simpler threshold to
 set
  instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
  and mergeFactor?
 
  Second, should this be an addition to LogMP, or a different
  type of MP. One that adheres to only those two factors (perhaps the
  segSize threshold should be allowed to set differently for optimize and
  regular merges). It can pick segments for merge such that it maximizes
  the result segment size (i.e., don't necessarily merge in sequential
  order), but not more than mergeFactor.
 
  I guess, if we think that maxResultSegmentSizeMB is more intuitive than
  the current thresholds, application-wise, then this change should go
  into LogMP. Otherwise, it feels like a different MP is needed, because
  LogMP is already complicated and another threshold would confuse things.
 
  What do you think of this? Am I trying to optimize too much? :)
 
  Shai
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: MergePolicy Thresholds

2011-05-02 Thread Earwin Burrfoot
Dunno, I'm quite happy with numLargeSegments (you critically
misspelled it). It neatly avoids uber-merges, keeps the number of
segments at bay, and does not require to recalculate thresholds when
my expected index size changes.

The problem is - each person needs his own set of knobs (or thinks he
needs them) for MergePolicy, and I can't call any of these sets
superior to others :/

2011/5/2 Shai Erera ser...@gmail.com:
 I did look at it, but I didn't find that it answers this particular need
 (ending with a segment no bigger than X). Perhaps by tweaking several
 parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve
 something, but it's not very clear what is the right combination.

 Which is related to one of the points -- is it not more intuitive for an app
 to set this threshold (if it needs any thresholds), than tweaking all of
 those parameters? If so, then we only need two thresholds (size +
 mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
 (perhaps w/ some adaptations) to derive a merge plan.

 Shai

 On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Have you checked BalancedSegmentMergePolicy? It has some more knobs :)

 On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Today, LogMP allows you to set different thresholds for segments sizes,
  thereby allowing you to control the largest segment that will be
  considered for merge + the largest segment your index will hold (=~
  threshold * mergeFactor).
 
  So, if you want to end up w/ say 20GB segments, you can set
  maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
 
  However, this often does not achieve your desired goal -- if the index
  contains 5 and 7 GB segments, they will never be merged b/c they are
  bigger than the threshold. I am willing to spend the CPU and IO
  resources
  to end up w/ 20 GB segments, whether I'm merging 10 segments together or
  only 2. After I reach a 20GB segment, it can rest peacefully, at least
  until I increase the threshold.
 
  So I wonder, first, if this threshold (i.e., largest segment size you
  would like to end up with) is more natural to set than thee current
  thresholds,
  from the application level? I.e., wouldn't it be a simpler threshold to
  set
  instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
  and mergeFactor?
 
  Second, should this be an addition to LogMP, or a different
  type of MP. One that adheres to only those two factors (perhaps the
  segSize threshold should be allowed to set differently for optimize and
  regular merges). It can pick segments for merge such that it maximizes
  the result segment size (i.e., don't necessarily merge in sequential
  order), but not more than mergeFactor.
 
  I guess, if we think that maxResultSegmentSizeMB is more intuitive than
  the current thresholds, application-wise, then this change should go
  into LogMP. Otherwise, it feels like a different MP is needed, because
  LogMP is already complicated and another threshold would confuse things.
 
  What do you think of this? Am I trying to optimize too much? :)
 
  Shai
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-02 Thread Shai Erera

 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/


I agree. I wonder tough if the knobs we give on LogMP are intuitive enough.

It neatly avoids uber-merges


I didn't see that I can define what uber-merge is, right? Can I tell it to
stop merging segments of some size? E.g., if my index grew to 100 segments,
40GB each, I don't think that merging 10 40GB segments (to create 400GB
segment) is going to speed up my search, for instance. A 40GB segment
(probably much less) is already big enough to not be touched anymore.

Will BalancedMP stop merging such segments (if all segments are of that
order of magnitude)?

Shai

On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Dunno, I'm quite happy with numLargeSegments (you critically
 misspelled it). It neatly avoids uber-merges, keeps the number of
 segments at bay, and does not require to recalculate thresholds when
 my expected index size changes.

 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/

 2011/5/2 Shai Erera ser...@gmail.com:
  I did look at it, but I didn't find that it answers this particular need
  (ending with a segment no bigger than X). Perhaps by tweaking several
  parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can
 achieve
  something, but it's not very clear what is the right combination.
 
  Which is related to one of the points -- is it not more intuitive for an
 app
  to set this threshold (if it needs any thresholds), than tweaking all of
  those parameters? If so, then we only need two thresholds (size +
  mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
  (perhaps w/ some adaptations) to derive a merge plan.
 
  Shai
 
  On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com
 wrote:
 
  Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
 
  On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
   Hi
  
   Today, LogMP allows you to set different thresholds for segments
 sizes,
   thereby allowing you to control the largest segment that will be
   considered for merge + the largest segment your index will hold (=~
   threshold * mergeFactor).
  
   So, if you want to end up w/ say 20GB segments, you can set
   maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
  
   However, this often does not achieve your desired goal -- if the index
   contains 5 and 7 GB segments, they will never be merged b/c they are
   bigger than the threshold. I am willing to spend the CPU and IO
   resources
   to end up w/ 20 GB segments, whether I'm merging 10 segments together
 or
   only 2. After I reach a 20GB segment, it can rest peacefully, at least
   until I increase the threshold.
  
   So I wonder, first, if this threshold (i.e., largest segment size you
   would like to end up with) is more natural to set than thee current
   thresholds,
   from the application level? I.e., wouldn't it be a simpler threshold
 to
   set
   instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
   and mergeFactor?
  
   Second, should this be an addition to LogMP, or a different
   type of MP. One that adheres to only those two factors (perhaps the
   segSize threshold should be allowed to set differently for optimize
 and
   regular merges). It can pick segments for merge such that it maximizes
   the result segment size (i.e., don't necessarily merge in sequential
   order), but not more than mergeFactor.
  
   I guess, if we think that maxResultSegmentSizeMB is more intuitive
 than
   the current thresholds, application-wise, then this change should go
   into LogMP. Otherwise, it feels like a different MP is needed, because
   LogMP is already complicated and another threshold would confuse
 things.
  
   What do you think of this? Am I trying to optimize too much? :)
  
   Shai
  
  
 
 
 
  --
  Kirill Zakharenko/Кирилл Захаренко
  E-Mail/Jabber: ear...@gmail.com
  Phone: +7 (495) 683-567-4
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: MergePolicy Thresholds

2011-05-02 Thread Michael McCandless
Actually the new TieredMergePolicy (only on trunk currently but I plan
to backport for 3.2) lets you set the max merged segment size
(maxMergedSegmentMB).

It's only an estimate, but if it's set, it tries to pick a merge
reaching around that target size.

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote:
 Hi

 Today, LogMP allows you to set different thresholds for segments sizes,
 thereby allowing you to control the largest segment that will be
 considered for merge + the largest segment your index will hold (=~
 threshold * mergeFactor).

 So, if you want to end up w/ say 20GB segments, you can set
 maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.

 However, this often does not achieve your desired goal -- if the index
 contains 5 and 7 GB segments, they will never be merged b/c they are
 bigger than the threshold. I am willing to spend the CPU and IO resources
 to end up w/ 20 GB segments, whether I'm merging 10 segments together or
 only 2. After I reach a 20GB segment, it can rest peacefully, at least
 until I increase the threshold.

 So I wonder, first, if this threshold (i.e., largest segment size you
 would like to end up with) is more natural to set than thee current
 thresholds,
 from the application level? I.e., wouldn't it be a simpler threshold to set
 instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
 and mergeFactor?

 Second, should this be an addition to LogMP, or a different
 type of MP. One that adheres to only those two factors (perhaps the
 segSize threshold should be allowed to set differently for optimize and
 regular merges). It can pick segments for merge such that it maximizes
 the result segment size (i.e., don't necessarily merge in sequential
 order), but not more than mergeFactor.

 I guess, if we think that maxResultSegmentSizeMB is more intuitive than
 the current thresholds, application-wise, then this change should go
 into LogMP. Otherwise, it feels like a different MP is needed, because
 LogMP is already complicated and another threshold would confuse things.

 What do you think of this? Am I trying to optimize too much? :)

 Shai



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-02 Thread Shai Erera
Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
way, or do you think it can easily be ported to 3x?

Shai

On Mon, May 2, 2011 at 6:34 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Actually the new TieredMergePolicy (only on trunk currently but I plan
 to backport for 3.2) lets you set the max merged segment size
 (maxMergedSegmentMB).

 It's only an estimate, but if it's set, it tries to pick a merge
 reaching around that target size.

 Mike

 http://blog.mikemccandless.com

 On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Today, LogMP allows you to set different thresholds for segments sizes,
  thereby allowing you to control the largest segment that will be
  considered for merge + the largest segment your index will hold (=~
  threshold * mergeFactor).
 
  So, if you want to end up w/ say 20GB segments, you can set
  maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
 
  However, this often does not achieve your desired goal -- if the index
  contains 5 and 7 GB segments, they will never be merged b/c they are
  bigger than the threshold. I am willing to spend the CPU and IO resources
  to end up w/ 20 GB segments, whether I'm merging 10 segments together or
  only 2. After I reach a 20GB segment, it can rest peacefully, at least
  until I increase the threshold.
 
  So I wonder, first, if this threshold (i.e., largest segment size you
  would like to end up with) is more natural to set than thee current
  thresholds,
  from the application level? I.e., wouldn't it be a simpler threshold to
 set
  instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
  and mergeFactor?
 
  Second, should this be an addition to LogMP, or a different
  type of MP. One that adheres to only those two factors (perhaps the
  segSize threshold should be allowed to set differently for optimize and
  regular merges). It can pick segments for merge such that it maximizes
  the result segment size (i.e., don't necessarily merge in sequential
  order), but not more than mergeFactor.
 
  I guess, if we think that maxResultSegmentSizeMB is more intuitive than
  the current thresholds, application-wise, then this change should go
  into LogMP. Otherwise, it feels like a different MP is needed, because
  LogMP is already complicated and another threshold would confuse things.
 
  What do you think of this? Am I trying to optimize too much? :)
 
  Shai
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: MergePolicy Thresholds

2011-05-02 Thread Michael McCandless
I think it should be an easy port...

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote:
 Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
 way, or do you think it can easily be ported to 3x?
 Shai

 On Mon, May 2, 2011 at 6:34 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Actually the new TieredMergePolicy (only on trunk currently but I plan
 to backport for 3.2) lets you set the max merged segment size
 (maxMergedSegmentMB).

 It's only an estimate, but if it's set, it tries to pick a merge
 reaching around that target size.

 Mike

 http://blog.mikemccandless.com

 On Mon, May 2, 2011 at 9:03 AM, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Today, LogMP allows you to set different thresholds for segments sizes,
  thereby allowing you to control the largest segment that will be
  considered for merge + the largest segment your index will hold (=~
  threshold * mergeFactor).
 
  So, if you want to end up w/ say 20GB segments, you can set
  maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
 
  However, this often does not achieve your desired goal -- if the index
  contains 5 and 7 GB segments, they will never be merged b/c they are
  bigger than the threshold. I am willing to spend the CPU and IO
  resources
  to end up w/ 20 GB segments, whether I'm merging 10 segments together or
  only 2. After I reach a 20GB segment, it can rest peacefully, at least
  until I increase the threshold.
 
  So I wonder, first, if this threshold (i.e., largest segment size you
  would like to end up with) is more natural to set than thee current
  thresholds,
  from the application level? I.e., wouldn't it be a simpler threshold to
  set
  instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
  and mergeFactor?
 
  Second, should this be an addition to LogMP, or a different
  type of MP. One that adheres to only those two factors (perhaps the
  segSize threshold should be allowed to set differently for optimize and
  regular merges). It can pick segments for merge such that it maximizes
  the result segment size (i.e., don't necessarily merge in sequential
  order), but not more than mergeFactor.
 
  I guess, if we think that maxResultSegmentSizeMB is more intuitive than
  the current thresholds, application-wise, then this change should go
  into LogMP. Otherwise, it feels like a different MP is needed, because
  LogMP is already complicated and another threshold would confuse things.
 
  What do you think of this? Am I trying to optimize too much? :)
 
  Shai
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: MergePolicy Thresholds

2011-05-02 Thread Burton-West, Tom
Hi Shai and Mike,

Testing the TieredMP on our large indexes has been on my todo list since I read 
Mikes blog post
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html.

If you port it to the 3.x branch Shai, I'll be more than happy to test it with 
our very large (300GB+) indexes.  Besides being able to set the max merged 
segment size, I'm especially interested in using the  maxSegmentsPerTier 
parameter.

From Mike's blog post:
 ...maxSegmentsPerTier that lets you set the allowed width (number of 
segments) of each stair in the staircase. This is nice because it decouples how 
many segments to merge at a time from how wide the staircase can be.

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, May 02, 2011 2:19 PM
To: dev@lucene.apache.org
Subject: Re: MergePolicy Thresholds

I think it should be an easy port...

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 2:16 PM, Shai Erera ser...@gmail.com wrote:
 Thanks Mike. I'll take a look at TieredMP. Does it depend on trunk in any
 way, or do you think it can easily be ported to 3x?
 Shai



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-02 Thread Earwin Burrfoot
 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/

 I agree. I wonder tough if the knobs we give on LogMP are intuitive enough.

 It neatly avoids uber-merges

 I didn't see that I can define what uber-merge is, right? Can I tell it to
 stop merging segments of some size? E.g., if my index grew to 100 segments,
 40GB each, I don't think that merging 10 40GB segments (to create 400GB
 segment) is going to speed up my search, for instance. A 40GB segment
 (probably much less) is already big enough to not be touched anymore.
No, you can't. But you can tell it to have exactly (not 'at most') N
top-tier segments and try to keep their sizes close with merges.
Whatever that size may be.
And this is exactly what I want. And defining max cap on segment size
is not what I want.

So the same set of knobs can be intuitive and meaningful for one
person, and useless for another. And you can't pick the best one.

 Will BalancedMP stop merging such segments (if all segments are of that
 order of magnitude)?

 Shai

 On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Dunno, I'm quite happy with numLargeSegments (you critically
 misspelled it). It neatly avoids uber-merges, keeps the number of
 segments at bay, and does not require to recalculate thresholds when
 my expected index size changes.

 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/

 2011/5/2 Shai Erera ser...@gmail.com:
  I did look at it, but I didn't find that it answers this particular need
  (ending with a segment no bigger than X). Perhaps by tweaking several
  parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can
  achieve
  something, but it's not very clear what is the right combination.
 
  Which is related to one of the points -- is it not more intuitive for an
  app
  to set this threshold (if it needs any thresholds), than tweaking all of
  those parameters? If so, then we only need two thresholds (size +
  mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
  (perhaps w/ some adaptations) to derive a merge plan.
 
  Shai
 
  On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com
  wrote:
 
  Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
 
  On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
   Hi
  
   Today, LogMP allows you to set different thresholds for segments
   sizes,
   thereby allowing you to control the largest segment that will be
   considered for merge + the largest segment your index will hold (=~
   threshold * mergeFactor).
  
   So, if you want to end up w/ say 20GB segments, you can set
   maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
  
   However, this often does not achieve your desired goal -- if the
   index
   contains 5 and 7 GB segments, they will never be merged b/c they are
   bigger than the threshold. I am willing to spend the CPU and IO
   resources
   to end up w/ 20 GB segments, whether I'm merging 10 segments together
   or
   only 2. After I reach a 20GB segment, it can rest peacefully, at
   least
   until I increase the threshold.
  
   So I wonder, first, if this threshold (i.e., largest segment size you
   would like to end up with) is more natural to set than thee current
   thresholds,
   from the application level? I.e., wouldn't it be a simpler threshold
   to
   set
   instead of doing weird calculus that depend on
   maxMergeMB(ForOptimize)
   and mergeFactor?
  
   Second, should this be an addition to LogMP, or a different
   type of MP. One that adheres to only those two factors (perhaps the
   segSize threshold should be allowed to set differently for optimize
   and
   regular merges). It can pick segments for merge such that it
   maximizes
   the result segment size (i.e., don't necessarily merge in sequential
   order), but not more than mergeFactor.
  
   I guess, if we think that maxResultSegmentSizeMB is more intuitive
   than
   the current thresholds, application-wise, then this change should go
   into LogMP. Otherwise, it feels like a different MP is needed,
   because
   LogMP is already complicated and another threshold would confuse
   things.
  
   What do you think of this? Am I trying to optimize too much? :)
  
   Shai
  
  
 
 
 
  --
  Kirill Zakharenko/Кирилл Захаренко
  E-Mail/Jabber: ear...@gmail.com
  Phone: +7 (495) 683-567-4
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: