setSegmentsPerTier >= setMaxMergeAtOnce ?

2013-06-09 Thread Boaz Leskes
Hi All,

I recently looked at the settings for the TieredMergedPolicy [1] and was
puzzled by the note on the setSegmentsPerTier method indicating it should
be equal or larger to the MaxMergeAtOnce settings, in order to not cause
too many merges.

I understood segments per tier to indicate the goal number of segments for
every segment-size tier. If a tier has more segments than that number, all
these segments will be likely to be merged into a single one, which will
then be part of the next tier. From point of view, it's efficient to be
able to collapse the tier in one merge operation. However, if
the MaxMergeAtOnce is smaller then the tier size it will not be able to do
it in one merge but will take several/not produce an segment which is close
to the ideal size of the bigger tier.

Obviously that line of though conflicts with the note of
setSegmentsPerTier's JavaDocs. Do I understand the setting/merge behavior
correctly?

Cheers,
Boaz





[1]
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/TieredMergePolicy.html


How to get the most frequent words for a set of documents in Lucene?

2013-06-09 Thread Gucko Gucko
Hello all,

I'm trying to cluster documents that were indexed using Lucene 4.3. The
results of the clustering algorithm is a set of clusters where each cluster
contains the most similar documents (I only store their docIDs in each
cluster). What I want is to get the most frequent words for each cluster.
So I query the Lucene index for the set of documents and then I want to get
the most frequent words for these documents. But how to do this in Lucene?
Especially I want an efficient way because I'm clustering tweets in
real-time.

What I was thinking about is to make a RAMDirectory and index each set of
documents in this directory and then get the statistics for each term.
However this is slow and uses a lot of memory!


Thanks in advance!


Gucko


Re: setSegmentsPerTier >= setMaxMergeAtOnce ?

2013-06-09 Thread Michael McCandless
The two settings let you decouple your tolerance for how many segments
are allowed to accumulate (setSegmentsPerTier), from how large a
single merge can be (setMaxMergeAtOnce).

E.g. say setSegmentsPerTier is 20 and setMaxMergeAtOnce is 10.

The 20 gives TMP a "generous" budget to allow up to 20 segments per
"log level" to accumulate, but at that point it will pick 10 of them
and merge them down at once.  At that point there are still 10
segments at that log level, which is fine, until another 10 segments
are created at that level and another merge is selected.

Mike McCandless

http://blog.mikemccandless.com


On Sun, Jun 9, 2013 at 4:38 AM, Boaz Leskes  wrote:
> Hi All,
>
> I recently looked at the settings for the TieredMergedPolicy [1] and was
> puzzled by the note on the setSegmentsPerTier method indicating it should
> be equal or larger to the MaxMergeAtOnce settings, in order to not cause
> too many merges.
>
> I understood segments per tier to indicate the goal number of segments for
> every segment-size tier. If a tier has more segments than that number, all
> these segments will be likely to be merged into a single one, which will
> then be part of the next tier. From point of view, it's efficient to be
> able to collapse the tier in one merge operation. However, if
> the MaxMergeAtOnce is smaller then the tier size it will not be able to do
> it in one merge but will take several/not produce an segment which is close
> to the ideal size of the bigger tier.
>
> Obviously that line of though conflicts with the note of
> setSegmentsPerTier's JavaDocs. Do I understand the setting/merge behavior
> correctly?
>
> Cheers,
> Boaz
>
>
>
>
>
> [1]
> http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/TieredMergePolicy.html

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: setSegmentsPerTier >= setMaxMergeAtOnce ?

2013-06-09 Thread Boaz Leskes
Hi Mike,

Thanks for the quick answer. So if I understand correctly, collapsing tiers
in one go leads to too many big merges. The goal is then to avoid too big
merges which will happen if we allow complete tiers to be collapsed in one
merge. We rather have a tier collapsed partially (and thus more
frequently). Am I correct?

Cheers,
Boaz



On Sun, Jun 9, 2013 at 12:10 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> The two settings let you decouple your tolerance for how many segments
> are allowed to accumulate (setSegmentsPerTier), from how large a
> single merge can be (setMaxMergeAtOnce).
>
> E.g. say setSegmentsPerTier is 20 and setMaxMergeAtOnce is 10.
>
> The 20 gives TMP a "generous" budget to allow up to 20 segments per
> "log level" to accumulate, but at that point it will pick 10 of them
> and merge them down at once.  At that point there are still 10
> segments at that log level, which is fine, until another 10 segments
> are created at that level and another merge is selected.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sun, Jun 9, 2013 at 4:38 AM, Boaz Leskes  wrote:
> > Hi All,
> >
> > I recently looked at the settings for the TieredMergedPolicy [1] and was
> > puzzled by the note on the setSegmentsPerTier method indicating it should
> > be equal or larger to the MaxMergeAtOnce settings, in order to not cause
> > too many merges.
> >
> > I understood segments per tier to indicate the goal number of segments
> for
> > every segment-size tier. If a tier has more segments than that number,
> all
> > these segments will be likely to be merged into a single one, which will
> > then be part of the next tier. From point of view, it's efficient to be
> > able to collapse the tier in one merge operation. However, if
> > the MaxMergeAtOnce is smaller then the tier size it will not be able to
> do
> > it in one merge but will take several/not produce an segment which is
> close
> > to the ideal size of the bigger tier.
> >
> > Obviously that line of though conflicts with the note of
> > setSegmentsPerTier's JavaDocs. Do I understand the setting/merge behavior
> > correctly?
> >
> > Cheers,
> > Boaz
> >
> >
> >
> >
> >
> > [1]
> >
> http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/TieredMergePolicy.html
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: setSegmentsPerTier >= setMaxMergeAtOnce ?

2013-06-09 Thread Michael McCandless
Hi Boaz,

That's correct!

But what is "too big" of a merge is an app-level decision / requires
testing in the "real" context / depends on things like how much free
RAM the OS can dedicate to bytes read-ahead, whether you have an SSD,
whether you throttle merge rate (RateLimitedDirWrapper), etc.


Mike McCandless

http://blog.mikemccandless.com


On Sun, Jun 9, 2013 at 7:08 AM, Boaz Leskes  wrote:
> Hi Mike,
>
> Thanks for the quick answer. So if I understand correctly, collapsing tiers
> in one go leads to too many big merges. The goal is then to avoid too big
> merges which will happen if we allow complete tiers to be collapsed in one
> merge. We rather have a tier collapsed partially (and thus more
> frequently). Am I correct?
>
> Cheers,
> Boaz
>
>
>
> On Sun, Jun 9, 2013 at 12:10 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> The two settings let you decouple your tolerance for how many segments
>> are allowed to accumulate (setSegmentsPerTier), from how large a
>> single merge can be (setMaxMergeAtOnce).
>>
>> E.g. say setSegmentsPerTier is 20 and setMaxMergeAtOnce is 10.
>>
>> The 20 gives TMP a "generous" budget to allow up to 20 segments per
>> "log level" to accumulate, but at that point it will pick 10 of them
>> and merge them down at once.  At that point there are still 10
>> segments at that log level, which is fine, until another 10 segments
>> are created at that level and another merge is selected.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sun, Jun 9, 2013 at 4:38 AM, Boaz Leskes  wrote:
>> > Hi All,
>> >
>> > I recently looked at the settings for the TieredMergedPolicy [1] and was
>> > puzzled by the note on the setSegmentsPerTier method indicating it should
>> > be equal or larger to the MaxMergeAtOnce settings, in order to not cause
>> > too many merges.
>> >
>> > I understood segments per tier to indicate the goal number of segments
>> for
>> > every segment-size tier. If a tier has more segments than that number,
>> all
>> > these segments will be likely to be merged into a single one, which will
>> > then be part of the next tier. From point of view, it's efficient to be
>> > able to collapse the tier in one merge operation. However, if
>> > the MaxMergeAtOnce is smaller then the tier size it will not be able to
>> do
>> > it in one merge but will take several/not produce an segment which is
>> close
>> > to the ideal size of the bigger tier.
>> >
>> > Obviously that line of though conflicts with the note of
>> > setSegmentsPerTier's JavaDocs. Do I understand the setting/merge behavior
>> > correctly?
>> >
>> > Cheers,
>> > Boaz
>> >
>> >
>> >
>> >
>> >
>> > [1]
>> >
>> http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/TieredMergePolicy.html
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Build your own Lucene finite state transducer

2013-06-09 Thread Michael McCandless
For those of you curious about Lucene's finite state transducers (FSTs)...

I just built simple web app that lets you enter input/output pairs and
see the resulting FST:

It's running here:

http://examples.mikemccandless.com/fst.py

And here's a quick blog post showing some examples/details:


http://blog.mikemccandless.com/2013/06/build-your-own-finite-state-transducer.html

Happy FST building,

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Build your own Lucene finite state transducer

2013-06-09 Thread Doug Turnbull
Awesome work Mike! Kudos!

Sent from my Windows Phone From: Michael McCandless
Sent: 6/9/2013 11:09 AM
To: Lucene Users
Subject: Build your own Lucene finite state transducer
For those of you curious about Lucene's finite state transducers (FSTs)...

I just built simple web app that lets you enter input/output pairs and
see the resulting FST:

It's running here:

http://examples.mikemccandless.com/fst.py

And here's a quick blog post showing some examples/details:


http://blog.mikemccandless.com/2013/06/build-your-own-finite-state-transducer.html

Happy FST building,

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Please add me as a wiki editor

2013-06-09 Thread Lance Norskog

I'm responsible for the OpenNLP wiki page:
https://wiki.apache.org/solr/OpenNLP

Please add me to the list of editors.