Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
Hi David,

 but with less memory.  As I understand it, FSTs are a highly compressed
 representation of a set of Strings (among other possibilities).  The

Yep. Not only, but this is one of the use cases. Will you be at Lucene
Revolution next week? I'll be talking about it there.

 representation of a set of Strings (among other possibilities).  The
 fieldCache would need to point to an FST entry (an arc?) using something
 small, say an integer.  Is there a way to point to an FST entry with an
 integer, and then somehow with relative efficiency construct the String from
 the arcs to get there?

Correct me if my understanding is wrong: you'd like to assign a unique
integer to each String and then retrieve it by this integer (something
like a
MapInteger, String)? This would be something called perfect hashing
and this can be done on top of an automaton (fairly easily). I assume
the data structure is immutable once constructed and does not change
too often, right?

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
You cannot get a string out of automaton by its ordinal without
storing additional data.
The string is stored there not as a single arc, but as a sequence of
them (basically.. err.. as a string),
so referencing them is basically writing the string asis. Space
savings here come from sharing arcs between strings.

Though, it's possible to do if you associate an additional number with
each node. (I invented some way, shared it with Mike and forgot.. good
grief :/)

Perfect hashing, on the other hand, is like a MapString, Integer
that accepts a predefined set of N strings and returns an int in
0..N-1 interval.
And it can't do the reverse lookup, by design, that's a lossy
compression for all good perfect hashing algos.
So, it's irrelevant here, huh?

On Thu, May 19, 2011 at 08:53, David Smiley (@MITRE.org)
dsmi...@mitre.org wrote:
 I've been pondering how to reduce the size of FieldCache entries when there
 are a large number of Strings. I'd like to facet on such a field with Solr
 but with less memory.  As I understand it, FSTs are a highly compressed
 representation of a set of Strings (among other possibilities).  The
 fieldCache would need to point to an FST entry (an arc?) using something
 small, say an integer.  Is there a way to point to an FST entry with an
 integer, and then somehow with relative efficiency construct the String from
 the arcs to get there?

 ~ David Smiley

 -
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2960030.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
 Though, it's possible to do if you associate an additional number with
 each node. (I invented some way, shared it with Mike and forgot.. good
 grief :/)

It doesn't need to be invented -- it's a known technique. On each arc
you store the number of strings under that arc; while traversing you
accumulate -- this gives you a unique number for each string (perfect
hash) and a way to locate a string given its number.

 And it can't do the reverse lookup, by design, that's a lossy
 compression for all good perfect hashing algos.
 So, it's irrelevant here, huh?

You can do both the way I described above. Jan Daciuk has details on
many more variants of doing that:

Jan Daciuk, Rafael C. Carrasco, Perfect Hashing with Pseudo-minimal
Bottom-up Deterministic Tree Automata, Intelligent Information Systems
XVI, Proceedings of the International IIS'08 Conference held in
Zakopane, Poland, June 16-18, 2008, Mieczysław A. Kłopotek, Adam
Przepiórkowski, Sławomir T. Wierzchoń, Krzysztof Trojanowski (eds.),
Academic Publishing House Exit, Warszawa 2008.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
We should add this (lookup by value, when value is guaranteed to
monotonically increase as the key increases) to our core FST APIs?
It's generically useful in many places ;)  I'll open an issue.

EG this would also enable an FST terms index that supports
lookup-by-ord, something VariableGapTermsIndex (this is the one that
uses FST for the index) does not support today.

David, one thing to remember is trunk has already seen drastic
reductions on the RAM required to store DocTerms/Index vs 3.x
(something maybe we should backport to 3.x...).  The bytes for the
terms are now stored as shared byte[] blocks, and the ords/offsets are
stored as packed ints, so we no longer have per-String memory
pointer overhead.  I describe the gains here:
http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html
-- though, those gains include RAM reduction from the terms index as
well.

While FST w/ lookup-by-monotonic-value would work here, I would be
worried about the per hit of that representation vs what
DocTerms/Index offers today... we should test to see.  Of course, for
certain apps that perf hit is justified, so probably we should make
this an option when populating field cache (ie, in-memory storage
option of using an FST vs using packed ints/byte[]).

Mike

http://blog.mikemccandless.com

On Thu, May 19, 2011 at 4:43 AM, Dawid Weiss
dawid.we...@cs.put.poznan.pl wrote:
 Though, it's possible to do if you associate an additional number with
 each node. (I invented some way, shared it with Mike and forgot.. good
 grief :/)

 It doesn't need to be invented -- it's a known technique. On each arc
 you store the number of strings under that arc; while traversing you
 accumulate -- this gives you a unique number for each string (perfect
 hash) and a way to locate a string given its number.

 And it can't do the reverse lookup, by design, that's a lossy
 compression for all good perfect hashing algos.
 So, it's irrelevant here, huh?

 You can do both the way I described above. Jan Daciuk has details on
 many more variants of doing that:

 Jan Daciuk, Rafael C. Carrasco, Perfect Hashing with Pseudo-minimal
 Bottom-up Deterministic Tree Automata, Intelligent Information Systems
 XVI, Proceedings of the International IIS'08 Conference held in
 Zakopane, Poland, June 16-18, 2008, Mieczysław A. Kłopotek, Adam
 Przepiórkowski, Sławomir T. Wierzchoń, Krzysztof Trojanowski (eds.),
 Academic Publishing House Exit, Warszawa 2008.

 Dawid

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
On Thu, May 19, 2011 at 6:16 AM, Dawid Weiss
dawid.we...@cs.put.poznan.pl wrote:
 We should add this (lookup by value, when value is guaranteed to
 monotonically increase as the key increases) to our core FST APIs?
 It's generically useful in many places ;)  I'll open an issue.

 The data structure itself should sort of build itself if you create
 an FST with increasing integers because the shared suffix should be
 pushed towards the root anyway, so the only thing would be to correct
 values on all outgoing arcs (they need to contain the count of leaves
 on the subtree) but then, this may be tricky if arc values are
 vcoded... I'd have to think how to do this.

I think, if we add ord as an output to the FST, then it builds
everything we need?  Ie no further data structures should be needed?
Maybe I'm confused :)

 While FST w/ lookup-by-monotonic-value would work here, I would be
 worried about the per hit of that representation vs what

 There are actually two things:

 a) performance; you need to descend in the automaton and some
 bookkeeping to maintain the count of nodes; this adds overhead,

 b) size; the procedure for storing/ calculating perfect hashes I
 described requires leaf counts on each arc and these are usually large
 integers. Even vcoded they bloat the resulting data structure.

Maybe we should iterate on the issue to get down to the specifics?  I
had thought there wouldn't be any backtracking, if the FST had stored
the ord as an output...

Mike

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
 I think, if we add ord as an output to the FST, then it builds
 everything we need?  Ie no further data structures should be needed?
 Maybe I'm confused :)

If you put the ord as an output the common part will be shifted towards the
front of the tree. This will work if you want to look up a given value
assigned to some string, but will not work if you need to look up the string
from its value. The latter case can be solved if you know which branch to
take while descending from root and the shared prefix alone won't give you
this information. At least I don't see how it could.

I am familiar with the basic prefix hashing procedure suggested by Daciuk
(and other authors), but maybe some progress has been made there, I don't
know... the one I know is really conceptually simple -- since each arc
encodes the number of leaves (or input sequences) in the automaton, you know
which path must lead you to your string. For example if you have a node like
this and seek for the 12-th term:

0 -- 10 -- ...
  +- 10 -- ...
  +- 5 -- ..

you look at the first path, it'd give you terms 1..10, then the next one
contains terms 11..20 so you add 10 to an internal counter which is added to
further computations, descend and repeat the procedure until you find a leaf
node.

Dawid


Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
 I think, if we add ord as an output to the FST, then it builds
 everything we need?  Ie no further data structures should be needed?
 Maybe I'm confused :)

 If you put the ord as an output the common part will be shifted towards the
 front of the tree. This will work if you want to look up a given value
 assigned to some string, but will not work if you need to look up the string
 from its value. The latter case can be solved if you know which branch to
 take while descending from root and the shared prefix alone won't give you
 this information. At least I don't see how it could.

 I am familiar with the basic prefix hashing procedure suggested by Daciuk
 (and other authors), but maybe some progress has been made there, I don't
 know... the one I know is really conceptually simple -- since each arc
 encodes the number of leaves (or input sequences) in the automaton, you know
 which path must lead you to your string. For example if you have a node like
 this and seek for the 12-th term:

 0 -- 10 -- ...
   +- 10 -- ...
   +- 5 -- ..
 you look at the first path, it'd give you terms 1..10, then the next one
 contains terms 11..20 so you add 10 to an internal counter which is added to
 further computations, descend and repeat the procedure until you find a leaf
 node.

 Dawid

There's a possible speedup here. If, instead of storing the count of
all downstream leaves, you store the sum of counts for all previous
siblings, you can do a binary lookup instead of linear scan on each
node.
Taking your example:

0 -- 0 -- ...
  +- 10 -- ... We know that for 12-th term we should descend along
this edge, as it has the biggest tag less than 12.
  +- 15 -- ...

That's what I invented, and yes, it was invented by countless people before :)

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
This (storing sums) is, I think, exactly what the FST stores as
outputs on the arcs.  Ie, it bounds the range of outputs if you were
to recurse on that arc.

So, from any node, we can unambiguously determine which arc to recurse
on, when looking up by value (only if the value is strictly
monotonic).

It should be straightforward to implement, ie should not require any
additional data structure / storage in the FST.  It's a lookup-only
change, I think.

Mike

http://blog.mikemccandless.com

On Thu, May 19, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com wrote:
 I think, if we add ord as an output to the FST, then it builds
 everything we need?  Ie no further data structures should be needed?
 Maybe I'm confused :)

 If you put the ord as an output the common part will be shifted towards the
 front of the tree. This will work if you want to look up a given value
 assigned to some string, but will not work if you need to look up the string
 from its value. The latter case can be solved if you know which branch to
 take while descending from root and the shared prefix alone won't give you
 this information. At least I don't see how it could.

 I am familiar with the basic prefix hashing procedure suggested by Daciuk
 (and other authors), but maybe some progress has been made there, I don't
 know... the one I know is really conceptually simple -- since each arc
 encodes the number of leaves (or input sequences) in the automaton, you know
 which path must lead you to your string. For example if you have a node like
 this and seek for the 12-th term:

 0 -- 10 -- ...
   +- 10 -- ...
   +- 5 -- ..
 you look at the first path, it'd give you terms 1..10, then the next one
 contains terms 11..20 so you add 10 to an internal counter which is added to
 further computations, descend and repeat the procedure until you find a leaf
 node.

 Dawid

 There's a possible speedup here. If, instead of storing the count of
 all downstream leaves, you store the sum of counts for all previous
 siblings, you can do a binary lookup instead of linear scan on each
 node.
 Taking your example:

 0 -- 0 -- ...
  +- 10 -- ... We know that for 12-th term we should descend along
 this edge, as it has the biggest tag less than 12.
  +- 15 -- ...

 That's what I invented, and yes, it was invented by countless people before :)

 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
 That's what I invented, and yes, it was invented by countless people before
 :)


You know I didn't mean to sound rude, right? I'm really admiring your
ability to come up with these solutions by yourself, I'm merely copying
other folks' ideas.

Anyway, the optimization you're describing is sure possible. Lucene's FST
implementation can actually combine both approaches because always expanding
nodes is inefficient and those already expanded will allow a binary search
(assuming the automaton structure is known to the implementation).

Another refinement of this idea creates a detached table (err.. index :) of
states to start from inside the automaton, so that you don't have to go
through the initial 2-3 states which are more or less always large and even
binary search is costly there.

Dawid


Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
On Thu, May 19, 2011 at 16:45, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:

 That's what I invented, and yes, it was invented by countless people
 before :)
 You know I didn't mean to sound rude, right? I'm really admiring your
 ability to come up with these solutions by yourself, I'm merely copying
 other folks' ideas.
I tried to prevent another reference to mr. Daciuk :)

 Anyway, the optimization you're describing is sure possible. Lucene's FST
 implementation can actually combine both approaches because always expanding
 nodes is inefficient and those already expanded will allow a binary search
 (assuming the automaton structure is known to the implementation).
 Another refinement of this idea creates a detached table (err.. index :) of
 states to start from inside the automaton, so that you don't have to go
 through the initial 2-3 states which are more or less always large and even
 binary search is costly there.
 Dawid

But you have to lookup this err..index somehow. And that's either
binary or hash lookup. Where's the win?


-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Robert Muir
2011/5/19 Michael McCandless luc...@mikemccandless.com:

 Of course, for
 certain apps that perf hit is justified, so probably we should make
 this an option when populating field cache (ie, in-memory storage
 option of using an FST vs using packed ints/byte[]).


or should we actually try to have different fieldcacheimpls?

I see all these missions to refactor the thing, which always fail.

maybe thats because we have one huge monolithic implementation.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Dawid Weiss
 I tried to prevent another reference to mr. Daciuk :)

Why? He's a very nice guy :)

 But you have to lookup this err..index somehow. And that's either
 binary or hash lookup. Where's the win?

You can do a sparse O(1) index and have a slight gain from these few
large initial states. This only makes sense if you perform tons of
these lookups, really.

Mike's right -- the FST will output a structure that is ready to be
used for by-number retrieval of strings (or anything else) as long as
the numbers are strictly monotonous (and preferably continuous). The
output will be what you're suggesting, Earwin -- accumulated sums.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Jason Rutherglen
 maybe thats because we have one huge monolithic implementation

Doesn't the DocValues branch solve this?

Also, instead of trying to implement clever ways of compressing
strings in the field cache, which probably won't bare fruit, I'd
prefer to look at [eventually] MMap'ing (using DV) the field caches to
avoid the loading and heap costs, which are signifcant.  I'm not sure
if we can easily MMap packed ints and the shared byte[], though it
seems fairly doable?

On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote:
 2011/5/19 Michael McCandless luc...@mikemccandless.com:

 Of course, for
 certain apps that perf hit is justified, so probably we should make
 this an option when populating field cache (ie, in-memory storage
 option of using an FST vs using packed ints/byte[]).


 or should we actually try to have different fieldcacheimpls?

 I see all these missions to refactor the thing, which always fail.

 maybe thats because we have one huge monolithic implementation.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
This is more about compressing strings in TermsIndex, I think.
And ability to use said TermsIndex directly in some cases that
required FieldCache before. (Maybe FC is still needed, but it can be
degraded to docId-ord map, storing actual strings in TI).
This yields fat space savings when we, eg,  need to both lookup on a
field and build facets out of it.

mmap is cool :)  What I want to see is a FST-based TermsDict that is
simply mmaped into memory, without building intermediate indexes, like
Lucene does now.
And docvalues are orthogonal to that, no?

On Thu, May 19, 2011 at 17:22, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 maybe thats because we have one huge monolithic implementation

 Doesn't the DocValues branch solve this?

 Also, instead of trying to implement clever ways of compressing
 strings in the field cache, which probably won't bare fruit, I'd
 prefer to look at [eventually] MMap'ing (using DV) the field caches to
 avoid the loading and heap costs, which are signifcant.  I'm not sure
 if we can easily MMap packed ints and the shared byte[], though it
 seems fairly doable?

 On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote:
 2011/5/19 Michael McCandless luc...@mikemccandless.com:

 Of course, for
 certain apps that perf hit is justified, so probably we should make
 this an option when populating field cache (ie, in-memory storage
 option of using an FST vs using packed ints/byte[]).


 or should we actually try to have different fieldcacheimpls?

 I see all these missions to refactor the thing, which always fail.

 maybe thats because we have one huge monolithic implementation.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Jason Rutherglen
 This is more about compressing strings in TermsIndex, I think.

Ah, because they're sorted.  I think if the string lookup cost
degrades then it's not worth it?  That's something that needs to be
tested in the MMap case as well, eg, are ByteBuffers somehow slowing
down everything by a factor of 10%?

On Thu, May 19, 2011 at 6:30 AM, Earwin Burrfoot ear...@gmail.com wrote:
 This is more about compressing strings in TermsIndex, I think.
 And ability to use said TermsIndex directly in some cases that
 required FieldCache before. (Maybe FC is still needed, but it can be
 degraded to docId-ord map, storing actual strings in TI).
 This yields fat space savings when we, eg,  need to both lookup on a
 field and build facets out of it.

 mmap is cool :)  What I want to see is a FST-based TermsDict that is
 simply mmaped into memory, without building intermediate indexes, like
 Lucene does now.
 And docvalues are orthogonal to that, no?


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:

 maybe thats because we have one huge monolithic implementation

 Doesn't the DocValues branch solve this?

Hopefully DocValues will replace FieldCache over time; maybe some day
we can deprecate  remove FieldCache.

But we still have work to do there, I believe; eg we don't have
comparators for all types (on the docvalues branch) yet.

 Also, instead of trying to implement clever ways of compressing
 strings in the field cache, which probably won't bare fruit, I'd
 prefer to look at [eventually] MMap'ing (using DV) the field caches to
 avoid the loading and heap costs, which are signifcant.  I'm not sure
 if we can easily MMap packed ints and the shared byte[], though it
 seems fairly doable?

In fact, the packed ints and the byte[] packing of terms data is very
much amenable/necessary for using MMap, far moreso than the separate
objects we had before.

I agree we should make an mmap option, though I would generally
recommend against apps using mmap for these caches.  We load these
caches so that we'll have fast random access to potentially a great
many documents during collection of one query (eg for sorting).  When
you mmap them you let the OS decide when to swap stuff out which mean
you pick up potentially high query latency waiting for these pages to
swap back in.  Various other data structures in Lucene needs this fast
random access (norms, del docs, terms index) and that's why we put
them in RAM.  I do agree for all else (the lrge postings), MMap is
great.

Of course the OS swaps out process RAM anyway, so... it's kinda moot
(unless you've fixed your OS to not do this, which I always do!).

I think a more productive area of exploration (to reduce RAM usage)
would be to make a StringFieldComparator that doesn't need full access
to all terms data, ie, operates per segment yet only does a few ord
lookups when merging the results across segments.  If few is small
enough we can just use us the seek-by-ord from the terms dict to do
them.  This would be a huge RAM reduction because we could then sort
by string fields (eg title field) without needing all term bytes
randomly accessible.

Mike

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Jason Rutherglen
 When
 you mmap them you let the OS decide when to swap stuff out which mean
 you pick up potentially high query latency waiting for these pages to
 swap back in

Right, however if one is using lets say SSDs, and the query time is
less important, then MMap'ing would be fine.  Also it prevents deadly
OOMs in favor of basic 'slowness' of the query.  If there is no
performance degradation I think MMap'ing is a great option.  A common
use case is an index that's far too large for a given server will
simply not work today, whereas with MMap'ed field caches the query
would complete, just extremely slowly.  If the user wishes to improve
performance it's easy enough to add more hardware.

On Thu, May 19, 2011 at 6:40 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:

 maybe thats because we have one huge monolithic implementation

 Doesn't the DocValues branch solve this?

 Hopefully DocValues will replace FieldCache over time; maybe some day
 we can deprecate  remove FieldCache.

 But we still have work to do there, I believe; eg we don't have
 comparators for all types (on the docvalues branch) yet.

 Also, instead of trying to implement clever ways of compressing
 strings in the field cache, which probably won't bare fruit, I'd
 prefer to look at [eventually] MMap'ing (using DV) the field caches to
 avoid the loading and heap costs, which are signifcant.  I'm not sure
 if we can easily MMap packed ints and the shared byte[], though it
 seems fairly doable?

 In fact, the packed ints and the byte[] packing of terms data is very
 much amenable/necessary for using MMap, far moreso than the separate
 objects we had before.

 I agree we should make an mmap option, though I would generally
 recommend against apps using mmap for these caches.  We load these
 caches so that we'll have fast random access to potentially a great
 many documents during collection of one query (eg for sorting).  When
 you mmap them you let the OS decide when to swap stuff out which mean
 you pick up potentially high query latency waiting for these pages to
 swap back in.  Various other data structures in Lucene needs this fast
 random access (norms, del docs, terms index) and that's why we put
 them in RAM.  I do agree for all else (the lrge postings), MMap is
 great.

 Of course the OS swaps out process RAM anyway, so... it's kinda moot
 (unless you've fixed your OS to not do this, which I always do!).

 I think a more productive area of exploration (to reduce RAM usage)
 would be to make a StringFieldComparator that doesn't need full access
 to all terms data, ie, operates per segment yet only does a few ord
 lookups when merging the results across segments.  If few is small
 enough we can just use us the seek-by-ord from the terms dict to do
 them.  This would be a huge RAM reduction because we could then sort
 by string fields (eg title field) without needing all term bytes
 randomly accessible.

 Mike

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread David Smiley (@MITRE.org)

Michael McCandless-2 wrote:
 
 I think a more productive area of exploration (to reduce RAM usage)
 would be to make a StringFieldComparator that doesn't need full access
 to all terms data, ie, operates per segment yet only does a few ord
 lookups when merging the results across segments.  If few is small
 enough we can just use us the seek-by-ord from the terms dict to do
 them.  This would be a huge RAM reduction because we could then sort
 by string fields (eg title field) without needing all term bytes
 randomly accessible.
 
 Mike
 

Yes!  I don't want to put all my titles into RAM just to sort documents by
them when I know Lucene has indexed the titles in sorted order on disk
already.  Of course the devil is in the details.

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961687.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread David Smiley (@MITRE.org)
Wow, 17 replies to my email overnight! This is clearly an interesting topic
to folks.

Hi Dawid.
Sadly, I won't be at Lucene Revolution next week. That's where all the cool
kids will be; I'll be home and be square. I made it to O'Reilly Strata in
February (a great conference) and I'll be presenting at Basis's Open Source
Search Conference (government customer focused) mid-June.  I've used up my
conference budget for this fiscal year.

Yes, the use-case here is a unique integer reference to a String that can be
looked up fairly quickly, whereas the set of all strings are in a compressed
data structure that won't change after its built. A bonus benefit would be
that this integer is a sortable substitute for the String.  Your observation
of this integer being a perfect-hash is astute.

I wonder if Lucene could store this FST on-disk for the bytes in a segment
instead of what it's doing now? Read-time construction would be super-fast,
though for multi-segment indexes, I suppose they'd need to be merged.

I expect that this use-case would be particularly useful for cases when you
know that the set of strings tends to have a great deal of prefixes in
common, such as when EdgeNGramming (applications: query-complete,
hierarchical faceting, prefix/tree based geospatial indexing).

~ David Smiley


Dawid Weiss wrote:
 
 Hi David,
 
 but with less memory.  As I understand it, FSTs are a highly compressed
 representation of a set of Strings (among other possibilities).  The
 
 Yep. Not only, but this is one of the use cases. Will you be at Lucene
 Revolution next week? I'll be talking about it there.
 
 representation of a set of Strings (among other possibilities).  The
 fieldCache would need to point to an FST entry (an arc?) using
 something
 small, say an integer.  Is there a way to point to an FST entry with an
 integer, and then somehow with relative efficiency construct the String
 from
 the arcs to get there?
 
 Correct me if my understanding is wrong: you'd like to assign a unique
 integer to each String and then retrieve it by this integer (something
 like a
 Maplt;Integer, Stringgt;)? This would be something called perfect
 hashing
 and this can be done on top of an automaton (fairly easily). I assume
 the data structure is immutable once constructed and does not change
 too often, right?
 
 Dawid
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961954.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
On Thu, May 19, 2011 at 10:09 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 When
 you mmap them you let the OS decide when to swap stuff out which mean
 you pick up potentially high query latency waiting for these pages to
 swap back in

 Right, however if one is using lets say SSDs, and the query time is
 less important, then MMap'ing would be fine.  Also it prevents deadly
 OOMs in favor of basic 'slowness' of the query.  If there is no
 performance degradation I think MMap'ing is a great option.  A common
 use case is an index that's far too large for a given server will
 simply not work today, whereas with MMap'ed field caches the query
 would complete, just extremely slowly.  If the user wishes to improve
 performance it's easy enough to add more hardware.

Well, be careful: if you just don't have enough memory to accomodate
all the RAM data structures Lucene needs... you're gonna be in trouble
with mmap too.  True, you won't hit OOMEs anymore, but instead you'll
be in a swap fest and your app is nearly unusable.

SSDs, while orders of magnitude faster than spinning magnets, are
still orders of magnitude slower than RAM.

But, yes, they obviously help substantially.  It's a one-way door...
you'll never go back once you've switched to SSDs.

And I do agree there are times when mmap is appropriate, eg if query
latency is unimportant to you, but it's not a panacea and it comes
with serious downsides.

I wish I could have the opposite of mmap from Java -- the ability to
pin the pages that hold important data structures.

Mike

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Jason Rutherglen
 And I do agree there are times when mmap is appropriate, eg if query
 latency is unimportant to you, but it's not a panacea and it comes
 with serious downsides

Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM?

There's also RAM based SSDs whose performance could be comparable with
well, RAM.  Also, with our heap based field caches, the first sorted
search requires that they be loaded into RAM.  Then we don't unload
them until the reader is closed?  With MMap the unloading would happen
automatically?

On Thu, May 19, 2011 at 8:59 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, May 19, 2011 at 10:09 AM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 When
 you mmap them you let the OS decide when to swap stuff out which mean
 you pick up potentially high query latency waiting for these pages to
 swap back in

 Right, however if one is using lets say SSDs, and the query time is
 less important, then MMap'ing would be fine.  Also it prevents deadly
 OOMs in favor of basic 'slowness' of the query.  If there is no
 performance degradation I think MMap'ing is a great option.  A common
 use case is an index that's far too large for a given server will
 simply not work today, whereas with MMap'ed field caches the query
 would complete, just extremely slowly.  If the user wishes to improve
 performance it's easy enough to add more hardware.

 Well, be careful: if you just don't have enough memory to accomodate
 all the RAM data structures Lucene needs... you're gonna be in trouble
 with mmap too.  True, you won't hit OOMEs anymore, but instead you'll
 be in a swap fest and your app is nearly unusable.

 SSDs, while orders of magnitude faster than spinning magnets, are
 still orders of magnitude slower than RAM.

 But, yes, they obviously help substantially.  It's a one-way door...
 you'll never go back once you've switched to SSDs.

 And I do agree there are times when mmap is appropriate, eg if query
 latency is unimportant to you, but it's not a panacea and it comes
 with serious downsides.

 I wish I could have the opposite of mmap from Java -- the ability to
 pin the pages that hold important data structures.

 Mike

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Michael McCandless
On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 And I do agree there are times when mmap is appropriate, eg if query
 latency is unimportant to you, but it's not a panacea and it comes
 with serious downsides

 Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM?

I don't know of a straight up comparison...

 There's also RAM based SSDs whose performance could be comparable with
 well, RAM.

True, though it's through layers of abstraction designed originally
for serving files off of spinning magnets :)

 Also, with our heap based field caches, the first sorted
 search requires that they be loaded into RAM.  Then we don't unload
 them until the reader is closed?  With MMap the unloading would happen
 automatically?

True, but really if the app knows it won't need that FC entry for a
long time (ie, long enough to make it worth unloading/reloading) then
it should really unload it.  MMap would still have to write all those
pages to disk...

DocValues actually makes this a lot cheaper because loading DocValues
is much (like ~100 X from Simon's testing) faster than populating
FieldCache since FieldCache must do all the uninverting.

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
On Thu, May 19, 2011 at 20:43, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 And I do agree there are times when mmap is appropriate, eg if query
 latency is unimportant to you, but it's not a panacea and it comes
 with serious downsides

 Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM?

 I don't know of a straight up comparison...
I did compare MMapDir vs RAMDir variant a couple of years ago.
Searches slowed down a teeny-weeny little bit. GC times went down
noticeably. For me it was a big win.

Whatever Mike might say, mmap is great for latency-conscious applications : )

If someone tries to create artificial benchmark for byte[] VS
ByteBuffer, I'd recommend going through Lucene's abstraction layer.
If you simply read/write in a loop, JIT will optimize away boundary
checks for byte[] in some cases. This didn't ever happen to *Buffer
family for me.

 There's also RAM based SSDs whose performance could be comparable with
 well, RAM.

 True, though it's through layers of abstraction designed originally
 for serving files off of spinning magnets :)

 Also, with our heap based field caches, the first sorted
 search requires that they be loaded into RAM.  Then we don't unload
 them until the reader is closed?  With MMap the unloading would happen
 automatically?

 True, but really if the app knows it won't need that FC entry for a
 long time (ie, long enough to make it worth unloading/reloading) then
 it should really unload it.  MMap would still have to write all those
 pages to disk...

 DocValues actually makes this a lot cheaper because loading DocValues
 is much (like ~100 X from Simon's testing) faster than populating
 FieldCache since FieldCache must do all the uninverting.

 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



FST and FieldCache?

2011-05-18 Thread David Smiley (@MITRE.org)
I've been pondering how to reduce the size of FieldCache entries when there
are a large number of Strings. I'd like to facet on such a field with Solr
but with less memory.  As I understand it, FSTs are a highly compressed
representation of a set of Strings (among other possibilities).  The
fieldCache would need to point to an FST entry (an arc?) using something
small, say an integer.  Is there a way to point to an FST entry with an
integer, and then somehow with relative efficiency construct the String from
the arcs to get there?

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2960030.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org