Re: FST and FieldCache?
Hi David, but with less memory. As I understand it, FSTs are a highly compressed representation of a set of Strings (among other possibilities). The Yep. Not only, but this is one of the use cases. Will you be at Lucene Revolution next week? I'll be talking about it there. representation of a set of Strings (among other possibilities). The fieldCache would need to point to an FST entry (an arc?) using something small, say an integer. Is there a way to point to an FST entry with an integer, and then somehow with relative efficiency construct the String from the arcs to get there? Correct me if my understanding is wrong: you'd like to assign a unique integer to each String and then retrieve it by this integer (something like a MapInteger, String)? This would be something called perfect hashing and this can be done on top of an automaton (fairly easily). I assume the data structure is immutable once constructed and does not change too often, right? Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
You cannot get a string out of automaton by its ordinal without storing additional data. The string is stored there not as a single arc, but as a sequence of them (basically.. err.. as a string), so referencing them is basically writing the string asis. Space savings here come from sharing arcs between strings. Though, it's possible to do if you associate an additional number with each node. (I invented some way, shared it with Mike and forgot.. good grief :/) Perfect hashing, on the other hand, is like a MapString, Integer that accepts a predefined set of N strings and returns an int in 0..N-1 interval. And it can't do the reverse lookup, by design, that's a lossy compression for all good perfect hashing algos. So, it's irrelevant here, huh? On Thu, May 19, 2011 at 08:53, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: I've been pondering how to reduce the size of FieldCache entries when there are a large number of Strings. I'd like to facet on such a field with Solr but with less memory. As I understand it, FSTs are a highly compressed representation of a set of Strings (among other possibilities). The fieldCache would need to point to an FST entry (an arc?) using something small, say an integer. Is there a way to point to an FST entry with an integer, and then somehow with relative efficiency construct the String from the arcs to get there? ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2960030.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
Though, it's possible to do if you associate an additional number with each node. (I invented some way, shared it with Mike and forgot.. good grief :/) It doesn't need to be invented -- it's a known technique. On each arc you store the number of strings under that arc; while traversing you accumulate -- this gives you a unique number for each string (perfect hash) and a way to locate a string given its number. And it can't do the reverse lookup, by design, that's a lossy compression for all good perfect hashing algos. So, it's irrelevant here, huh? You can do both the way I described above. Jan Daciuk has details on many more variants of doing that: Jan Daciuk, Rafael C. Carrasco, Perfect Hashing with Pseudo-minimal Bottom-up Deterministic Tree Automata, Intelligent Information Systems XVI, Proceedings of the International IIS'08 Conference held in Zakopane, Poland, June 16-18, 2008, Mieczysław A. Kłopotek, Adam Przepiórkowski, Sławomir T. Wierzchoń, Krzysztof Trojanowski (eds.), Academic Publishing House Exit, Warszawa 2008. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
We should add this (lookup by value, when value is guaranteed to monotonically increase as the key increases) to our core FST APIs? It's generically useful in many places ;) I'll open an issue. EG this would also enable an FST terms index that supports lookup-by-ord, something VariableGapTermsIndex (this is the one that uses FST for the index) does not support today. David, one thing to remember is trunk has already seen drastic reductions on the RAM required to store DocTerms/Index vs 3.x (something maybe we should backport to 3.x...). The bytes for the terms are now stored as shared byte[] blocks, and the ords/offsets are stored as packed ints, so we no longer have per-String memory pointer overhead. I describe the gains here: http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html -- though, those gains include RAM reduction from the terms index as well. While FST w/ lookup-by-monotonic-value would work here, I would be worried about the per hit of that representation vs what DocTerms/Index offers today... we should test to see. Of course, for certain apps that perf hit is justified, so probably we should make this an option when populating field cache (ie, in-memory storage option of using an FST vs using packed ints/byte[]). Mike http://blog.mikemccandless.com On Thu, May 19, 2011 at 4:43 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: Though, it's possible to do if you associate an additional number with each node. (I invented some way, shared it with Mike and forgot.. good grief :/) It doesn't need to be invented -- it's a known technique. On each arc you store the number of strings under that arc; while traversing you accumulate -- this gives you a unique number for each string (perfect hash) and a way to locate a string given its number. And it can't do the reverse lookup, by design, that's a lossy compression for all good perfect hashing algos. So, it's irrelevant here, huh? You can do both the way I described above. Jan Daciuk has details on many more variants of doing that: Jan Daciuk, Rafael C. Carrasco, Perfect Hashing with Pseudo-minimal Bottom-up Deterministic Tree Automata, Intelligent Information Systems XVI, Proceedings of the International IIS'08 Conference held in Zakopane, Poland, June 16-18, 2008, Mieczysław A. Kłopotek, Adam Przepiórkowski, Sławomir T. Wierzchoń, Krzysztof Trojanowski (eds.), Academic Publishing House Exit, Warszawa 2008. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 6:16 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: We should add this (lookup by value, when value is guaranteed to monotonically increase as the key increases) to our core FST APIs? It's generically useful in many places ;) I'll open an issue. The data structure itself should sort of build itself if you create an FST with increasing integers because the shared suffix should be pushed towards the root anyway, so the only thing would be to correct values on all outgoing arcs (they need to contain the count of leaves on the subtree) but then, this may be tricky if arc values are vcoded... I'd have to think how to do this. I think, if we add ord as an output to the FST, then it builds everything we need? Ie no further data structures should be needed? Maybe I'm confused :) While FST w/ lookup-by-monotonic-value would work here, I would be worried about the per hit of that representation vs what There are actually two things: a) performance; you need to descend in the automaton and some bookkeeping to maintain the count of nodes; this adds overhead, b) size; the procedure for storing/ calculating perfect hashes I described requires leaf counts on each arc and these are usually large integers. Even vcoded they bloat the resulting data structure. Maybe we should iterate on the issue to get down to the specifics? I had thought there wouldn't be any backtracking, if the FST had stored the ord as an output... Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
I think, if we add ord as an output to the FST, then it builds everything we need? Ie no further data structures should be needed? Maybe I'm confused :) If you put the ord as an output the common part will be shifted towards the front of the tree. This will work if you want to look up a given value assigned to some string, but will not work if you need to look up the string from its value. The latter case can be solved if you know which branch to take while descending from root and the shared prefix alone won't give you this information. At least I don't see how it could. I am familiar with the basic prefix hashing procedure suggested by Daciuk (and other authors), but maybe some progress has been made there, I don't know... the one I know is really conceptually simple -- since each arc encodes the number of leaves (or input sequences) in the automaton, you know which path must lead you to your string. For example if you have a node like this and seek for the 12-th term: 0 -- 10 -- ... +- 10 -- ... +- 5 -- .. you look at the first path, it'd give you terms 1..10, then the next one contains terms 11..20 so you add 10 to an internal counter which is added to further computations, descend and repeat the procedure until you find a leaf node. Dawid
Re: FST and FieldCache?
I think, if we add ord as an output to the FST, then it builds everything we need? Ie no further data structures should be needed? Maybe I'm confused :) If you put the ord as an output the common part will be shifted towards the front of the tree. This will work if you want to look up a given value assigned to some string, but will not work if you need to look up the string from its value. The latter case can be solved if you know which branch to take while descending from root and the shared prefix alone won't give you this information. At least I don't see how it could. I am familiar with the basic prefix hashing procedure suggested by Daciuk (and other authors), but maybe some progress has been made there, I don't know... the one I know is really conceptually simple -- since each arc encodes the number of leaves (or input sequences) in the automaton, you know which path must lead you to your string. For example if you have a node like this and seek for the 12-th term: 0 -- 10 -- ... +- 10 -- ... +- 5 -- .. you look at the first path, it'd give you terms 1..10, then the next one contains terms 11..20 so you add 10 to an internal counter which is added to further computations, descend and repeat the procedure until you find a leaf node. Dawid There's a possible speedup here. If, instead of storing the count of all downstream leaves, you store the sum of counts for all previous siblings, you can do a binary lookup instead of linear scan on each node. Taking your example: 0 -- 0 -- ... +- 10 -- ... We know that for 12-th term we should descend along this edge, as it has the biggest tag less than 12. +- 15 -- ... That's what I invented, and yes, it was invented by countless people before :) -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
This (storing sums) is, I think, exactly what the FST stores as outputs on the arcs. Ie, it bounds the range of outputs if you were to recurse on that arc. So, from any node, we can unambiguously determine which arc to recurse on, when looking up by value (only if the value is strictly monotonic). It should be straightforward to implement, ie should not require any additional data structure / storage in the FST. It's a lookup-only change, I think. Mike http://blog.mikemccandless.com On Thu, May 19, 2011 at 8:31 AM, Earwin Burrfoot ear...@gmail.com wrote: I think, if we add ord as an output to the FST, then it builds everything we need? Ie no further data structures should be needed? Maybe I'm confused :) If you put the ord as an output the common part will be shifted towards the front of the tree. This will work if you want to look up a given value assigned to some string, but will not work if you need to look up the string from its value. The latter case can be solved if you know which branch to take while descending from root and the shared prefix alone won't give you this information. At least I don't see how it could. I am familiar with the basic prefix hashing procedure suggested by Daciuk (and other authors), but maybe some progress has been made there, I don't know... the one I know is really conceptually simple -- since each arc encodes the number of leaves (or input sequences) in the automaton, you know which path must lead you to your string. For example if you have a node like this and seek for the 12-th term: 0 -- 10 -- ... +- 10 -- ... +- 5 -- .. you look at the first path, it'd give you terms 1..10, then the next one contains terms 11..20 so you add 10 to an internal counter which is added to further computations, descend and repeat the procedure until you find a leaf node. Dawid There's a possible speedup here. If, instead of storing the count of all downstream leaves, you store the sum of counts for all previous siblings, you can do a binary lookup instead of linear scan on each node. Taking your example: 0 -- 0 -- ... +- 10 -- ... We know that for 12-th term we should descend along this edge, as it has the biggest tag less than 12. +- 15 -- ... That's what I invented, and yes, it was invented by countless people before :) -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
That's what I invented, and yes, it was invented by countless people before :) You know I didn't mean to sound rude, right? I'm really admiring your ability to come up with these solutions by yourself, I'm merely copying other folks' ideas. Anyway, the optimization you're describing is sure possible. Lucene's FST implementation can actually combine both approaches because always expanding nodes is inefficient and those already expanded will allow a binary search (assuming the automaton structure is known to the implementation). Another refinement of this idea creates a detached table (err.. index :) of states to start from inside the automaton, so that you don't have to go through the initial 2-3 states which are more or less always large and even binary search is costly there. Dawid
Re: FST and FieldCache?
On Thu, May 19, 2011 at 16:45, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: That's what I invented, and yes, it was invented by countless people before :) You know I didn't mean to sound rude, right? I'm really admiring your ability to come up with these solutions by yourself, I'm merely copying other folks' ideas. I tried to prevent another reference to mr. Daciuk :) Anyway, the optimization you're describing is sure possible. Lucene's FST implementation can actually combine both approaches because always expanding nodes is inefficient and those already expanded will allow a binary search (assuming the automaton structure is known to the implementation). Another refinement of this idea creates a detached table (err.. index :) of states to start from inside the automaton, so that you don't have to go through the initial 2-3 states which are more or less always large and even binary search is costly there. Dawid But you have to lookup this err..index somehow. And that's either binary or hash lookup. Where's the win? -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
2011/5/19 Michael McCandless luc...@mikemccandless.com: Of course, for certain apps that perf hit is justified, so probably we should make this an option when populating field cache (ie, in-memory storage option of using an FST vs using packed ints/byte[]). or should we actually try to have different fieldcacheimpls? I see all these missions to refactor the thing, which always fail. maybe thats because we have one huge monolithic implementation. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
I tried to prevent another reference to mr. Daciuk :) Why? He's a very nice guy :) But you have to lookup this err..index somehow. And that's either binary or hash lookup. Where's the win? You can do a sparse O(1) index and have a slight gain from these few large initial states. This only makes sense if you perform tons of these lookups, really. Mike's right -- the FST will output a structure that is ready to be used for by-number retrieval of strings (or anything else) as long as the numbers are strictly monotonous (and preferably continuous). The output will be what you're suggesting, Earwin -- accumulated sums. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
maybe thats because we have one huge monolithic implementation Doesn't the DocValues branch solve this? Also, instead of trying to implement clever ways of compressing strings in the field cache, which probably won't bare fruit, I'd prefer to look at [eventually] MMap'ing (using DV) the field caches to avoid the loading and heap costs, which are signifcant. I'm not sure if we can easily MMap packed ints and the shared byte[], though it seems fairly doable? On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote: 2011/5/19 Michael McCandless luc...@mikemccandless.com: Of course, for certain apps that perf hit is justified, so probably we should make this an option when populating field cache (ie, in-memory storage option of using an FST vs using packed ints/byte[]). or should we actually try to have different fieldcacheimpls? I see all these missions to refactor the thing, which always fail. maybe thats because we have one huge monolithic implementation. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
This is more about compressing strings in TermsIndex, I think. And ability to use said TermsIndex directly in some cases that required FieldCache before. (Maybe FC is still needed, but it can be degraded to docId-ord map, storing actual strings in TI). This yields fat space savings when we, eg, need to both lookup on a field and build facets out of it. mmap is cool :) What I want to see is a FST-based TermsDict that is simply mmaped into memory, without building intermediate indexes, like Lucene does now. And docvalues are orthogonal to that, no? On Thu, May 19, 2011 at 17:22, Jason Rutherglen jason.rutherg...@gmail.com wrote: maybe thats because we have one huge monolithic implementation Doesn't the DocValues branch solve this? Also, instead of trying to implement clever ways of compressing strings in the field cache, which probably won't bare fruit, I'd prefer to look at [eventually] MMap'ing (using DV) the field caches to avoid the loading and heap costs, which are signifcant. I'm not sure if we can easily MMap packed ints and the shared byte[], though it seems fairly doable? On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote: 2011/5/19 Michael McCandless luc...@mikemccandless.com: Of course, for certain apps that perf hit is justified, so probably we should make this an option when populating field cache (ie, in-memory storage option of using an FST vs using packed ints/byte[]). or should we actually try to have different fieldcacheimpls? I see all these missions to refactor the thing, which always fail. maybe thats because we have one huge monolithic implementation. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
This is more about compressing strings in TermsIndex, I think. Ah, because they're sorted. I think if the string lookup cost degrades then it's not worth it? That's something that needs to be tested in the MMap case as well, eg, are ByteBuffers somehow slowing down everything by a factor of 10%? On Thu, May 19, 2011 at 6:30 AM, Earwin Burrfoot ear...@gmail.com wrote: This is more about compressing strings in TermsIndex, I think. And ability to use said TermsIndex directly in some cases that required FieldCache before. (Maybe FC is still needed, but it can be degraded to docId-ord map, storing actual strings in TI). This yields fat space savings when we, eg, need to both lookup on a field and build facets out of it. mmap is cool :) What I want to see is a FST-based TermsDict that is simply mmaped into memory, without building intermediate indexes, like Lucene does now. And docvalues are orthogonal to that, no? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: maybe thats because we have one huge monolithic implementation Doesn't the DocValues branch solve this? Hopefully DocValues will replace FieldCache over time; maybe some day we can deprecate remove FieldCache. But we still have work to do there, I believe; eg we don't have comparators for all types (on the docvalues branch) yet. Also, instead of trying to implement clever ways of compressing strings in the field cache, which probably won't bare fruit, I'd prefer to look at [eventually] MMap'ing (using DV) the field caches to avoid the loading and heap costs, which are signifcant. I'm not sure if we can easily MMap packed ints and the shared byte[], though it seems fairly doable? In fact, the packed ints and the byte[] packing of terms data is very much amenable/necessary for using MMap, far moreso than the separate objects we had before. I agree we should make an mmap option, though I would generally recommend against apps using mmap for these caches. We load these caches so that we'll have fast random access to potentially a great many documents during collection of one query (eg for sorting). When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in. Various other data structures in Lucene needs this fast random access (norms, del docs, terms index) and that's why we put them in RAM. I do agree for all else (the lrge postings), MMap is great. Of course the OS swaps out process RAM anyway, so... it's kinda moot (unless you've fixed your OS to not do this, which I always do!). I think a more productive area of exploration (to reduce RAM usage) would be to make a StringFieldComparator that doesn't need full access to all terms data, ie, operates per segment yet only does a few ord lookups when merging the results across segments. If few is small enough we can just use us the seek-by-ord from the terms dict to do them. This would be a huge RAM reduction because we could then sort by string fields (eg title field) without needing all term bytes randomly accessible. Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in Right, however if one is using lets say SSDs, and the query time is less important, then MMap'ing would be fine. Also it prevents deadly OOMs in favor of basic 'slowness' of the query. If there is no performance degradation I think MMap'ing is a great option. A common use case is an index that's far too large for a given server will simply not work today, whereas with MMap'ed field caches the query would complete, just extremely slowly. If the user wishes to improve performance it's easy enough to add more hardware. On Thu, May 19, 2011 at 6:40 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: maybe thats because we have one huge monolithic implementation Doesn't the DocValues branch solve this? Hopefully DocValues will replace FieldCache over time; maybe some day we can deprecate remove FieldCache. But we still have work to do there, I believe; eg we don't have comparators for all types (on the docvalues branch) yet. Also, instead of trying to implement clever ways of compressing strings in the field cache, which probably won't bare fruit, I'd prefer to look at [eventually] MMap'ing (using DV) the field caches to avoid the loading and heap costs, which are signifcant. I'm not sure if we can easily MMap packed ints and the shared byte[], though it seems fairly doable? In fact, the packed ints and the byte[] packing of terms data is very much amenable/necessary for using MMap, far moreso than the separate objects we had before. I agree we should make an mmap option, though I would generally recommend against apps using mmap for these caches. We load these caches so that we'll have fast random access to potentially a great many documents during collection of one query (eg for sorting). When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in. Various other data structures in Lucene needs this fast random access (norms, del docs, terms index) and that's why we put them in RAM. I do agree for all else (the lrge postings), MMap is great. Of course the OS swaps out process RAM anyway, so... it's kinda moot (unless you've fixed your OS to not do this, which I always do!). I think a more productive area of exploration (to reduce RAM usage) would be to make a StringFieldComparator that doesn't need full access to all terms data, ie, operates per segment yet only does a few ord lookups when merging the results across segments. If few is small enough we can just use us the seek-by-ord from the terms dict to do them. This would be a huge RAM reduction because we could then sort by string fields (eg title field) without needing all term bytes randomly accessible. Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
Michael McCandless-2 wrote: I think a more productive area of exploration (to reduce RAM usage) would be to make a StringFieldComparator that doesn't need full access to all terms data, ie, operates per segment yet only does a few ord lookups when merging the results across segments. If few is small enough we can just use us the seek-by-ord from the terms dict to do them. This would be a huge RAM reduction because we could then sort by string fields (eg title field) without needing all term bytes randomly accessible. Mike Yes! I don't want to put all my titles into RAM just to sort documents by them when I know Lucene has indexed the titles in sorted order on disk already. Of course the devil is in the details. ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961687.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
Wow, 17 replies to my email overnight! This is clearly an interesting topic to folks. Hi Dawid. Sadly, I won't be at Lucene Revolution next week. That's where all the cool kids will be; I'll be home and be square. I made it to O'Reilly Strata in February (a great conference) and I'll be presenting at Basis's Open Source Search Conference (government customer focused) mid-June. I've used up my conference budget for this fiscal year. Yes, the use-case here is a unique integer reference to a String that can be looked up fairly quickly, whereas the set of all strings are in a compressed data structure that won't change after its built. A bonus benefit would be that this integer is a sortable substitute for the String. Your observation of this integer being a perfect-hash is astute. I wonder if Lucene could store this FST on-disk for the bytes in a segment instead of what it's doing now? Read-time construction would be super-fast, though for multi-segment indexes, I suppose they'd need to be merged. I expect that this use-case would be particularly useful for cases when you know that the set of strings tends to have a great deal of prefixes in common, such as when EdgeNGramming (applications: query-complete, hierarchical faceting, prefix/tree based geospatial indexing). ~ David Smiley Dawid Weiss wrote: Hi David, but with less memory. As I understand it, FSTs are a highly compressed representation of a set of Strings (among other possibilities). The Yep. Not only, but this is one of the use cases. Will you be at Lucene Revolution next week? I'll be talking about it there. representation of a set of Strings (among other possibilities). The fieldCache would need to point to an FST entry (an arc?) using something small, say an integer. Is there a way to point to an FST entry with an integer, and then somehow with relative efficiency construct the String from the arcs to get there? Correct me if my understanding is wrong: you'd like to assign a unique integer to each String and then retrieve it by this integer (something like a Maplt;Integer, Stringgt;)? This would be something called perfect hashing and this can be done on top of an automaton (fairly easily). I assume the data structure is immutable once constructed and does not change too often, right? Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2961954.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 10:09 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in Right, however if one is using lets say SSDs, and the query time is less important, then MMap'ing would be fine. Also it prevents deadly OOMs in favor of basic 'slowness' of the query. If there is no performance degradation I think MMap'ing is a great option. A common use case is an index that's far too large for a given server will simply not work today, whereas with MMap'ed field caches the query would complete, just extremely slowly. If the user wishes to improve performance it's easy enough to add more hardware. Well, be careful: if you just don't have enough memory to accomodate all the RAM data structures Lucene needs... you're gonna be in trouble with mmap too. True, you won't hit OOMEs anymore, but instead you'll be in a swap fest and your app is nearly unusable. SSDs, while orders of magnitude faster than spinning magnets, are still orders of magnitude slower than RAM. But, yes, they obviously help substantially. It's a one-way door... you'll never go back once you've switched to SSDs. And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides. I wish I could have the opposite of mmap from Java -- the ability to pin the pages that hold important data structures. Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM? There's also RAM based SSDs whose performance could be comparable with well, RAM. Also, with our heap based field caches, the first sorted search requires that they be loaded into RAM. Then we don't unload them until the reader is closed? With MMap the unloading would happen automatically? On Thu, May 19, 2011 at 8:59 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, May 19, 2011 at 10:09 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in Right, however if one is using lets say SSDs, and the query time is less important, then MMap'ing would be fine. Also it prevents deadly OOMs in favor of basic 'slowness' of the query. If there is no performance degradation I think MMap'ing is a great option. A common use case is an index that's far too large for a given server will simply not work today, whereas with MMap'ed field caches the query would complete, just extremely slowly. If the user wishes to improve performance it's easy enough to add more hardware. Well, be careful: if you just don't have enough memory to accomodate all the RAM data structures Lucene needs... you're gonna be in trouble with mmap too. True, you won't hit OOMEs anymore, but instead you'll be in a swap fest and your app is nearly unusable. SSDs, while orders of magnitude faster than spinning magnets, are still orders of magnitude slower than RAM. But, yes, they obviously help substantially. It's a one-way door... you'll never go back once you've switched to SSDs. And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides. I wish I could have the opposite of mmap from Java -- the ability to pin the pages that hold important data structures. Mike http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM? I don't know of a straight up comparison... There's also RAM based SSDs whose performance could be comparable with well, RAM. True, though it's through layers of abstraction designed originally for serving files off of spinning magnets :) Also, with our heap based field caches, the first sorted search requires that they be loaded into RAM. Then we don't unload them until the reader is closed? With MMap the unloading would happen automatically? True, but really if the app knows it won't need that FC entry for a long time (ie, long enough to make it worth unloading/reloading) then it should really unload it. MMap would still have to write all those pages to disk... DocValues actually makes this a lot cheaper because loading DocValues is much (like ~100 X from Simon's testing) faster than populating FieldCache since FieldCache must do all the uninverting. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: FST and FieldCache?
On Thu, May 19, 2011 at 20:43, Michael McCandless luc...@mikemccandless.com wrote: On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: And I do agree there are times when mmap is appropriate, eg if query latency is unimportant to you, but it's not a panacea and it comes with serious downsides Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM? I don't know of a straight up comparison... I did compare MMapDir vs RAMDir variant a couple of years ago. Searches slowed down a teeny-weeny little bit. GC times went down noticeably. For me it was a big win. Whatever Mike might say, mmap is great for latency-conscious applications : ) If someone tries to create artificial benchmark for byte[] VS ByteBuffer, I'd recommend going through Lucene's abstraction layer. If you simply read/write in a loop, JIT will optimize away boundary checks for byte[] in some cases. This didn't ever happen to *Buffer family for me. There's also RAM based SSDs whose performance could be comparable with well, RAM. True, though it's through layers of abstraction designed originally for serving files off of spinning magnets :) Also, with our heap based field caches, the first sorted search requires that they be loaded into RAM. Then we don't unload them until the reader is closed? With MMap the unloading would happen automatically? True, but really if the app knows it won't need that FC entry for a long time (ie, long enough to make it worth unloading/reloading) then it should really unload it. MMap would still have to write all those pages to disk... DocValues actually makes this a lot cheaper because loading DocValues is much (like ~100 X from Simon's testing) faster than populating FieldCache since FieldCache must do all the uninverting. Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко E-Mail/Jabber: ear...@gmail.com Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
FST and FieldCache?
I've been pondering how to reduce the size of FieldCache entries when there are a large number of Strings. I'd like to facet on such a field with Solr but with less memory. As I understand it, FSTs are a highly compressed representation of a set of Strings (among other possibilities). The fieldCache would need to point to an FST entry (an arc?) using something small, say an integer. Is there a way to point to an FST entry with an integer, and then somehow with relative efficiency construct the String from the arcs to get there? ~ David Smiley - Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2960030.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org