[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2014-03-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Labels: gsoc2013  (was: gsoc2014)

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 4.7
>
> Attachments: LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, df-ttf-estimate.txt, 
> example.png
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-09-04 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch from last commit, and summary:

Previously our term dictionary were both block-based: 

* BlockTerms dict breaks terms list into several blocks, as a linear 
  structure with skip points. 

* BlockTreeTerms dict uses a trie-like structure to decide how terms are 
  assigned to different blocks, and uses an FST index to optimize seeking 
  performance.

However, those two kinds of term dictionary don't hold all the term 
data in memory. For the worst case there would be at least two seeks:
one from index in memory, another from file on disk. And we already have 
many complicated optimizations for this...

If by design a term dictionary can be memory resident, the data structure 
will be simpler (after all we don't need maintain extra file pointers for 
a second-time seek, and we don't have to decide heuristic for how terms 
are clustered). And this is why those two FST-based implementation are 
introduced.

Another big change in the code is: since our term dictionaries were both 
block-based, previous API was also limited. It was the postings writer who 
collected term metadata, and the term dictionary who told postings writer 
the range of terms it should flush to block. However, encoding of terms 
data should be decided by term dictionary part, since postings writer 
doesn't always know how terms are structured in term dictionary...
Previous API had some tricky codes for this, e.g. PulsingPostingsWriter had
to use terms' ordinal in block to decide how to write metadata, which is 
unnecessary.

To make the API between term dict and postings list more 'pluggable' and 
'general', I refactored the PostingsReader/WriterBase. For example, the 
postings writer should provide some information to term dictionary, like 
how many metadata values are strictly monotonic, so that term dictionary 
can optimize delta-encoding itself. And since the term dictionary now fully
decides how metadata are written, it gets the ability to utilize 
intblock-based metadata encoding.

Now the two implementations of term dictionary can easily be plugged with 
current postings formats, like:
* FST41 = 
FSTTermdict + Lucene41PostingsBaseFormat,
* FSTOrd41 = 
FSTOrdTermdict + Lucene41PostingsBaseFormat. 
* FSTOrdPulsing41 = 
FSTOrdTermsdict + PulsingPostingsWrapper + Lucene41PostingsFormat

About performance, as shown before, those two term dict improve on primary 
key lookup, but still have overhead on wildcard query (both two term dict 
have only prefix information, and term dictionary cannot work well with 
this...). I'll try to hack this later.

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-09-03 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

The uploaded patch should show all the changes against trunk: I added two 
different implementations of term dict, and refactored the PostingsBaseFormat 
to plug in non-block based term dicts.

I'm still working on the javadocs, and maybe we should rename that 'temp' 
package, like 'fstterms'?



> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-28 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch, to show the impersonation hack for Pulsing format. 

We cannot perfectly impersonate old pulsing format yet: the old format divided 
metadata block as inlined bytes and wrapped bytes, so when the term dict reader 
reads the length of metadata block, it is actually the length of 'inlined 
block'... And the 'wrapped block' won't be loaded for wrapped PF.

However, to introduce a new method in PostingsReaderBase doesn't seem to be a 
good way...

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-23 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch, it will show how current codecs (Block/BlockTree + 
Lucene4X/Pulsing/Mock*) are changed according to our API refactoring. 
TestBackwardsCompatibility still fails, and I'll work on the impersonation 
later.

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-15 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch, update BlockTerms dict so that it follows refactored API.

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-13 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch with backward compability fix on Lucene41PBF (TempPostingsReader is 
actually a fork of Lucene41PostingsReader).

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-08-02 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Uploaded patch.

It is optimized for wildcardquery, and I did a quick test on 1M wiki data:
{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
PKLookup  314.63  (1.5%)  314.64  (1.2%)
0.0% (  -2% -2%)
  Fuzzy1   91.32  (3.7%)   92.50  (1.6%)
1.3% (  -3% -6%)
 Respell  104.54  (3.9%)  106.97  (1.6%)
2.3% (  -2% -8%)
  Fuzzy2   38.22  (4.1%)   39.16  (1.2%)
2.5% (  -2% -8%)
Wildcard  109.56  (3.1%)  273.42  (5.0%)  
149.6% ( 137% -  162%)
{noformat}

and TempFSTOrd vs. Lucene41, on 1M data:
{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell  134.85  (3.7%)  106.30  (0.6%)  
-21.2% ( -24% -  -17%)
  Fuzzy2   47.78  (4.1%)   39.03  (0.9%)  
-18.3% ( -22% -  -13%)
  Fuzzy1  112.02  (3.0%)   91.95  (0.6%)  
-17.9% ( -20% -  -14%)
Wildcard  326.68  (3.5%)  273.41  (1.9%)  
-16.3% ( -20% -  -11%)
PKLookup  194.61  (1.8%)  314.24  (0.7%)   
61.5% (  57% -   65%)
{noformat}

But I'm not happy with it :(, the hack I did here is to consume another big 
block to store the last byte of each term. So for wildcard query ab*c, we have 
external information to tell the ord of nearest term like *c. Knowing the ord, 
we can use a similar approach like getByOutput to jump to the next target term.

Previously, we have to walk on fst to the stop node to find out whether the 
last byte is 'c', so this optimization comes to be a big chunk.

However I don't really like this patch :(, we have to increase index size (521M 
=> 530M), and the code comes to be mess up, since we always have to foresee the 
next arc on current stack. 

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-30 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch, revive IntersectTermsEnum in TempFSTOrd.

Mike, since we already have an intersect() impl, maybe we can still keep this? 
By the way, it is easy to migrate from TempFST to TempFSTOrd.

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-30 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-30 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: (was: LUCENE-5152.patch)

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-30 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-5152.patch

Previous design put much stress on decoding of Outputs. 
This becomes disaster for wildcard queries: like for f*nd, 
we usually have to walk to the last character of FST, then
find that it is not 'd' and automaton doesn't accept this.
In this case, TempFST is actually iterating all the result
of f*, which decodes all the metadata for them...

So I'm trying another approach, the main idea is to load 
metadata & stats as lazily as possible. 
Here I use FST as term index, and leave all other stuff 
in a single term block. The term index FST holds the relationship 
between , and in the term block we can maintain a skip list
for find related metadata & stats.

It is a little similar to BTTR now, and we can someday control how much
data to keep memory resident (e.g. keep stats in memory but metadata on 
disk, however this should be another issue).
Another good part is, it naturally supports seek by ord.(ah, 
actually I don't understand where it is used).

Tests pass, and intersect is not implemented yet.
perf based on 1M wiki data, between non-intersect TempFST and TempFSTOrd:

{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
PKLookup  373.80  (0.0%)  320.30  (0.0%)  
-14.3% ( -14% -  -14%)
  Fuzzy1   43.82  (0.0%)   47.10  (0.0%)
7.5% (   7% -7%)
 Prefix3  399.62  (0.0%)  433.95  (0.0%)
8.6% (   8% -8%)
  Fuzzy2   14.26  (0.0%)   15.95  (0.0%)   
11.9% (  11% -   11%)
 Respell   40.69  (0.0%)   46.29  (0.0%)   
13.8% (  13% -   13%)
Wildcard   83.44  (0.0%)   96.54  (0.0%)   
15.7% (  15% -   15%)
{noformat}

perf hit on pklookup should be sane, since I haven't optimize the skip list.

I'll update intersect() later, and later we'll cutover to 
PagedBytes & PackedLongBuffer.


> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-5152.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-23 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 4.4
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-23 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Upload patch: implemented IntersectEnum.next() & seekCeil()
lots of nocommits, but passed all tests

The main idea is to run a DFS on FST, and backtrack as early as
possible (i.e. when we see this label is rejected by automaton)

For this version, there is one explicit perf overhead: I use a 
real stack here, which can be replaced by a Frame[] to reuse objects.

There're several aspects I didn't dig deep: 

* currently, CompiledAutomaton provides a commonSuffixRef, but how
  can we make use of it in FST?
* the DFS is somewhat a 'goto' version, i.e, we can make the code 
  cleaner with a single while-loop similar to BFS search. 
  However, since FST doesn't always tell us how may arcs are leaving 
  current arc, we have problem dealing with this...
* when FST is large enough, the next() operation will takes much time
  doing the linear arc read, maybe we should make use of 
  CompiledAutomaton.sortedTransition[] when leaving arcs are heavy.


> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 4.4
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-16 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch: revert hashCode()

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 4.4
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-15 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: LUCENE-3069.patch

Patch according to previous comments.

We still somewhat need the existance of
hashCode(), because in NodeHash, it will 
check whether the frozen node have the same 
hashcode with uncompiled node (NodeHash:128).

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 4.4
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-13 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: df-ttf-estimate.txt

Uploaded detail data for wikimediumall.

Oh, sorry, there is an error when I 
caculated index size for df==0 trick, 
it should be 105MB instead of 70MB.

But the real test is still beyond 
estimation (weird...). df==0 tricks
gains similar compression.

Index size are below:
{noformat}
v0:   13195304
v1 = v0 + flag byte:  12847172
v2 = v1 + steal bit:  12770700
v3 = v1 + zero df:12780884
{noformat}

Another thing that surprised me is, with the same code/conf, 
luceneutil creates different sizes of index? I tested 
that df==0 trick several times on wikimedium1m, the 
index size varies from 514M~522M... Will multi-threading affects
much here?


> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 4.4
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-07-11 Thread Han Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Jiang updated LUCENE-3069:
--

Attachment: example.png
LUCENE-3069.patch

Uploaded patch, it is the main part of changes I commited to branch3069.

The picture shows current impl of outputs (it is fetched from one field in 
wikimedium5k).

* long[] (sortable metadata)
* byte[] (unsortable, generic metadata)
* df, ttf (term stats)

A single byte flag is used to indicate whether/which fields current outputs 
maintains, 
for PBF with short byte[], this should be enough. Also, for long-tail terms, 
the totalTermFreq
an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms 
have df == ttf).


Since TermsEnum is totally based on FSTEnum, the performance of term dict 
should be similar with 
MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this 
hurts.


Following is the performance comparison:

{noformat}
pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall

TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell   48.13  (4.4%)   15.38  (1.0%)  
-68.0% ( -70% -  -65%)
  Fuzzy2   51.30  (5.3%)   17.47  (1.3%)  
-65.9% ( -68% -  -62%)
  Fuzzy1   52.24  (4.0%)   18.50  (1.2%)  
-64.6% ( -67% -  -61%)
Wildcard9.31  (1.7%)6.16  (2.2%)  
-33.8% ( -37% -  -30%)
 Prefix3   23.25  (1.8%)   19.00  (2.2%)  
-18.3% ( -21% -  -14%)
PKLookup  244.92  (3.6%)  225.42  (2.3%)   
-8.0% ( -13% -   -2%)
 LowTerm  295.88  (5.5%)  293.27  (4.8%)   
-0.9% ( -10% -9%)
  HighPhrase   13.62  (6.5%)   13.54  (7.4%)   
-0.6% ( -13% -   14%)
 MedTerm   99.51  (7.8%)   99.19  (7.7%)   
-0.3% ( -14% -   16%)
   MedPhrase  154.63  (9.4%)  154.38 (10.1%)   
-0.2% ( -17% -   21%)
HighTerm   28.25 (10.7%)   28.25 (10.0%)   
-0.0% ( -18% -   23%)
  OrHighHigh   16.83 (13.3%)   16.86 (13.1%)
0.2% ( -23% -   30%)
HighSloppyPhrase9.02  (4.4%)9.03  (4.5%)
0.2% (  -8% -9%)
   LowPhrase6.26  (3.4%)6.27  (4.1%)
0.2% (  -7% -8%)
   OrHighMed   13.73 (13.2%)   13.77 (12.8%)
0.3% ( -22% -   30%)
   OrHighLow   25.65 (13.2%)   25.73 (13.0%)
0.3% ( -22% -   30%)
 MedSloppyPhrase6.63  (2.7%)6.66  (2.7%)
0.5% (  -4% -6%)
  AndHighMed   42.77  (1.8%)   43.13  (1.5%)
0.8% (  -2% -4%)
 LowSloppyPhrase   32.68  (3.0%)   32.96  (2.8%)
0.8% (  -4% -6%)
 AndHighHigh   22.90  (1.2%)   23.18  (0.7%)
1.2% (   0% -3%)
 LowSpanNear   29.30  (2.0%)   29.83  (2.2%)
1.8% (  -2% -6%)
 MedSpanNear8.39  (2.7%)8.56  (2.9%)
2.0% (  -3% -7%)
  IntNRQ3.12  (1.9%)3.18  (6.7%)
2.1% (  -6% -   10%)
  AndHighLow  507.01  (2.4%)  522.10  (2.8%)
3.0% (  -2% -8%)
HighSpanNear5.43  (1.8%)5.60  (2.6%)
3.1% (  -1% -7%)
{noformat}


{noformat}
pure TempFST vs. pure Lucene41, on wikimediumall

TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell   49.24  (2.7%)   15.51  (1.0%)  
-68.5% ( -70% -  -66%)
  Fuzzy2   52.01  (4.8%)   17.61  (1.4%)  
-66.1% ( -68% -  -63%)
  Fuzzy1   53.00  (4.0%)   18.62  (1.3%)  
-64.9% ( -67% -  -62%)
Wildcard9.37  (1.3%)6.15  (2.1%)  
-34.4% ( -37% -  -31%)
 Prefix3   23.36  (0.8%)   18.96  (2.1%)  
-18.8% ( -21% -  -16%)
   MedPhrase  155.86  (9.8%)  152.34  (9.7%)   
-2.3% ( -19% -   19%)
   LowPhrase6.33  (3.7%)6.23  (4.0%)   
-1.6% (  -8% -6%)
  HighPhrase   13.68  (7.2%)   13.49  (6.8%)   
-1.4% ( -14% -   13%)
   OrHighMed   13.78 (13.0%)   13.68 (12.7%)   
-0.8% ( -23% -   28%)
HighSloppyPhrase9.14  (5.2%)9.07  (3.7%)   
-0.7% (  -9% -8%)
  OrHighHigh   16.87 (13.3%)   16.76 (12.9%)   
-0.6% ( -23% -   29%)
   OrHighLow   25.71 (13.1%)   25.58 (12.8%)   
-0.5% ( -23% -   29%)

[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-05-09 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-3069:
--

Fix Version/s: (was: 4.3)
   4.4

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Han Jiang
>  Labels: gsoc2013
> Fix For: 4.4
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3069:
---

Fix Version/s: (was: 4.1)
   4.2

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.2
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2012-03-20 Thread Michael McCandless (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-3069:
---

Labels: gsoc2012 lucene-gsoc-12  (was: )

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index, core/search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
>  Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.0
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2011-05-04 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-3069:


Summary: Lucene should have an entirely memory resident term dictionary  
(was: Lucene should be able to have a entirely memory resident term dictionary)

> Lucene should have an entirely memory resident term dictionary
> --
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index, Search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: 4.0
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org