[jira] Resolved: (LUCENE-1411) Enable IndexWriter to open an arbitrary commit point

2008-10-20 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1411.


Resolution: Fixed

 Enable IndexWriter to open an arbitrary commit point
 

 Key: LUCENE-1411
 URL: https://issues.apache.org/jira/browse/LUCENE-1411
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1411.patch


 With a 2-phase commit involving multiple resources, each resource
 first does its prepareCommit and then if all are successful they each
 commit.  If an exception or timeout/power loss is hit in any of the
 resources during prepareCommit or commit, all of the resources must
 then rollback.
 But, because IndexWriter always opens the most recent commit, getting
 Lucene to rollback after commit() has been called is not easy, unless
 you make Lucene the last resource to commit.  A simple workaround is
 to simply remove the segments_N files of the newer commits but that's
 sort of a hassle.
 To fix this, we just need to add a ctor to IndexWriter that takes an
 IndexCommit.  We recently added this for IndexReader (LUCENE-1311) as
 well.  This ctor is definitely an expert method, and only makes
 sense if you have a custom DeletionPolicy that preserves more than
 just the most recent commit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1382) Allow storing user data when IndexWriter.commit() is called

2008-10-20 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1382.


   Resolution: Fixed
Fix Version/s: 2.9

 Allow storing user data when IndexWriter.commit() is called
 ---

 Key: LUCENE-1382
 URL: https://issues.apache.org/jira/browse/LUCENE-1382
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1382.patch


 Spinoff from here:
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg22303.html
 The idea is to allow optionally passing an opaque String commitUserData to 
 the IndexWriter.commit method.  This String would be stored in the segments_N 
 file, and would be retrievable by an IndexReader.  Applications could then 
 use this to assign meaning to each commit.
 It would be nice to get this done for 2.4, but I don't think we should hold 
 the release for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)
Next steps towards flexible indexing


 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


In working on LUCENE-1410 (PFOR compression) I tried to prototype
switching the postings files to use PFOR instead of vInts for
encoding.

But it quickly became difficult.  EG we currently mux the skip data
into the .frq file, which messes up the int blocks.  We inline
payloads with positions which would also mess up the int blocks.
Skipping offsets and TermInfo offsets hardwire the file pointers of
frq  prox files yet I need to change these to block + offset, etc.

Separately this thread also started up, on how to customize how Lucene
stores positional information in the index:

  http://www.gossamer-threads.com/lists/lucene/java-user/66264

So I decided to make a bit more progress towards flexible indexing
by first modularizing/isolating the classes that actually write the
index format.  The idea is to capture the logic of each (terms, freq,
positions/payloads) into separate interfaces and switch the flushing
of a new segment as well as writing the segment during merging to use
the same APIs.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1426:
---

Attachment: LUCENE-1426.patch

Attached patch.  I think it's ready to commit... I'll wait a few days.

This factors the writing of postings into separate Format* classes.
The approach I took is similar to what I did for DocumentsWriter,
where there is a hierarchical consumer interface (abstract class) for
each of fields, terms, docs, and positions writing.  Then there's a
corresponding set of concrete classes (the codec chain) that write
today's index format.  There is no change to the index format.

Here are the details:

  * This only applies to postings (not stored fields, term vectors,
norms, field infos)

  * Both SegmentMerger  FreqProxTermsWriter now use the same codec
API to write postings.  I think this is a big step forward: we now
have a single set of classes that ever write the postings.

  * You can't yet customize this codec chain; we can add that at some
point.  It's all package private.

  * I don't yet allow the codec to override SegmentInfo.files(); at
some point (when I first try to make a codec that uses different
files) I will add this.

I ran a quick performance test, indexing wikipedia, and found
negligible performance cost of this.

The next step, which is trickier, is to modularize/genericize the
classes the read from the index, and then refactor
SegmentTerm{Enum,Docs,Positions} to use that codec API.

Then, finally, I want to make a codec that uses PFOR to encode
postings.

 Next steps towards flexible indexing
 

 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1426.patch


 In working on LUCENE-1410 (PFOR compression) I tried to prototype
 switching the postings files to use PFOR instead of vInts for
 encoding.
 But it quickly became difficult.  EG we currently mux the skip data
 into the .frq file, which messes up the int blocks.  We inline
 payloads with positions which would also mess up the int blocks.
 Skipping offsets and TermInfo offsets hardwire the file pointers of
 frq  prox files yet I need to change these to block + offset, etc.
 Separately this thread also started up, on how to customize how Lucene
 stores positional information in the index:
   http://www.gossamer-threads.com/lists/lucene/java-user/66264
 So I decided to make a bit more progress towards flexible indexing
 by first modularizing/isolating the classes that actually write the
 index format.  The idea is to capture the logic of each (terms, freq,
 positions/payloads) into separate interfaces and switch the flushing
 of a new segment as well as writing the segment during merging to use
 the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TokenStream and Token APIs

2008-10-20 Thread Grant Ingersoll


On Oct 19, 2008, at 7:08 PM, Michael Busch wrote:


Grant Ingersoll wrote:


On Oct 19, 2008, at 12:56 AM, Mark Miller wrote:


Grant Ingersoll wrote:


Bear with me, b/c I'm not sure I'm following, but looking at https://issues.apache.org/jira/browse/LUCENE-1422 
, I see at least 5 different implemented Attributes.


So, let's say I add a 5 more attributes and now have a total of  
10 attributes. Are you saying that I then would have,  
potentially, 10 different variables that all point to the token  
as in the code snippet above where the casting takes place? Or  
would I just create a single Super attribute that folds in all  
of my new attributes, plus any other existing ones? Or, maybe,  
what I would do is create the 5 new attributes and then 1 new  
attribute that extends all 10, thus allowing me to use them  
individually, but saving me from having to do a whole ton of  
casting in my Consumer.
Potentially one consumer doing 10 things, but not likely right? I  
mean, things will stay logical as they are now, and rather than a  
super consumer doing everything, we will still have a chain of  
consumers each doing its own piece. So more likely, maybe  
something comes along every so often (another 5, over *much* time,  
say) and each time we add a Consumer that uses one or two  
TokenStream types. And then its just an implementation detail on  
whether you make a composite TokenStream - if you have added 10  
new attributes and see it fit to make one consumer use them all,  
sure, make a composite, super type, but in my mind, the way its  
done in the example code is clearer/cleaner for a handful of  
TokenStream types. And even if you do make the composite,super  
type, its likely to just be a sugar wrapper anyway - the  
implementation for say, payload and positions, should probably be  
maintained in their own classes anyway.


Well, there are 5 different attributes already, all of which are  
commonly used.  Seems weird to have to cast the same var 5  
different ways.  Definitely agree that one would likely deal with  
this by wrapping, but then you end up either needing to extend your  
wrapper or add new wrappers...


Well yes, there are 5 attributes, but n neither of the core  
tokenstreams and -filters that I changed in my patch did I have to  
use more than two or three of those. Currently the only attributes  
that are really used are PositionIncrementAttribute and  
PayloadAttribute. And the OffsetAttribute when TermVectors are  
turned on.


Even in the indexing chain currently we don't have a single consumer  
that needs all attributes. The FreqProxWriter needs positions and  
payloads, the TermVectorsWriter needs positions and offsets.



I have an application that uses all the attributes of a Token, or at  
least, almost all of them.  There are many uses for Lucene's analysis  
code that have nothing to do with indexing, Consumers or even Lucene.





Also, you don't have to cast the same variable multiple times. In  
the current patch you would call e. g.  
token.getAttribute(PayloadAttribute.class) and keep a reference to  
it in the consumer or filter.


IMO even calling getAttribute() 5 times or so and storing the  
references wouldn't be so bad. And if you really don't like it you  
could make a wrapper as you said. You also mentioned the  
disadvantages of the wrapper, e. g. that you would have to extend it  
to add new attributes. But then, isn't that the same disadvantage  
the current Token API has?


True.  I didn't say the idea was bad, in fact I mostly like it, I was  
just saying I'd like to explore how it would work in practice and the  
main thing that struck me was all the casting or all the references.   
Since it's likely that you only deal with a Token one at a time,  
you're right, it's probably not a big deal other than the code looks  
funny, IMO.





You could even use the new API in exact the same way as the old one.  
Just create a subclass of Token that has all members you need and  
don't add any attributes.


So I think the new API adds more flexibility, and still offers to  
use it in the same way as the old one. I however think the  
recommended best practice should be to use the new attributes, for  
reusability of consumers that only need certain attributes.


Perhaps it would be useful for Lucene to offer exactly one subclass of  
Token that we guarantee will always have all known Attributes (i.e.  
the ones Lucene provides)  available to it for casting purposes.





However, please let me know if you have any concrete recommendations  
about changing the API in LUCENE-1422.


I thought those concerns were pretty concrete...  :-)


There might be better ones than the APIs I came up with.


I think the APIs in the 2nd patch look pretty reasonable.


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

2008-10-20 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641075#action_12641075
 ] 

Grant Ingersoll commented on LUCENE-1406:
-

Committed revision 706342.

I made some small changes to reuse Tokens, also added in some comments into the 
stopwords list and added to WordListLoader to accommodate this

Thanks Robert!

 new Arabic Analyzer (Apache license)
 

 Key: LUCENE-1406
 URL: https://issues.apache.org/jira/browse/LUCENE-1406
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Robert Muir
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-1406.patch


 I've noticed there is no Arabic analyzer for Lucene, most likely because Tim 
 Buckwalter's morphological dictionary is GPL.
 However, it is not necessary  to have full morphological analysis engine for 
 a quality arabic search. 
 This implementation implements the light-8s algorithm present in the 
 following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
 As you can see from the paper, improvement via this method over searching 
 surface forms (as lucene currently does) is significant, with almost 100% 
 improvement in average precision.
 While I personally don't think all the choices were the best, and some easily 
 improvements are still possible, the major motivation for implementing it 
 exactly the way it is presented in the paper is that the algorithm is 
 TREC-tested, so the precision/recall improvements to lucene are already 
 documented.
 For a stopword list, I used a list present at 
 http://members.unine.ch/jacques.savoy/clef/index.html simply because the 
 creator of this list documents the data as BSD-licensed.
 This implementation (Analyzer) consists of above mentioned stopword list plus 
 two filters:
  ArabicNormalizationFilter: performs orthographic normalization (such as 
 hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, 
 etc)
  ArabicStemFilter: performs arabic light stemming
 Both filters operate directly on termbuffer for maximum performance. There is 
 no object creation in this Analyzer.
 There are no external dependencies. I've indexed about half a billion words 
 of arabic text and tested against that.
 If there are any issues with this implementation I am willing to fix them. I 
 use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641121#action_12641121
 ] 

Paul Elschot commented on LUCENE-1426:
--

bq. We inline payloads with positions which would also mess up the int blocks.

Which begs the question whether we should also allow compression of these 
payloads.
I think we should do that because normally only one or two bytes will be used 
as payload per position.
Thinking about this: position+payload actually looks a lot like docId+freq, 
could that
be used to simplify future index formats for inverted terms?
Btw. allowing a payload to accompany the field norms would allow to store a 
kind of
dictionary for the position payloads. This could help to keep the position 
payloads small
so they would compress nicely.

bq. Both SegmentMerger  FreqProxTermsWriter now use the same codec API to 
write postings.

That is indeed a big step.

bq. It's all package private.

Good for now, making it public might actually reduce flexibility for new index 
formats.



 Next steps towards flexible indexing
 

 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1426.patch


 In working on LUCENE-1410 (PFOR compression) I tried to prototype
 switching the postings files to use PFOR instead of vInts for
 encoding.
 But it quickly became difficult.  EG we currently mux the skip data
 into the .frq file, which messes up the int blocks.  We inline
 payloads with positions which would also mess up the int blocks.
 Skipping offsets and TermInfo offsets hardwire the file pointers of
 frq  prox files yet I need to change these to block + offset, etc.
 Separately this thread also started up, on how to customize how Lucene
 stores positional information in the index:
   http://www.gossamer-threads.com/lists/lucene/java-user/66264
 So I decided to make a bit more progress towards flexible indexing
 by first modularizing/isolating the classes that actually write the
 index format.  The idea is to capture the logic of each (terms, freq,
 positions/payloads) into separate interfaces and switch the flushing
 of a new segment as well as writing the segment during merging to use
 the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641125#action_12641125
 ] 

Paul Elschot commented on LUCENE-1426:
--

bq. Skipping offsets and TermInfo offsets hardwire the file pointers of  frq  
prox files yet I need to change these to block + offset, etc.

Does the offset imply that there is also a need for random access into each 
block?
For such blocks PFOR patching might better be avoided.
Even with patching random access is possible, but it is not available yet at 
LUCENE-1410.


 Next steps towards flexible indexing
 

 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1426.patch


 In working on LUCENE-1410 (PFOR compression) I tried to prototype
 switching the postings files to use PFOR instead of vInts for
 encoding.
 But it quickly became difficult.  EG we currently mux the skip data
 into the .frq file, which messes up the int blocks.  We inline
 payloads with positions which would also mess up the int blocks.
 Skipping offsets and TermInfo offsets hardwire the file pointers of
 frq  prox files yet I need to change these to block + offset, etc.
 Separately this thread also started up, on how to customize how Lucene
 stores positional information in the index:
   http://www.gossamer-threads.com/lists/lucene/java-user/66264
 So I decided to make a bit more progress towards flexible indexing
 by first modularizing/isolating the classes that actually write the
 index format.  The idea is to capture the logic of each (terms, freq,
 positions/payloads) into separate interfaces and switch the flushing
 of a new segment as well as writing the segment during merging to use
 the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641128#action_12641128
 ] 

Eks Dev commented on LUCENE-1426:
-

Just a few random thoughts on this topic

- I am sure I read somewhere in these pdfs that were floating around that it 
would make sense to use VInts for very short postings and PFOR for the rest. I 
just do not remember rationale behind it.   

- During omitTf() discussion, we came up with cool idea to actually inline very 
short postings into term dict instead of storing offset. This way we spare one 
seek per term in many cases, as well as some space for storing offset. I do not 
know if this is a problem, but sounds reasonable. With standard Zipfian 
distribution, a lot of postings should get inlined. Use cases where we have 
query expansion on many terms (think spell checker, synonyms ...) should 
benefit from that heavily. These postings are small but there is a lot of them, 
so it adds up... seek is deadly :)

I am sorry to miss the party here with PFOR, but let us hope this credit crunch 
gets over soon so I that I could dedicate some time to fun things like this :)

cheers, eks 


  

 Next steps towards flexible indexing
 

 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1426.patch


 In working on LUCENE-1410 (PFOR compression) I tried to prototype
 switching the postings files to use PFOR instead of vInts for
 encoding.
 But it quickly became difficult.  EG we currently mux the skip data
 into the .frq file, which messes up the int blocks.  We inline
 payloads with positions which would also mess up the int blocks.
 Skipping offsets and TermInfo offsets hardwire the file pointers of
 frq  prox files yet I need to change these to block + offset, etc.
 Separately this thread also started up, on how to customize how Lucene
 stores positional information in the index:
   http://www.gossamer-threads.com/lists/lucene/java-user/66264
 So I decided to make a bit more progress towards flexible indexing
 by first modularizing/isolating the classes that actually write the
 index format.  The idea is to capture the logic of each (terms, freq,
 positions/payloads) into separate interfaces and switch the flushing
 of a new segment as well as writing the segment during merging to use
 the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641132#action_12641132
 ] 

Doug Cutting commented on LUCENE-1426:
--

+1 This sounds like a great way to approach flexible indexing: incrementally.

 Next steps towards flexible indexing
 

 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1426.patch


 In working on LUCENE-1410 (PFOR compression) I tried to prototype
 switching the postings files to use PFOR instead of vInts for
 encoding.
 But it quickly became difficult.  EG we currently mux the skip data
 into the .frq file, which messes up the int blocks.  We inline
 payloads with positions which would also mess up the int blocks.
 Skipping offsets and TermInfo offsets hardwire the file pointers of
 frq  prox files yet I need to change these to block + offset, etc.
 Separately this thread also started up, on how to customize how Lucene
 stores positional information in the index:
   http://www.gossamer-threads.com/lists/lucene/java-user/66264
 So I decided to make a bit more progress towards flexible indexing
 by first modularizing/isolating the classes that actually write the
 index format.  The idea is to capture the logic of each (terms, freq,
 positions/payloads) into separate interfaces and switch the flushing
 of a new segment as well as writing the segment during merging to use
 the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641137#action_12641137
 ] 

Michael McCandless commented on LUCENE-1426:


bq. During omitTf() discussion, we came up with cool idea to actually inline 
very short postings into term dict instead of storing offset.

Yes, there's this issue:

  https://issues.apache.org/jira/browse/LUCENE-1278

And you had found this one:

  http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf

And then Doug referenced this:

  http://citeseer.ist.psu.edu/cutting90optimizations.html

I think the idea makes tons of sense (saving a seek) and one of my
goals in phase 2 (genericizing the reading of an index) is to make
pulsing a drop-in codec as an example  litmus test.  Terms iteration
may suffer, though, unless we put this in a separate file.

I also think, at the opposite end of the spectrum, it would make sense
for very common terms to use simple n-bit packing (PFOR minus the
exceptions).  For massive terms we need the fastest search we can
get, since that gates when you have to start sharding.

bq. I am sorry to miss the party here with PFOR, but let us hope this credit 
crunch gets over soon so I that I could dedicate some time to fun things like 
this

Well the stock market seems to think the credit crunch is improving,
today... of course who knows what'll happen tomorrow!  Good luck :)

Also, I'd like to explore improving the terms dict indexing -- I don't
think we need to load a TermInfo instance for every indexed term, into
RAM.  I think we just need the term  seek data (into the tis file),
then you seek there and skip to the TermInfo you need.  This should
save a good amount of RAM for large indices with odd terms, sicne each
TermInfo instance requires a pointer to it (4 or 8 bytes), an object
header (8 bytes at least) then 20 bytes for the members.

All these explorations should become simple drop-in codecs, once I can
finish phase 2.


 Next steps towards flexible indexing
 

 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1426.patch


 In working on LUCENE-1410 (PFOR compression) I tried to prototype
 switching the postings files to use PFOR instead of vInts for
 encoding.
 But it quickly became difficult.  EG we currently mux the skip data
 into the .frq file, which messes up the int blocks.  We inline
 payloads with positions which would also mess up the int blocks.
 Skipping offsets and TermInfo offsets hardwire the file pointers of
 frq  prox files yet I need to change these to block + offset, etc.
 Separately this thread also started up, on how to customize how Lucene
 stores positional information in the index:
   http://www.gossamer-threads.com/lists/lucene/java-user/66264
 So I decided to make a bit more progress towards flexible indexing
 by first modularizing/isolating the classes that actually write the
 index format.  The idea is to capture the logic of each (terms, freq,
 positions/payloads) into separate interfaces and switch the flushing
 of a new segment as well as writing the segment during merging to use
 the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641139#action_12641139
 ] 

Michael McCandless commented on LUCENE-1426:



{quote}
Does the offset imply that there is also a need for random access into each 
block?
For such blocks PFOR patching might better be avoided.
Even with patching random access is possible, but it is not available yet at 
LUCENE-1410.
{quote}

Yeah this is one of the reasons why I'm thinking for frequent terms we
may want to fallback to pure nbit packing (which would make random
access simple).

But, for starters would could simply implement random access as load
 decode the entire block, then look at the part you want and then
assess the cost.  While it will clearly increase the cost of queries
that do alot of skipping (eg AND query of N terms), it may not matter
so much since these queries should be fairly fast now.  It's the OR of
frequent term queries that we need to improve since that limits how
big an index you can put on one box.


 Next steps towards flexible indexing
 

 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1426.patch


 In working on LUCENE-1410 (PFOR compression) I tried to prototype
 switching the postings files to use PFOR instead of vInts for
 encoding.
 But it quickly became difficult.  EG we currently mux the skip data
 into the .frq file, which messes up the int blocks.  We inline
 payloads with positions which would also mess up the int blocks.
 Skipping offsets and TermInfo offsets hardwire the file pointers of
 frq  prox files yet I need to change these to block + offset, etc.
 Separately this thread also started up, on how to customize how Lucene
 stores positional information in the index:
   http://www.gossamer-threads.com/lists/lucene/java-user/66264
 So I decided to make a bit more progress towards flexible indexing
 by first modularizing/isolating the classes that actually write the
 index format.  The idea is to capture the logic of each (terms, freq,
 positions/payloads) into separate interfaces and switch the flushing
 of a new segment as well as writing the segment during merging to use
 the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641140#action_12641140
 ] 

Michael McCandless commented on LUCENE-1426:



bq. Which begs the question whether we should also allow compression of these 
payloads.

I think that's interesting, but would probably be rather application dependent.

{quote}
Btw. allowing a payload to accompany the field norms would allow to store a 
kind of
dictionary for the position payloads. This could help to keep the position 
payloads small
so they would compress nicely.
{quote}

Couldn't stored fields, once they are faster (with column-stride
fields, LUCENE-1231) solve this?


 Next steps towards flexible indexing
 

 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1426.patch


 In working on LUCENE-1410 (PFOR compression) I tried to prototype
 switching the postings files to use PFOR instead of vInts for
 encoding.
 But it quickly became difficult.  EG we currently mux the skip data
 into the .frq file, which messes up the int blocks.  We inline
 payloads with positions which would also mess up the int blocks.
 Skipping offsets and TermInfo offsets hardwire the file pointers of
 frq  prox files yet I need to change these to block + offset, etc.
 Separately this thread also started up, on how to customize how Lucene
 stores positional information in the index:
   http://www.gossamer-threads.com/lists/lucene/java-user/66264
 So I decided to make a bit more progress towards flexible indexing
 by first modularizing/isolating the classes that actually write the
 index format.  The idea is to capture the logic of each (terms, freq,
 positions/payloads) into separate interfaces and switch the flushing
 of a new segment as well as writing the segment during merging to use
 the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1387) Add LocalLucene

2008-10-20 Thread Xibin Zeng (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641264#action_12641264
 ] 

Xibin Zeng commented on LUCENE-1387:


Hey Guys! Where is this now? Has it been checked in yet? I am asking as I am 
currently planning a feature and wanted to know if it is realistic to take 
advantage of it now. Any update is appreciated!

 Add LocalLucene
 ---

 Key: LUCENE-1387
 URL: https://issues.apache.org/jira/browse/LUCENE-1387
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Grant Ingersoll
Priority: Minor
 Attachments: spatial.zip


 Local Lucene (Geo-search) has been donated to the Lucene project, per 
 https://issues.apache.org/jira/browse/INCUBATOR-77.  This issue is to handle 
 the Lucene portion of integration.
 See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1422) New TokenStream API

2008-10-20 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1422:
--

Attachment: lucene-1422.take3.patch

I added several things in this new patch:

 * hashCode() and equals() now incorporate the attributes
 * patch compiles against Java 1.4
 * all core tests pass with and without the new API turned
   on (via TokenStream.setUseNewAPI(true))
 * Added setToken() method to InvertedDocConsumerPerField
   and TermsHashConsumerPerField and updated the 
   implementing classes. I have actually a question here,
   because I don't know these classes very well yet. Would
   it be better to add the Token to the DocInverter.FieldInvertState?
   I think I also have to review LUCENE-1426 to see if these
   changes are not in conflict ( I think 1426 should be committed
   first?)

Outstanding:
 * dedicated junits for new APIs, even though the existing tests
   already cover a lot when setUseNewAPI(true)
 * javadocs
 * contrib streams and filters


 New TokenStream API
 ---

 Key: LUCENE-1422
 URL: https://issues.apache.org/jira/browse/LUCENE-1422
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9

 Attachments: lucene-1422.patch, lucene-1422.take2.patch, 
 lucene-1422.take3.patch


 This is a very early version of the new TokenStream API that 
 we started to discuss here:
 http://www.gossamer-threads.com/lists/lucene/java-dev/66227
 This implementation is a bit different from what I initially
 proposed in the thread above. I introduced a new class called
 AttributedToken, which contains the same termBuffer logic 
 from Token. In addition it has a lazily-initialized map of
 Class? extends Attribute - Attribute. Attribute is also a
 new class in a new package, plus several implementations like
 PositionIncrementAttribute, PayloadAttribute, etc.
 Similar to my initial proposal is the prototypeToken() method
 which the consumer (e. g. DocumentsWriter) needs to call.
 The token is created by the tokenizer at the end of the chain
 and pushed through all filters to the end consumer. The 
 tokenizer and also all filters can add Attributes to the 
 token and can keep references to the actual types of the
 attributes that they need to read of modify. This way, when
 boolean nextToken() is called, no casting is necessary.
 I added a class called TestNewTokenStreamAPI which is not 
 really a test case yet, but has a static demo() method, which
 demonstrates how to use the new API.
 The reason to not merge Token and TokenStream into one class 
 is that we might have caching (or tee/sink) filters in the 
 chain that might want to store cloned copies of the tokens
 in a cache. I added a new class NewCachingTokenStream that
 shows how such a class could work. I also implemented a deep
 clone method in AttributedToken and a 
 copyFrom(AttributedToken) method, which is needed for the 
 caching. Both methods have to iterate over the list of 
 attributes. The Attribute subclasses itself also have a
 copyFrom(Attribute) method, which unfortunately has to down-
 cast to the actual type. I first thought that might be very
 inefficient, but it's not so bad. Well, if you add all
 Attributes to the AttributedToken that our old Token class
 had (like offsets, payload, posIncr), then the performance
 of the caching is somewhat slower (~40%). However, if you 
 add less attributes, because not all might be needed, then
 the performance is even slightly faster than with the old API.
 Also the new API is flexible enough so that someone could
 implement a custom caching filter that knows all attributes
 the token can have, then the caching should be just as 
 fast as with the old API.
 This patch is not nearly ready, there are lot's of things 
 missing:
 - unit tests
 - change DocumentsWriter to use new API 
   (in backwards-compatible fashion)
 - patch is currently java 1.5; need to change before 
   commiting to 2.9
 - all TokenStreams and -Filters should be changed to use 
   new API
 - javadocs incorrect or missing
 - hashcode and equals methods missing in Attributes and 
   AttributedToken
   
 I wanted to submit it already for brave people to give me 
 early feedback before I spend more time working on this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TokenStream and Token APIs

2008-10-20 Thread Michael Busch

Grant Ingersoll wrote:


On Oct 19, 2008, at 7:08 PM, Michael Busch wrote:


Grant Ingersoll wrote:


On Oct 19, 2008, at 12:56 AM, Mark Miller wrote:


Grant Ingersoll wrote:


Bear with me, b/c I'm not sure I'm following, but looking at 
https://issues.apache.org/jira/browse/LUCENE-1422, I see at least 
5 different implemented Attributes.


So, let's say I add a 5 more attributes and now have a total of 10 
attributes. Are you saying that I then would have, potentially, 10 
different variables that all point to the token as in the code 
snippet above where the casting takes place? Or would I just 
create a single Super attribute that folds in all of my new 
attributes, plus any other existing ones? Or, maybe, what I would 
do is create the 5 new attributes and then 1 new attribute that 
extends all 10, thus allowing me to use them individually, but 
saving me from having to do a whole ton of casting in my Consumer.
Potentially one consumer doing 10 things, but not likely right? I 
mean, things will stay logical as they are now, and rather than a 
super consumer doing everything, we will still have a chain of 
consumers each doing its own piece. So more likely, maybe something 
comes along every so often (another 5, over *much* time, say) and 
each time we add a Consumer that uses one or two TokenStream types. 
And then its just an implementation detail on whether you make a 
composite TokenStream - if you have added 10 new attributes and see 
it fit to make one consumer use them all, sure, make a composite, 
super type, but in my mind, the way its done in the example code is 
clearer/cleaner for a handful of TokenStream types. And even if you 
do make the composite,super type, its likely to just be a sugar 
wrapper anyway - the implementation for say, payload and positions, 
should probably be maintained in their own classes anyway.


Well, there are 5 different attributes already, all of which are 
commonly used.  Seems weird to have to cast the same var 5 different 
ways.  Definitely agree that one would likely deal with this by 
wrapping, but then you end up either needing to extend your wrapper 
or add new wrappers...


Well yes, there are 5 attributes, but n neither of the core 
tokenstreams and -filters that I changed in my patch did I have to 
use more than two or three of those. Currently the only attributes 
that are really used are PositionIncrementAttribute and 
PayloadAttribute. And the OffsetAttribute when TermVectors are turned 
on.


Even in the indexing chain currently we don't have a single consumer 
that needs all attributes. The FreqProxWriter needs positions and 
payloads, the TermVectorsWriter needs positions and offsets.



I have an application that uses all the attributes of a Token, or at 
least, almost all of them.  There are many uses for Lucene's analysis 
code that have nothing to do with indexing, Consumers or even Lucene.





Also, you don't have to cast the same variable multiple times. In the 
current patch you would call e. g. 
token.getAttribute(PayloadAttribute.class) and keep a reference to it 
in the consumer or filter.


IMO even calling getAttribute() 5 times or so and storing the 
references wouldn't be so bad. And if you really don't like it you 
could make a wrapper as you said. You also mentioned the 
disadvantages of the wrapper, e. g. that you would have to extend it 
to add new attributes. But then, isn't that the same disadvantage the 
current Token API has?


True.  I didn't say the idea was bad, in fact I mostly like it, I was 
just saying I'd like to explore how it would work in practice and the 
main thing that struck me was all the casting or all the references.  
Since it's likely that you only deal with a Token one at a time, 
you're right, it's probably not a big deal other than the code looks 
funny, IMO.





You could even use the new API in exact the same way as the old one. 
Just create a subclass of Token that has all members you need and 
don't add any attributes.


So I think the new API adds more flexibility, and still offers to use 
it in the same way as the old one. I however think the recommended 
best practice should be to use the new attributes, for reusability of 
consumers that only need certain attributes.


Perhaps it would be useful for Lucene to offer exactly one subclass of 
Token that we guarantee will always have all known Attributes (i.e. 
the ones Lucene provides)  available to it for casting purposes.




Yeah we could do that. In fact, I did exactly this when I started 
working on this patch. I created a class called PlainToken, which had 
all the termBuffer and attributes logic, and changed Token to extend it. 
Then the new getToken() method would return an instance of PlainToken. 
My main concern with this approach is that it will make the code in the 
indexer more complicated, because it always has to check if we have a 
Token or PlainToken; if it's a Token then it has to use the get*() 
method directly, for a