date:20081020

[jira] Resolved: (LUCENE-1411) Enable IndexWriter to open an arbitrary commit point

2008-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1411.


Resolution: Fixed

> Enable IndexWriter to open an arbitrary commit point
> 
>
> Key: LUCENE-1411
> URL: https://issues.apache.org/jira/browse/LUCENE-1411
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1411.patch
>
>
> With a 2-phase commit involving multiple resources, each resource
> first does its prepareCommit and then if all are successful they each
> commit.  If an exception or timeout/power loss is hit in any of the
> resources during prepareCommit or commit, all of the resources must
> then rollback.
> But, because IndexWriter always opens the most recent commit, getting
> Lucene to rollback after commit() has been called is not easy, unless
> you make Lucene the last resource to commit.  A simple workaround is
> to simply remove the segments_N files of the newer commits but that's
> sort of a hassle.
> To fix this, we just need to add a ctor to IndexWriter that takes an
> IndexCommit.  We recently added this for IndexReader (LUCENE-1311) as
> well.  This ctor is definitely an "expert" method, and only makes
> sense if you have a custom DeletionPolicy that preserves more than
> just the most recent commit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1382) Allow storing user data when IndexWriter.commit() is called

2008-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1382.


   Resolution: Fixed
Fix Version/s: 2.9

> Allow storing user data when IndexWriter.commit() is called
> ---
>
> Key: LUCENE-1382
> URL: https://issues.apache.org/jira/browse/LUCENE-1382
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1382.patch
>
>
> Spinoff from here:
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg22303.html
> The idea is to allow optionally passing an opaque String commitUserData to 
> the IndexWriter.commit method.  This String would be stored in the segments_N 
> file, and would be retrievable by an IndexReader.  Applications could then 
> use this to assign meaning to each commit.
> It would be nice to get this done for 2.4, but I don't think we should hold 
> the release for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)

Next steps towards flexible indexing


 Key: LUCENE-1426
 URL: https://issues.apache.org/jira/browse/LUCENE-1426
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


In working on LUCENE-1410 (PFOR compression) I tried to prototype
switching the postings files to use PFOR instead of vInts for
encoding.

But it quickly became difficult.  EG we currently mux the skip data
into the .frq file, which messes up the int blocks.  We inline
payloads with positions which would also mess up the int blocks.
Skipping offsets and TermInfo offsets hardwire the file pointers of
frq & prox files yet I need to change these to block + offset, etc.

Separately this thread also started up, on how to customize how Lucene
stores positional information in the index:

  http://www.gossamer-threads.com/lists/lucene/java-user/66264

So I decided to make a bit more progress towards "flexible indexing"
by first modularizing/isolating the classes that actually write the
index format.  The idea is to capture the logic of each (terms, freq,
positions/payloads) into separate interfaces and switch the flushing
of a new segment as well as writing the segment during merging to use
the same APIs.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1426:
---

Attachment: LUCENE-1426.patch

Attached patch.  I think it's ready to commit... I'll wait a few days.

This factors the writing of postings into separate Format* classes.
The approach I took is similar to what I did for DocumentsWriter,
where there is a hierarchical consumer interface (abstract class) for
each of fields, terms, docs, and positions writing.  Then there's a
corresponding set of concrete classes (the "codec chain") that write
today's index format.  There is no change to the index format.

Here are the details:

  * This only applies to postings (not stored fields, term vectors,
norms, field infos)

  * Both SegmentMerger & FreqProxTermsWriter now use the same codec
API to write postings.  I think this is a big step forward: we now
have a single set of classes that ever write the postings.

  * You can't yet customize this codec chain; we can add that at some
point.  It's all package private.

  * I don't yet allow the codec to override SegmentInfo.files(); at
some point (when I first try to make a codec that uses different
files) I will add this.

I ran a quick performance test, indexing wikipedia, and found
negligible performance cost of this.

The next step, which is trickier, is to modularize/genericize the
classes the read from the index, and then refactor
SegmentTerm{Enum,Docs,Positions} to use that codec API.

Then, finally, I want to make a codec that uses PFOR to encode
postings.

> Next steps towards flexible indexing
> 
>
> Key: LUCENE-1426
> URL: https://issues.apache.org/jira/browse/LUCENE-1426
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TokenStream and Token APIs

2008-10-20 Thread Grant Ingersoll

On Oct 19, 2008, at 7:08 PM, Michael Busch wrote:

Grant Ingersoll wrote:

On Oct 19, 2008, at 12:56 AM, Mark Miller wrote:

Grant Ingersoll wrote:

Bear with me, b/c I'm not sure I'm following, but looking at https://issues.apache.org/jira/browse/LUCENE-1422
, I see at least 5 different implemented Attributes.

So, let's say I add a 5 more attributes and now have a total of
10 attributes. Are you saying that I then would have,
potentially, 10 different variables that all point to the token
as in the code snippet above where the casting takes place? Or
would I just create a single "Super" attribute that folds in all
of my new attributes, plus any other existing ones? Or, maybe,
what I would do is create the 5 new attributes and then 1 new
attribute that extends all 10, thus allowing me to use them
individually, but saving me from having to do a whole ton of
casting in my Consumer.
Potentially one consumer doing 10 things, but not likely right? I
mean, things will stay logical as they are now, and rather than a
super consumer doing everything, we will still have a chain of
consumers each doing its own piece. So more likely, maybe
something comes along every so often (another 5, over *much* time,
say) and each time we add a Consumer that uses one or two
TokenStream types. And then its just an implementation detail on
whether you make a composite TokenStream - if you have added 10
new attributes and see it fit to make one consumer use them all,
sure, make a composite, super type, but in my mind, the way its
done in the example code is clearer/cleaner for a handful of
TokenStream types. And even if you do make the composite,super
type, its likely to just be a sugar wrapper anyway - the
implementation for say, payload and positions, should probably be
maintained in their own classes anyway.

Well, there are 5 different attributes already, all of which are
commonly used. Seems weird to have to cast the same var 5
different ways. Definitely agree that one would likely deal with
this by wrapping, but then you end up either needing to extend your
wrapper or add new wrappers...

Well yes, there are 5 attributes, but n neither of the core
tokenstreams and -filters that I changed in my patch did I have to
use more than two or three of those. Currently the only attributes
that are really used are PositionIncrementAttribute and
PayloadAttribute. And the OffsetAttribute when TermVectors are
turned on.

Even in the indexing chain currently we don't have a single consumer
that needs all attributes. The FreqProxWriter needs positions and
payloads, the TermVectorsWriter needs positions and offsets.

I have an application that uses all the attributes of a Token, or at
least, almost all of them. There are many uses for Lucene's analysis
code that have nothing to do with indexing, Consumers or even Lucene.

Also, you don't have to cast the same variable multiple times. In
the current patch you would call e. g.
token.getAttribute(PayloadAttribute.class) and keep a reference to
it in the consumer or filter.

IMO even calling getAttribute() 5 times or so and storing the
references wouldn't be so bad. And if you really don't like it you
could make a wrapper as you said. You also mentioned the
disadvantages of the wrapper, e. g. that you would have to extend it
to add new attributes. But then, isn't that the same disadvantage
the current Token API has?

True. I didn't say the idea was bad, in fact I mostly like it, I was
just saying I'd like to explore how it would work in practice and the
main thing that struck me was all the casting or all the references.
Since it's likely that you only deal with a Token one at a time,
you're right, it's probably not a big deal other than the code looks
funny, IMO.

You could even use the new API in exact the same way as the old one.
Just create a subclass of Token that has all members you need and
don't add any attributes.

So I think the new API adds more flexibility, and still offers to
use it in the same way as the old one. I however think the
recommended best practice should be to use the new attributes, for
reusability of consumers that only need certain attributes.

Perhaps it would be useful for Lucene to offer exactly one subclass of
Token that we guarantee will always have all known Attributes (i.e.
the ones Lucene provides) available to it for casting purposes.

However, please let me know if you have any concrete recommendations
about changing the API in LUCENE-1422.

I thought those concerns were pretty concrete... :-)

There might be better ones than the APIs I came up with.

I think the APIs in the 2nd patch look pretty reasonable.

-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1416) Ant contrib test can fail if there is a space in path to lucene project

2008-10-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1416.


   Resolution: Fixed
Fix Version/s: 2.9
 Assignee: Michael McCandless
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Thank Mark, I just committed this!

> Ant contrib test can fail if there is a space in path to lucene project
> ---
>
> Key: LUCENE-1416
> URL: https://issues.apache.org/jira/browse/LUCENE-1416
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1416.patch, LUCENE-1416.patch
>
>
> A couple contrib ant tests get the path to test files through a URL object, 
> and so the path is URL encoded. Normally fine, but if you have a space in 
> your path (/svn stuff/lucene/contrib/ant) then it will have %20 for the space 
> and (at least on my Ubuntu system) the test will fail with filenotfound. This 
> patch simply replaces all %20 with " ". Not sure if we want/need to take it 
> any further.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

2008-10-20 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641075#action_12641075
 ] 

Grant Ingersoll commented on LUCENE-1406:
-

Committed revision 706342.

I made some small changes to reuse Tokens, also added in some comments into the 
stopwords list and added to WordListLoader to accommodate this

Thanks Robert!

> new Arabic Analyzer (Apache license)
> 
>
> Key: LUCENE-1406
> URL: https://issues.apache.org/jira/browse/LUCENE-1406
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Robert Muir
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim 
> Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for 
> a quality arabic search. 
> This implementation implements the light-8s algorithm present in the 
> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching 
> surface forms (as lucene currently does) is significant, with almost 100% 
> improvement in average precision.
> While I personally don't think all the choices were the best, and some easily 
> improvements are still possible, the major motivation for implementing it 
> exactly the way it is presented in the paper is that the algorithm is 
> TREC-tested, so the precision/recall improvements to lucene are already 
> documented.
> For a stopword list, I used a list present at 
> http://members.unine.ch/jacques.savoy/clef/index.html simply because the 
> creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus 
> two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as 
> hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, 
> etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is 
> no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words 
> of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I 
> use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641121#action_12641121
 ] 

Paul Elschot commented on LUCENE-1426:
--

bq. We inline payloads with positions which would also mess up the int blocks.

Which begs the question whether we should also allow compression of these 
payloads.
I think we should do that because normally only one or two bytes will be used 
as payload per position.
Thinking about this: position+payload actually looks a lot like docId+freq, 
could that
be used to simplify future index formats for inverted terms?
Btw. allowing a payload to accompany the field norms would allow to store a 
kind of
dictionary for the position payloads. This could help to keep the position 
payloads small
so they would compress nicely.

bq. Both SegmentMerger & FreqProxTermsWriter now use the same codec API to 
write postings.

That is indeed a big step.

bq. It's all package private.

Good for now, making it public might actually reduce flexibility for new index 
formats.



> Next steps towards flexible indexing
> 
>
> Key: LUCENE-1426
> URL: https://issues.apache.org/jira/browse/LUCENE-1426
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641125#action_12641125
 ] 

Paul Elschot commented on LUCENE-1426:
--

bq. Skipping offsets and TermInfo offsets hardwire the file pointers of  frq & 
prox files yet I need to change these to block + offset, etc.

Does the offset imply that there is also a need for random access into each 
block?
For such blocks PFOR patching might better be avoided.
Even with patching random access is possible, but it is not available yet at 
LUCENE-1410.


> Next steps towards flexible indexing
> 
>
> Key: LUCENE-1426
> URL: https://issues.apache.org/jira/browse/LUCENE-1426
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Eks Dev (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641128#action_12641128
 ] 

Eks Dev commented on LUCENE-1426:
-

Just a few random thoughts on this topic

- I am sure I read somewhere in these pdfs that were floating around that it 
would make sense to use VInts for very short postings and PFOR for the rest. I 
just do not remember rationale behind it.   

- During omitTf() discussion, we came up with cool idea to actually inline very 
short postings into term dict instead of storing offset. This way we spare one 
seek per term in many cases, as well as some space for storing offset. I do not 
know if this is a problem, but sounds reasonable. With standard Zipfian 
distribution, a lot of postings should get inlined. Use cases where we have 
query expansion on many terms (think spell checker, synonyms ...) should 
benefit from that heavily. These postings are small but there is a lot of them, 
so it adds up... seek is deadly :)

I am sorry to miss the party here with PFOR, but let us hope this credit crunch 
gets over soon so I that I could dedicate some time to fun things like this :)

cheers, eks 


  

> Next steps towards flexible indexing
> 
>
> Key: LUCENE-1426
> URL: https://issues.apache.org/jira/browse/LUCENE-1426
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641132#action_12641132
 ] 

Doug Cutting commented on LUCENE-1426:
--

+1 This sounds like a great way to approach flexible indexing: incrementally.

> Next steps towards flexible indexing
> 
>
> Key: LUCENE-1426
> URL: https://issues.apache.org/jira/browse/LUCENE-1426
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641137#action_12641137
 ] 

Michael McCandless commented on LUCENE-1426:


bq. During omitTf() discussion, we came up with cool idea to actually inline 
very short postings into term dict instead of storing offset.

Yes, there's this issue:

  https://issues.apache.org/jira/browse/LUCENE-1278

And you had found this one:

  http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf

And then Doug referenced this:

  http://citeseer.ist.psu.edu/cutting90optimizations.html

I think the idea makes tons of sense (saving a seek) and one of my
goals in phase 2 (genericizing the reading of an index) is to make
pulsing a drop-in codec as an example & litmus test.  Terms iteration
may suffer, though, unless we put this in a separate file.

I also think, at the opposite end of the spectrum, it would make sense
for very common terms to use simple n-bit packing (PFOR minus the
exceptions).  For massive terms we need the fastest search we can
get, since that gates when you have to start sharding.

bq. I am sorry to miss the party here with PFOR, but let us hope this credit 
crunch gets over soon so I that I could dedicate some time to fun things like 
this

Well the stock market seems to think the credit crunch is improving,
today... of course who knows what'll happen tomorrow!  Good luck :)

Also, I'd like to explore improving the terms dict indexing -- I don't
think we need to load a TermInfo instance for every indexed term, into
RAM.  I think we just need the term & seek data (into the tis file),
then you seek there and skip to the TermInfo you need.  This should
save a good amount of RAM for large indices with odd terms, sicne each
TermInfo instance requires a pointer to it (4 or 8 bytes), an object
header (8 bytes at least) then 20 bytes for the members.

All these explorations should become simple drop-in codecs, once I can
finish phase 2.


> Next steps towards flexible indexing
> 
>
> Key: LUCENE-1426
> URL: https://issues.apache.org/jira/browse/LUCENE-1426
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641139#action_12641139
 ] 

Michael McCandless commented on LUCENE-1426:



{quote}
Does the offset imply that there is also a need for random access into each 
block?
For such blocks PFOR patching might better be avoided.
Even with patching random access is possible, but it is not available yet at 
LUCENE-1410.
{quote}

Yeah this is one of the reasons why I'm thinking for frequent terms we
may want to fallback to pure nbit packing (which would make random
access simple).

But, for starters would could simply implement random access as "load
& decode the entire block, then look at the part you want" and then
assess the cost.  While it will clearly increase the cost of queries
that do alot of skipping (eg AND query of N terms), it may not matter
so much since these queries should be fairly fast now.  It's the OR of
frequent term queries that we need to improve since that limits how
big an index you can put on one box.


> Next steps towards flexible indexing
> 
>
> Key: LUCENE-1426
> URL: https://issues.apache.org/jira/browse/LUCENE-1426
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

2008-10-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641140#action_12641140
 ] 

Michael McCandless commented on LUCENE-1426:



bq. Which begs the question whether we should also allow compression of these 
payloads.

I think that's interesting, but would probably be rather application dependent.

{quote}
Btw. allowing a payload to accompany the field norms would allow to store a 
kind of
dictionary for the position payloads. This could help to keep the position 
payloads small
so they would compress nicely.
{quote}

Couldn't stored fields, once they are faster (with column-stride
fields, LUCENE-1231) solve this?


> Next steps towards flexible indexing
> 
>
> Key: LUCENE-1426
> URL: https://issues.apache.org/jira/browse/LUCENE-1426
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1387) Add LocalLucene

2008-10-20 Thread Xibin Zeng (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641264#action_12641264
 ] 

Xibin Zeng commented on LUCENE-1387:


Hey Guys! Where is this now? Has it been checked in yet? I am asking as I am 
currently planning a feature and wanted to know if it is realistic to take 
advantage of it now. Any update is appreciated!

> Add LocalLucene
> ---
>
> Key: LUCENE-1387
> URL: https://issues.apache.org/jira/browse/LUCENE-1387
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Grant Ingersoll
>Priority: Minor
> Attachments: spatial.zip
>
>
> Local Lucene (Geo-search) has been donated to the Lucene project, per 
> https://issues.apache.org/jira/browse/INCUBATOR-77.  This issue is to handle 
> the Lucene portion of integration.
> See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1422) New TokenStream API

2008-10-20 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1422:
--

Attachment: lucene-1422.take3.patch

I added several things in this new patch:

 * hashCode() and equals() now incorporate the attributes
 * patch compiles against Java 1.4
 * all core tests pass with and without the new API turned
   on (via TokenStream.setUseNewAPI(true))
 * Added setToken() method to InvertedDocConsumerPerField
   and TermsHashConsumerPerField and updated the 
   implementing classes. I have actually a question here,
   because I don't know these classes very well yet. Would
   it be better to add the Token to the DocInverter.FieldInvertState?
   I think I also have to review LUCENE-1426 to see if these
   changes are not in conflict ( I think 1426 should be committed
   first?)

Outstanding:
 * dedicated junits for new APIs, even though the existing tests
   already cover a lot when setUseNewAPI(true)
 * javadocs
 * contrib streams and filters


> New TokenStream API
> ---
>
> Key: LUCENE-1422
> URL: https://issues.apache.org/jira/browse/LUCENE-1422
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: lucene-1422.patch, lucene-1422.take2.patch, 
> lucene-1422.take3.patch
>
>
> This is a very early version of the new TokenStream API that 
> we started to discuss here:
> http://www.gossamer-threads.com/lists/lucene/java-dev/66227
> This implementation is a bit different from what I initially
> proposed in the thread above. I introduced a new class called
> AttributedToken, which contains the same termBuffer logic 
> from Token. In addition it has a lazily-initialized map of
> Class -> Attribute. Attribute is also a
> new class in a new package, plus several implementations like
> PositionIncrementAttribute, PayloadAttribute, etc.
> Similar to my initial proposal is the prototypeToken() method
> which the consumer (e. g. DocumentsWriter) needs to call.
> The token is created by the tokenizer at the end of the chain
> and pushed through all filters to the end consumer. The 
> tokenizer and also all filters can add Attributes to the 
> token and can keep references to the actual types of the
> attributes that they need to read of modify. This way, when
> boolean nextToken() is called, no casting is necessary.
> I added a class called TestNewTokenStreamAPI which is not 
> really a test case yet, but has a static demo() method, which
> demonstrates how to use the new API.
> The reason to not merge Token and TokenStream into one class 
> is that we might have caching (or tee/sink) filters in the 
> chain that might want to store cloned copies of the tokens
> in a cache. I added a new class NewCachingTokenStream that
> shows how such a class could work. I also implemented a deep
> clone method in AttributedToken and a 
> copyFrom(AttributedToken) method, which is needed for the 
> caching. Both methods have to iterate over the list of 
> attributes. The Attribute subclasses itself also have a
> copyFrom(Attribute) method, which unfortunately has to down-
> cast to the actual type. I first thought that might be very
> inefficient, but it's not so bad. Well, if you add all
> Attributes to the AttributedToken that our old Token class
> had (like offsets, payload, posIncr), then the performance
> of the caching is somewhat slower (~40%). However, if you 
> add less attributes, because not all might be needed, then
> the performance is even slightly faster than with the old API.
> Also the new API is flexible enough so that someone could
> implement a custom caching filter that knows all attributes
> the token can have, then the caching should be just as 
> fast as with the old API.
> This patch is not nearly ready, there are lot's of things 
> missing:
> - unit tests
> - change DocumentsWriter to use new API 
>   (in backwards-compatible fashion)
> - patch is currently java 1.5; need to change before 
>   commiting to 2.9
> - all TokenStreams and -Filters should be changed to use 
>   new API
> - javadocs incorrect or missing
> - hashcode and equals methods missing in Attributes and 
>   AttributedToken
>   
> I wanted to submit it already for brave people to give me 
> early feedback before I spend more time working on this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TokenStream and Token APIs

2008-10-20 Thread Michael Busch


Grant Ingersoll wrote:


On Oct 19, 2008, at 7:08 PM, Michael Busch wrote:


Grant Ingersoll wrote:


On Oct 19, 2008, at 12:56 AM, Mark Miller wrote:


Grant Ingersoll wrote:


Bear with me, b/c I'm not sure I'm following, but looking at 
https://issues.apache.org/jira/browse/LUCENE-1422, I see at least 
5 different implemented Attributes.


So, let's say I add a 5 more attributes and now have a total of 10 
attributes. Are you saying that I then would have, potentially, 10 
different variables that all point to the token as in the code 
snippet above where the casting takes place? Or would I just 
create a single "Super" attribute that folds in all of my new 
attributes, plus any other existing ones? Or, maybe, what I would 
do is create the 5 new attributes and then 1 new attribute that 
extends all 10, thus allowing me to use them individually, but 
saving me from having to do a whole ton of casting in my Consumer.
Potentially one consumer doing 10 things, but not likely right? I 
mean, things will stay logical as they are now, and rather than a 
super consumer doing everything, we will still have a chain of 
consumers each doing its own piece. So more likely, maybe something 
comes along every so often (another 5, over *much* time, say) and 
each time we add a Consumer that uses one or two TokenStream types. 
And then its just an implementation detail on whether you make a 
composite TokenStream - if you have added 10 new attributes and see 
it fit to make one consumer use them all, sure, make a composite, 
super type, but in my mind, the way its done in the example code is 
clearer/cleaner for a handful of TokenStream types. And even if you 
do make the composite,super type, its likely to just be a sugar 
wrapper anyway - the implementation for say, payload and positions, 
should probably be maintained in their own classes anyway.


Well, there are 5 different attributes already, all of which are 
commonly used.  Seems weird to have to cast the same var 5 different 
ways.  Definitely agree that one would likely deal with this by 
wrapping, but then you end up either needing to extend your wrapper 
or add new wrappers...


Well yes, there are 5 attributes, but n neither of the core 
tokenstreams and -filters that I changed in my patch did I have to 
use more than two or three of those. Currently the only attributes 
that are really used are PositionIncrementAttribute and 
PayloadAttribute. And the OffsetAttribute when TermVectors are turned 
on.


Even in the indexing chain currently we don't have a single consumer 
that needs all attributes. The FreqProxWriter needs positions and 
payloads, the TermVectorsWriter needs positions and offsets.



I have an application that uses all the attributes of a Token, or at 
least, almost all of them.  There are many uses for Lucene's analysis 
code that have nothing to do with indexing, Consumers or even Lucene.





Also, you don't have to cast the same variable multiple times. In the 
current patch you would call e. g. 
token.getAttribute(PayloadAttribute.class) and keep a reference to it 
in the consumer or filter.


IMO even calling getAttribute() 5 times or so and storing the 
references wouldn't be so bad. And if you really don't like it you 
could make a wrapper as you said. You also mentioned the 
disadvantages of the wrapper, e. g. that you would have to extend it 
to add new attributes. But then, isn't that the same disadvantage the 
current Token API has?


True.  I didn't say the idea was bad, in fact I mostly like it, I was 
just saying I'd like to explore how it would work in practice and the 
main thing that struck me was all the casting or all the references.  
Since it's likely that you only deal with a Token one at a time, 
you're right, it's probably not a big deal other than the code looks 
funny, IMO.





You could even use the new API in exact the same way as the old one. 
Just create a subclass of Token that has all members you need and 
don't add any attributes.


So I think the new API adds more flexibility, and still offers to use 
it in the same way as the old one. I however think the recommended 
best practice should be to use the new attributes, for reusability of 
consumers that only need certain attributes.


Perhaps it would be useful for Lucene to offer exactly one subclass of 
Token that we guarantee will always have all known Attributes (i.e. 
the ones Lucene provides)  available to it for casting purposes.




Yeah we could do that. In fact, I did exactly this when I started 
working on this patch. I created a class called PlainToken, which had 
all the termBuffer and attributes logic, and changed Token to extend it. 
Then the new getToken() method would return an instance of PlainToken. 
My main concern with this approach is that it will make the code in the 
indexer more complicated, because it always has to check if we have a 
Token or PlainToken; if it's a Token then it has to use the get*() 
method directly, for a Plain

[jira] Resolved: (LUCENE-1411) Enable IndexWriter to open an arbitrary commit point

[jira] Resolved: (LUCENE-1382) Allow storing user data when IndexWriter.commit() is called

[jira] Created: (LUCENE-1426) Next steps towards flexible indexing

[jira] Updated: (LUCENE-1426) Next steps towards flexible indexing

Re: TokenStream and Token APIs

[jira] Resolved: (LUCENE-1416) Ant contrib test can fail if there is a space in path to lucene project

[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

[jira] Commented: (LUCENE-1387) Add LocalLucene

[jira] Updated: (LUCENE-1422) New TokenStream API

Re: TokenStream and Token APIs

17 matches

Site Navigation

Mail list logo

Footer information