from:"Marvin Humphrey"

Re: Proposal about Version API "relaxation"

2010-04-14 Thread Marvin Humphrey

On Wed, Apr 14, 2010 at 12:49:52AM -0400, Robert Muir wrote:

> its very unnatural for release 3.0 to be almost a no-op and for release 3.1
> to provide a new default index format and support for customizing how the
> index is stored. And now we are looking at providing flexibility in scoring
> that will hopefully redefine lucene from being a vector-space search engine
> library to something much more flexible?  This is a minor release?!

I agree, but what really bothers me are the X.9 releases.  

2.9 changed performance characteristics dramatically enough that it was a
backwards-break in all but name for many users -- most prominently, Solr[1].
Solr's FieldCache RAM requirements doubled because of the transition to
per-segment search.  And 2.9's backwards compatibility layer in TokenStream
was significantly slower.

In my opinion, the transition to per-segment search and new-style TokenStreams
should have triggered a major version break.  Had that been the case, less
effort could have been spent on backwards compatibility shims and fewer API
design compromises would have been necessary.

To avoid such costs in the future, and to communicate disruptions in the
library to users via version numbers more accurately...

  * There should not be a Lucene 3.9.  
  * Lucene 4.0 should do more than remove deprecations.

Marvin Humphrey

[1] Thanks to Robert and Mark Miller for reminding me just what the
Solr/Lucene-2.9 problems were via IRC.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API "relaxation"

2010-04-14 Thread Marvin Humphrey

On Wed, Apr 14, 2010 at 08:30:14AM -0400, Grant Ingersoll wrote:
> The thing I keep going back to is that somehow Lucene has managed for years
> (and I mean lots of years) w/o stuff like Version and all this massive back
> compatibility checking.

Non-constant global variables are an anti-pattern.  Having a non-constant
global determine library behavior which results in silent failure (search
results degrade subtly, as opposed to e.g. an exception being thrown) is a
particularly insidious anti-pattern. 

In the Perl world, where modules are very heavily used thanks to CPAN, you're
more likely to come across the action-at-a-distance bugs spawned by this
anti-pattern.  I have direct experience debugging such usage of global vars.
It is extremely costly and frustrating.

For instance, there was one time when some module set the global variable
$YAML::Syck::ImplicitUnicode to a true value.  Whether or not that module was
loaded affected how YAML::Syck's Load() function would interpret character
data in completely unrelated portions of the code.  As with subtly degraded
search results, the result was silent failure (incorrect text stored in a
database).  It took many hours to hunt down what was going wrong because the
code that was causing the problem was nowhere near the code where the problem
manifested.  The authors of the affected code had done nothing wrong, aside
from using a poorly designed module like YAML::Syck.

I am strongly opposed to using a global variable for versioning because I do
not wish to impose such maddening debugging sessions on a handful of unlucky
duckies who have done nothing wrong other than to choose Lucene as their
search engine library.  

This shouldn't be controversial.  The temptations of global variables are
obvious, but their flaws are well understood:

http://www.google.com/search?q=global+variables+evil

It is to be expected that the global would work most of the time.  This design
flaw, by nature, disproportionately afflicts a small number of users with
action-at-a-distance bugs.  Knowingly choosing to impose such costs on a
random few is deeply unfair.

> I also am not sure whether it in the past we just missed/ignored more back
> compatibility issues or whether now we are creating more back compat. issues
> due to more rapid change.  

It would be hard to search the archives to confirm my recollection, but I seem
to remember back compat for Analyzers coming up every once in a while -- say,
in the context of modifying StandardAnalyzer's stoplist -- and changes not
being made because they would change search results.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API "relaxation"

2010-04-13 Thread Marvin Humphrey

On Tue, Apr 13, 2010 at 02:46:56PM -0400, Robert Muir wrote:

> Unlike other components in Lucene, Analyzers get hit the worst because any
> change is basically a break, and there's really not any other option besides
> Version to implement any backwards compatibility for them.

New class names would work, too.  

I only mention that for the sake of completeness, though -- it's not a
suggestion.

> But things like index back compat seems kinda useless for analyzers anyway.
> If we improve them in nearly any way, you have to reindex with them to get
> the benefits.

I'm a little concerned about the issue DM Smith brought up: what happens when
you have separate applications within the same JVM which have built indexes
using separate versions of an Analyzer?

That use case is supported under the current regime, but I'm not sure whether
it would be with aggressively versioned Analyzer packages.  If it's not, under
what circumstances does that matter?

> I'd love to hear elaborations of any thoughts you have on how this could
> work.

Well, for Lucy, I think we may have addressed this problem with the new back
compat policy we're auditioning with KS:

KinoSearch spins off stable forks into new namespaces periodically. As of
this release, the latest is "KinoSearch1", forked from version 0.165.
Users who require strong backwards compatibility should use a stable fork.

The main namespace, "KinoSearch", is an unstable development branch (as
hinted at by its version number). Superficial API changes are frequent.
Hard file format compatibility breaks which require reindexing are rare,
as we generally try to provide continuity across multiple releases, but
they happen every once in a while.

Essentially, we're free to break back compat within "Lucy" at any time, but
we're not able to break back compat within a stable fork like "Lucy1",
"Lucy2", etc.  So what we'll probably do during normal development with
Analyzers is just change them and note the break in the Changes file.

I doubt such a policy would be an option for Lucene, though.  I think you'd
have to go with separate jars for lucene-core and lucene-analyzers, possibly
on independent release schedules.  You'd have to bundle the broken ones with
lucene-core until a major version break for bug compatibility, but the fixed
ones could be distributed via lucene-analyzers concurrently.

Hmm, I suppose that doesn't work with the convention that the only difference
between Lucene X.9 and Lucene Y.0 is the removal of deprecations.  But if
anything is crying out for a rethink in the Lucene back compat policy, IMO
that's it: make major version breaks act like major version breaks and change
stuff that needs changin'.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal about Version API "relaxation"

2010-04-13 Thread Marvin Humphrey

On Tue, Apr 13, 2010 at 11:17:56AM -0700, Andi Vajda wrote:

> Using global statics is flawed.

+1.

I wonder if it's possible to solve this problem for Analyzers by decoupling
their distribution from the Lucene core and versioning them separately.  I.e.
remove MatchVersion and increment individual Analyzer version numbers instead.

This wouldn't solve the problem for good defaults elsewhere in the library.
For that, I see no remedy other than more frequent major version increments.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Changing the subject for a JIRA-issue (Was: [jira] Created: (LUCENE-2335) optimization: when sorting by field, if index has one segment and field values are not needed, do not load String[] into f

2010-04-06 Thread Marvin Humphrey

On Tue, Apr 06, 2010 at 11:26:23AM +0200, Toke Eskildsen wrote:
> The current subject and description of
> https://issues.apache.org/jira/browse/LUCENE-2335
> is obsolete due to new knowledge.
> 
> Is it possible to change it? If not, what is the policy here? To open a
> new issue and close the old one?

No policy, per se.  Here's my take, FWIW:

Do whatever it takes to keep the subject line of your messages in tune with
the content of your messages.  When they conflict, it becomes harder to follow
the conversation, and our knowledge base degrades because relevant posts
become harder to discover via search.

In this case, that would mean either closing this issue and opening a new one,
or taking the discussion to the mailing list where subject headers may be
modified as the conversation evolves.  

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-26 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850356#action_12850356
 ] 

Marvin Humphrey commented on LUCENE-2345:
-

> Is there a ticket or wiki page that details the "plugin" architecture/design
> so i could take a look?

FWIW, KinoSearch has a complete prototype implementation of this design, 
based loosely on the mailing list conversations that Earwin referred to.

  * SegReader and SegWriter are both composites with minimal APIs.
  * All subcomponents subclass either DataWriter or DataReader.
  * The Architecture class (under KinoSearch::Plan) determines which plugins
get loaded.

[http://www.rectangular.com/svn/kinosearch/trunk/core/]

> Make it possible to subclass SegmentReader
> --
>
> Key: LUCENE-2345
> URL: https://issues.apache.org/jira/browse/LUCENE-2345
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Index
>Reporter: Tim Smith
> Fix For: 3.1
>
> Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch
>
>
> I would like the ability to subclass SegmentReader for numerous reasons:
> * to capture initialization/close events
> * attach custom objects to an instance of a segment reader (caches, 
> statistics, so on and so forth)
> * override methods on segment reader as needed
> currently this isn't really possible
> I propose adding a SegmentReaderFactory that would allow creating custom 
> subclasses of SegmentReader
> default implementation would be something like:
> {code}
> public class SegmentReaderFactory {
>   public SegmentReader get(boolean readOnly) {
> return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
>   }
>   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
> return newSegmentReader(readOnly);
>   }
> }
> {code}
> It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
> (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
> etc)
> I could prepare a patch if others think this has merit
> Obviously, this API would be "experimental/advanced/will change in future"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-26 Thread Marvin Humphrey

On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
> > Maybe aggressive automatic data-reduction makes more sense in the context of
> > "flexible matching", which is more expansive than "flexible scoring"?
> 
> I think so.  Maybe it shouldn't be called a Similarity (which to me
> (though, carrying a heavy curse of knowledge burden...) means
> "scoring")?  Matcher?

I think we can express the difference between your proposed approach for
Lucene Similarity (no effect on index) and my proposed approach for Lucy
Similarity (aggressive index-time data reduction) by putting Lucy's Similarity
under Lucy::Index instead of Lucy::Search.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-25 Thread Marvin Humphrey

use not only can
one thing's relevance to another be continuously variable (i.e. score), it can
also be binary: relevant/not-relevant (i.e. match).

But I don't see why "Relevance", "Matcher", or anything else would be so much
better than "Similarity".  I think this is your hang up.  ;) 

> > I'm +0 (FWIW) on search-time Sim settability for Lucene.  It's a nice 
> > feature,
> > but I don't think we've worked out all the problems yet.  If we can, I might
> > switch to +1 (FWIW).
> 
> What problems remain, for Lucene?

Storage, formatting, and compression of boosts.

I'm also concerned about making significant changes to the file format when
you've indicated they're "for starters".  IMO, file format changes ought to
clear a higher bar than that.  But I expect to to dissent on that point.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: #lucene IRC log [was: RE: lucene and solr trunk]

2010-03-23 Thread Marvin Humphrey

On Tue, Mar 23, 2010 at 01:30:42PM -0700, Otis Gospodnetic wrote:
> Archiving the logs feels like it would be useful, but realistically
> speaking, they would be pretty big and who has the time to read them after
> the fact?  

Someone who participated in the chat reviewing it while preparing a summary.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-22 Thread Marvin Humphrey

ser made when they spec'd
MatchSimilarity.  Saying that Lucy should keep the boost bytes under those
circumstances is like saying that Lucene should outright ignore omitNorms()
and always write boost bytes because users can't be trusted.

> > I meant that if you're writing out boost bytes, there's no sensible way to
> > execute the lossy data reduction and reduce the index size other than having
> > Sim do it.
> 
> Right Sim is the right class to do this.  Heck one could even use
> boost nibbles... or, use float.  This is an impl detail of the Sim
> class.

For Lucene, I think that makes sense, because the reduced form would be
ephemeral.  

For Lucy, it's more complicated because the reduced data gets written to the
index.  Core Sim implementations should all use the same algorithm in order to
minimize the complexity of the index file spec.  However, it would be nice to
offer an extension point enabling user-defined Sims to write non-standard
formats.

> I think this all boils down to how important flexible scoring is --

Oh, please, Mike.  Search-time settability for Similarity isn't the same thing
as "flexible scoring".  :(  Everybody thinks "flexible scoring" is important.

Frankly, I think we're going to do a better job making "flexible scoring"
available to our users because we're not going to make them fight through a
thicket of jargon to get it.

> I'd like users to be able to try out different scoring at search
> time, even if it means "having to understand low level stuff" when
> setting their field types during indexing.
> 
> You don't think flexible scoring is that important ("just reindex")
> and that's it's not great to have users understand low level stats for
> indexing.

I'm +0 (FWIW) on search-time Sim settability for Lucene.  It's a nice feature,
but I don't think we've worked out all the problems yet.  If we can, I might
switch to +1 (FWIW).  

For Lucy, I'm -1 on search-time Sim settability, for a wide variety of
reasons.

Whether or not to perform automatic data-reduction based on Similarity choice
or force the user to specify data-reduction manually is a separate issue.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-15 Thread Marvin Humphrey

 more important it is for them to buy
optimization seminars where Lucene gurus explain all the obscure incantations
to them.  :)

> > You seem to be fixated on the notion of swapping in a MatchOnlySim object at
> > search time.  You can't do that in KS/Lucy, because you can't modify a 
> > Schema
> > at search-time, and the per-field Similarity assignments are part of the
> > Schema.  But *it doesn't matter* because you don't need a MatchOnlySim to
> > do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can
> > spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can.
> 
> I am fixated because it's a glaring example (to me) of what's wrong
> with forcing user to commit to how scoring is going to happen, at
> index time, for that field.

Haha, well that would sure suck if it didn't work!  

But I'm telling you it's no problem.

> And I'm still confused on how this'll work in Lucey -- if in my global
> write-once Lucy scheme I bind a field during indexing to
> AllBellsAndWhistlesScoringSim... then at search time, sure, it can
> spawn a doc-id-only PostingDecoder... so that does mean I can do
> match-only searching using that, somehow?  

Of course.

Lucene can't do that?  No way, that can't be right!  I've gotta be missing
something.  (Though I guess that would explain the fixation on needing a
different Sim.)  

Needing a special Sim for match-only seems like an absurd limitation -- I mean
the doc id data is there, and you don't need scores.  You've gotta be able to
fake it at least.

> (Ie I can't change the field to MatchOnlySim, but, I have a some workaround
> that lets me achieve the same functionality...?).

It's not a workaround.  Things just work that way.  

Without getting into the gory details... if you're not calculating a score,
you don't need Similarity's functionality.  If Lucene still needs a Sim object
despite not needing its functionality, that's just an accident of the OO
design, and it so happens that our "loose C" port doesn't have the same quirk.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-14 Thread Marvin Humphrey

the index.  But you can still do
> > doc-id-only posting iteration against any posting format since doc-id-only 
> > is
> > the minimum requirement for a posting list.
> >
> > So your question is predicated on the assumption that you need a
> > doc-id-only Similarity to do doc-id-only postings iteration, but that's not
> > true -- you need a doc-id-only PostingDecoder, which may be spawned by any
> > Similarity.
> >
> > Does that make sense?
> 
> It sounds like... if the user had used AllBellsAndWhistlesScoringSim
> while indexing, they will still be able to use MatchOnlySim while
> searching because under-the-hood MatchOnlySim knows how to pull a
> docID only postings iterator from that field.

You seem to be fixated on the notion of swapping in a MatchOnlySim object at
search time.  You can't do that in KS/Lucy, because you can't modify a Schema
at search-time, and the per-field Similarity assignments are part of the
Schema.  But *it doesn't matter* because you don't need a MatchOnlySim to
do doc-id-only postings iteration -- an AllBellsAndWhistlesScoringSim can
spawn a doc-id-only PostingDecoder just as easily as MatchOnlySim can.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2316) Define clear semantics for Directory.fileLength

2010-03-13 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844906#action_12844906
 ] 

Marvin Humphrey commented on LUCENE-2316:
-

Is it really necessary to obtain the length of a file from the Directory? Lucy
doesn't implement that functionality, and we haven't missed it -- we're able
to get away with using the length() method on InStream and OutStream. 

I see that IndexInput and IndexOutput already have length() methods. Can you
simply eliminate all uses of Directory.fileLength() within core and deprecate
it without introducing a new method?

> Define clear semantics for Directory.fileLength
> ---
>
> Key: LUCENE-2316
> URL: https://issues.apache.org/jira/browse/LUCENE-2316
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 3.1
>
>
> On this thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
>  it was mentioned that Directory's fileLength behavior is not consistent 
> between Directory implementations if the given file name does not exist. 
> FSDirectory returns a 0 length while RAMDirectory throws FNFE.
> The problem is that the semantics of fileLength() are not defined. As 
> proposed in the thread, we'll define the following semantics:
> * Returns the length of the file denoted by name if the file 
> exists. The return value may be anything between 0 and Long.MAX_VALUE.
> * Throws FileNotFoundException if the file does not exist. Note that you can 
> call dir.fileExists(name) if you are not sure whether the file exists or not.
> For backwards we'll create a new method w/ clear semantics. Something like:
> {code}
> /**
>  * @deprecated the method will become abstract when #fileLength(name) has 
> been removed.
>  */
> public long getFileLength(String name) throws IOException {
>   long len = fileLength(name);
>   if (len == 0 && !fileExists(name)) {
> throw new FileNotFoundException(name);
>   }
>   return len;
> }
> {code}
> The first line just calls the current impl. If it throws exception for a 
> non-existing file, we're ok. The second line verifies whether a 0 length is 
> for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-12 Thread Marvin Humphrey

 to describe how the field will be scored.  Based on
that info, we can customize the posting format, possibly making optimizations
and omitting certain posting data.  

When people ask on the user list...

"How can I make my index smaller?"
   
... we can reply like so:

"Make some fields match-only by specifying MatchSimilarity in the
FieldType, or even better if you don't need phrase queries, by specifying
MinimalSimilarity.  You'll be throwing away data Lucy needs for
sophisticated queries, but your index will get smaller."

I think that response is easier to understand than a response instructing them
to "enable omitNorms", and it introduces the very important Similarity class
rather than the confusing, overloaded, and not-very-useful terminology,
"norms".

> >> > They could use better codecs under the format-follows-Similarity model, 
> >> > too.
> >> > They'd just have to subclass and override the factory methods that spawn
> >> > posting encoders/decoders.
> >>
> >> Ahh, OK so that's how they'd do it.
> >>
> >> So... I think we're making a mountain out of a molehill.
> >
> > Well, I don't see it that way, because I place great value on designing
> > good public APIs, and I think it's important that we avoid forcing users to
> > know about codecs.
> 
> I had thought we were bickering about whether you subclass & override
> a method (to alter the codec) (= Lucy) vs you create your own
> Codec/CodecProvider and pass that to your writer, which seems. a
> minor difference.
> 
> If the user is not tweaking the codec, they don't have to do anything
> with codes (the defaults work) for either Lucy or Lucene.
> 
> So the only difference is the specifics of how the codec-tweaking-user
> in fact alters the codec.

I don't think that's the only difference.  What does the novice user know
about "PFOR", about "pulsing", about "group varint", etc?  They aren't
going to know jack.  So how are you expecting them to distinguish between
various Codec subclasses named after those high-falutin' concepts?

The difference is that you're forcing the novice user to learn esoteric
material just to get started, while the format-follows-sim model is trying to
spare the novice yet enable the expert.  Users shouldn't have to distinguish
between "codecs" until they are actually ready to write their own.  

As we discussed on IRC yesterday, the number of people who will be qualified
to write posting codec code will still be very small, even after we finish
this democratization push.  It will be a big step forward if we can just get
more Lucene committers to grok the inner workings of posting lists.  

However, there are some very useful optimizations that will be underutilized
by the user base if the public API uses jargon like "omitTFAP" and "PFORCodec"
that shuts out everyone except elite developers.

> > Under format-follows-Sim, it would be the Similarity object that knows all
> > supported decoding configurations for the field.
> 
> I'm still hazy on how you'll know at search time which Sims are
> "congruent" with what's stored in the index ie that downgrading to
> MatchOnlySim is allowed, but swapping to a different scoring model is
> not (because norms are committed at indexing time).

I'm not sure that e.g. TermScorer would even know what Similarity it was
dealing with.  It would ask for a boost-byte decoder from the sim, but it
wouldn't have to know or care how the boost bytes got translated to float
multipliers.  

Under Lucy, you can't switch to a different weighting model at search time
because the boost bytes are baked into the index.  But you can still do
doc-id-only posting iteration against any posting format since doc-id-only is
the minimum requirement for a posting list.

So your question is predicated on the assumption that you need a
doc-id-only Similarity to do doc-id-only postings iteration, but that's not
true -- you need a doc-id-only PostingDecoder, which may be spawned by any
Similarity.  

Does that make sense?

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844688#action_12844688
 ] 

Marvin Humphrey commented on LUCENE-2308:
-

> Also creating a FieldType with args like
> new FieldType(true, false, false) isn't really readable. 

Agreed Another option would be a "flags" integer and bitwise constants:

{code}
FieldType type = new FieldType(analyzer, FieldType.INDEXED | FieldType.STORED);
{code}

> It would be nice if we could do something similar to IndexWriterConfig
> (LUCENE-2294), where you use incremental ctor/setters to set up the
> configuration but then once it's used ("bound" to a Field), it's
> immutable.

I bet that'll be more popular than flags, but I thought it was worth
bringing it up anyway. :)

> Separately specify a field's type
> -
>
> Key: LUCENE-2308
> URL: https://issues.apache.org/jira/browse/LUCENE-2308
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> This came up from dicussions on IRC.  I'm summarizing here...
> Today when you make a Field to add to a document you can set things
> index or not, stored or not, analyzed or not, details like omitTfAP,
> omitNorms, index term vectors (separately controlling
> offsets/positions), etc.
> I think we should factor these out into a new class (FieldType?).
> Then you could re-use this FieldType instance across multiple fields.
> The Field instance would still hold the actual value.
> We could then do per-field analyzers by adding a setAnalyzer on the
> FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
> for per-field codecs (with flex), where we now have
> PerFieldCodecWrapper).
> This would NOT be a schema!  It's just refactoring what we already
> specify today.  EG it's not serialized into the index.
> This has been discussed before, and I know Michael Busch opened a more
> ambitious (I think?) issue.  I think this is a good first baby step.  We could
> consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
> off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey

On Fri, Mar 12, 2010 at 03:01:27PM -0500, Mark Miller wrote:
> Committers are competant in different areas of the code.  Even mike  
> wasn't big into the search side until per segment.  Commiters are  
> trusted to mess with the pieces they know.

Absolutely.  I wouldn't expect every committer to undertand the gory details
of posting formats, and I've been a little caught off guard by the blowback
from what I thought was an inoccuous observation.

But by the same token, I wouldn't expect our users to have sufficient
expertise to understand all the variants of omit*() either.  This stuff
oughtta be implementation details.

> I don't see anyone even remotely suggesting that users should have to  
> understand all of the implications of posting format modifications.

That's what omitTFAP() and omitNorms() do, though.  And as Mike pointed out in
the "baby steps" thread, omitTFAP() is often misunderstood.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844659#action_12844659
 ] 

Marvin Humphrey commented on LUCENE-2308:
-

I'm simply suggesting that the proposed API is too hard to understand.  

Most users know whether their fields can be "match-only" but have no idea what
TFAP is.  And even advanced users will have difficulty understanding all the
implications for matching and scoring when they selectively disable portions
of the posting format.

I'm not a fan of omitTFAP, omitTF, omitNorms, omitPositions, or omit(flags).
Something that ordinary users can grok would be used more often and more
effectively.

> Separately specify a field's type
> -
>
> Key: LUCENE-2308
> URL: https://issues.apache.org/jira/browse/LUCENE-2308
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> This came up from dicussions on IRC.  I'm summarizing here...
> Today when you make a Field to add to a document you can set things
> index or not, stored or not, analyzed or not, details like omitTfAP,
> omitNorms, index term vectors (separately controlling
> offsets/positions), etc.
> I think we should factor these out into a new class (FieldType?).
> Then you could re-use this FieldType instance across multiple fields.
> The Field instance would still hold the actual value.
> We could then do per-field analyzers by adding a setAnalyzer on the
> FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
> for per-field codecs (with flex), where we now have
> PerFieldCodecWrapper).
> This would NOT be a schema!  It's just refactoring what we already
> specify today.  EG it's not serialized into the index.
> This has been discussed before, and I know Michael Busch opened a more
> ambitious (I think?) issue.  I think this is a good first baby step.  We could
> consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
> off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844637#action_12844637
 ] 

Marvin Humphrey commented on LUCENE-2308:
-

If you disable term freq, you also have to disable positions.  The "freq" 
tells you how many positions there are.

I think it's asking an awful lot of our users to require that they understand
all the implications of posting format modifications when committers 
have difficulty mastering all the subtleties.

> Separately specify a field's type
> -
>
> Key: LUCENE-2308
> URL: https://issues.apache.org/jira/browse/LUCENE-2308
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> This came up from dicussions on IRC.  I'm summarizing here...
> Today when you make a Field to add to a document you can set things
> index or not, stored or not, analyzed or not, details like omitTfAP,
> omitNorms, index term vectors (separately controlling
> offsets/positions), etc.
> I think we should factor these out into a new class (FieldType?).
> Then you could re-use this FieldType instance across multiple fields.
> The Field instance would still hold the actual value.
> We could then do per-field analyzers by adding a setAnalyzer on the
> FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
> for per-field codecs (with flex), where we now have
> PerFieldCodecWrapper).
> This would NOT be a schema!  It's just refactoring what we already
> specify today.  EG it's not serialized into the index.
> This has been discussed before, and I know Michael Busch opened a more
> ambitious (I think?) issue.  I think this is a good first baby step.  We could
> consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
> off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844626#action_12844626
 ] 

Marvin Humphrey commented on LUCENE-2308:
-

I think we might consider matchOnly() instead of omitNorms().  If a field is
"match only", we don't need boost bytes a.k.a. "norms" because they are only
used as a scoring multiplier.

Haven't got a good synonym for "omitTFAP", but I'd sure like one.

> Separately specify a field's type
> -
>
> Key: LUCENE-2308
> URL: https://issues.apache.org/jira/browse/LUCENE-2308
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>
> This came up from dicussions on IRC.  I'm summarizing here...
> Today when you make a Field to add to a document you can set things
> index or not, stored or not, analyzed or not, details like omitTfAP,
> omitNorms, index term vectors (separately controlling
> offsets/positions), etc.
> I think we should factor these out into a new class (FieldType?).
> Then you could re-use this FieldType instance across multiple fields.
> The Field instance would still hold the actual value.
> We could then do per-field analyzers by adding a setAnalyzer on the
> FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
> for per-field codecs (with flex), where we now have
> PerFieldCodecWrapper).
> This would NOT be a schema!  It's just refactoring what we already
> specify today.  EG it's not serialized into the index.
> This has been discussed before, and I know Michael Busch opened a more
> ambitious (I think?) issue.  I think this is a good first baby step.  We could
> consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
> off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-11 Thread Marvin Humphrey

On Mon, Mar 08, 2010 at 02:10:35PM -0500, Michael McCandless wrote:

> We ask it to give us a Codec.

There's a conflict between the segment-wide role of the "Codec" class and its
role as specifier for posting format.

In some sense, you could argue that the "codec" reads/writes the entire index
segment -- which includes not only postings files, but also stored fields,
term vectors, etc.  However, the compression algorithms after which these
codecs are named have nothing to do with those other files.  PFORCodec isn't
relevant to stored fields.

I'd argue for limiting the role of "Codec" to encoding and decoding posting
files.

As far as modularizing other aspects of index reading and writing, I don't
think a simple factory is the way to go.  I favor using a composite design
pattern for SegWriter and SegReader (rather than subclassing), and an
initialization phase controlled by an Architecture object.  

It was Earwin Burrfoot who persuaded me of the merits of a user-defined
initialization phase over a user-defined factory method:
<http://markmail.org/message/ukhcvp2ydfxpcg7q>.

> So far my fav is still CodecProvider ;)

It seems that the primary reason this object is needed is that IndexReader
needs to be able to find the right decoder when it encounters an unfamiliar
codec name.  Since the core doesn't know about user-created codecs, it's
necessary for the user to register the name => codec pairing in advance so
that core can find it.

If that's this object's main role, I'd suggest "CodecRegistry".

> Naming is the hardest part!!

For me, the hardest parts of API design are...

  A) Designing public abstract classes / interfaces.
  B) Compensating for the curse of knowledge.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-09 Thread Marvin Humphrey

rity object that knows all
supported decoding configurations for the field.

> > You don't want to use the stronger, more constrictive check, right?
> 
> You mean single inheritance?  No.  Because then we hardwire the attrs
> to the Codec.  Standard codec should encode whatever attrs the app
> hands us... I think.

I might approach things the same way if Clownfish supported interface method
dispatch.  :)  

As it is, though, I'm not sure that the single inheritance requirement is an
important liability.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Multi-node stats within individual nodes (was "Baby steps...")

2010-03-09 Thread Marvin Humphrey

On Tue, Mar 09, 2010 at 01:04:19PM -0500, Michael McCandless wrote:
> BM25 needs the field length in tokens.  lnu.ltc needs avg(tf).  These
> 2 stats seem to the "common" ones (according to Robert).  So I want to
> start with them.

OK, interesting.

> > I don't know that compressing the raw materials is going to work as well as
> > compressing the final product.  Early quantization errors get compounded 
> > when
> > used in later calculations.
> 
> I would not compress for starters...

How about lossless compression, then?  Do you need random access into this
specialized posting list?  For the use cases you've described so far I don't
think so, since you're just iterating it top to bottom on segment open.

You could store the total length of the field in tokens and the number of
unique terms as integers, compressing with vbyte, PFOR or whatever... then
divide at search time to get average term frequency.  That way, you also avoid
committing to a float encoding, which I don't think Lucene has standardized
yet.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-09 Thread Marvin Humphrey

On Tue, Mar 09, 2010 at 05:06:08AM -0500, Michael McCandless wrote:
> > For what it's worth, that's sort of the way KS used to work: 
> > Schema/FieldType
> > information was stored entirely in source code.  That's changed and now we
> > serialize the whole schema including all Analyzers, but source-code-only is 
> > a
> > viable approach.
> 
> Hmm but KS still somehow enforced strong typing across indexing
> sessions?

Nope, it wasn't enforced.

> You said "of course" before but... how in your proposal could one
> store all stats for a given field during indexing, but then sometimes
> use match-only and sometimes full-scoring when querying against that
> field?

The same way that Lucene knows that sometimes it needs a docs-only-enum and
sometimes it needs a docs-and-positions enum.  Sometimes you need scores,
sometimes you don't.

> >> If user switches up their codec then they'll need to ensure it also
> >> stores stats required by their Sim(s).
> >
> > That's backwards, IMO.
> 
> I'm still baffled.  If I wanna play a movie on my 1080P monitor I'll
> need to find a movie that was encoded hidef (ie, bluray not dvd).
> 
> I mean, I don't have to.  DVD content will play fine still... just
> degraded quality.

Heh.  Consumers hate format wars

In this case, though, we're dealing with software, not DVD hardware, so
upgrading is a lot easier.  Under the format-follows-Similarity model, the
relationship between Similarity and posting format is more akin to the
relationship between a container format like Quicktime and codecs like
Sorenson 3 or H.264.  

Tweakers will want to go in and monkey with the choice of codec within the
Quicktime file, but most users will just trust us to use the latest and
greatest.

> > The posting format encoding should be an implementation detail.  The general
> > user should be expressing their intent as far as how they want the field to 
> > be
> > scored, and the posting format should flow from that.
> 
> Maybe it's that it bothers you that with this proposed changed the
> user makes 2 decisions -- Codec and Sim?  

Yes, and it bothers me that users have to know about codecs at all, when in
the vast majority of cases it doesn't matter because the default is going to
be the best choice.

Since compression algorithm performance depends on knowing how to exploit
patterns in the data and sometimes the user will know about patterns that are
opaque to us, in some circumstances they will be able to select a more
appropriate codec.  But that's not the common case, as it requires both
unusual data and an unusually sophisticated user.

What users will be able to tell us is how they want the field to be used, and
we can use that information to help us optimize.  For example, when a user
declares that they want a field to be "match-only", we know we don't have to
write boost bytes, freq or positions, saving space.

> Ie user will choose PFor or Standard or Pulsing(PFor/Standard) codec, and
> then separately choose Sim?
> 
> But these are important choices.  They should be separate.  Why
> force-bundle them?

Because most of the time the user isn't going to be able to improve on the
default.

> > Whether we use VInt, PFOR, group varint, hand-tuned bit shifting, etc under
> > the hood to implement BM25, match-only, boost-per-position or whatever
> > shouldn't be the user's concern.  As time goes on, we should allow ourselves
> > the flexibility to use new compression techniques to write new segments.
> 
> But w/ the proposed change Lucene users will be free to use better
> codecs? 

They could use better codecs under the format-follows-Similarity model, too.
They'd just have to subclass and override the factory methods that spawn
posting encoders/decoders.

> Are you worried about proper defaulting?  We'll handle that
> (under Version).

I don't think it's necessary or desirable to handle this with Version.  A
codec improvement (say, encoding match-only fields using PFOR instead of
VInts) would simply trigger an index format number increment, and new segments
would be written using the latest format.

> > There's no difference between calling enum.nextPosition() and
> > positions.next(), is there?
> 
> Right now it's a 2 step process when you access via attr -- first you
> ask the enum to next(), then you ask each attr associated w/ that enum
> for their value.

OK, I think I see where the limitation arises.

In Lucy/KS, we'd just access the positions value as a member variable (direct
struct access) rather than invoking a method.  By default, struct definitions
are opaque and thus member vars are inaccessible (to encourage loose
cou

Re: Multi-node stats within individual nodes (was "Baby steps...")

2010-03-08 Thread Marvin Humphrey

On Mon, Mar 08, 2010 at 02:23:47PM -0500, Michael McCandless wrote:
> For a large index the stats will be stable after re-indexing only a
> few more docs.

Well, not if there's been huge churn on other nodes in the interim.

> No... the stat is avg tf within the doc.  

Don't you need the *total* field length -- not just the average tf -- for the
docXfield in question to perform length normalization?  

Or is average term frequency within the docXfield a BM25-specific precursor
that you are using as an example stat?

> So if I index this doc:
> 
>   a a a a b b b c c d
> 
> The avg(tf) = average(4 3 2 1) = 2.5.
> 
> So we'd store 2.5 for that docXfield in a fixed-width dense postings
> list (like column stride fields -- every doc has a value).

Like column-stride fields, but also analogous to the current "norms" -- only
with 4x the space requirements.  That is, unless you compress that float down
to a byte, as is currently done with the norm (3 bit mantissa, 5 bit
exponent).

The generation of a "norm" byte involves some pretty intense lossy
data-reduction.  If you're going to store the pre-data-reduction raw
materials, you're going to incur a space penalty unless you can eke out
similar savings somewhere.

The coarse quantization is justified because we only care about big
differences at search-time.  If two documents are judged as reasonably close
to each other in relevance, the order in which they rank isn't important.
It's only when docs are judged as far apart in relevance that their relative
rank order matters.

I don't know that compressing the raw materials is going to work as well as
compressing the final product.  Early quantization errors get compounded when
used in later calculations.

BTW, I think we should refer to these bytes as "boost bytes" rather than
"norms".  Their purpose is not simply to convey length normalization; they
also include document boost and field boost.  And the length normalization
multiplier is a kind of boost... so "boost byte" has everything covered, and
avoids the overloading of the term "norm".

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-08 Thread Marvin Humphrey

ldn't be the user's concern.  As time goes on, we should allow ourselves
the flexibility to use new compression techniques to write new segments.

> > Just a thought: why not make positions an attribute on a DocsEnum?
> 
> Maybe... though I think the double method call (enum.next() then
> posAttr.get()) is too much added cost.

Why wouldn't it work to have the consumer extract the positions attribute from
the DocsEnum during construction?  There's no difference between calling
enum.nextPosition() and positions.next(), is there?

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-07 Thread Marvin Humphrey

ot;normal" postings iterator.  As to whether we expose the
part-of-speech via an attribute or via a method, that's up in the air.  Hmm.

>From a class-design perspective, it would probably be best to go with an
attribute, since Lucy has only single-inheritance and no interfaces.  A rigid
class hierarchy is going to cause problems when you need an iterator that
combines unrelated concepts like BM25 weighting and part-of-speech tagging.

> In flex you'd get a "normal" DocsAndPositionsEnum, pull the POS attr
> up front, and as you're next'ing your way through it, optionally look
> up the POS of each position you step through, using the POS attr.

Just a thought: why not make positions an attribute on a DocsEnum?

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Multi-node stats within individual nodes (was "Baby steps...")

2010-03-07 Thread Marvin Humphrey

On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
> > Fortunately, beaming field length data around is an easier problem than
> > distributed IDF, because with rare exceptions, the number of fields in a
> > typical index is miniscule compared to the number of terms.
> 
> Right... so how do we control/configure when stats are fully
> recomputed corpus wide hmmm.  Should be fully app controllable.

Hmm, at first, I don't like the sound of that.  Right now, we're talking about
an esoteric need for a specific plugin, BM25 similarity.  The top level
indexer object should be oblivious to the implementation details of plugins.

However, the theme here is the need for an individual node to sync up with the
distributed corpus.  If you don't do that at index time, you have to do it at
search time, which isn't always ideal.  So I can see us building in some sort
of functionality to address that more general case.  It would be the flip of
the MultiSearcher-comprised-of-remote-searchables situation.

> > I guess you'd want to accumulate that average while building the segment...
> > oh wait, ugh, deletions are going to make that really messy.  :(
> >
> > Think about it for a sec, and see if you swing back to the desirability of
> > calculation on the fly using maxDoc(), like I just did.
> 
> I think we'd store a float (holding avg(tf) that you computed when
> inverting that doc, ie, for all unique terms in the doc what's the avg
> of their freqs) for every doc, in the index.  Then we can regen fully
> when needed right?  

Hmm, full regeneration would be expensive, so I'd discounted it.  You'd have
to iterate the entire posting list for every term, adding up freq() while
skipping deleted docs.

> Or maybe we store sum(tf) and #unique terms... hmm.
> 
> Handling docs that did not have the field is a good point... but we
> can assign a special value (eg 0.0, or, any negative number say)  to
> encode that?

Where?

In the full field storage?  To slow to recover.

In the term dictionary?  The term dictionary can't store nulls.  You'd have to
use sentinels... thus restricting the allowable content of the field?!  No
way.

In the Lucy-style mmap'd sort cache?  That would work, because we always have
a "null ord", to which documents which did not supply a value for the field
get assigned in the ords array.  However, sort/field caches are orthogonal to
this problem and we don't want to require them for an ancillary need.

I suppose you could do it by iterating all posting lists for a field and
flipping bits in a bit vector.  The bits that are left unset correspond to
docs with null values.

> Deletions I think across the board will skew stats until they are
> reclaimed.

Yes, and unless the stats are fully regenerated when a segment with deletions
get merged away, the averages will be wrong to some degree, with the skew
potentially worsening over time.

Say that you have a segment with an average field length of 5 for the "tags"
field, but that that average is the result of most docs having 1 tag, while a
handful of docs have 100 tags.  Now say you delete all of the docs with 100
tags.  The recorded average for the "tags" field within the segment is now all
messed up -- it should be "1", but it's "5".  You have to regenerate a new,
correct average when building a new segment.  You can't use the existing value
of "5" as a shortcut, or the consolidated segment's averages will be wrong
from the get-go.

That's what I was getting at earlier.  However, I'd thought that we could get
around the problem by fudging with maxDoc(), and I no longer believe that.  I
think full regeneration is the only way.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-05 Thread Marvin Humphrey

t being to fall back to doc-id-only and discard data when an
unknown posting format is encountered, I presume.

> >> > Similarity is where we decode norms right now.  In my opinion, it
> >> > should be the Similarity object from which we specify per-field
> >> > posting formats.
> >>
> >> I agree.
> >
> > Great, I'm glad we're on the same page about that.
> 
> Actually [sorry] I'm not longer so sure I agree!
> 
> In flex we have a separate Codec class that's responsible for
> creating the necessary readers/writers.  It seems like Similarity is a
> consumer of these stats, but need not know what format is used to
> encode them on disk?

It's true that it's possible to separate out Similarity as a consumer.
However, I'm also thinking about how to make this API as easy to use as
possible.

One rationale behind the proposed elevation of Similarity is that I'm not a
fan of the name "Codec".  I think it's too generic to use for the class which
specifies a posting format.  "PostingCodec" is better, but might be too long.
In contrast, "Similarity" is more esoteric than "Codec", and thus conveys more
information.  

For Lucy, I'm imagining a stripped-down Similarity class compared to current
Lucene.  It would bear the responsibility for setting policy as to how scores
are calculated (in other words, judging how "similar" a document is to the
query), but what information it uses to calculate that score would be left
entirely open.  Methods such as tf(), idf(), encodeNorm(), etc. would move to
a TF/IDF-specific subclass.  Here's a sampling of possible Similarity
subclasses:

  * MatchSimilarity   // core
  * TFIDFSimilarity   // core
  * LongFieldTFIDFSimilarity  // contrib
  * BM25Similarity// contrib
  * PartOfSpeechSimilarity// contrib

For Lucy, Similarity would be specified as a member of a FieldType object
within a Schema.  No subclassing would be required to spec custom posting
formats:

   Schema schema = new Schema();
   FullTextType bm25Type = new FullTextType(new BM25Similarity());
   schema.specField("content", bm25Type);
   schema.specField("title", bm25Type);
   StringType matchType = new StringType(new MatchSimilarity());
   schema.specField("category", matchType);

Since the Similarity instance is settable rather than generated by a factory
method, that means it will have to be serialized within the schema JSON file,
just like analyzers must be.

I think it's important to make choosing a posting format reasonably easy.
Match-only fields should be accessible to someone learning basic index tuning
and optimization techniques.

Actually writing posting codecs is totally different.  Not many people are
going to want to do that, though we should make it easy for experts.

What's the flex API for specifying a custom posting format?

> > What's going to be a little tricky is that you can't have just one
> > Similarity.makePostingDecoder() method.  Sometime's you'll want a
> > match-only decoder.  Sometimes you'll want positions.  Sometimes
> > you'll want part-of-speech id.  It's more of a interface/roles
> > situation than a subclass situation.
> 
> match-only decoder is handled on flex now by asking for the DocsEnum
> and then while iterating only using the .doc() (even if underlyingly
> the codec spent effort decoding freq and maybe other things).
> 
> If you want positions you get a DocsAndPositionsEnum.

Right.  But what happens when you want a custom codec to use BM25 weighting
*and* inline a part-of-speech ID *and* use PFOR?

I think we have to supply a class object or class name when asking for the
enumerator, like you do with AttributeSource.
   
  PostingList plist = null;
  PostingListReader pListReader = segReader.fetch(PostingListReader);
  if (pListReader != null) {
PostingsReader pReader = pListReader.fetch(field);
if (pReader != null) {
  plist = pReader.makePostingList(klass); // e.g. PartOfSpeechPostingList
}
  }

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Composing posts for both JIRA and email (was a JIRA post)

2010-03-04 Thread Marvin Humphrey

(CC to lucy-dev and general, reply-to set to general)

On Thu, Mar 04, 2010 at 06:18:28AM +, Shai Erera (JIRA) wrote:

> (Warning, this post is long, and is easier to read in JIRA)

I consume email from many of the Lucene lists, and I hate it when people force
me to read stuff via JIRA.  It slows me down to have to jump to all those
forum web pages.  I only go the web page if there are 5 or more posts in a row
on the same issue that I need to read.

For what it's worth, I've worked out a few routines that make it possible to
compose messages which read well in both mediums.

  * Never edit your posts unless absolutely necessary.  If JIRA used diffs,
things would be different, but instead it sends the whole frikkin' post 
twice (before and after), which makes it very difficult to see what was
edited.  If you must edit, append an "edited:" block at the end to
describe what you changed instead of just making changes inline.
  * Use FireFox and the "It's All Text" plugin, which makes it possible to edit
JIRA posts using an external editor such as Vim instead of typing into a
textarea. <http://trac.gerf.org/itsalltext>
  * After editing, use the preview button (it's a little monitor icon to the
upper right of the textarea) to make sure the post looks good in JIRA.
  * Use "> " for quoting instead of JIRA's "bq." and "{quote}" since JIRA's
mechanisms look so crappy in email.  This is easy from Vim, because
rewrapping a long line (by typing "gq" from visual mode to rewrap the
current selection) that starts with "> " causes "> " to be prepended to
the wrapped lines.
  * Use asterisk bullet lists liberally, because they look good everywhere.
  * Use asterisks for *emphasis*, because that looks good everywhere.
  * If you wrap lines, use a reasonably short line length.  (I use 78; Mike
McCandless, who also wraps lines for his Jira posts, uses a smaller
number).  Otherwise you'll get nasty wrapping in narrow windows, both in
email clients and web browsers.

There are still a couple compromises that don't work out well.  For email,
ideally you want to set off code blocks with indenting:

int foo = 1;
int bar = 2;

To make code look decent in JIRA, you have to wrap that with {code} tags,
which unfortunately look heinous in email.  Left-justifying the tags but
indenting the code seems like it would be a rotten-but-salvageable compromise,
as it at least sets off the tags visually rather than making them appear as
though they are part of the code fragment.

{code}
int foo = 1;
int bar = 2;
{code}

Unfortunately, that's going to look like this in JIRA, because of a bug that
strips all leading whitespace from the first line.

   |-|
   | int foo;|
   | int bar;|
   |-|

It seems that this has been fixed by Atlassian in the Confluence wiki
(<http://jira.atlassian.com/browse/CONF-4548>), but the issue remains for the
JIRA installation at issues.apache.org.  So for now, I manually strip
indentation until the whole block is flush left.

{code}
int foo = 1;
int bar = 2;
{code}

(Gag.  I vastly prefer wikis that automatically apply fixed-width styling to
any indented text.)

One last tip for Lucy developers (and other non-Java devs).  JIRA has limited
syntax highlighting support -- Java, JavaScript, ActionScript, XML and SQL
only -- and defaults to assuming your code is Java.  In general, you want to
override that and tell JIRA to use "none".

{code:none}
int foo = 1;
int bar = 2;
{code}

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-02 Thread Marvin Humphrey

On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote:
> The problem is, these scoring models need the avg field length (in
> tokens) across the entire index, to compute the norms.
>
> Ie, you can't do that on writing a single segment.

I don't see why not.  We can just move everything you're doing on Searcher
open to index time, and calculate the stats and norms before writing the
segment out.

At search time, the only segment with valid norms would be the last one, so
we'd make sure the Searcher used those.

I think the fact that Lucy always writes one segment per indexing session --
as opposed to Lucene's one segment per document -- makes a difference here.

Whether burning norms to disk at index time is the most efficient setup
depends on the ratio of commits to searcher-opens.

In a multi-node search cluster, pre-calculating norms at index-time wouldn't
work well without additional communication between nodes to gather corpus-wide
stats.  But I suspect the same trick that works for IDF in large corpuses
would work for average field length: it will tend to be the stable over time,
so you can update it infrequently.

> So I think it must be done during searcher init.
> 
> The most we can do is store the aggregates (eg sum of all lengths in
> this segment) in the SegmentInfo -- this saves one pass on searcher
> init.

Logically...

   token_counts: {
   segment: {
   title: 4,
   content: 154,
   },
   all: {
   title: 98342,
   content: 2854213
   }
   }

(Would that suffice?  I don't recall the gory details of BM25.)

As documents get deleted, the stats will gradually drift out of sync, just
like doc freq does.  However, that's mitigated if you recycle segments that
exceed a threshold deletion percentage on a regular basis.

> The norms array will be stored in this per-field sim instance.

Interesting, but that wasn't where I was thinking of putting them.  Similarity
objects need to be sent over the network, don't they?  At least they do in KS.
So I think we need a local per-field PostingsReader object to hold such cached
data.

> > The insane loose typing of fields in Lucene is going to make it a
> > little tricky to implement, though.  I think you just have to
> > exclude fields assigned to specific similarity implementations from
> > your merge-anything-to-the-lowest-common-denominator policy and
> > throw exceptions when there are conflicts rather than attempt to
> > resolve them.
> 
> Our disposition on conflict (throw exception vs silently coerce)
> should just match what we do today, which is to always silently
> coerce.

What do you do when you have to reconcile two posting codecs like this?

  * doc id, freq, position, part-of-speech identifier
  * doc id, boost

Do you silently drop all information except doc id?

> > Similarity is where we decode norms right now.  In my opinion, it
> > should be the Similarity object from which we specify per-field
> > posting formats.
> 
> I agree.

Great, I'm glad we're on the same page about that.

> > Similarity implementation and posting format are so closely related
> > that in my opinion, it makes sense to tie them.
> 
> This confuses me -- what is stored in these stats (each field's token
> length, each field's avg tf, whatever other a codec wants to add over
> time...) should be decoupled from the low level format used to store
> it?

I don't know about that.  I don't think it's necessary to decouple them.
There might be some minor code duplication, but similarity implementations
don't tend to be very large, so the DRY violation doesn't bother me.

What's going to be a little tricky is that you can't have just one
Similarity.makePostingDecoder() method.  Sometime's you'll want a match-only
decoder.  Sometimes you'll want positions.  Sometimes you'll want
part-of-speech id.  It's more of a interface/roles situation than a subclass
situation.

> > If you're looking for small steps, my suggestion would be to focus
> > on per-field Similarity support.
> 
> Well that alone isn't sufficient -- the index needs to record/provide
> the raw stats, and doc boosting (norms array) needs to be done using
> these stats.

Not sufficient, but it's probably a prerequisite.  Since it's a common feature
request anyway, I think it's a great place to start:

http://lucene.markmail.org/message/ln2xkesici6aksbi
http://lucene.markmail.org/thread/46vxibpubogtcy3g
http://lucene.markmail.org/message/56bk6wrbwallyjvr
https://issues.apache.org/jira/browse/LUCENE-2236

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-02-28 Thread Marvin Humphrey

 an index segment: 
   its fields, document count, and so on.  The Segment object itself writes 
   one file, segmeta.json; besides storing info needed by 
   Segment itself, the "segmeta" file serves as a central repository for 
   metadata generated by other index components -- relieving them of the 
   burden of storing metadata themselves.

As far as aggregates go, I think you want to be careful to avoid storing any
kind of data that scales with segment size within a SegmentInfo.

>   * Change Similarity, to allow field-specific Similarity (I think we
> have issue open for this already).  I think, also, lengthNorm
> (which is no longer invoked during indexing) would no longer be
> used.

Well, as you might suspect, I consider that one a gimme.  KinoSearch supports
per-field Similarity now.

The insane loose typing of fields in Lucene is going to make it a little
tricky to implement, though.  I think you just have to exclude fields assigned
to specific similarity implementations from your
merge-anything-to-the-lowest-common-denominator policy and throw exceptions
when there are conflicts rather than attempt to resolve them.

> I think we'd make the class that computes norms from these per-doc
> stats on IR open pluggable.  

Similarity is where we decode norms right now.  In my opinion, it should be
the Similarity object from which we specify per-field posting formats.

See my reply to Robert in the BM25 thread:

http://markmail.org/message/77rmrfmpatxd3p2e

That way, custom scoring implementations can guarantee that they always
have the posting information they need available to make their similarity
judgments. Similarity also becomes a more generalized notion, with the
TF/IDF-specific functionality moving into a subclass. 

Similarity implementation and posting format are so closely related that in my
opinion, it makes sense to tie them. 

> And, someday we could make what stats are gathered/stored during indexing
> pluggable but for starters I think we should simply support the field length
> in tokens and avg tf per field.

I would argue against making this your top priority, because I think adding
half-baked features that require index-format changes is bad policy.

If you're looking for small steps, my suggestion would be to focus on
per-field Similarity support.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code

2010-02-24 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837988#action_12837988
 ] 

Marvin Humphrey commented on LUCENE-2282:
-

> As the API is now marked @lucene.internal, and it'll only be very
> expert usage, I'm not as concerned as Marvin is about the risks of
> even exposing this. 

Um, the only possible concerns I could have had were regarding public exposure
of this API.  If it's marked as internal, it's an implementation detail.
Whether or not the dot is included in internal-use-only constant strings isn't
something I'm going to waste a lot of time thinking about. ;)

So now, not only do I really, really not care whether this goes in, I have no
qualms about it either.

Having users like Shai who are willing to recompile and regenerate to take
advantage of experimental features is a big boon, as it allows us to test
drive features before declaring them stable.  Designing optimal APIs without
usability testing is difficult to impossible.

> Expose IndexFileNames as public, and make use of its methods in the code
> 
>
> Key: LUCENE-2282
> URL: https://issues.apache.org/jira/browse/LUCENE-2282
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2282.patch, LUCENE-2282.patch, LUCENE-2282.patch
>
>
> IndexFileNames is useful for applications that extend Lucene, an in 
> particular those who extend Directory or IndexWriter. It provides useful 
> constants and methods to query whether a certain file is a core Lucene file 
> or not. In addition, IndexFileNames should be used by Lucene's code to 
> generate segment file names, or query whether a certain file matches a 
> certain extension.
> I'll post the patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code

2010-02-23 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837499#action_12837499
 ] 

Marvin Humphrey commented on LUCENE-2282:
-

> Any application that extends IW, or provide its own Directory
> implementation, and wants to reference Lucene's file extensions properly
> (i.e. not by putting its code under o.a.l.index or hardcoding ".del" as its
> deletions file) will benefit from making it public.

> Forgot to tag IFN as @lucene.internal 

?

If the class is tagged as "internal", then external applications like the one 
you describe above aren't supposed to use it, right?

But I don't really care about whether this goes into Lucene.  Go ahead, make
it fully public and omit the "internal" tag.  Not my problem. :)

The thing is, I really don't understand what kind of thing you want to do.
Are you writing your own deletions files?

I'm trying to understand because the only use cases I can think of for this
aren't compatible with index pluggability, which is a high priority for Lucy.

* Sweep "non-core-Lucene files" to "clean up" an index.
* Gather up "core-Lucene files" for export.
* Audit "core-Lucene files" to determine whether the index is in a valid
  state.
* Differentiate between "core-Lucene and "non-core-Lucene" files when
  writing a compound file.

Maybe there's something I haven't thought of, though.  Why do you want to
"reference Lucene's file extensions properly"?  Once you've identified
identified which files are "core Lucene" and which files aren't, what are you
going to do with them?



> Expose IndexFileNames as public, and make use of its methods in the code
> 
>
> Key: LUCENE-2282
> URL: https://issues.apache.org/jira/browse/LUCENE-2282
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
> Fix For: 3.1
>
> Attachments: LUCENE-2282.patch, LUCENE-2282.patch
>
>
> IndexFileNames is useful for applications that extend Lucene, an in 
> particular those who extend Directory or IndexWriter. It provides useful 
> constants and methods to query whether a certain file is a core Lucene file 
> or not. In addition, IndexFileNames should be used by Lucene's code to 
> generate segment file names, or query whether a certain file matches a 
> certain extension.
> I'll post the patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2282) Expose IndexFileNames as public, and make use of its methods in the code

2010-02-23 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837446#action_12837446
 ] 

Marvin Humphrey commented on LUCENE-2282:
-

It seems to me that identifying only core index files conflicts with the idea
of pluggable index formats.  Presumably plugins would use their own file
extensions.  Would these belong to the index, according to a detector based
off of IndexFileNames?  Presumably not, which would either limit the
usefulness of such a utility, or outright encourage anti-patterns such as a
sweeper that zaps files created by plugins because they aren't "core Lucene"
enough.

Also, are temporary files "core Lucene"?  Lockfiles?  Only sometimes?

What are the applications that we are trying to support by exposing this API?

> Expose IndexFileNames as public, and make use of its methods in the code
> 
>
> Key: LUCENE-2282
> URL: https://issues.apache.org/jira/browse/LUCENE-2282
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Shai Erera
> Fix For: 3.1
>
>
> IndexFileNames is useful for applications that extend Lucene, an in 
> particular those who extend Directory or IndexWriter. It provides useful 
> constants and methods to query whether a certain file is a core Lucene file 
> or not. In addition, IndexFileNames should be used by Lucene's code to 
> generate segment file names, or query whether a certain file matches a 
> certain extension.
> I'll post the patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-20 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836197#action_12836197
 ] 

Marvin Humphrey commented on LUCENE-2271:
-

An awful lot of thought went into optimizing those collection algorithms.  I
disagree with many of the design decisions that were made, but it seems rushed
to blithely revert those optimizations.

FWIW, the SortCollector in KS (designed on the Lucy list last spring, would be
in Lucy but some prereqs haven't gone in yet) doesn't have the problem with
-Inf sentinels.  It uses an array of "actions" representing sort rules to
determine whether a hit is "competitive" and should be inserted into the
queue; the first action is set to AUTO_ACCEPT (meaning try inserting the hit
into the queue) until the queue fills up, and then again to AUTO_ACCEPT at the
start of each segment.  It's not necessary to fill up the queue with dummy
hits beforehand.

{code:none}
static INLINE bool_t
SI_competitive(SortCollector *self, i32_t doc_id)
{
u8_t *const actions = self->actions;
u32_t i = 0;

/* Iterate through our array of actions, returning as quickly as
 * possible. */
do {
switch (actions[i] & ACTIONS_MASK) {
case AUTO_ACCEPT:
return true;
case AUTO_REJECT:
return false;
case AUTO_TIE:
break;
case COMPARE_BY_SCORE: {
float score = Matcher_Score(self->matcher);
if  (score > self->bubble_score) {
self->bumped->score = score;
return true;
}
else if (score < self->bubble_score) {
return false;
}
}
break;
case COMPARE_BY_SCORE_REV: {
// ...
case COMPARE_BY_DOC_ID:
// ...
case COMPARE_BY_ORD1: {
// ...
}
} while (++i < self->num_actions);

/* If we've made it this far and we're still tied, reject the doc so that
 * we prefer items already in the queue.  This has the effect of
 * implicitly breaking ties by doc num, since docs are collected in order.
 */
return false;
}
{code}

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, 
> LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later

[jira] Commented: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present

2010-02-12 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832989#action_12832989
 ] 

Marvin Humphrey commented on LUCENE-1941:
-

> off on "vacation" (scare quotes for Marvin)

Have "fun"!

> MinPayloadFunction returns 0 when only one payload is present
> -
>
> Key: LUCENE-1941
> URL: https://issues.apache.org/jira/browse/LUCENE-1941
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.9, 3.0
>Reporter: Erik Hatcher
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-1941.patch, LUCENE-1941.patch
>
>
> In some experiments with payload scoring through PayloadTermQuery, I'm seeing 
> 0 returned when using MinPayloadFunction.  I believe there is a bug there.  
> No time at the moment to flesh out a unit test, but wanted to report it for 
> tracking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Having a default constructor in Analyzers

2010-02-07 Thread Marvin Humphrey

DM Smith:

> Imagine that each index maintains a manifest of the toolchain for the index,
> which includes the version of each part of the chain. Since the index is
> created all at once, this probably is the same as the version of lucene.
> When the user searches the index the manifest is consulted to recreate the
> toolchain.

>8 snip 8<

> IIRC: This is something that Marvin has implemented in Lucy.

Yes.  QueryParser's constructor takes a Schema argument.  Furthermore, Schema
definitions are fully externalized and stored as JSON with the index itself.
So you can do stuff like this:

  IndexReader reader = IndexReader.open("/path/to/index");
  QueryParser qparser = new QueryParser(reader.getSchema());

We haven't got Version for our Analyzers yet, but it's planned.  I'm following
this discussion with interest to see how the deployment of Version plays out
with the user base.

However, Lucy's approach won't work for Lucene because Lucene allows you to
have fields with the same name and completely different semantics.

Marvin Humphrey



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize

2010-01-17 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801433#action_12801433
 ] 

Marvin Humphrey commented on LUCENE-2213:
-

> if it starts getting used very often for very small arrays, the overhead
> will start to matter

I think in most cases usage will only occur after an inequality test, when
it's known that reallocation will be occurring.  In my experience, the
overhead of allocation will tend to swamp this kind of calculation.

{code}
if (needed > capacity) {
   int amount = ArrayUtil.oversize(needed, RamUsageEstimator.NUM_BYTES_CHAR);
   buffer = new char[amount];
}
{code}

> Small improvements to ArrayUtil.getNextSize
> ---
>
> Key: LUCENE-2213
> URL: https://issues.apache.org/jira/browse/LUCENE-2213
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2213.patch, LUCENE-2213.patch, LUCENE-2213.patch
>
>
> Spinoff from java-dev thread "Dynamic array reallocation algorithms" started 
> on Jan 12, 2010.
> Here's what I did:
>   * Keep the +3 for small sizes
>   * Added 2nd arg = number of bytes per element.
>   * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively)
>   * Still grow by 1/8th
>   * If 0 is passed in, return 0 back
> I also had to remove some asserts in tests that were checking the actual 
> values returned by this method -- I don't think we should test that (it's an 
> impl. detail).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize

2010-01-17 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801432#action_12801432
 ] 

Marvin Humphrey commented on LUCENE-2213:
-

Seems like the one permutation of "over" "allocation" and "size" you've omitted
is oversize(minimum, width). (It's a style thing, but I try to use get* for 
accessors 
and avoid it elsewhere.)

> Small improvements to ArrayUtil.getNextSize
> ---
>
> Key: LUCENE-2213
> URL: https://issues.apache.org/jira/browse/LUCENE-2213
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2213.patch, LUCENE-2213.patch, LUCENE-2213.patch
>
>
> Spinoff from java-dev thread "Dynamic array reallocation algorithms" started 
> on Jan 12, 2010.
> Here's what I did:
>   * Keep the +3 for small sizes
>   * Added 2nd arg = number of bytes per element.
>   * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively)
>   * Still grow by 1/8th
>   * If 0 is passed in, return 0 back
> I also had to remove some asserts in tests that were checking the actual 
> values returned by this method -- I don't think we should test that (it's an 
> impl. detail).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2213) Small improvements to ArrayUtil.getNextSize

2010-01-15 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800828#action_12800828
 ] 

Marvin Humphrey commented on LUCENE-2213:
-

Algorithm looks good.  The addition of the mandatory second argument works
well. Nice work.

Looks like there's a typo in the currently unused constant "NUM_BYTES_DOUBLT".

As for the tests... Testing that optimizations like these are working properly
is a pain, so I understand why you zapped 'em.  Sometimes inequality or
proportional testing can work in these situations:

{code}
  assertTrue(t.termBuffer().length > t.termLength());
{code}

That assertion wouldn't always hold true for this object, because sometimes
the term will fill the whole array.  And in a perfect world, you'd want to test 
that 
each and every array growth happens as expected -- but that's not practical.  
Still, in my opinion, a fragile, imperfect test in this situation is OK.

> Small improvements to ArrayUtil.getNextSize
> ---
>
> Key: LUCENE-2213
> URL: https://issues.apache.org/jira/browse/LUCENE-2213
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2213.patch
>
>
> Spinoff from java-dev thread "Dynamic array reallocation algorithms" started 
> on Jan 12, 2010.
> Here's what I did:
>   * Keep the +3 for small sizes
>   * Added 2nd arg = number of bytes per element.
>   * Round up to 4 or 8 byte boundary (if it's 32 or 64 bit JRE respectively)
>   * Still grow by 1/8th
>   * If 0 is passed in, return 0 back
> I also had to remove some asserts in tests that were checking the actual 
> values returned by this method -- I don't think we should test that (it's an 
> impl. detail).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Dynamic array reallocation algorithms

2010-01-13 Thread Marvin Humphrey

On Wed, Jan 13, 2010 at 11:46:50AM -0500, Michael McCandless wrote:
> If forced to pick, in general, I tend to prefer burning CPU not RAM,
> because the CPU is often a one-time burn, whereas RAM ties up storage
> for indefinite amounts of time.

With our dependence on indexes being RAM-resident for optimum performance, I'd
also favor being conservative with RAM.

> I think this function should still aim to handle the smallish values,
> ie, we shouldn't require every caller to have to handle the small
> values themselves.  Callers that want to override the small cases can
> still do so...

The more "helpful" the behavior of getNextSize(), the harder it is to
understand what happens when you partially override it.

But I guess it's not that big a deal one way or the other.  There aren't that
many places in Lucene where you might call getNextSize().  There are more such
places in Lucy because we have to roll our own string and array classes, and
we need finer-grained control over what happens there -- so maybe that
explains why I'm not excited about trying to cram all that logic into a shared
routine.

Putting more logic into getNextSize() would be less of a problem if Lucene's
implementation was less convoluted.  It's only one line and one comment, but
it's deceptively difficult to grok.  Looks like some Perl golfer wrote it.  ;)

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Dynamic array reallocation algorithms

2010-01-13 Thread Marvin Humphrey

On Wed, Jan 13, 2010 at 09:43:12AM -0500, Yonik Seeley wrote:
> Yeah, something highly optimized for python in C may not be optimal for Java.

It looks like that algo was tuned to address poor reallocation performance
under Windows 9x.

http://svn.python.org/view/python/trunk/Objects/listobject.c?r1=19445&r2=20939

 * This over-allocates proportional to the list size, making room
 * for additional growth.  The over-allocation is mild, but is
 * enough to give linear-time amortized behavior over a long
 * sequence of appends() in the presence of a poorly-performing
 * system realloc() (which is a reality, e.g., across all flavors
 * of Windows, with Win9x behavior being particularly bad -- and
 * we've still got address space fragmentation problems on Win9x
 * even with this scheme, although it requires much longer lists to
 * provoke them than it used to).
 */

That "3" used to be a "1":

http://svn.python.org/view/python/trunk/Objects/listobject.c?r1=35279&r2=35280

> Seems like the right thing is highly dependent on the use case.  

It seems that way to me.  That's why I think it's better to have a simpler
routine and to force more responsibility onto the client code.

> In this case, the number of byte arrays temporarily being managed can
> be maxDoc (in the very worst case) so it's critical not to waste any
> space.

Yes, we want to make sure it's possible to ask for a specific array size and
get that exact array size.  (I think this is a bigger problem in Lucy than in
Lucene, because we have to simulate bounded arrays with classes).  

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Dynamic array reallocation algorithms

2010-01-13 Thread Marvin Humphrey

On Wed, Jan 13, 2010 at 05:48:08AM -0500, Michael McCandless wrote:
> Have you notified python-dev?

No, not yet.  Is it kosher with the Python license to have copied-and-pasted
that comment?  It's not credited from what I can see.  Small, but we should
probably fix that.

> Right, and also to strike a balance with not wasting too much
> over-allocated but not-yet-used RAM (hence 1/8 growth, not 1/2 or 1).

I agree, a smaller size is better.  Say you start with an array which hold 800
elements but which has been over-allocated by one eighth (+100 elements =
900).  Reallocating at 900 and again at 1000-something isn't that different
from reallocating only once at 1000.  So long as you avoid reallocating
at every increment -- 801, 802, etc -- you have achieved your main goal.

Both mild and aggressive over-allocation solve that problem, but aggressive
over-allocation also involves significant RAM costs.  Where the best balance
lies depends on how bad the reallocation performance is in relation to the
cost of insertion.  An eighth seems pretty good.   Doubling seems too high.
1.5, dunno, seems like it could work.  According to this superficially
credible-seeming comment at stackoverflow, that's what Ruby does:

http://stackoverflow.com/questions/1100311/what-is-the-ideal-growth-rate-for-a-dynamically-allocated-array/1100449#1100449

> But, you have to do something different when targetSize < 8, else that
> formula doesn't grow.

That's right, it was by design.  For small sizes, things get tricky.  

If it's a byte array, you definitely want to round up to the size of a
pointer, and as we enter the era of ubiquitous 64-bit processors, rounding up
to the nearest multiple of 8 seems proper.  

But what about arrays of objects?  Would we really want every two-element
array to reserve space for 8 objects?  

The way I've got this handled in a forthcoming patch to Lucy is to trust the
user about the size they want in most cases and count on them to add the logic
for small sizes (as was done in TermBufferImpl.java).  The Grow() methods of
CharBuf, ByteBuf, and VArray obey the exact amount -- if you want an
overallocation, you'll invoke the over-estimator on the argument you supply to
Grow().  It's just the incremental appends like VArray's Push(), or CharBuf's
Cat_Char() that trigger the automatic over-allocation internally.

> Also, I think for smallish sizes we want faster than 12.5% growth,
> because the constant-time cost of the mechanics of doing a
> reallocation are at that point relatively high (wrt the cost of
> copying the bytes over to the new array).

But is that important if in most cases where it will grow incrementally,
you've already overallocated manually?

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Dynamic array reallocation algorithms

2010-01-12 Thread Marvin Humphrey

On Tue, Jan 12, 2010 at 10:46:29PM -0500, DM Smith wrote:

> So starting at 0, the size is 0.
> 0 => 0
> 0 + 1 => 4
> 4 + 1 => 8
> 8 + 1 => 16
> 16 + 1 => 25
> 25 + 1 => 35
> ...
> 
> So I think the copied python comment is correct but not obviously correct.

So those numbers are supposed to be where the transitions occur?  But that's
not where the jumps are.  The jumps happen at...

8, 16, 24, 32, 40, 48...

... as you'd expect when adding (input >> 3), which is after all just a more
obscure way of writing (input / 8).

Sorry for being dense, but I still don't get it.

> The addition of 3 or 6 only helps initially, after some point it is just
> noise. It has the characteristic of being less aggressive with subsequent
> allocations.

Well, I have my doubts about whether this actually helps or not.  It doesn't
seem general purpose enough.

For an array of bytes, the desirable behavior is clear -- you really ought
to round up to at least the size of a pointer because you're never going to
return a non-aligned buffer.  Feeding 10 into getNextSize() and getting back
17 is weird -- you should get back either 16 or 24.

I also consider it strange that if you ask for 0 you get 3.  A lot of the
time, if you're asking for 0 it's because the resource may never need to be
allocated.  And if you know that the resource actually is going to be needed,
you're going to write code like this, from TermAttributeImpl.java:

termBuffer = new char[ArrayUtil.getNextSize(newSize < MIN_BUFFER_SIZE ?  
MIN_BUFFER_SIZE : newSize)];

(MIN_BUFFER_SIZE is 10.)

But whatever.  "3" and "6" look more significant on first read of the code
than they actually are, but they're only strange, not detrimental to
performance.  I'm just ticked off because I spent so much time trying to
understand code which turns out to do so little.

> I'm not really up on whether this is best, but it is better than the
> doubling algorithm that it replaced. I think your suggestion that such an
> algorithm might contribute to fragmented memory is interesting. I wonder if
> C, perl and java have different issues regarding that.

It's wherever the memory allocator lives.  That could be the JVM, it could be
glibc, it could be your own custom allocator.  If you compile Perl with
-DUSE_MY_MALLOC it uses its own allocator, otherwise it uses the system's
malloc.  KinoSearch actually has a dedicated allocator it uses for a very
targeted purpose, and this allocator has its own strategy for avoiding
fragmentation.  The golden mean issue is relevant to all of those allocators.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Dynamic array reallocation algorithms

2010-01-12 Thread Marvin Humphrey

Greets,

I've been trying to understand this comment regarding ArrayUtil.getNextSize():

 * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...

Maybe I'm missing something, but I can't see how the formula yields such a
growth pattern:

return (targetSize >> 3) + (targetSize < 9 ? 3 : 6) + targetSize;

For input values of 9 or greater, all that formula does is multiply by 1.125
and add 6. (Values enumerated below my sig.)

The comment appears to have originated with this Python commit:


http://svn.python.org/view/python/trunk/Objects/listobject.c?r1=35279&r2=35280

I think it was wrong then, and it's wrong now.

The primary purpose of getNextSize() is to minimize reallocations during
dynamic array resizing by overallocating on certain actions.

Exactly how much we overallocate by doesn't seem to matter that much.  Python
apparently adds an extra eighth or so.  Ruby reportedly multiplies by 1.5.
Theoretically, multipliers larger than the golden mean are supposed to be
suboptimal because they tend to induce memory fragmentation: subsequent
reallocations cannot reuse previously freed sections, because they never add
up to the total required by the newly requested fragment.  However, that
assumes a reasonably closed memory usage pattern, and so long as the freed
fragment can be reused by someone else somewhere, it won't go to waste.  

IMO, minimizing memory fragmentation is so dependent on the internal
implementation of the system's memory allocator as to be not worth the
trouble, but if we were to do it, I think the right approach is outlined in
this comment documenting the intention of the Python resizing routine prior to
the commit that introduced the new (broken?) algo:


http://svn.python.org/view/python/trunk/Objects/listobject.c?revision=35125&view=markup

/* Round up:
 * If n <   256, to a multiple of8.
 * If n <  2048, to a multiple of   64.
 * If n < 16384, to a multiple of  512.
 * If n <131072, to a multiple of 4096.
 * If n <   1048576, to a multiple of32768.
 * If n <   8388608, to a multiple of   262144.
 * If n <  67108864, to a multiple of  2097152.
 * If n < 536870912, to a multiple of 16777216.
 * ...
 * If n < 2**(5+3*i), to a multiple of 2**(3*i).

I can't really see the point of adding the small constant (6) for large
values, as is done in the new algo.  And if oversizing is important for small
values (debatable, since there will always be lots of small memory chunks
floating around in the allocation pool), then rounding up to 8 consistently
makes more sense to me than the current behavior.

IMO, just overallocating by some multiplier between 1.125 and 1.5 achieves our
primary goal of avoiding pathological reallocation behavior, and that's enough.
How about simplifying ArrayUtil.getNextSize() down to this?

return targetSize + (targetSize / 8);

Marvin Humphrey


mar...@smokey:~ $ perl -le 'print "$_ => " . (($_ >> 3) + ($_ < 9 ? 3 : 6 ) + 
$_) for 0 .. 100'
0 => 3
1 => 4
2 => 5
3 => 6
4 => 7
5 => 8
6 => 9
7 => 10
8 => 12
9 => 16
10 => 17
11 => 18
12 => 19
13 => 20
14 => 21
15 => 22
16 => 24
17 => 25
18 => 26
19 => 27
20 => 28
21 => 29
22 => 30
23 => 31
24 => 33
25 => 34
26 => 35
27 => 36
28 => 37
29 => 38
30 => 39
31 => 40
32 => 42
33 => 43
34 => 44
35 => 45
36 => 46
37 => 47
38 => 48
39 => 49
40 => 51
41 => 52
42 => 53
43 => 54
44 => 55
45 => 56
46 => 57
47 => 58
48 => 60
49 => 61
50 => 62
51 => 63
52 => 64
53 => 65
54 => 66
55 => 67
56 => 69
57 => 70
58 => 71
59 => 72
60 => 73
61 => 74
62 => 75
63 => 76
64 => 78
65 => 79
66 => 80
67 => 81
68 => 82
69 => 83
70 => 84
71 => 85
72 => 87
73 => 88
74 => 89
75 => 90
76 => 91
77 => 92
78 => 93
79 => 94
80 => 96
81 => 97
82 => 98
83 => 99
84 => 100
85 => 101
86 => 102
87 => 103
88 => 105
89 => 106
90 => 107
91 => 108
92 => 109
93 => 110
94 => 111
95 => 112
96 => 114
97 => 115
98 => 116
99 => 117
100 => 118


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Compound File Default

2010-01-12 Thread Marvin Humphrey

On Tue, Jan 12, 2010 at 11:05:13AM -0500, Grant Ingersoll wrote:
> At any rate,  I feel pretty safe assuming no one is running a production
> system on a MBP...

I don't really care whether Lucene defaults to the compound file format or not
(KS does, Lucy will, and that's good enough for me), but it seems weird that
you're assuming that only Mac Book Pros have that default.  Just for giggles,
I checked my old PowerPC Mac Mini, running 10.5 -- it's got a limit of 256.

But beyond that, Lucene adopted the compound file format default for a reason,
right?  What's changed about the environment that justifies overturning that
decision?  

When I checked a RedHat 9 box several years ago, it was at 1024; when I
checked a CentOS 5.2 box today, it was at 1024.  A FreeBSD 5.3 box several
years ago was at 65536.  File descriptor limits don't seem to advance like
hardware stats.

Go ahead and change the default, but I've got a feeling you're about to
relearn old lessons.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Compound File Default

2010-01-12 Thread Marvin Humphrey

On Tue, Jan 12, 2010 at 09:49:09AM -0500, Grant Ingersoll wrote:
> My Mac (non-laptop) reports:
>  ulimit -n
> 2560
> 
> And I know I didn't change it.

Before I posted, I had a few officemates corroborate.  4 people had 256 --
three on 10.6 and me on 10.5.  I think these were all Mac Book Pros.  The
exception was our DBA, who had high numbers (thousands) on both his MBP and
his desktop.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Compound File Default

2010-01-11 Thread Marvin Humphrey

On Mon, Jan 11, 2010 at 03:20:17PM -0500, Grant Ingersoll wrote:
> Should we really still be defaulting to true for setUseCompoundFile?  Do
> people still run out of file handles?

Yep.  You're going to smack up against that limit pretty quick on Mac OS X:

mar...@smokey:~ $ ulimit -n
256

> If so, why not have them turn it on, instead of everyone else having to turn
> it off.   

Can you up the file descriptor limit from within a running JVM?

If not, you're setting yourself up with a non-portable default.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-23 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794137#action_12794137
 ] 

Marvin Humphrey commented on LUCENE-2026:
-

> we can't give hints to the OS to tell it not to cache certain reads/writes
> (ie segment merging), 

For what it's worth, we haven't really solved that problem in Lucy either.
The sliding window abstraction we wrapped around mmap/MapViewOfFile largely
solved the problem of running out of address space on 32-bit operating
systems.  However, there's currently no way to invoke madvise through Lucy's
IO abstraction layer -- it's a little tricky with compound files.  

Linux, at least, requires that the buffer supplied to madvise be page-aligned.
So, say we're starting off on a posting list, and we want to communicate to
the OS that it should treat the region we're about to read as MADV_SEQUENTIAL.
If the start of the postings file is in the middle of a 4k page and the file
right before it is a term dictionary, we don't want to indicate that that
region should be treated as sequential.

I'm not sure how to solve that problem without violating the encapsulation of
the compound file model.  Hmm, maybe we could store metadata about the virtual
files indicating usage patterns (sequential, random, etc.)?  Since files are
generally part of dedicated data structures whose usage patterns are known at 
index time.

Or maybe we just punt on that use case and worry only about segment merging.  
Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) 
tell
the OS that it's free to recycle any memory pages associated with it?

> Actually why can't ord & offset be one, for the string sort cache?
> Ie, if you write your string data in sort order, then the offsets are
> also in sort order? (I think we may have discussed this already?)

Right, we discussed this on lucy-dev last spring:

http://markmail.org/message/epc56okapbgit5lw

Incidentally, some of this thread replays our exchange at the top of
LUCENE-1458 from a year ago.  It was fun to go back and reread that: in the
interrim, we've implemented segment-centric search and memory mapped field
caches and term dictionaries, both of which were first discussed back then.
:)

Ords are great for low cardinality fields of all kinds, but become less
efficient for high cardinality primitive numeric fields.  For simplicity's
sake, the prototype implementation of mmap'd field caches in KS always uses
ords.

> You don't want to have to create Lucy's equivalent of the JMM...

The more I think about making Lucy classes thread safe, the harder it seems.
:(  I'd like to make it possible to share a Schema across threads, for
instance, but that means all its Analyzers, etc have to be thread-safe as
well, which isn't practical when you start getting into contributed
subclasses.  

Even if we succeed in getting Folders and FileHandles thread safe, it will be
hard for the user to keep track of what they can and can't do across threads.
"Don't share anything" is a lot easier to understand.

We reap a big benefit by making Lucy's metaclass infrastructure thread-safe.
Beyond that, seems like there's a lot of pain for little gain.

> Refactoring of IndexWriter
> --
>
> Key: LUCENE-2026
> URL: https://issues.apache.org/jira/browse/LUCENE-2026
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these sh

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-22 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793918#action_12793918
 ] 

Marvin Humphrey commented on LUCENE-2026:
-

> Very interesting - thanks. So it also factors in how much the page
> was used in the past, not just how long it's been since the page was
> last used.

In theory, I think that means the term dictionary will tend to be favored over
the posting lists.  In practice... hard to say, it would be difficult to test.
:(

> Even smallish indexes can see the pages swapped out? 

Yes, you're right -- the wait time to get at a small term dictionary isn't
necessarily small.  I've amended my previous post, thanks.

> And of course Java pretty much forces threads-as-concurrency (JVM
> startup time, hotspot compilation, are costly).

Yes.  Java does a lot of stuff that most operating systems can also do, but of
course provides a coherent platform-independent interface.  In Lucy we're
going to try to go back to the OS for some of the stuff that Java likes to
take over -- provided that we can develop a sane genericized interface using
configuration probing and #ifdefs.  

It's nice that as long as the box is up our OS-as-JVM is always running, so we
don't have to worry about its (quite lengthy) startup time. 

> Right, this is how Lucy would force warming.

I think slurp-instead-of-mmap is orthogonal to warming, because we can warm
file-backed RAM structures by forcing them into the IO cache, using either the
cat-to-dev-null trick or something more sophisticated.  The
slurp-instead-of-mmap setting would cause warming as a side effect, but the
main point would be to attempt to persuade the virtual memory system that
certain data structures should have a higher status and not be paged out as
quickly.

> But, even within that CFS file, these three sub-files will not be
> local? Ie you'll still have to hit three pages per "lookup" right?

They'll be next to each other in the compound file because CompoundFileWriter
orders them alphabetically.  For big segments, though, you're right that they
won't be right next to each other, and you could possibly incur as many as
three page faults when retrieving a sort cache value.

But what are the alternatives for variable width data like strings?  You need
the ords array anyway for efficient comparisons, so what's left are the
offsets array and the character data.

An array of String objects isn't going to have better locality than one solid
block of memory dedicated to offsets and another solid block of memory
dedicated to file data, and it's no fewer derefs even if the string object
stores its character data inline -- more if it points to a separate allocation
(like Lucy's CharBuf does, since it's mutable). 

For each sort cache value lookup, you're going to need to access two blocks of
memory.  

  * With the array of String objects, the first is the memory block dedicated
to the array, and the second is the memory block dedicated to the String
object itself, which contains the character data.
  * With the file-backed block sort cache, the first memory block is the
offsets array, and the second is the character data array.

I think the locality costs should be approximately the same... have I missed 
anything?

> Write-once is good for Lucene too.

Hellyeah.
 
> And it seems like Lucy would not need anything crazy-os-specific wrt
> threads?

It depends on how many classes we want to make thread-safe, and it's not just
the OS, it's the host.

The bare minimum is simply to make Lucy thread-safe as a library.  That's
pretty close, because Lucy studiously avoided global variables whenever
possible.  The only problems that have to be addressed are the VTable_registry
Hash, race conditions when creating new subclasses via dynamic VTable
singletons, and refcounts on the VTable objects themselves.

Once those issues are taken care of, you'll be able to use Lucy objects in
separate threads with no problem, e.g. one Searcher per thread.

However, if you want to *share* Lucy objects (other than VTables) across
threads, all of a sudden we have to start thinking about "synchronized",
"volatile", etc.  Such constructs may not be efficient or even possible under
some threading models.

> Hmm I'd guess that field cache is slowish; deleted docs & norms are
> very fast; terms index is somewhere in between.

That jibes with my own experience.  So maybe consider file-backed sort caches
in Lucene, while keeping the status quo for everything else?

> You're right, you'd get two readers for seg_12 in that case. By
> "pool" I meant you're tapping into all the sub-readers that the
> existing reader have opened - the rea

[jira] Issue Comment Edited: (LUCENE-2026) Refactoring of IndexWriter

2009-12-22 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793431#action_12793431
 ] 

Marvin Humphrey edited comment on LUCENE-2026 at 12/23/09 3:54 AM:
---

> I guess my confusion is what are all the other benefits of using
> file-backed RAM? You can efficiently use process only concurrency
> (though shared memory is technically an option for this too), and you
> have wicked fast open times (but, you still must warm, just like
> Lucene). 

Processes are Lucy's primary concurrency model.  ("The OS is our JVM.")
Making process-only concurrency efficient isn't optional -- it's a *core*
*concern*.

> What else? Oh maybe the ability to inform OS not to cache
> eg the reads done when merging segments. That's one I sure wish
> Lucene could use...

Lightweight searchers mean architectural freedom.  

Create 2, 10, 100, 1000 Searchers without a second thought -- as many as you
need for whatever app architecture you just dreamed up -- then destroy them
just as effortlessly.  Add another worker thread to your search server without
having to consider the RAM requirements of a heavy searcher object.  Create a
command-line app to search a documentation index without worrying about
daemonizing it.  Etc.

If your normal development pattern is a single monolithic Java process, then
that freedom might not mean much to you.  But with their low per-object RAM
requirements and fast opens, lightweight searchers are easy to use within a
lot of other development patterns. For example: lightweight searchers work 
well for maxing out multiple CPU cores under process-only concurrency.

> In exchange you risk the OS making poor choices about what gets
> swapped out (LRU policy is too simplistic... not all pages are created
> equal), 

The Linux virtual memory system, at least, is not a pure LRU.  It utilizes a
page aging algo which prioritizes pages that have historically been accessed
frequently even when they have not been accessed recently:

{panel}
http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html

The default action when a page is first allocated, is to give it an
initial age of 3. Each time it is touched (by the memory management
subsystem) it's age is increased by 3 to a maximum of 20. Each time the
Kernel swap daemon runs it ages pages, decrementing their age by 1.
{panel}

And while that system may not be ideal from our standpoint, it's still pretty
good.  In general, the operating system's virtual memory scheme is going to
work fine as designed, for us and everyone else, and minimize memory
availability wait times.

When will swapping out the term dictionary be a problem?  

  * For indexes where queries are made frequently, no problem.  
  * Foir systems with plenty of RAM, no problem.  
  * For systems that aren't very busy, no problem.  
  * -For small indexes, no problem.-  

The only situation we're talking about is infrequent queries against -large-
indexes on busy boxes where RAM isn't abundant.  Under those circumstances, it
*might* be noticable that Lucy's term dictionary gets paged out somewhat
sooner than Lucene's.

But in general, if the term dictionary gets paged out, so what?  Nobody was
using it.  Maybe nobody will make another query against that index until next
week.  Maybe the OS made the right decision.

OK, so there's a vulnerable bubble where the the query rate against 
-a large index- an index is neither too fast nor too slow, on busy machines 
where RAM isn't abundant.  I don't think that bubble ought to drive major 
architectural decisions.

Let me turn your question on its head.  What does Lucene gain in return for
the slow index opens and large process memory footprint of its heavy
searchers?

> I do love how pure the file-backed RAM approach is, but I worry that
> down the road it'll result in erratic search performance in certain
> app profiles.

If necessary, there's a straightforward remedy: slurp the relevant files into
RAM at object construction rather than mmap them.  The rest of the code won't 
know the difference between malloc'd RAM and mmap'd RAM.  The slurped files 
won't take up any more space than the analogous Lucene data structures; more 
likely, they'll take up less.

That's the kind of setting we'd hide away in the IndexManager class rather
than expose as prominent API, and it would be a hint to index components
rather than an edict.

> Yeah, that you need 3 files for the string sort cache is a little
> spooky... that's 3X the chance of a page fault.

Not when using the compound format.

> But the CFS construction must also go through the filesystem (like
> Lucene) right? So you still incur IO load of creating the smal

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-21 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793431#action_12793431
 ] 

Marvin Humphrey commented on LUCENE-2026:
-

> I guess my confusion is what are all the other benefits of using
> file-backed RAM? You can efficiently use process only concurrency
> (though shared memory is technically an option for this too), and you
> have wicked fast open times (but, you still must warm, just like
> Lucene). 

Processes are Lucy's primary concurrency model.  ("The OS is our JVM.")
Making process-only concurrency efficient isn't optional -- it's a *core*
*concern*.

> What else? Oh maybe the ability to inform OS not to cache
> eg the reads done when merging segments. That's one I sure wish
> Lucene could use...

Lightweight searchers mean architectural freedom.  

Create 2, 10, 100, 1000 Searchers without a second thought -- as many as you
need for whatever app architecture you just dreamed up -- then destroy them
just as effortlessly.  Add another worker thread to your search server without
having to consider the RAM requirements of a heavy searcher object.  Create a
command-line app to search a documentation index without worrying about
daemonizing it.  Etc.

If your normal development pattern is a single monolithic Java process, then
that freedom might not mean much to you.  But with their low per-object RAM
requirements and fast opens, lightweight searchers are easy to use within a
lot of other development patterns. For example: lightweight searchers work 
well for maxing out multiple CPU cores under process-only concurrency.

> In exchange you risk the OS making poor choices about what gets
> swapped out (LRU policy is too simplistic... not all pages are created
> equal), 

The Linux virtual memory system, at least, is not a pure LRU.  It utilizes a
page aging algo which prioritizes pages that have historically been accessed
frequently even when they have not been accessed recently:

{panel}
http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html

The default action when a page is first allocated, is to give it an
initial age of 3. Each time it is touched (by the memory management
subsystem) it's age is increased by 3 to a maximum of 20. Each time the
Kernel swap daemon runs it ages pages, decrementing their age by 1.
{panel}

And while that system may not be ideal from our standpoint, it's still pretty
good.  In general, the operating system's virtual memory scheme is going to
work fine as designed, for us and everyone else, and minimize memory
availability wait times.

When will swapping out the term dictionary be a problem?  

  * For indexes where queries are made frequently, no problem.  
  * Foir systems with plenty of RAM, no problem.  
  * For systems that aren't very busy, no problem.  
  * For small indexes, no problem.  

The only situation we're talking about is infrequent queries against large
indexes on busy boxes where RAM isn't abundant.  Under those circumstances, it
*might* be noticable that Lucy's term dictionary gets paged out somewhat
sooner than Lucene's.

But in general, if the term dictionary gets paged out, so what?  Nobody was
using it.  Maybe nobody will make another query against that index until next
week.  Maybe the OS made the right decision.

OK, so there's a vulnerable bubble where the the query rate against a large
index is neither too fast nor too slow, on busy machines where RAM isn't
abundant.  I don't think that bubble ought to drive major architectural
decisions.

Let me turn your question on its head.  What does Lucene gain in return for
the slow index opens and large process memory footprint of its heavy
searchers?

> I do love how pure the file-backed RAM approach is, but I worry that
> down the road it'll result in erratic search performance in certain
> app profiles.

If necessary, there's a straightforward remedy: slurp the relevant files into
RAM at object construction rather than mmap them.  The rest of the code won't 
know the difference between malloc'd RAM and mmap'd RAM.  The slurped files 
won't take up any more space than the analogous Lucene data structures; more 
likely, they'll take up less.

That's the kind of setting we'd hide away in the IndexManager class rather
than expose as prominent API, and it would be a hint to index components
rather than an edict.

> Yeah, that you need 3 files for the string sort cache is a little
> spooky... that's 3X the chance of a page fault.

Not when using the compound format.

> But the CFS construction must also go through the filesystem (like
> Lucene) right? So you still incur IO load of creating the small
> files, then 2nd pass to consolidate.

Yes.

> I think we m

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-19 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792939#action_12792939
 ] 

Marvin Humphrey commented on LUCENE-2026:
-

> But, that's where Lucy presumably takes a perf hit. Lucene can share
> these in RAM, not usign the filesystem as the intermediary (eg we do
> that today with deletions; norms/field cache/eventual CSF can do the
> same.) Lucy must go through the filesystem to share.

For a flush(), I don't think there's a significant penalty.  The only extra
costs Lucy will pay are the bookkeeping costs to update the file system state
and to create the objects that read the index data.  Those are real, but since
we're skipping the fsync(), they're small.  As far as the actual data, I don't
see that there's a difference.  Reading from memory mapped RAM isn't any
slower than reading from malloc'd RAM.

If we have to fsync(), there'll be a cost, but in Lucene you have to pay that
same cost, too.  Lucene expects to get around it with IndexWriter.getReader().
In Lucy, we'll get around it by having you call flush() and then reopen a
reader somewhere, often in another proecess.  

  * In both cases, the availability of fresh data is decoupled from the fsync.  
  * In both cases, the indexing process has to be careful about dropping data
on the floor before a commit() succeeds.
  * In both cases, it's possible to protect against unbounded corruption by
rolling back to the last commit.

> Mostly I was thinking performance, ie, trusting the OS to make good
> decisions about what should be RAM resident, when it has limited
> information...

Right, for instance because we generally can't force the OS to pin term
dictionaries in RAM, as discussed a while back.  It's not an ideal situation,
but Lucene's approach isn't bulletproof either, since Lucene's term
dictionaries can get paged out too.  

We're sure not going to throw away all the advantages of mmap and go back to
reading data structures into process RAM just because of that.

> But, also risky is that all important data structures must be "file-flat",
> though in practice that doesn't seem like an issue so far? 

It's a constraint.  For instance, to support mmap, string sort caches
currently require three "files" each: ords, offsets, and UTF-8 character data.  

The compound file system makes the file proliferation bearable, though.  And
it's actually nice in a way to have data structures as named files, strongly
separated from each other and persistent.

If we were willing to ditch portability, we could cast to arrays of structs in
Lucy -- but so far we've just used primitives.  I'd like to keep it that way,
since it would be nice if the core Lucy file format was at least theoretically
compatible with a pure Java implementation.  But Lucy plugins could break that
rule and cast to structs if desired.  

> The RAM resident things Lucene has - norms, deleted docs, terms index, field
> cache - seem to "cast" just fine to file-flat. 

There are often benefits to keeping stuff "file-flat", particularly when the
file-flat form is compressed.  If we were to expand those sort caches to
string objects, they'd take up more RAM than they do now.

I think the only significant drawback is security: we can't trust memory
mapped data the way we can data which has been read into process RAM and
checked on the way in.  For instance, we need to perform UTF-8 sanity checking
each time a string sort cache value escapes the controlled environment of the
cache reader.  If the sort cache value was instead derived from an existing
string in process RAM, we wouldn't need to check it.

> If we switched to an FST for the terms index I guess that could get
> tricky...

Hmm, I haven't been following that.  Too much work to keep up with those
giganto patches for flex indexing, even though it's a subject I'm intimately
acquainted with and deeply interested in.  I plan to look it over when you're
done and see if we can simplify it.  :)

> Wouldn't shared memory be possible for process-only concurrent models?

IPC is a platform-compatibility nightmare.  By restricting ourselves to
communicating via the file system, we save ourselves oodles of engineering
time.  And on really boring, frustrating work, to boot.

> Also, what popular systems/environments have this requirement (only process
> level concurrency) today?

Perl's threads suck.  Actually all threads suck.  Perl's are just worse than
average -- and so many Perl binaries are compiled without them.  Java threads
suck less, but they still suck -- look how much engineering time you folks
blow on managing that stuff.  Threads are a terrible p

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-18 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792638#action_12792638
 ] 

Marvin Humphrey commented on LUCENE-2026:
-

Yes, this is using the sort cache model worked out this spring on lucy-dev.
The memory mapping happens within FSFileHandle (LUCY-83). SortWriter 
and SortReader haven't made it into the Lucy repository yet.

> Refactoring of IndexWriter
> --
>
> Key: LUCENE-2026
> URL: https://issues.apache.org/jira/browse/LUCENE-2026
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-18 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792625#action_12792625
 ] 

Marvin Humphrey commented on LUCENE-2026:
-

> Well, autoCommit just means "periodically call commit". So, if you
> decide to offer a commit() operation, then autoCommit would just wrap
> that? But, I don't think autoCommit should be offered... app should
> decide.

Agreed, autoCommit had benefits under legacy Lucene, but wouldn't be important
now.  If we did add some sort of "automatic commit" feature, it would mean
something else: commit every change instantly.  But that's easy to implement
via a wrapper, so there's no point cluttering the the primary index writer
class to support such a feature.

> Again: NRT is not a "specialized reader". It's a normal read-only
> DirectoryReader, just like you'd get from IndexReader.open, with the
> only difference being that it consulted IW to find which segments to
> open. Plus, it's pooled, so that if IW already has a given segment
> reader open (say because deletes were applied or merges are running),
> it's reused.

Well, it seems to me that those two features make it special -- particularly
the pooling of SegmentReaders.  You can't take advantage of that outside the
context of IndexWriter:

> Yes, Lucene's approach must be in the same JVM. But we get important
> gains from this - reusing a single reader (the pool), carrying over
> merged deletions directly in RAM (and eventually field cache & norms
> too - LUCENE-1785).

Exactly.  In my view, that's what makes that reader "special": unlike ordinary
Lucene IndexReaders, this one springs into being with its caches already
primed rather than in need of lazy loading.

But to achieve those benefits, you have to mod the index writing process.
Those modifications are not necessary under the Lucy model, because the mere
act of writing the index stores our data in the system IO cache.

> Instead, Lucy (by design) must do all sharing & access all index data
> through the filesystem (a decision, I think, could be dangerous),
> which will necessarily increase your reopen time. 

Dangerous in what sense?

Going through the file system is a tradeoff, sure -- but it's pretty nice to
design your low-latency search app free from any concern about whether
indexing and search need to be coordinated within a single process.
Furthermore, if separate processes are your primary concurrency model, going
through the file system is actually mandatory to achieve best performance on a
multi-core box.  Lucy won't always be used with multi-threaded hosts.

I actually think going through the file system is dangerous in a different
sense: it puts pressure on the file format spec.  The easy way to achieve IPC
between writers and readers will be to dump stuff into one of the JSON files
to support the killer-feature-du-jour -- such as what I'm proposing with this
"fsync" key in the snapshot file.  But then we wind up with a bunch of crap
cluttering up our index metadata files.  I'm determined that Lucy will have a
more coherent file format than Lucene, but with this IPC requirement we're
setting our community up to push us in the wrong direction.  If we're not
careful, we could end up with a file format that's an unmaintainable jumble.

But you're talking performance, not complexity costs, right?

> Maybe in practice that cost is small though... the OS write cache should
> keep everything fresh... but you still must serialize.

Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records
and 900 MB worth of sort cache data; opening a fresh searcher and loading all
sort caches takes circa 21 ms.

There's room to improve that further -- we haven't yet implemented
IndexReader.reopen() -- but that was fast enough to achieve what we wanted to
achieve.

> Refactoring of IndexWriter
> --
>
> Key: LUCENE-2026
> URL: https://issues.apache.org/jira/browse/LUCENE-2026
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-16 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791549#action_12791549
 ] 

Marvin Humphrey commented on LUCENE-2026:
-

>> Wasn't that a possibility under autocommit as well? All it takes is for the
>> OS to finish flushing the new snapshot file to persistent storage before it
>> finishes flushing a segment data file needed by that snapshot, and for the
>> power failure to squeeze in between.
> 
> Not after LUCENE-1044... autoCommit simply called commit() at certain
> opportune times (after finish big merges), which does the right thing (I
> hope!). The segments file is not written until all files it references are
> sync'd.

FWIW, autoCommit doesn't really have a place in Lucy's
one-segment-per-indexing-session model.

Revisiting the LUCENE-1044 threads, one passage stood out:

{panel}
http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321

This is why in a db system, the only file that is sync'd is the log
file - all other files can be made "in sync" from the log file - and
this file is normally striped for optimum write performance. Some
systems have special "log file drives" (some even solid state, or
battery backed ram) to aid the performance. 
{panel}

The fact that we have to sync all files instead of just one seems sub-optimal.

Yet Lucene is not well set up to maintain a transaction log.  The very act of
adding a document to Lucene is inherently lossy even if all fields are stored,
because doc boost is not preserved.

> Also, having the app explicitly decouple these two notions keeps the
> door open for future improvements. If we force absolutely all sharing
> to go through the filesystem then that limits the improvements we can
> make to NRT.

However, Lucy has much more to gain going through the file system than Lucene
does, because we don't necessarily incur JVM startup costs when launching a
new process.  The Lucene approach to NRT -- specialized reader hanging off of
writer -- is constrained to a single process.  The Lucy approach -- fast index
opens enabled by mmap-friendly index formats -- is not.

The two approaches aren't mutually exclusive.  It will be possible to augment
Lucy with a specialized index reader within a single process.  However, A)
there seems to be a lot of disagreement about just how to integrate that
reader, and B) there seem to be ways to bolt that functionality on top of the
existing classes.  Under those circumstances, I think it makes more sense to
keep that feature external for now.

> Alternatively, you could keep the notion "flush" (an unsafe commit)
> alive? You write the segments file, but make no effort to ensure it's
> durability (and also preserve the last "true" commit). Then a normal
> IR.reopen suffices...

That sounds promising.  The semantics would differ from those of Lucene's
flush(), which doesn't make changes visible.

We could implement this by somehow marking a "committed" snapshot and a
"flushed" snapshot differently, either by adding an "fsync" property to the
snapshot file that would be false after a flush() but true after a commit(),
or by encoding the property within the snapshot filename.  The file purger
would have to ensure that all index files referenced by either the last
committed snapshot or the last flushed snapshot were off limits.  A rollback()
would zap all changes since the last commit().  

Such a scheme allows the the top level app to avoid the costs of fsync while
maintaining its own transaction log -- perhaps with the optimizations
suggested above (separate disk, SSD, etc). 

> Refactoring of IndexWriter
> --
>
> Key: LUCENE-2026
> URL: https://issues.apache.org/jira/browse/LUCENE-2026
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and Me

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-13 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789905#action_12789905
 ] 

Marvin Humphrey commented on LUCENE-2026:
-

> I think that's a poor default (trades safety for performance), unless
> Lucy eg uses a transaction log so you can concretely bound what's lost
> on crash/power loss. Or, if you go back to autocommitting I guess...

Search indexes should not be used for canonical data storage -- they should be
built *on top of* canonical data storage.  Guarding against power failure
induced corruption in a database is an imperative.  Guarding against power
failure induced corruption in a search index is a feature, not an imperative.

Users have many options for dealing with the potential for such corruption.
You can go back to your canonical data store and rebuild your index from
scratch when it happens.  In a search cluster environment, you can rsync a
known-good copy from another node.  Potentially, you might enable
fsync-before-commit and keep your own transaction log.  However, if the time
it takes to rebuild or recover an index from scratch would have caused you
unacceptable downtime, you can't possibly be operating in a
single-point-of-failure environment where a power failure could take you down
anyway -- so other recovery options are available to you.

Turning on fsync is only one step towards ensuring index integrity; others
steps involve making decisions about hard drives, RAID arrays, failover
strategies, network and off-site backups, etc, and are outside of our domain
as library authors.  We cannot meet the needs of users who need guaranteed
index integrity on our own.

For everybody else, what turning on fsync by default achieves is to make an
exceedingly rare event rarer.  That's valuable, but not essential.  My
argument is that since the search indexes should not be used for canonical
storage, and since fsync is not testably reliable and not sufficient on its
own, it's a good engineering compromise to prioritize performance.  

> If we did this in Lucene, you can have unbounded corruption. It's not
> just the last few minutes of updates...

Wasn't that a possibility under autocommit as well?   All it takes is for the
OS to finish flushing the new snapshot file to persistent storage before it
finishes flushing a segment data file needed by that snapshot, and for the
power failure to squeeze in between. 

In practice, locality of reference is going to make the window very very
small, since those two pieces of data will usually get written very close to
each other on the persistent media.

I've seen a lot more messages to our user lists over the years about data
corruption caused by bugs and misconfigurations than by power failures.

But really, that's as it should be.  Ensuring data integrity to the degree
required by a database is costly -- it requires far more rigorous testing, and
far more conservative development practices.  If we accept that our indexes
must *never* go corrupt, it will retard innovation.

Of course we should work very hard to prevent index corruption.  However, I'm
much more concerned about stuff like silent omission of search results due to
overzealous, overly complex optimizations than I am about problems arising
from power failures.  When a power failure occurs, you know it -- so you get
the opportunity to fsck the disk, run checkIndex(), perform data integrity
reconciliation tests against canonical storage, and if anything fails, take
whatever recovery actions you deem necessary.

> You don't need to turn off sync for NRT - that's the whole point. It
> gives you a reader without syncing the files. 

I suppose this is where Lucy and Lucene differ.  Thanks to mmap and the
near-instantaneous reader opens it has enabled, we don't need to keep a
special reader alive.  Since there's no special reader, the only way to get
data to a search process is to go through a commit.  But if we fsync on every
commit, we'll drag down indexing responsiveness.  Fishishing the commit and
returning control to client code as quickly as possible is a high priority for
us.

Furthermore, I don't want us to have to write the code to support a
near-real-time reader hanging off of IndexWriter a la Lucene.  The
architectural discussions have made for very interesting reading, but the
design seems to be tricky to pull off, and implementation simplicity in core
search code is a high priority for Lucy.  It's better for Lucy to kill two
birds with one stone and concentrate on making *all* index opens fast. 

> Really, this is your safety tradeoff - it means you can commit less
> frequently, since the NRT reader can search the latest updates. But, your
> app has complete control over how it wants to to trade safety for
> performance.

So long

[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-13 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789895#action_12789895
 ] 

Marvin Humphrey commented on LUCENE-2126:
-

> I disagree with you here: introducing DataInput/Output makes IMO the API
> actually easier for the "normal" user to understand.
> 
> I would think that most users don't implement IndexInput/Output extensions,
> but simply use the out-of-the-box Directory implementations, which provide
> IndexInput/Output impls. Also, most users probably don't even call the
> IndexInput/Output APIs directly. 

I agree with everything you say in the second paragraph, but I don't see how
any of that supports the assertion you make in the first paragraph.

Lucene's file system has a directory class, named "Directory", and a pair of
classes which representing files, named "IndexInput" and "IndexOutput".
Directories and files.  Easy to understand.

All common IO systems have entities which represent data streaming to/from a
file.  They might be called "file handles", "file descriptors", "readers" and 
"writers", "streams", or whatever, but they're all basically the same thing.

What this patch does is fragment the pair of classes that representing file
IO... why?

What does a "normal" user do with a file?

   Step 1: Open the file.
   Step 2: Write data to the file.
   Step 3: Close the file.

Then, later...

   Step 1: Open the file.
   Step 2: Read data from the file.
   Step 3: Close the file.

You're saying that Lucene's file abstraction is easier to understand if you
break that up?

I grokked your first rationale -- that you don't want people to be able to
call close() on an IndexInput that they're essentially borrowing for a bit.
OK, I think it's overkill to create an entire class to thwart something nobody
was going to do anyway, but at least I understand why you might want to do
that.

But the idea that this strange fragmentation of the IO hierarchy makes things
*easier* -- I don't get it at all.  And I certainly don't see how it's such an 
improvement over what exists now that it justifies a change to the public API.

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-11 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789614#action_12789614
 ] 

Marvin Humphrey commented on LUCENE-2026:
-

> I say it's better to sacrifice write guarantee. 

I don't grok why sync is the default, especially given how sketchy hardware 
drivers are about obeying fsync:

{panel}
  But, beware: some hardware devices may in fact cache writes even during 
  fsync,  and return before the bits are actually on stable storage, to give 
the 
  appearance of faster performance.
{panel}

IMO, it should have been an option which defaults to false, to be enabled only 
by 
users who have the expertise to ensure that fsync() is actually doing what 
it advertises. But what's done is done (and Lucy will probably just do 
something 
different.)

With regard to Lucene NRT, though, turning sync() off would really help.  If 
and 
when some sort of settings class comes about, an enableSync(boolean enabled) 
method seems like it would come in handy.

> Refactoring of IndexWriter
> --
>
> Key: LUCENE-2026
> URL: https://issues.apache.org/jira/browse/LUCENE-2026
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-09 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788098#action_12788098
 ] 

Marvin Humphrey commented on LUCENE-2126:
-

> These methods should only be able to call the read/write methods (which this
> issue moves to DataInput/Output), but not methods like close(), seek() etc..

Ah, so that's what it is.  

In that case, let me vote my (non-binding) -1.  I don't believe that the
enforcement of such a restriction justifies the complexity cost of adding a
new class to the public API.

First, adding yet another class to the hierarchy steepens the learning curve
for users and contributors.  If you aren't in the rarefied echelon of
exceptional brilliance occupied by people named Michael who work for IBM :),
the gradual accumulation of complexity in the Lucene code base matters.  Inch
by inch, things move out of reach.

Second, changing things now for what seems to me like a minor reason makes it
harder to refactor the class hierarchy in the future when other, more
important reasons are inevitably discovered.

For LUCENE-2125, I recommend two possible options. 

  * Do nothing and assume that the sort of advanced user who writes a posting 
codec won't do something incredibly stupid like call indexInput.close().
  * Add a note to the docs for writing posting codecs indicating which sort of
of IO methods you ought not to call.

> once we see a need to allow users to extend DataInput/Output outside of
> Lucene we can go ahead and make the additional changes that are mentioned in
> your in my comments here.

In Lucy, there are three tiers of IO usage:

* For low-level IO, use FileHandle.
* For most applications, use InStream's encoder/decoder methods.
* For performance-critical inner-loop material (e.g. posting decoders, 
  SortCollector), access the raw memory-mapped IO buffer using
  InStream_Buf()/InStream_Advance_Buf() and use static inline functions 
  such as NumUtil_decode_c32 (which does no bounds checking) from
  Lucy::Util::NumberUtils.

While you can extend InStream to add a codec, that's not generally the best
way to go about it, because adding a method to InStream requires that all of
your users both use your InStream class and use a subclassed Folder which
overrides the Folder_Open_In() factory method (analogous to 
Directory.openInput()).  Better is to use the extension point provided by
InStream_Buf()/InStream_Advance_Buf() and write a utility function which
accepts an InStream as an argument.

I don't expect and am not advocating that Lucene adopt the same IO hierarchy
as Lucy, but I wanted to provide an example of other reasons why you might
change things.  (What I'd really like to see is for Lucene to come up with
something *better* than the Lucy IO hierarchy.)  

One of the reasons Lucene has so many backwards compatibility headaches is
because the public APIs are so extensive and thus constitute such an elaborate
set of backwards compatibility promises.  IMO, DataInput and DataOutput do 
not offer sufficient benefit to compensate for the increased intricacy they add 
to that backwards compatibility contract.


> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-08 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787876#action_12787876
 ] 

Marvin Humphrey commented on LUCENE-2126:
-

I spent a long time today trying to understand why DataInput and 
DataOutput are justified so that I could formulate an intelligent reply, 
but I had to give up. :\ Please carry on.

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-07 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786959#action_12786959
 ] 

Marvin Humphrey commented on LUCENE-2126:
-

FWIW, this approach is sort of the inverse of where we've gone with Lucy.  

In Lucy, low-level unbuffered IO operations are abstracted into FileHandle,
which is either a thin wrapper around a POSIX file descriptor (e.g.
FSFileHandle under unixen), or a simulation thereof (e.g. FSFileHandle under
Windows, RAMFileHandle).  

Then there are InStream and OutStream, which would be analogous to DataInput
and DataOutput, in that they have all the Lucy-specific encoding/decoding
methods.  However, instead of requiring that subclasses implement the
low-level IO operations, InStream "has a" FileHandle and OutStream "has a"
FileHandle.

The advantage of breaking out FileHandle as a separate class is that if e.g.
you extend InStream by adding on PFOR encoding, you automatically get the
benefit for all IO implementations.  I think that under the
DataInput/DataOutput model, that extension technique will only be available to
core devs of Lucene, no?

More info: 

  * LUCY-58 FileHandle
  * LUCY-63 InStream and OutStream

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Resolved: (LUCENE-2119) If you pass Integer.MAX_VALUE as 2nd param to search(Query, int) you hit unexpected NegativeArraySizeException

2009-12-06 Thread Marvin Humphrey

On Sun, Dec 06, 2009 at 05:31:53PM -0500, Erick Erickson wrote:
> This may be a silly question, and I admit that I haven't looked a the code,
> but was there a good reason to +1 it in the first place or was that just
> paranoia to prevent off-by-one errors? 

IIRC, this implementation of the priority queue algo leaves open slot 0 to
simplify internal calculations.  It was that way when I ported 1.4.3, and I
doubt that's changed.

> If there *was* a valid reason, might it make sense to 
> +1 min(nDocs, maxDoc())?

I think the patch is fine.  It's really only needed to provide a more accurate
error message in the event somebody specifies that they want Integer.MAX_VALUE
elements, not realizing that they will be allocated up front rather than
lazily -- they'll get an OOME rather than a NegativeArraySizeException.

Cheers,

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)

2009-11-23 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781531#action_12781531
 ] 

Marvin Humphrey commented on LUCENE-1877:
-

>> take it somewhere other than this closed issue.
>
> Yes, where?

The java-dev list: http://markmail.org/message/ivdgmxrivs3jzhfe

> Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, 
> LUCENE-1877.patch
>
>
> A user requested we add a note in IndexWriter alerting the availability of 
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
> exit). Seems reasonable to me - we want users to be able to easily stumble 
> upon this class. The below code looks like a good spot to add a note - could 
> also improve whats there a bit - opening an IndexWriter does not necessarily 
> create a lock file - that would depend on the LockFactory used.
> {code}  Opening an IndexWriter creates a lock file for the 
> directory in use. Trying to open
>   another IndexWriter on the same directory will lead to a
>   {...@link LockObtainFailedException}. The {...@link 
> LockObtainFailedException}
>   is also thrown if an IndexReader on the same directory is used to delete 
> documents
>   from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over 
> SimpleFSLockFactory?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Socket and file locks

2009-11-23 Thread Marvin Humphrey

On Sun, Nov 22, 2009 at 10:36:57AM +, Thomas Mueller (JIRA) wrote:

> Thomas Mueller commented on LUCENE-1877:
> 
> 
> > take it somewhere other than this closed issue.
> 
> Yes, where?

The java-dev list.

> > shouldn't active code like that live in the application layer?
> 
> Why?

You can all but guarantee that polling will work at the app layer, because you
can have almost full control over process priority.

If the polling code is lower down and hidden away, then it worries me that a
lock might be swept away by another process, and by the time the original
process realizes that it doesn't hold the lock anymore, the damage could
already have been done.  Unless I'm missing something, it doesn't seem like a
failsafe design.

But this is "theoretical", I suppose:

> I'm just trying to say that in theory, the thread is problematic, but in
> practice it isn't. While file locking is not a problem in theory, but in
> practice.

Heh. :)

> > What happens when the app sleeps?
> 
> Good question! Standby / hibernate are not supported. I didn't think about
> that. Is there a way to detect the wakeup?

Not sure.  

FYI, I'm only an indirect contributor to Java Lucene.  My main projects are
Lucy and KinoSearch, loose ports to C.  I know the problem domain intimately,
but my Java skills are sketchy.

> > host name and the pid
> 
> Yes. It is not so easy to get the PID in Java, I found:
> http://stackoverflow.com/questions/35842/process-id-in-java
> "ManagementFactory.getRuntimeMXBean().getName()". 

A web search for "java process id" turns up a bazillion hits about how to hack
up a PID.

How annoying.  This seems to me like a case of the perfect being the enemy of
the good.  How many machines that run Java are running operating systems that
have no support for PIDs?

Hasn't somebody open sourced a "GiveMeTheFrikkinPID" library yet?

> What do you do if the lock was generated by another machine? 

Require that all machines participating in the writer pool supply a unique
host ID as part of the locking API.  Store that host ID in the lockfile and
only allow machines to sweep stale files that they own.

Unfortunately, that's not failsafe either, though: misconfiguration leads to
index corruption rather than deadlock, when two machines that use identical
host IDs sweep each others lockfiles and write simultaneously.

> I tried with using a server socket, so you need the IP address, but
> unfortunately, sometimes the network is not configured correctly (but maybe
> it's possible to detect that). Maybe the two machines can't access each
> other over TCP/IP.

This is an intriguing approach.  Can it be designed to be failsafe?

If the server and the client can't access each other, that's failsafe at
least, because the client will simply fail to acquire the lock.

But if a client is misconfigured, could it contact the wrong host,
successfully open a port that coincidentally happens to be open, believe it
has acquired the lock and corrupt the index?  If so, could some sort of
handshake prevent that?

I'm also curious if we can use this approach for read locking.  For that, you
need a reference counting scheme -- one ref for each reader accessing the
index.  Is that possible under the socket model?

> > hard links
> 
> Yes, but it looks like this doesn't work always.

It is theoretically possible for the link() call to return false incorrectly
when the hard link has actually been created, for instance because a network
problem prevents the "success" packet from getting back to the client from the
server. 

However, this is failsafe, because the requestor will not believe that the
lock has been secured and thus won't write.  That process won't be able to
sweep away the orphaned lock file itself, but once it exits, a graceful
recovery will occur.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1877) Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)

2009-11-20 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780647#action_12780647
 ] 

Marvin Humphrey commented on LUCENE-1877:
-

> http://www.h2database.com/html/advanced.html#file_locking_protocols

I'm a little concerned about the suitability of the polling approach for a
low-level library like Lucene -- shouldn't active code like that live in the
application layer?  Is it possible to exceed the polling interval for a low
priority process on a system under heavy load?  What happens when the app
sleeps?

For removing stale lock files, another technique is to incorporate the host
name and the pid.  So long as you can determine that the lock file belongs to
your machine and that the PID is not active, you can safely zap it.

Then tricky bit is how you get that information into the lock file.  If you
try to write that info to the lock file itself after an O_EXCL open, creating
a fully valid lock file is no longer an atomic operation.  

The approach suggested by the creat(2) man page and endorsed in the Linux NFS
FAQ involves hard links:

{noformat}
The solution for performing atomic file locking using a lockfile
is to create a unique file on the same file system (e.g.,
incorporating hostname and pid), use link(2) to make a link to the
lockfile. If link() returns 0, the lock is successful. Otherwise,
use stat(2) on the unique file to check if its link count has
increased to 2, in which case the lock is also successful. 
{noformat}

This approach should also work on Windows for NTFS file systems since Windows
2000 thanks to the CreateHardLink() function.  (Samba file shares, you're out
of luck.)  However, I'm not sure about the state of support for hard links in
Java.

If you're interested in continuing this discussion, we should probably take it
somewhere other than this closed issue.

> Use NativeFSLockFactory as default for new API (direct ctors & FSDir.open)
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Mark Miller
>Assignee: Uwe Schindler
> Fix For: 2.9
>
> Attachments: LUCENE-1877.patch, LUCENE-1877.patch, LUCENE-1877.patch, 
> LUCENE-1877.patch
>
>
> A user requested we add a note in IndexWriter alerting the availability of 
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
> exit). Seems reasonable to me - we want users to be able to easily stumble 
> upon this class. The below code looks like a good spot to add a note - could 
> also improve whats there a bit - opening an IndexWriter does not necessarily 
> create a lock file - that would depend on the LockFactory used.
> {code}  Opening an IndexWriter creates a lock file for the 
> directory in use. Trying to open
>   another IndexWriter on the same directory will lead to a
>   {...@link LockObtainFailedException}. The {...@link 
> LockObtainFailedException}
>   is also thrown if an IndexReader on the same directory is used to delete 
> documents
>   from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over 
> SimpleFSLockFactory?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2073) Document issues involved in building your index with one jdk version and then searching/updating with another

2009-11-17 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779015#action_12779015
 ] 

Marvin Humphrey commented on LUCENE-2073:
-

> I am pretty sure StandardAnalyzer is ok actually now.

Good news!  Thanks for performing that analysis.

> Document issues involved in building your index with one jdk version and then 
> searching/updating with another
> -
>
> Key: LUCENE-2073
> URL: https://issues.apache.org/jira/browse/LUCENE-2073
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Mark Miller
> Attachments: LUCENE-2073.patch, LUCENE-2073.patch
>
>
> I think this needs to go in something of a permenant spot - this isn't a one 
> time release type issues - its going to present over multiple release.
> {quote}
> If there is nothing we can do here, then we just have to do the best we can -
> such as a prominent notice alerting that if you transition JVM's between 
> building and searching the index and you are using or doing X, things will 
> break.
> We should put this in a spot that is always pretty visible - perhaps even a 
> new readme file titlted something like IndexBackwardCompatibility or 
> something, to which we can add other tips and gotchyas as they come up. Or 
> MaintainingIndicesAcrossVersions, or 
> FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent 
> entry/sticky entry at the top of Changes.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2073) Document issues involved in building your index with one jdk version and then searching/updating with another

2009-11-17 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779006#action_12779006
 ] 

Marvin Humphrey commented on LUCENE-2073:
-

> are you sure? StandardAnalyzer uses LowerCaseFilter,

No, I'm not sure.  :(  I was confusing StandardAnalyzer and StandardTokenizer.

I still think that there are a lot of people who don't need to reindex,
because, for example, their entire corpus is limited to Latin-1 code points. 

Conversely, the people most likely to be affected are the people most likely
to be on the lookout for this kind of think.  I think it's important to
reach this group, without unduly alarming those who don't really need to
reindex.  Reindexing is a huge pain for some installations.

> Document issues involved in building your index with one jdk version and then 
> searching/updating with another
> -
>
> Key: LUCENE-2073
> URL: https://issues.apache.org/jira/browse/LUCENE-2073
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Mark Miller
> Attachments: LUCENE-2073.patch, LUCENE-2073.patch
>
>
> I think this needs to go in something of a permenant spot - this isn't a one 
> time release type issues - its going to present over multiple release.
> {quote}
> If there is nothing we can do here, then we just have to do the best we can -
> such as a prominent notice alerting that if you transition JVM's between 
> building and searching the index and you are using or doing X, things will 
> break.
> We should put this in a spot that is always pretty visible - perhaps even a 
> new readme file titlted something like IndexBackwardCompatibility or 
> something, to which we can add other tips and gotchyas as they come up. Or 
> MaintainingIndicesAcrossVersions, or 
> FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent 
> entry/sticky entry at the top of Changes.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2073) Document issues involved in building your index with one jdk version and then searching/updating with another

2009-11-17 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778998#action_12778998
 ] 

Marvin Humphrey commented on LUCENE-2073:
-

I like this:

> some parts of Lucene

... but I still think the message is a little too aggressive.  There are a lot
of people just using ye olde StandardAnalyzer, and they don't need to reindex.
We don't need to spread our own FUD. :)

Can we change it to say "Analyzers", and then refer people to the docs for
their specific Analyzer?  Alternatively, should that notification just contain
a complete list of the affected classes?

> Document issues involved in building your index with one jdk version and then 
> searching/updating with another
> -
>
> Key: LUCENE-2073
> URL: https://issues.apache.org/jira/browse/LUCENE-2073
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Mark Miller
> Attachments: LUCENE-2073.patch, LUCENE-2073.patch
>
>
> I think this needs to go in something of a permenant spot - this isn't a one 
> time release type issues - its going to present over multiple release.
> {quote}
> If there is nothing we can do here, then we just have to do the best we can -
> such as a prominent notice alerting that if you transition JVM's between 
> building and searching the index and you are using or doing X, things will 
> break.
> We should put this in a spot that is always pretty visible - perhaps even a 
> new readme file titlted something like IndexBackwardCompatibility or 
> something, to which we can add other tips and gotchyas as they come up. Or 
> MaintainingIndicesAcrossVersions, or 
> FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent 
> entry/sticky entry at the top of Changes.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2073) Document issues involved in building your index with one jdk version and then searching/updating with another

2009-11-17 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778875#action_12778875
 ] 

Marvin Humphrey commented on LUCENE-2073:
-

Which components are affected by this?  I think just Analyzers and query
parsers, yes?

If that's true, my inclination would be to add a note to the javadocs for each
such class. In every case, it's theoretically possible to build alternative
implementations which are unaffected by upgrading the JVM.  

This isn't a fundamental problem with the Lucene architecture; it's an
artifact of the way certain classes are implemented.  Outside of the affected
components, Lucene doesn't get down and dirty with Unicode properties and
other fast-moving stuff -- it's just dealing in UTF-8 bytes, Java strings,
etc.  Those things can change (Modified UTF-8, shudder), but they move on a
slower timescale.

Arguably, Analyzer subclasses shouldn't be in core for reasons like this.
Perhaps there could be an "ICUAnalysis" package which depends on ICU4J, so
that Unicode-related index incompatibilites occur when you upgrade your
Unicode library.  Though most people would probably choose to use the
smaller-footprint, zero-dependency "JVMAnalysis" package, where reindexing
would be required after a JVM upgrade.

The software certifiers wouldn't like that, and I'm not seriously advocating
such a disruptive change (yet), but I just wanted to illustrate that this is a
contained problem.



> Document issues involved in building your index with one jdk version and then 
> searching/updating with another
> -
>
> Key: LUCENE-2073
> URL: https://issues.apache.org/jira/browse/LUCENE-2073
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Mark Miller
>
> I think this needs to go in something of a permenant spot - this isn't a one 
> time release type issues - its going to present over multiple release.
> {quote}
> If there is nothing we can do here, then we just have to do the best we can -
> such as a prominent notice alerting that if you transition JVM's between 
> building and searching the index and you are using or doing X, things will 
> break.
> We should put this in a spot that is always pretty visible - perhaps even a 
> new readme file titlted something like IndexBackwardCompatibility or 
> something, to which we can add other tips and gotchyas as they come up. Or 
> MaintainingIndicesAcrossVersions, or 
> FancyWhateverGetsYourAttentionAboutUpgradingStuff. Or a permanent 
> entry/sticky entry at the top of Changes.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1997) Explore performance of multi-PQ vs single-PQ sorting API

2009-10-28 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771201#action_12771201
 ] 

Marvin Humphrey commented on LUCENE-1997:
-

> What kind of comparator can't pre-create a fixed ordinal list for all the
> possible values? I'm sure I've seen this too, but I can't bring one to mind
> right now.

I think the only time the ordinal list can't be created is when the source
array contains some value that can't be compared against another value -- e.g.
some variant on NULL -- or when the comparison function is broken, e.g. when 
a < b and b < c but c > a.

For current KinoSearch and future Lucy, we pre-build the ord array at index
time and mmap it at search time.  (Thanks to mmap, sort caches have virtually
no impact on IndexReader launch time.)


> Explore performance of multi-PQ vs single-PQ sorting API
> 
>
> Key: LUCENE-1997
> URL: https://issues.apache.org/jira/browse/LUCENE-1997
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, LUCENE-1997.patch, 
> LUCENE-1997.patch
>
>
> Spinoff from recent "lucene 2.9 sorting algorithm" thread on java-dev,
> where a simpler (non-segment-based) comparator API is proposed that
> gathers results into multiple PQs (one per segment) and then merges
> them in the end.
> I started from John's multi-PQ code and worked it into
> contrib/benchmark so that we could run perf tests.  Then I generified
> the Python script I use for running search benchmarks (in
> contrib/benchmark/sortBench.py).
> The script first creates indexes with 1M docs (based on
> SortableSingleDocSource, and based on wikipedia, if available).  Then
> it runs various combinations:
>   * Index with 20 balanced segments vs index with the "normal" log
> segment size
>   * Queries with different numbers of hits (only for wikipedia index)
>   * Different top N
>   * Different sorts (by title, for wikipedia, and by random string,
> random int, and country for the random index)
> For each test, 7 search rounds are run and the best QPS is kept.  The
> script runs singlePQ then multiPQ, and records the resulting best QPS
> for each and produces table (in Jira format) as output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-05 Thread Marvin Humphrey

On Mon, Oct 05, 2009 at 08:27:20AM +0200, Uwe Schindler wrote:

> Pass a Properties or Map to the ctor/open. The keys are predefined
> constants. Maybe our previous idea of an IndexConfiguration class is a
> subclass of HashMap with all the constants and some easy-to-use
> setter methods for very often-used settings (like index dir) and some
> reasonable defaults.

Interesting.  The design we worked out for Lucy's Segment class (prototype in
KS devel branch) uses hash/array/string data to store arbitrary metadata on
behalf of segment components, written as JSON to seg_NNN/segmeta.json.  In
that case, though, each component is responsible for generating and consuming
its own data.  That's different from having the user supply data via such a
format.

I still think you're going to want an extensible builder class.

> This allows us to pass these properties to any flex indexing component
> without need to modify/extend it to support the additional properties. The
> flexible indexing component just defines its own property names (e.g. as
> URNs, URLs, using its class name as prefix,...). 

But how do you determine what the flex indexing components *are*?  In theory,
you can pass class names and sufficient arguments to build them up via your
big ball of data, but then you're essentially creating a new language, with
all the headaches that entails. 

In KS, Indexer/IndexReader configuration is divided between three classes.

  * Schema: field definitions.
  * Architecture: Settings that never change for the life of the index.
  * IndexManager: Settings that can change per index/search session.

Schema isn't worth discussing -- Lucy will have it, Lucene won't, end of
story.  Architecture and IndexManager, though, are fairly close to what's
being discussed.

Architecture is responsible for e.g. determining which plugabble components
get registered.  It's the builder class.

IndexManager is where things like merging and locking policies reside.

> Property names are always String, values any type (therefore Map).
> With Java 5, integer props and so on are no "bad syntax" problem because of
> autoboxing (no need to pass new Integer() or Integer.valueOf()).

Argument validation gets to be a headache when you pass around complex data
structures.  It's doable, but messy and hard to grok.  Going through dedicated
methods is cleaner and safer.

> Another good thing is, that implementors of e.g. XML config files like in
> Solr, can simple pass all elements in config to this map.

I go back and forth on this.  At some point, the volume of data becomes
overwhelming and it becomes easier to swap in the name of a class where most
of the data can reside in nice, reliable, structured code.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-04 Thread Marvin Humphrey

On Sun, Oct 04, 2009 at 05:53:14AM -0400, Michael McCandless wrote:

>   1 Do we prevent config settings from changing after creating an
> IW/IR?

Any settings conveyed via a settings object ought to be final if you want
pluggable index components.  Otherwise, you need some nightmarish notification
system to propagate settings down into your subcomponents, which may or may
not be prepared to handle the value modifications.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene 2.9 and deprecated IR.open() methods

2009-10-04 Thread Marvin Humphrey

On Sun, Oct 04, 2009 at 03:04:13PM -0400, Mark Miller wrote:
> Earwin Burrfoot wrote:
> > As I stated in my last email, there's zero difference between
> > settings+static factory and builder except for syntax. Cannot
> > understand what Mark, Mike are arguing about.
> >   
> Sounds like we are arguing that we don't like the syntax then...

So, implement the static factory methods as wrappers around the builder
method.

  public static IndexWriter open(Directory dir, Analyzer analyzer) {
return open(new IndexManager(dir), dir, analyzer)
  }

  public static IndexWriter open(IndexManager manager, Directory dir, 
 Analyzer analyzer) {
 return arch.buildIndexWriter(new Architecture(), manager, dir, analyzer);
  }

  public static IndexWriter open(Architecture arch, IndexManager manager, 
 Directory dir, Analyzer analyzer) {
 return arch.buildIndexWriter(manager, dir, analyzer);
  }

IMO, it's important not to force first-time users to grok builder classes in
order to perform basic indexing or searching.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: custom segment files

2009-09-17 Thread Marvin Humphrey

On Fri, Sep 18, 2009 at 08:14:24AM +0800, John Wang wrote:

> Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
> It might be good to store in a segment:
> 

Heh.  You've just described this proof of concept class:


http://www.rectangular.com/kinosearch/docs/devel/KSx/Index/ByteBufDocWriter.html

http://www.rectangular.com/svn/kinosearch/trunk/perl/lib/KSx/Index/ByteBufDocWriter.pm

> Hopefully I am describing it clearly.

Sure, I understand exactly what you mean.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

2009-09-14 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755062#action_12755062
 ] 

Marvin Humphrey commented on LUCENE-1908:
-

The rationale behind the coarseness of the norms is that since the accuracy of
search engines in retrieving the documents that the user really wants is so
poor, only big differences matter.  (It's not just poor "recall" against a
given query, but the difficulty that the user experiences in formulating a
proper query to express what they're really looking for in the first place.)

Doug wrote at least once about this some years back, but I haven't been 
able to track down the post.

> Similarity javadocs for scoring function to relate more tightly to scoring 
> models in effect
> ---
>
> Key: LUCENE-1908
> URL: https://issues.apache.org/jira/browse/LUCENE-1908
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Doron Cohen
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch, 
> LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1900) Confusing Javadoc in Searchable.java

2009-09-08 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752675#action_12752675
 ] 

Marvin Humphrey commented on LUCENE-1900:
-

IMO, maxDoc(), docFreq(), and docFreqs() are all expert, because they all
require an understanding of the deletions mechanism to grasp their behavior.  

I'd vote for adding the "expert" tag to IndexReader.maxDoc() before stripping
it from those.

> Confusing Javadoc in Searchable.java
> 
>
> Key: LUCENE-1900
> URL: https://issues.apache.org/jira/browse/LUCENE-1900
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.9
>Reporter: Nadav Har'El
>Assignee: Mark Miller
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1900.patch
>
>
> In Searchable.java, the javadoc for maxdoc() is:
>   /** Expert: Returns one greater than the largest possible document number.
>* Called by search code to compute term weights.
>* @see org.apache.lucene.index.IndexReader#maxDoc()
> The qualification "expert" and the statement "called by search code to 
> compute term weights" is a bit confusing, It implies that maxdoc() somehow 
> computes weights, which is obviously not true (what it does is explained in 
> the other sentence). Maybe it is used as one factor of the weight, but do we 
> really need to mention this here? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1900) Confusing Javadoc in Searchable.java

2009-09-08 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752612#action_12752612
 ] 

Marvin Humphrey commented on LUCENE-1900:
-

maxDoc() isn't just used for calculating weights.  It's also used for e.g.
figuring out how big your bit vector needs to be in order to accommodate the
largest doc in the collection.

My vote would be to just strip that extra comment about calculating term
weights.


> Confusing Javadoc in Searchable.java
> 
>
> Key: LUCENE-1900
> URL: https://issues.apache.org/jira/browse/LUCENE-1900
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.9
>Reporter: Nadav Har'El
>Priority: Trivial
>
> In Searchable.java, the javadoc for maxdoc() is:
>   /** Expert: Returns one greater than the largest possible document number.
>* Called by search code to compute term weights.
>* @see org.apache.lucene.index.IndexReader#maxDoc()
> The qualification "expert" and the statement "called by search code to 
> compute term weights" is a bit confusing, It implies that maxdoc() somehow 
> computes weights, which is obviously not true (what it does is explained in 
> the other sentence). Maybe it is used as one factor of the weight, but do we 
> really need to mention this here? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

2009-09-08 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752591#action_12752591
 ] 

Marvin Humphrey commented on LUCENE-1896:
-

> at what I am trusting is essentially no cost.

Here's the snippet from TermQuery.score() where queryNorm() actually 
gets applied to each document's score:

{code}
float raw =   // compute tf(f)*weight
  f < SCORE_CACHE_SIZE// check cache
  ? scoreCache[f] // cache hit
  : getSimilarity().tf(f)*weightValue;// cache miss
{code}

At this point, queryNorm() has already been factored into weightValue (and
scoreCache).  It happens during setup. You can either scale weightValue by
queryNorm() during setup or not -- the per-document computational cost is 
unaffected.

> Modify confusing javadoc for queryNorm
> --
>
> Key: LUCENE-1896
> URL: https://issues.apache.org/jira/browse/LUCENE-1896
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Jiri Kuhn
>Priority: Minor
> Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1896) Modify confusing javadoc for queryNorm

2009-09-07 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752311#action_12752311
 ] 

Marvin Humphrey commented on LUCENE-1896:
-

FWIW, after all that 
[fuss|http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200802.mbox/%3c9396e8e7-46ff-4b78-9427-13e9a7e58...@rectangular.com%3e],
 
I would up leaving it in.

>From the standpoint of ordinary users, queryNorm() is harmless or mildly 
beneficial. Scores are never going to be comparable across multiple queries 
without what _I_ normally think of as "normalization" (given my background 
in audio): setting the top score to 1.0, and multiplying all other scores by 
the 
same factor. Nevertheless, it's better for them to be closer together than 
farther apart.

>From the standpoint of users trying to write Query subclasses, it's a wash.
On the one hand, it's not the most important method, since it doesn't affect
ranking within a single query -- and zapping it would mean one less thing to
think about.  On the other hand, it's nice to have it in there for the sake of
completeness in the implementation of cosine similarity.

I eventually would up messing with *how* the query norm gets applied to
achieve my de-voodoo-fication goals.  Essentially, I hid away queryNorm() so
that you don't need to think about it unless you really need it.

> Modify confusing javadoc for queryNorm
> --
>
> Key: LUCENE-1896
> URL: https://issues.apache.org/jira/browse/LUCENE-1896
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Jiri Kuhn
>Priority: Minor
> Fix For: 2.9
>
>
> See http://markmail.org/message/arai6silfiktwcer
> The javadoc confuses me as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1877) Improve IndexWriter javadoc on locking

2009-08-30 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749363#action_12749363
 ] 

Marvin Humphrey commented on LUCENE-1877:
-

> I can see how this is not ideal, but I'm not seeing how any of the 
> mentioned issues apply to our simple lock usage ...

"Simple lock usage"?!  You must have a bigger brain than me...

As a matter of fact, I think you're right.   Fcntl locks have two major
drawbacks, and upon review I think NativeFSLockFactory avoids both of them.

The first is that opening and closing a file releases all locks for the entire
process.  Even if you never request a lock on the second filehandle, or if you
request a lock and the request is denied, closing the filehandle releases the
lock on the first filehandle.  But NativeFSLockFactory avoids that problem by
keeping the HashSet of filepaths and ensuring that the same file is never
opened twice.  Furthermore, since the lockfiles are private to Lucene, you can
assume that nobody else is going to open them and inadvertently spoil the lock.

The second is that child processes spawned via fork() do not inherit locks
from the parent process.  If you assume that nobody's ever going to fork a
Java process, that's not relevant.  (Too bad that won't work for Lucy... we
have to support fork().)

I think you're probably safe with Fcntl locks on all non-shared volumes.

> Improve IndexWriter javadoc on locking
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Mark Miller
>Priority: Trivial
> Fix For: 2.9
>
>
> A user requested we add a note in IndexWriter alerting the availability of 
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
> exit). Seems reasonable to me - we want users to be able to easily stumble 
> upon this class. The below code looks like a good spot to add a note - could 
> also improve whats there a bit - opening an IndexWriter does not necessarily 
> create a lock file - that would depend on the LockFactory used.
> {code}  Opening an IndexWriter creates a lock file for the 
> directory in use. Trying to open
>   another IndexWriter on the same directory will lead to a
>   {...@link LockObtainFailedException}. The {...@link 
> LockObtainFailedException}
>   is also thrown if an IndexReader on the same directory is used to delete 
> documents
>   from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over 
> SimpleFSLockFactory?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1877) Improve IndexWriter javadoc on locking

2009-08-30 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749330#action_12749330
 ] 

Marvin Humphrey commented on LUCENE-1877:
-

> Anyone remember why NativeFSLockFactory is not the default over 
> SimpleFSLockFactory?

Wasn't it because native locking is somethings implemented with Fcntl, and
Fcntl locking blows chunks?  Especially for a library rather than an
application.

>From the BSD manpage on Fcntl:

{quote}
This interface follows the completely stupid semantics of System V and IEEE
Std 1003.1-1988 (``POSIX.1'') that require that all locks associated with a
file for a given process are removed when any file descriptor for that file is
closed by that process.  This semantic means that applications must be aware
of any files that a subroutine library may access.  For example if an
application for updating the password file locks the password file database
while making the update, and then calls getpwname(3) to retrieve a record, the
lock will be lost because getpwname(3) opens, reads, and closes the password
database.  The database close will release all locks that the process has
associated with the database, even if the library routine never requested a
lock on the database.  Another minor semantic problem with this interface is
that locks are not inherited by a child process created using the fork(2)
function.  The flock(2) interface has much more rational last close
semantics and allows locks to be inherited by child processes.  Flock(2) is
recommended for applications that want to ensure the integrity of their locks
when using library routines or wish to pass locks to their children...
{quote}

The lockfile may be annoying, but at least it's guaranteed safe on all 
non-shared
volumes when the OS implements atomic file opening.

Are you folks at least able to clean up orphaned lockfiles if the PID it was 
created
under is no longer active?

> Improve IndexWriter javadoc on locking
> --
>
> Key: LUCENE-1877
> URL: https://issues.apache.org/jira/browse/LUCENE-1877
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Javadocs
>Reporter: Mark Miller
>Priority: Trivial
> Fix For: 2.9
>
>
> A user requested we add a note in IndexWriter alerting the availability of 
> NativeFSLockFactory (allowing you to avoid retaining locks on abnormal jvm 
> exit). Seems reasonable to me - we want users to be able to easily stumble 
> upon this class. The below code looks like a good spot to add a note - could 
> also improve whats there a bit - opening an IndexWriter does not necessarily 
> create a lock file - that would depend on the LockFactory used.
> {code}  Opening an IndexWriter creates a lock file for the 
> directory in use. Trying to open
>   another IndexWriter on the same directory will lead to a
>   {...@link LockObtainFailedException}. The {...@link 
> LockObtainFailedException}
>   is also thrown if an IndexReader on the same directory is used to delete 
> documents
>   from the index.{code}
> Anyone remember why NativeFSLockFactory is not the default over 
> SimpleFSLockFactory?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

2009-08-26 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748109#action_12748109
 ] 

Marvin Humphrey commented on LUCENE-1859:
-

> I don't believe there is ever any valid argument against adding
> documentation.

The more that documentation grows, the harder it is to absorb.  The more
bells and whistles on an API, the harder it is to grok and to use effectively.
The more a code base bloats, the harder it is to maintain or to evolve.

> keeping average memory usage down prevents those wonderful OutOfMemory
> Exceptions

No, it won't.  If someone is emitting large tokens regularly, it is likely
that several threads will require large RAM footprints simultaneously, and an
OOM will occur.  That would be the common case.

If someone is emmitting large tokens periodically, well, this doesn't prevent
the OOM, it just makes it less likely.  That's not worthless, but it's not
something anybody should count on when assessing required RAM usage.

Keeping average memory usage down is good for the system at large.  If this is
implemented, that should be the justification.


> TermAttributeImpl's buffer will never "shrink" if it grows too big
> --
>
> Key: LUCENE-1859
> URL: https://issues.apache.org/jira/browse/LUCENE-1859
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.9
>Reporter: Tim Smith
>Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be 
> able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" 
> tokens, however it seems that the TermAttributeImpl should have a reasonable 
> static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, 
> it will shrink back down to this size once the next token smaller than 
> MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it 
> seems like if you have multiple indexing threads, you could end up with a 
> char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is 
> currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

2009-08-26 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748102#action_12748102
 ] 

Marvin Humphrey commented on LUCENE-1859:
-

> i fail to see the complexity of adding one method to TermAttribute:

Death by a thousand cuts.  This is one cut.

I wouldn't even add the note to the documentation.  If you emit large tokens,
you have to plan for obscene peak memory usage anyway, and if you're not
prepared for that, you deserve what you get.  Keeping the average down 
doesn't help that.

The only reason to do this is to keep average memory usage down for
the hell of it, and if it goes in, it should be an implementation detail.

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> --
>
> Key: LUCENE-1859
> URL: https://issues.apache.org/jira/browse/LUCENE-1859
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.9
>Reporter: Tim Smith
>Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be 
> able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" 
> tokens, however it seems that the TermAttributeImpl should have a reasonable 
> static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, 
> it will shrink back down to this size once the next token smaller than 
> MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it 
> seems like if you have multiple indexing threads, you could end up with a 
> char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is 
> currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

2009-08-26 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748089#action_12748089
 ] 

Marvin Humphrey commented on LUCENE-1859:
-

IMO, the benefit of adding these theoretical helper methods to lower average -- 
but not peak -- memory usage by non-core Tokenizers which are probably doing 
something impractical anyway... does not justify the complexity cost.

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> --
>
> Key: LUCENE-1859
> URL: https://issues.apache.org/jira/browse/LUCENE-1859
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.9
>Reporter: Tim Smith
>Priority: Minor
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be 
> able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" 
> tokens, however it seems that the TermAttributeImpl should have a reasonable 
> static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, 
> it will shrink back down to this size once the next token smaller than 
> MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it 
> seems like if you have multiple indexing threads, you could end up with a 
> char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is 
> currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never "shrink" if it grows too big

2009-08-26 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748064#action_12748064
 ] 

Marvin Humphrey commented on LUCENE-1859:
-

The worst-case scenario seems kind of theoretical, since there are so many
reasons that huge tokens are impractical. (Is a priority of "major"
justified?) If there's a significant benefit to shrinking the allocation, it's
minimizing average memory usage over time.  But even that assumes a nearly
pathological distribution in field size -- it would have to be large for early
documents, then consistently small for subsequent documents.  If it's
scattered, you have to plan for worst case RAM usage as an app developer,
anyway.  Which generally means limiting token size.

I assume that, based on this report, TermAttributeImpl never gets reset or
discarded/recreated over the course of an indexing session?

-0 if the reallocation happens no more often than once per document.

-1 if it the reallocation has be performed in an inner loop.

> TermAttributeImpl's buffer will never "shrink" if it grows too big
> --
>
> Key: LUCENE-1859
> URL: https://issues.apache.org/jira/browse/LUCENE-1859
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.9
>Reporter: Tim Smith
>
> This was also an issue with Token previously as well
> If a TermAttributeImpl is populated with a very long buffer, it will never be 
> able to reclaim this memory
> Obviously, it can be argued that Tokenizer's should never emit "large" 
> tokens, however it seems that the TermAttributeImpl should have a reasonable 
> static "MAX_BUFFER_SIZE" such that if the term buffer grows bigger than this, 
> it will shrink back down to this size once the next token smaller than 
> MAX_BUFFER_SIZE is set
> I don't think i have actually encountered issues with this yet, however it 
> seems like if you have multiple indexing threads, you could end up with a 
> char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
> perhaps growTermBuffer should have the logic to shrink if the buffer is 
> currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Finishing Lucene 2.9

2009-08-24 Thread Marvin Humphrey

On Mon, Aug 24, 2009 at 10:15:20PM +0300, Shai Erera wrote:

> I think it all boils down to this jar drop-in ability. 

Expecting jar drop-in compatibility for bugfix releases is 100% reasonable.
Expecting something close to jar drop-in compatibility for minor releases
seems pretty reasonable, too.  

Expecting jar drop-in compatibility minus deprecations at a major version
change is only reasonable when that has been made the explicit public policy
of the project.  By making that promise, you are squandering your one
opportunity to make disruptive changes.  

Instead, you're trying to shoehorn what ought to be disruptive changes into an
artificially continuous release cycle.  It's a lot of work, results in a lot
of inelegant compatibility APIs, and seems not to have been successfully
implemented yet for 2.9.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Finishing Lucene 2.9

2009-08-24 Thread Marvin Humphrey

On Mon, Aug 24, 2009 at 01:46:35PM -0400, Michael McCandless wrote:

> Right, that is and has been the "plan" for 2.9/3.0/3.1 for quite some time.
> 
> We are now discussing whether to change the plan, but so far it looks
> likely we'll just stick with it...

It seems like breaking the promise would be disruptive now.  But you have an
opportunity to change the policy at 3.0, affecting 3.9 and 4.0.

That's a 3.0 issue, though -- not a 2.9 issue.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Finishing Lucene 2.9

2009-08-24 Thread Marvin Humphrey

On Mon, Aug 24, 2009 at 11:44:17AM -0400, Michael McCandless wrote:

> Separately, we can think about having 3.1 be a "real" release, not
> just a "fast turnaround" release.

All problems flow from this "fast turnaround release" constraint.

If you had the freedom to make the kind of API changes people normally expect
to accompany a major version change, everything would be a lot simpler.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1684) Add matchVersion to StandardAnalyzer

2009-06-11 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718485#action_12718485
 ] 

Marvin Humphrey commented on LUCENE-1684:
-

+1

This approach addresses all of my concerns about 
action-at-a-distance behaviors.

Nice work, Mike.

> Add matchVersion to StandardAnalyzer
> 
>
> Key: LUCENE-1684
> URL: https://issues.apache.org/jira/browse/LUCENE-1684
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1684.patch
>
>
> I think we should add a matchVersion arg to StandardAnalyzer.  This
> allows us to fix bugs (for new users) while keeping precise back
> compat (for users who upgrade).
> We've discussed this on java-dev, but I'd like to now make it concrete
> (patch attached).  I think it actually works very well, and is a
> simple tool to help us carry out our back-compat policy.
> I coded up an example with StandardAnalyzer:
>   * The ctor now takes a required arg (Version matchVersion).  You
> pass Version.LUCENE_CURRENT to always get lates & greatest, or eg
> Version.LUCENE_24 to match 2.4's bugs/settings/behavior.
>   * StandardAalyzer conditionalizes the "replace invalid acronym" and
> "enable position increment in StopFilter" based on matchVersion.
>   * It also prevents creating zillions of ctors, over time, as we need
> to change settings in the class.  EG StandardAnalyzer now has 2
> settings that are version dependent, and there's at least another
> 2 issues open on fixing some more of its bugs.
> The migration is also very clean: we'd only add this to classes on an
> "as needed" basis.  On the first release that adds the arg, the
> default remains back compatible with the prior release.  Then, going
> forward, we are free to fix issues on that class and conditionalize by
> matchVersion.
> The javadoc at the top of StandardAnalyzer clearly calls out what
> version specific behavior is done:
> {code}
>  * You must specify the required {...@link Version}
>  * compatibility when creating StandardAnalyzer:
>  * 
>  *As of 2.9, StopFilter preserves position
>  *increments by default
>  *As of 2.9, Tokens incorrectly idenfied as acronyms
>  *are corrected (see  href="https://issues.apache.org/jira/browse/LUCENE-1068";>LUCENE-1608
>  * 
>  *
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-22 Thread Marvin Humphrey

On Fri, May 22, 2009 at 10:40:03PM +0400, Earwin Burrfoot wrote:
> >> Custom analyzers.
> > No problem.
> How are they recorded in the index?

Analyzers must implement dump() and load(), which convert the Analyzer to/from
a JSON-izable data structure.  They end up as JSON in
index_dir/schema_NNN.json.

Custom subclasses must be loaded by whatever app wants to read the index,
naturally.

> >> Intentionally different analyzers for indexing and searching.
> > No problem.  That only makes sense in the context of QueryParser, and the KS
> > QueryParser allows you to supply an analyzer which overrides the Schema.
> But well, it differs from analyzer used for indexation in one or two
> options, and shares a heap of others.

A constructor argument solves that problem, doesn't it?  Am I missing
something?

> >> Using this analyzer without any index at all - like I do highlight on
> >> a separate machine to minimize GC pauses, or tag docs by running a
> >> heap of queries against MemoryIndex.
> > No problem.  Distribute a Schema subclass among several machines.
> You mean read an index on one machine, create Analyzer, serialize it
> and send over the wire to other machines? I hope that's either a joke
> or I misunderstood you.

Please.  

How did your Analyzer class get on the other machines?  Do the same thing with
your Schema subclass.

> Storing a list of stopwords in the index sounds fun. Storing a fat
> synonym/morphology dictionary while completely analogous, is no longer
> fun.

So, don't store that whole dictionary in the serialized Analyzer -- just store
a version number.  Make the synonym data class data.  

If it's reasonable to key multiple versions of the class data off of the
version number constructor argument, do that.  If not and an index was built
with an version of the Analyzer that is no longer supported, either throw an
exception or intentionally ignore the mismatch and serve screwed up search
results.  Your call.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-22 Thread Marvin Humphrey

On Fri, May 22, 2009 at 09:06:32PM +0400, Earwin Burrfoot wrote:
> > In KinoSearch SVN trunk, satellite classes like QueryParser and Highlighter
> > have to be passed a Schema, which contains all the Analyzers.  Analyzers
> > aren't satellite classes under this model -- they are a fixed property of a
> > FullTextType field spec.  Think of them as baked into an SQL field 
> > definition.
> >
> > You can create a Schema from scratch to pass to the QueryParser, but it's
> > easier to just get it from the Searcher.  Translating to Java...
> >
> >   Searcher searcher = new Searcher("/path/to/index");
> >   QueryParser qparser = new QueryParser(searcher.getSchema());
> >
> > I don't see how that's so different from getting an analyzer actsAsVersion
> > number from the index.
> >
> > Now, where stuff might start to get complicated is 
> > PerFieldAnalyzerWrapper...
> > is that where the sneakiness gets overwhelming?
> Some people can have setups more complex than that.
> Different analyzers per field.

Heh.  One of the primary rationales behind Schema was to tie individual
analyzers to specific fields.

> Custom analyzers.

No problem.

> Several indexes using the same analyzer.

No problem.  Only necessary if the analyzer is costly or has some esoteric
need for shared state.  And possible via subclassing Schema or Analyzer.

> Intentionally different analyzers for indexing and searching.

No problem.  That only makes sense in the context of QueryParser, and the KS
QueryParser allows you to supply an analyzer which overrides the Schema.

> Using this analyzer without any index at all - like I do highlight on
> a separate machine to minimize GC pauses, or tag docs by running a
> heap of queries against MemoryIndex.

No problem.  Distribute a Schema subclass among several machines.

These are all solved problems under the per-index field semantics serialized
Schema model.  That's why I said it was the "theoretical solution".

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-22 Thread Marvin Humphrey

On Fri, May 22, 2009 at 01:22:24PM -0400, Michael McCandless wrote:
> > Sounds like an argument for more frequent major releases.
> 
> Yeah.  Or "rebranding" what we now call minor as major releases, by
> changing our policy ;) 

Not sure how much of that is a jest, bug I don't think that's a good idea.  It
violates commonly held expectations about what constitutes a "minor release".

Of course, I'm not sure to what extent modified interfaces will surprise
people.  At least that's compile-time... but then it will make it harder for
multiple apps with Lucene depenencies to coexist.

> Will Lucy do scoring when sorting by field, by default?

Nope.  Why would we do that?  The only reason you're doing it in Lucene is
to preserve back compat, and Lucy doesn't have that constraint.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-22 Thread Marvin Humphrey

> I feel the opposite: I'd like new users to see improvements by
> default, and users that require strict back-compate to ask for that.

By "strict back-compat", do you mean "people who would like their search app to
not fail silently"? ;)  A "new user" who follows your advice...

   // haha stupid noob 
   StandardAnalyzer analyzer = new StandardAnalyzer(Versons.LATEST);

... is going to get screwed when the default tokenization behavior changes.
And it would be much worse if we follow my preference for making the arg
optional without following my preference for keeping defaults intact:

   // haha eat it luser 
   StandardAnalyzer analyzer = new StandardAnalyzer();

It's either make the arg mandatory when changing default behavior and
recommend that new users pass a fixed argument, or make it optional but keep
defaults intact between major releases.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-22 Thread Marvin Humphrey

On Fri, May 22, 2009 at 11:33:33AM -0400, Michael McCandless wrote:

> when working on 3.1 if we make some great improvement, I'd like new users in
> 3.1 to see the improvement by default.  

Sounds like an argument for more frequent major releases.  But I'm not exactly
one to talk.  ;)

> On thinking about it more... automagically storing the "actsAsVersion"
> in the index, and then having IndexWriter (for example) ask the
> analyzer for a tokenStream matching that version, seems a little too
> sneaky.  

Can you elaborate?

In KinoSearch SVN trunk, satellite classes like QueryParser and Highlighter
have to be passed a Schema, which contains all the Analyzers.  Analyzers
aren't satellite classes under this model -- they are a fixed property of a
FullTextType field spec.  Think of them as baked into an SQL field definition.

You can create a Schema from scratch to pass to the QueryParser, but it's
easier to just get it from the Searcher.  Translating to Java... 

   Searcher searcher = new Searcher("/path/to/index");
   QueryParser qparser = new QueryParser(searcher.getSchema());

I don't see how that's so different from getting an analyzer actsAsVersion
number from the index.

Now, where stuff might start to get complicated is PerFieldAnalyzerWrapper...
is that where the sneakiness gets overwhelming?

> I prefer the up-front "you specify actsAsVersion" when you
> create the analyzer, only for analyzers that have changed across
> releases.  So things like WhitespaceAnalyzer would likely never need
> an actsAsVersion arg.

Hmm, this is kind of hard.  I'd prefer that the argument remain optional, so
that new users don't have to think about it.  But unlike in KS/Lucy, then
there's a danger of leaving it off inadvertently and getting the wrong
behavior. :\

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-22 Thread Marvin Humphrey

On Fri, May 22, 2009 at 11:53:02AM -0400, Michael McCandless wrote:

>   1. If we deprecate an API in the 2.1 release, we can remove it in
>  the next minor release (2.2).
> 
>   2. JAR drop-in-ability is only guaranteed on point releases (2.4.1
>  is a drop-in replacement to 2.4.0).  When switching to a new
>  minor release (2.1 -> 2.2) likely you'll need to recompile.

>   4. [Maybe?] Allow certain limited changes that will require source
>  code changes in your app on upgrading to a new minor release:
>  adding a new method to an interface, adding a new abstract method
>  to an abstract class, renaming of deprecated methods.

These make sense to me.  Catastrophic failure at compile time is vastly
easier to deal with than subtle failure at run time.

>   3. Default settings can change, but if the change is big enough (and
>  certainly if it will impact what's indexed or how searches find
>  docs/do scoring), we add a required "actsAsVersion" arg to the
>  ctor of the affected class.  New users get the latest & greatest,
>  and upgraded users keep their old defaults.

I still like per-class settings classes.  For instance, an IndexWriterSettings
class which allows you to hide away all the tweaky stuff that's cluttering up
the IndexWriter API.

   IndexWriterSettings settings = new IndexWriterSettings("3.1");
   IndexWriter writer = new IndexWriter("path/to/index", analyzer, settings);

I also think that the argument should be optional rather than mandatory, and
that defaults should remain stable between major releases.  In other words, to
take advantage of improved defaults, you need to ask for them -- but new users
don't have to think about such things during the initial learning phase.

This approach is reasonably close to how Architecture and IndexManager are
used to hide away settings for the KS/Lucy Indexer class.

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-21 Thread Marvin Humphrey

On Thu, May 21, 2009 at 05:19:43PM -0400, Michael McCandless wrote:

> Marvin, which solution would you prefer?

Between the two, I'd prefer settings constructor arguments, though I would be
inclined to have settings classes that are specific to individual classes
rather than Lucene-wide.  

At least that scheme gets locality right.  The global actsAsVersion variable
violates that principle and has the potential to saddle a small number of
users who have done absolutely nothing wrong with bugs that are very, very
hard to hunt down.  That's unfair.

As far as analyzers and token streams, the theoretical answer is making
indexes self-describing via serializable schemas, as discussed on the Lucy dev
list, and as implemented in KinoSearch svn trunk.  With versioning metadata
attached to the index, there is no longer any worry about upgrading analysis
modules provided that those modules handle their own versioning correctly.

For instance, in KS the Stopalizer always embeds the complete stoplist in the
schema file, so even if we update the "English" stoplist, we don't get invalid
search results for indexes which were created with the old stoplist.
Similarly, it may not be possible to keep around multiple variants of
Snowball, but at least we can fail catastrophically instead of subtly if we
detect that the Snowball version has changed.

Full-on schema serialization isn't feasible for Lucene, but attaching an
actsAsVersion variable to an index and feeding that to your analyzers would be
a decent start.

Lastly, I think a major java Lucene release is justified already.  Won't this
discussion die down somewhat if you can get 3.0 out?  If there are issues that
are half done, how about rolling back whatever's in the way?

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-21 Thread Marvin Humphrey

Mike McCandless:

> Well this is what I love about the actsAsVersion solution.  There's no
> pain for our back-compat users (besides the one-time effort to set
> actsAsVersion), and new users always get the best settings.

When some mad-as-hell user complains to this list after spending an inordinate
amount of time chasing down an action-at-a-distance bug because of this
insidious and irresponsible OO design decision, I intend to follow up their
email with an I-told-you-so.

There's an action-at-a-distance bug in the Perl core module 'base.pm' that
bedeviled people for years before I finally cornered it.  Turns out it can't
be fixed, but at least now we know what's happening:

http://rt.cpan.org/Public/Bug/Display.html?id=28799

While this error does not occur frequently in the wild, when it does, the
cost to the user is high because the debug path is obscure. I personally
encountered it after failing to wrap a "use_ok" test in a BEGIN block;
isolating it took me... longer than I would have liked. ;)

That bug has led to 'base' having a compromised reputation among elite users
because of intermittent, inexplicable flakiness.  Is that what you want for
Lucene?

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-20 Thread Marvin Humphrey

On Wed, May 20, 2009 at 05:57:49PM +0400, Earwin Burrfoot wrote:

> > What happens when two libraries loaded in the same VM have Lucene as a
> > dependency and set actsAsVersion to conflicting numbers?

> Exactly what happens when you call BooleanQuery.setMaxClauseCount(n)
> from two libraries.
> Last one wins.

Yeesh, that's evil.  :(

It will be sweet, sweet justice if one of your own projects gets infected by
the kind of action-at-a-distance bug you're so blithely unconcerned about.

http://en.wikipedia.org/wiki/Action_at_a_distance_(computer_science)

That was supposed to be a rhetorical question.  To be clear, I consider the
idea of a settable global variable determining library behavior completely
unacceptable.  Changing class load order somewhere in your code shouldn't do
things like change search results (because Stopfilters are applied differently
depending on who "won").

Marvin Humphrey

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene's default settings & back compatibility

2009-05-20 Thread Marvin Humphrey

> But since 3.0 is a major release anyway, we could change the default
> of actsAsVersion with each 3.x release (or just set it to 3) and
> require that a users set actsAsVersion=3 (or whatever version they
> are on) in order to get maximum back compatibility.

What happens when two libraries loaded in the same VM have Lucene as a
dependency and set actsAsVersion to conflicting numbers?

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

1 2 3 4 5 >

1 - 100 of 419 matches

Mail list logo