Re: Global field semantics

2006-07-10 Thread Chuck Williams


Chris Hostetter wrote on 07/10/2006 12:31 PM:
> So i guess we are on the same page that this kind of thing can be done at
> the App level -- what benefits do you see moving them into the Lucene
> index level?
>   

Other than performance per David's and Marvin's ideas, the functionality
benefits of having this in the core are probably not compelling. I've
been able to hook almost everything and reference a global field model
at the application level (except for QueryParser which needs some
patches to enhance extensibility for some of these features). It just
seemed that a global field model was so useful that it might be a
beneficial extension to the core, so I was curious what others thought
about this.

Thanks for your thoughts,

Chuck



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread Chris Hostetter

: previously mentioned a very simple one:  validating fields in the query
: parser.  More interesting examples are:

This strikes me as something that can be done with an abstraction layer
above and seperate from the physical index (this is in fact what Solr
does) without needing to add any hard constraints on the index itself
(other then those impossed by the abstraction layer)

:   1.  Multiple inheritance on the fields of documents that record the
: sources of each inherited value to support efficient incremental maintenance

I'm sorry, you completely lost me ... can you clarify what you mean?

:   2.  "Record-valued fields" that store facets with values (e.g., time
: and user information for who set that value).  These cannot easily be
: broken into multiple fields because the fields in question are multi-valued.
:   3.  "Join fields" that reference id's of objects stored in separate
: indices (supporting queries that reference the fields in the joined index)

Both of these cases sound like situations where what you really want is
more flexibility in the Fields/Terms that can be associated with a docId
-- in the case of your "Record-valued fields" you want what I can only
think of as "rich terms", hierarchical data that can be queried ... along
the lines of the "FlexibleIndexing" wiki page correct? ... this doesn't
seem like it would require a more concrete Field rules, but i can
certianly see how an added level of abstraction might help.

: Managing these kinds of rich semantic features in query parsing and
: indexing is greatly facilitated by a global field model.  I've built
: this into my app, and then started thinking about benefits in Lucene
: generally from such a model.

...


So i guess we are on the same page that this kind of thing can be done at
the App level -- what benefits do you see moving them into the Lucene
index level?

(I imagine it making the most sense as a contrib-ish auxillary API that
developers can use when they don't need the full flexibility the low level
API allows ... but it sounds like you think there are functional benefits
to it being a first order concept in the Lucene API?)

: Yes.  Here is (an elaboration of) the "global model with exceptions"
: idea we reached:

if there can be exceptions then there can't be any hard constraints in the
data store, correct? ... so an implimentation like this could be a higher
level API?

: >   docA.add(new Field(f, "bar", Store.YES, Index.UN_TOKENIZED)):
: >   docA.add(new Field(f, "foo", Store.NO,  Index.TOKENIZED)):
: >
: >   docB.add(new Field(f, "x y", Store.YES, Index.TOKENIZED)):
: >   docB.add(new Field(f, "z",   Store.NO,  Index.UN_TOKENIZED)):

: Hoss, do you have a use case requiring Store and Index variance like this?

Not to that extreme, but i have certainly encountered situations where
storing a single value while indexing multiple values was needed -- this
is something Solr's schema can't handle actually, and we had to work
arround it by using two fields.  I've also seen situations where it would
make a lot of sense to not only do that with one doc, but to also indexing
a single value and storing multiple values in a different doc.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread David Balmain

On 7/11/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 7/10/06, David Balmain <[EMAIL PROTECTED]> wrote:
> I don't think declaring all fields up front is necessary for
> substantial optimizations. I've found that the key to some really good
> optimizations is having constant field numbers. That is, once a field
> is added to the index it is assigned a field number and it it keeps
> that field number for the life of the index.

I can sort of see how this would work when adding documents to a singe index.
What about merging indicies via IndexWriter.addIndexes()?  I guess
this would require keeping the current way of merging around as a
fallback?


That's right. I still need to work on this. Currently you need to spec
each index before hand to make sure they have the same fields. But
it's just a matter of using the old merge model for adding
heterogenous indexes.


Does this mess up opening a MultiReader on multiple indicies
constructed at different times?  This is a common thing for people to
do.


Same as above. I still need to fix this. I'm yet to release all these
new changes.


> This allows one
> FieldInfos object per index instead of one per segment.

So when a new segment is written, the global FieldInfos may need to be updated.
I guess this should be written after the new segment and before the
"segments" file.


That's exactly how I do it. I did consider putting it all in the
"segments" file but I decided not to. I can't remember why right now.
So I have a "segments" file and a "fields" file, the "segments" file
being written last.


>  As I mentioned
> earlier this greatly optimizes the merging of term vectors and stored
> fields. The only problem I could find with this solution is that
> fields are no longer in alphabetical order in the term dictionary but
> I couldn't think of a use-case where this is necessary although I'm
> sure there probably is one.

Isn't an ordered term dictionary necessary to do lookups?


Terms are alphabetically sorted, just not the fields. So if you add a
"title" field and then a "content" field they'd have the numbers 0 and
1 respectively. Now if the title field has the terms "alpha" and
"bravo" and the "content" field has the terms "apple" and "banana"
then they'd be ordered like this;

0:alpha
0:bravo
1:apple
1:banana

instead of like this;

content:apple
content:banana
title:alpha
title:bravo

Notice the terms are correctly ordered in both but the fields aren't.

Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread Yonik Seeley

On 7/10/06, David Balmain <[EMAIL PROTECTED]> wrote:

I don't think declaring all fields up front is necessary for
substantial optimizations. I've found that the key to some really good
optimizations is having constant field numbers. That is, once a field
is added to the index it is assigned a field number and it it keeps
that field number for the life of the index.


I can sort of see how this would work when adding documents to a singe index.
What about merging indicies via IndexWriter.addIndexes()?  I guess
this would require keeping the current way of merging around as a
fallback?

Does this mess up opening a MultiReader on multiple indicies
constructed at different times?  This is a common thing for people to
do.


This allows one
FieldInfos object per index instead of one per segment.


So when a new segment is written, the global FieldInfos may need to be updated.
I guess this should be written after the new segment and before the
"segments" file.


 As I mentioned
earlier this greatly optimizes the merging of term vectors and stored
fields. The only problem I could find with this solution is that
fields are no longer in alphabetical order in the term dictionary but
I couldn't think of a use-case where this is necessary although I'm
sure there probably is one.


Isn't an ordered term dictionary necessary to do lookups?

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread David Balmain

On 7/11/06, Chuck Williams <[EMAIL PROTECTED]> wrote:

David Balmain wrote on 07/10/2006 01:04 AM:
> The only problem I could find with this solution is that
> fields are no longer in alphabetical order in the term dictionary but
> I couldn't think of a use-case where this is necessary although I'm
> sure there probably is one.

So presumably fields are still contiguous, you keep a pointer to where
each field starts, and terms within the field remain in alphabetical order?


Actually yes, that is how I did it although I'm not sure it's the best
way now. I was hoping that by having a pointer to the start of each
field there would be some good perfomance gains in searching but it
turned out not to be the case. You really only save a couple of
iterations in the getIndexOffset method.

To make things easier though, you can just leave the
TermInfosWriter/Reader almost as they are. The only difference though
is that you store field numbers in the index rather than field names
and when you compare terms while scanning the index, you also compare
field numbers rather than field names.

I don't know if I've described it very well but I hope that makes sense.

Cheers,

Dave

PS. By the way, I don't know if I made this clear but the 5x speed up
I was talking about comes during indexing. The performance improvement
as far as search is concerned wasn't what I had hoped. It is a little
faster but the bottle neck really comes from reading the documents
from the index. So to alleviate that I've added lazy field loading
which seems to work well. Actually, I've set it up so that I can read
excerpts from fields without even loading the whole field so
highlighting is super fast.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread Chuck Williams
Chris Hostetter wrote on 07/10/2006 02:06 AM:
> As near as i can tell, the large issue can be sumarized with the following
> sentiment:
>
>   Performance gains could be realized if Field
>   properties were made fixed and homogeneous for
>   all Documents in an index.
>   

This is certainly a large issue, as David says he has achieved a 5x
performance gain.

My interest in global field semantics originally sprang from
functionality considerations, not performance considerations.  I've got
many features that require reasoning about field semantics.  I
previously mentioned a very simple one:  validating fields in the query
parser.  More interesting examples are:

  1.  Multiple inheritance on the fields of documents that record the
sources of each inherited value to support efficient incremental maintenance
  2.  "Record-valued fields" that store facets with values (e.g., time
and user information for who set that value).  These cannot easily be
broken into multiple fields because the fields in question are multi-valued.
  3.  "Join fields" that reference id's of objects stored in separate
indices (supporting queries that reference the fields in the joined index)

Managing these kinds of rich semantic features in query parsing and
indexing is greatly facilitated by a global field model.  I've built
this into my app, and then started thinking about benefits in Lucene
generally from such a model.

>   1) all Fields and their properties must be predeclared before any
>  document is ever added to the index, and any Field not declared is
>  illegal.
>   2) a Field springs into existence the first time a Document is added
>  with a value for it -- but after that all newly added Documents with
>  a value for that field must conform to the Field properites initially
>  used.
>
> (have I missed any general approaches?)
>   

Yes.  Here is (an elaboration of) the "global model with exceptions"
idea we reached:

3) There is a global field model in Lucene that contains the list of
all known fields and their "default semantics".  The class that contains
this model supports a number of implicit and explicit methods to
construct and query the model.  The model can be evolved.  The model is
used many places in Lucene, in some cases according to
application-settable properties.  E.g.:
a) Creating a Field uses the properties of the model so they
need not be specified at each construction.  A global model property
determines whether or not field properties may be overridden, and
whether or not fields may be created that are not in the model (in which
case, they are automatically added to the model).
b) The query parser has hooks that affect Query generation based
on the model properties of the field (not just for certain special query
types like Term's and RangeQuery's).  The application can easily provide
methods to implement these hooks.  This is essential for features like
2&3 above (and beneficial for 1).

> How would something like this work?
>
>   docA.add(new Field(f, "bar", Store.YES, Index.UN_TOKENIZED)):
>   docA.add(new Field(f, "foo", Store.NO,  Index.TOKENIZED)):
>
>   docB.add(new Field(f, "x y", Store.YES, Index.TOKENIZED)):
>   docB.add(new Field(f, "z",   Store.NO,  Index.UN_TOKENIZED)):
>   

The application could determine whether or not this kind of operation
was supported accorded to the global enforcement properties of the
model.  If this is needed, the ability to have exceptions at the Field
level would permit it.

Hoss, do you have a use case requiring Store and Index variance like this?

The impact of this flexibility on David's 5x is another question...

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread Chuck Williams
David Balmain wrote on 07/10/2006 01:04 AM:
> The only problem I could find with this solution is that
> fields are no longer in alphabetical order in the term dictionary but
> I couldn't think of a use-case where this is necessary although I'm
> sure there probably is one.

So presumably fields are still contiguous, you keep a pointer to where
each field starts, and terms within the field remain in alphabetical order?

Chuck



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread Chris Hostetter

: > Are there good reasons this path has not been followed?
:
: Hoss, that's your cue.

I must admit, I haven't been able to fully follow this thread, perhaps
it's just because it's late (no, that can't be it ... i started reading it
at 3:30 this afternoon and then stoped because it was making my head
hurt).  In honestly, I probably would skimmed the whole thing without
commenting if Marvin hadn't called me out onto the mat -- so I'll do my
best to make sense of it.

As near as i can tell, the large issue can be sumarized with the following
sentiment:

Performance gains could be realized if Field
properties were made fixed and homogeneous for
all Documents in an index.

...I've left this sentiment vague, and i'll ignore the implimentation
specifics since i don't understand them -- but there seems to be two high
level approaches that are involved, which are advocated to varying degrees
by varying folks...

  1) all Fields and their properties must be predeclared before any
 document is ever added to the index, and any Field not declared is
 illegal.
  2) a Field springs into existence the first time a Document is added
 with a value for it -- but after that all newly added Documents with
 a value for that field must conform to the Field properites initially
 used.

(have I missed any general approaches?)

The questions (in my mind at least) are:

  a) How much performance gain can be realized by these limitations?
  b) Would it be possible to impliment these limitiations in such a way
 that they are "optional" for people willing to accept the trade off?
  c) if (b) is false, then is (a) great enough to warrant changing Lucene
 anyway?  What exactly is sacrificed?


I can't speak to (a) or (b) ... but I'll throw out some examples for (c)

Regarding #1...

If Fields must be predeclared, Lucene would lose two of the biggest
advantages it has in my opinion:

 * The ability to evolve an index.  To have an extremely large index, and
to add a field to this index that is only used by "new" documents.  This
is not only usefull when the nature of you data changes (TPS Reports
didn't use to have a "cover_sheet" field, and now they do) but also when
the usage of an existing field changes and you don't want to rebuild from
scratch (you've allways had an index "cover_sheet" field, and now you want
it to be stored to .. so you change your index building code, and let it
run for a little while, and then go back and reindex the old stuff later)

 * the ability to have dynamicly named fields.  At CNET we have
"attibutes" for products, those attributes are defined in a database, and
the list of valid attributes is differnet based on the type of product.  I
don't know what they all are, and that list could change tomorow -- and i
don't want to have to rebuild my index from scratch just because someone
decided that laptops need a new attribute called "heat disopation factor"
(note:


Regarding #2...

This approach wouldn't neccessarily conflict with the dynamicly named
fields example above, but it would suffer the same "evolving index"
problems.


Last but not least is the high level issue of "homogeneous" Fields and
Field properties for all documents.  As has been pointed out, in many
cases this is not that big of a deal, because even if you want
heterogenous documents stored in a single index, you can construct a list
of Fields which is the union of the Fields from your heterogenous
Documents and use it -- hopefully no new requirement is added that all
Documents must have a value for all fields.  But what about complex
iteractions between multi-values, stored, indexed fields?

How would something like this work?

  docA.add(new Field(f, "bar", Store.YES, Index.UN_TOKENIZED)):
  docA.add(new Field(f, "foo", Store.NO,  Index.TOKENIZED)):

  docB.add(new Field(f, "x y", Store.YES, Index.TOKENIZED)):
  docB.add(new Field(f, "z",   Store.NO,  Index.UN_TOKENIZED)):

...both docs have two "FIelds" for field name "f", both have a stored
value for f, both have some indexed terms for f, both have
some tokenized terms and one utokenized term for f ... but do these two
docs both conform to the same "Global field semantics" ?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread David Balmain

On 7/10/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

Chuck Williams wrote:
> Lucene today allows many field properties to vary at the Field level.
> E.g., the same field name might be tokenized in one Field on a Document
> while it is untokenized in another Field on the same or different
> Document.

The rationale for this design was to keep the API simple.  I think of it
like variable declarations: some languages require them and some don't.
  I opted to make Lucene fields like dynamically-typed variables.  In
part, Lucene's popularity is due to the simplicity of its API.


It's just now struck me the irony that most people are happy with the
"dynamically-typed" fields in Java (Lucene) but they didn't go down as
well in Ruby (Ferret).


However, in my uses of Lucene, most documents have the same fields used
in the same way, so I don't think I've ever actually taken much
advantage of this functionality.  It is nice to be able to add a field
to an index by changing the indexing code in a single place, where the
field's value is created, and not having to also change the index
initialization code.  We should try to keep such redundancies out of
user code.

Thus I would encourage any change in this direction to continue to
permit fields to be defined lazily, the first time they are added,
rather than requiring all fields to be declared up front.  Are there
substantial optimizations that are only possible if all fields are known
when the index is initialized?


I don't think declaring all fields up front is necessary for
substantial optimizations. I've found that the key to some really good
optimizations is having constant field numbers. That is, once a field
is added to the index it is assigned a field number and it it keeps
that field number for the life of the index. This allows one
FieldInfos object per index instead of one per segment. As I mentioned
earlier this greatly optimizes the merging of term vectors and stored
fields. The only problem I could find with this solution is that
fields are no longer in alphabetical order in the term dictionary but
I couldn't think of a use-case where this is necessary although I'm
sure there probably is one.

Anyway, hopefully we'll be able to lead the way with some brilliant
new ideas in the Lucy project. Put our money where our mouth is, so to
speak. If only I had a little more time right now.

Cheers,
Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-10 Thread Doug Cutting

Chuck Williams wrote:
Lucene today allows many field properties to vary at the Field level. 
E.g., the same field name might be tokenized in one Field on a Document

while it is untokenized in another Field on the same or different
Document.


The rationale for this design was to keep the API simple.  I think of it 
like variable declarations: some languages require them and some don't. 
 I opted to make Lucene fields like dynamically-typed variables.  In 
part, Lucene's popularity is due to the simplicity of its API.


However, in my uses of Lucene, most documents have the same fields used 
in the same way, so I don't think I've ever actually taken much 
advantage of this functionality.  It is nice to be able to add a field 
to an index by changing the indexing code in a single place, where the 
field's value is created, and not having to also change the index 
initialization code.  We should try to keep such redundancies out of 
user code.


Thus I would encourage any change in this direction to continue to 
permit fields to be defined lazily, the first time they are added, 
rather than requiring all fields to be declared up front.  Are there 
substantial optimizations that are only possible if all fields are known 
when the index is initialized?


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-09 Thread David Balmain

On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote:

David Balmain wrote on 07/09/2006 06:44 PM:
> On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote:
>> Marvin Humphrey wrote on 07/08/2006 11:13 PM:
>> >
>> > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:
>> >
>> >> Many things would be cleaner in Lucene if fields had a global
>> semantics,
>> >> i.e., if properties like text vs. binary, Index, Store,
>> TermVector, the
>> >> appropriate Analyzer, the assignment of Directory in
>> ParallelReader (or
>> >> ParallelWriter), etc. were a function of just the field name and the
>> >> index.
>> >
>> > In June, Dave Balmain and I discussed the issue extensively on the
>> > Ferret list.  It might have been nice to use the Lucy list, since a
>> > lot of the discussion was about Lucy, but the Lucy lists didn't exist
>> > at the time.
>> >
>> > http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html
>> >
>> I think there are a number of problems with that proposal and hope it
>> was not adopted.
>
> Hi Chuck,
>
> Actually, it was adopted and I'm quite happy with the solution. I'd be
> very interested to hear what the number of problems are, besides the
> example you've already given. Even if you never use Ferret, it can
> only help me improve my software.

Hi David,

Thanks for your reply.

I'm not aware of other problems beyond the ones I've already cited.
After thinking of these, my confidence that there were not others waned.

>
> I'll start by covering your term-vector example. By adding fixed
> index-wide field properties to Ferret I was able to obtain up to a
> huge speed improvement during indexing.

This is very interesting.  Can you say how much?


About a factor of 5 times. I won't compare it to Lucenes speed though
as I know that's asking for trouble. You'll be able to try it yourself
in a week or so when I finally release it.


> With the CPU time I gain in Ferret I could
> easily re-analyze large fields and build term vectors for them
> separately. It's a little more work for less common use cases like
> yours but in the end, everyone benifits in terms of performance.

Does Ferret work this way, or would that be up to the application?


Currently that would be up to the application.


>> As my earlier example showed, there is at least one
>> valid use case where storing a term vector is not an invariant property
>> of a field; specifically, when using term vectors to optimize excerpt
>> generation, it is best to store them only for fields that have long
>> values.  This is even a counter-example to Karl's proposal, since a
>> single Document may have multiple fields of the same name, some with
>> long values and others with short values; multiple fields of the same
>> name may legitimately have different TermVector settings even on a
>> single Document.
>
> I think you'll find if you look at the DocumentWriter#writePostings
> method that it's "one in, all in" in terms of storing term vectors for
> a field. That is, if you have 5 "content" fields and only one of those
> is set to store term vectors, then all of the fields will store term
> vectors.

Right you are, and clearly necessarily so since the values of the
multiple fields are implicitly concatenated (with
positionIncrementGap).  So, Lucene already limits my term vector
optimization to the Document level.  As it happens, I only use it for
large body fields, of which each of my Documents has at most one.

>
>> I haven't thought of cases where Index or Store would legitimately vary
>> across Fields or Documents, but am less convinced there aren't important
>> use cases for these as well.  Similarly, although it is important to
>> allow term vectors to be on or off at the field level, I don't see any
>> obvious need to vary the type of term vector (positions, offsets or
>> both).
>
> I think Store could definitely legitimately vary across Fields or
> Documents for the same reason your term vectors do. Perhaps you are
> indexing pages from the web and you want to cache only the smaller
> pages.

That's an interesting example, but not as compelling an objection to me
(and seemingly not to you either!).  The app could always store an empty
string without much consequence in this scenario.

>
>> There are significant benefits to global semantics, as evidenced by the
>> fact that several of us independently came to desire this.  However,
>> deciding what can be global and what cannot is more subtle.
>
> I agree. I can't se

Re: Global field semantics

2006-07-09 Thread Chuck Williams
David Balmain wrote on 07/09/2006 06:44 PM:
> On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote:
>> Marvin Humphrey wrote on 07/08/2006 11:13 PM:
>> >
>> > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:
>> >
>> >> Many things would be cleaner in Lucene if fields had a global
>> semantics,
>> >> i.e., if properties like text vs. binary, Index, Store,
>> TermVector, the
>> >> appropriate Analyzer, the assignment of Directory in
>> ParallelReader (or
>> >> ParallelWriter), etc. were a function of just the field name and the
>> >> index.
>> >
>> > In June, Dave Balmain and I discussed the issue extensively on the
>> > Ferret list.  It might have been nice to use the Lucy list, since a
>> > lot of the discussion was about Lucy, but the Lucy lists didn't exist
>> > at the time.
>> >
>> > http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html
>> >
>> I think there are a number of problems with that proposal and hope it
>> was not adopted.
>
> Hi Chuck,
>
> Actually, it was adopted and I'm quite happy with the solution. I'd be
> very interested to hear what the number of problems are, besides the
> example you've already given. Even if you never use Ferret, it can
> only help me improve my software.

Hi David,

Thanks for your reply.

I'm not aware of other problems beyond the ones I've already cited. 
After thinking of these, my confidence that there were not others waned.

>
> I'll start by covering your term-vector example. By adding fixed
> index-wide field properties to Ferret I was able to obtain up to a
> huge speed improvement during indexing.

This is very interesting.  Can you say how much?

> With the CPU time I gain in Ferret I could
> easily re-analyze large fields and build term vectors for them
> separately. It's a little more work for less common use cases like
> yours but in the end, everyone benifits in terms of performance.

Does Ferret work this way, or would that be up to the application?

>
>> As my earlier example showed, there is at least one
>> valid use case where storing a term vector is not an invariant property
>> of a field; specifically, when using term vectors to optimize excerpt
>> generation, it is best to store them only for fields that have long
>> values.  This is even a counter-example to Karl's proposal, since a
>> single Document may have multiple fields of the same name, some with
>> long values and others with short values; multiple fields of the same
>> name may legitimately have different TermVector settings even on a
>> single Document.
>
> I think you'll find if you look at the DocumentWriter#writePostings
> method that it's "one in, all in" in terms of storing term vectors for
> a field. That is, if you have 5 "content" fields and only one of those
> is set to store term vectors, then all of the fields will store term
> vectors.

Right you are, and clearly necessarily so since the values of the
multiple fields are implicitly concatenated (with
positionIncrementGap).  So, Lucene already limits my term vector
optimization to the Document level.  As it happens, I only use it for
large body fields, of which each of my Documents has at most one.

>
>> I haven't thought of cases where Index or Store would legitimately vary
>> across Fields or Documents, but am less convinced there aren't important
>> use cases for these as well.  Similarly, although it is important to
>> allow term vectors to be on or off at the field level, I don't see any
>> obvious need to vary the type of term vector (positions, offsets or
>> both).
>
> I think Store could definitely legitimately vary across Fields or
> Documents for the same reason your term vectors do. Perhaps you are
> indexing pages from the web and you want to cache only the smaller
> pages.

That's an interesting example, but not as compelling an objection to me
(and seemingly not to you either!).  The app could always store an empty
string without much consequence in this scenario.

>
>> There are significant benefits to global semantics, as evidenced by the
>> fact that several of us independently came to desire this.  However,
>> deciding what can be global and what cannot is more subtle.
>
> I agree. I can't see global field semantics making it into Lucene in
> the short term. It's a rather large change, particularly if you want
> to make full use of the performance benifits it affords.

Could you summarize where these derive from?

>
>> Perhaps the best thing at the Lucene level is to 

Re: Global field semantics

2006-07-09 Thread David Balmain

On 7/10/06, Chuck Williams <[EMAIL PROTECTED]> wrote:

Marvin Humphrey wrote on 07/08/2006 11:13 PM:
>
> On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:
>
>> Many things would be cleaner in Lucene if fields had a global semantics,
>> i.e., if properties like text vs. binary, Index, Store, TermVector, the
>> appropriate Analyzer, the assignment of Directory in ParallelReader (or
>> ParallelWriter), etc. were a function of just the field name and the
>> index.
>
> In June, Dave Balmain and I discussed the issue extensively on the
> Ferret list.  It might have been nice to use the Lucy list, since a
> lot of the discussion was about Lucy, but the Lucy lists didn't exist
> at the time.
>
> http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html
>
I think there are a number of problems with that proposal and hope it
was not adopted.


Hi Chuck,

Actually, it was adopted and I'm quite happy with the solution. I'd be
very interested to hear what the number of problems are, besides the
example you've already given. Even if you never use Ferret, it can
only help me improve my software.

I'll start by covering your term-vector example. By adding fixed
index-wide field properties to Ferret I was able to obtain up to a
huge speed improvement during indexing. I believe Marvin has had
similar success using his own merge model and with fixed field
properties in KinoSearch. With the CPU time I gain in Ferret I could
easily re-analyze large fields and build term vectors for them
separately. It's a little more work for less common use cases like
yours but in the end, everyone benifits in terms of performance.


As my earlier example showed, there is at least one
valid use case where storing a term vector is not an invariant property
of a field; specifically, when using term vectors to optimize excerpt
generation, it is best to store them only for fields that have long
values.  This is even a counter-example to Karl's proposal, since a
single Document may have multiple fields of the same name, some with
long values and others with short values; multiple fields of the same
name may legitimately have different TermVector settings even on a
single Document.


I think you'll find if you look at the DocumentWriter#writePostings
method that it's "one in, all in" in terms of storing term vectors for
a field. That is, if you have 5 "content" fields and only one of those
is set to store term vectors, then all of the fields will store term
vectors.


As another counter-example from my own app which I'd forgotten
yesterday, an important case where the Analyzer will vary across
documents is for i18n, where different languages require different
analyzers.  Refuting again my own argument about this not being
consistent with query parsing, the language of the query is a distinct
property from the languages of various documents in the collection.  In
my app, I let the user specify the language of the query, while the
language of each Document is determined automatically.  So, analyzers
vary for both queries and documents, but independently.


Ferret doesn't record any details about analysis in the field
properties. I definitely agree with you here.


I haven't thought of cases where Index or Store would legitimately vary
across Fields or Documents, but am less convinced there aren't important
use cases for these as well.  Similarly, although it is important to
allow term vectors to be on or off at the field level, I don't see any
obvious need to vary the type of term vector (positions, offsets or both).


I think Store could definitely legitimately vary across Fields or
Documents for the same reason your term vectors do. Perhaps you are
indexing pages from the web and you want to cache only the smaller
pages.


There are significant benefits to global semantics, as evidenced by the
fact that several of us independently came to desire this.  However,
deciding what can be global and what cannot is more subtle.


I agree. I can't see global field semantics making it into Lucene in
the short term. It's a rather large change, particularly if you want
to make full use of the performance benifits it affords.


Perhaps the best thing at the Lucene level is to have a notion of
default semantics for a field name.  Whenever a Field of that name is
constructed, those semantics would be used unless the constructor
overrides them.  This would allow additional constructors on Field with
simpler signatures for the common case of invariant Field properties.
It would also allow applications to access the class that holds the
default field information for an index.  The application will know which
properties it can rely on as invariant and whether or not the set of
fields is closed.

This approach would preserve upward compatibility and provide, I
believe, most of the benefits we 

Re: Global field semantics

2006-07-09 Thread Marvin Humphrey


On Jul 9, 2006, at 11:31 AM, Chuck Williams wrote:


Marvin Humphrey wrote on 07/08/2006 11:13 PM:


On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:

Many things would be cleaner in Lucene if fields had a global  
semantics,
i.e., if properties like text vs. binary, Index, Store,  
TermVector, the
appropriate Analyzer, the assignment of Directory in  
ParallelReader (or

ParallelWriter), etc. were a function of just the field name and the
index.


In June, Dave Balmain and I discussed the issue extensively on the
Ferret list.  It might have been nice to use the Lucy list, since a
lot of the discussion was about Lucy, but the Lucy lists didn't exist
at the time.

http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html


I think there are a number of problems with that proposal and hope it
was not adopted.


The email which kicks off the thread is Dave's initial proposal for  
*Ferret*.  That's outside the domain of Apache Lucene.  Dave did not  
submit it as a proposal for either Lucy or Lucene.  There's an  
extended discussion which follows where a number of ideas are kicked  
around, some of them related to Lucy.  Since you respond only to that  
one email, perhaps you did not read the rest of the thread.


You asked for prior discussion, and I gave you a link to prior  
discussion.  Here is the quote from my original email, with the parts  
which you silently snipped restored:





Has this been considered before?


Robert Kirchgessner made some of the same arguments in a January  
thread.  They were compelling then, and they're compelling now.


http://mail-archives.apache.org/mod_mbox/lucene-java-dev/ 
200601.mbox/[EMAIL PROTECTED]


In June, Dave Balmain and I discussed the issue extensively on the  
Ferret list.  It might have been nice to use the Lucy list, since a  
lot of the discussion was about Lucy, but the Lucy lists didn't  
exist at the time.


http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html




I did not intend to submit Dave's Ferret proposal by proxy to this  
group.  I don't have time right now to defend something which was  
never meant for either Lucene or Lucy at length.  I know that Dave  
doesn't either.  I regret having provided the link.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-09 Thread Chuck Williams
Marvin Humphrey wrote on 07/08/2006 11:13 PM:
>
> On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:
>
>> Many things would be cleaner in Lucene if fields had a global semantics,
>> i.e., if properties like text vs. binary, Index, Store, TermVector, the
>> appropriate Analyzer, the assignment of Directory in ParallelReader (or
>> ParallelWriter), etc. were a function of just the field name and the
>> index.
>
> In June, Dave Balmain and I discussed the issue extensively on the
> Ferret list.  It might have been nice to use the Lucy list, since a
> lot of the discussion was about Lucy, but the Lucy lists didn't exist
> at the time.
>
> http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html
>
I think there are a number of problems with that proposal and hope it
was not adopted.  As my earlier example showed, there is at least one
valid use case where storing a term vector is not an invariant property
of a field; specifically, when using term vectors to optimize excerpt
generation, it is best to store them only for fields that have long
values.  This is even a counter-example to Karl's proposal, since a
single Document may have multiple fields of the same name, some with
long values and others with short values; multiple fields of the same
name may legitimately have different TermVector settings even on a
single Document.

As another counter-example from my own app which I'd forgotten
yesterday, an important case where the Analyzer will vary across
documents is for i18n, where different languages require different
analyzers.  Refuting again my own argument about this not being
consistent with query parsing, the language of the query is a distinct
property from the languages of various documents in the collection.  In
my app, I let the user specify the language of the query, while the
language of each Document is determined automatically.  So, analyzers
vary for both queries and documents, but independently.

I haven't thought of cases where Index or Store would legitimately vary
across Fields or Documents, but am less convinced there aren't important
use cases for these as well.  Similarly, although it is important to
allow term vectors to be on or off at the field level, I don't see any
obvious need to vary the type of term vector (positions, offsets or both).

There are significant benefits to global semantics, as evidenced by the
fact that several of us independently came to desire this.  However,
deciding what can be global and what cannot is more subtle.

Perhaps the best thing at the Lucene level is to have a notion of
default semantics for a field name.  Whenever a Field of that name is
constructed, those semantics would be used unless the constructor
overrides them.  This would allow additional constructors on Field with
simpler signatures for the common case of invariant Field properties. 
It would also allow applications to access the class that holds the
default field information for an index.  The application will know which
properties it can rely on as invariant and whether or not the set of
fields is closed.

This approach would preserve upward compatibility and provide, I
believe, most of the benefits we all seek.

Thoughts?

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-08 Thread Marvin Humphrey


On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:

Many things would be cleaner in Lucene if fields had a global  
semantics,
i.e., if properties like text vs. binary, Index, Store, TermVector,  
the
appropriate Analyzer, the assignment of Directory in ParallelReader  
(or

ParallelWriter), etc. were a function of just the field name and the
index.


This is the direction I would like to go.


This approach would naturally admit a class, say IndexFieldSet,
that would hold global field semantics for an index.

Lucene today allows many field properties to vary at the Field level.
E.g., the same field name might be tokenized in one Field on a  
Document

while it is untokenized in another Field on the same or different
Document.  Does anybody know how often this flexibility is used?  Are
there interesting use cases for which it is important?  It seems to me
this functionality is already problematic and not fully supported;  
e.g.,

indexing can manage tokenization-variant fields, but query parsing
cannot.  Various extensions to Lucene exacerbate this kind of problem.

Perhaps more controversially, the notion of global field semantics  
would
be even stronger if the set of fields is closed.  This would allow,  
for

example, QueryParser to validate field names.  This has a number of
benefits, including for example avoiding false-negative "no  
results" due

to misspelling a field name.

Has this been considered before?


Robert Kirchgessner made some of the same arguments in a January  
thread.  They were compelling then, and they're compelling now.


http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200601.mbox/% 
[EMAIL PROTECTED]


In June, Dave Balmain and I discussed the issue extensively on the  
Ferret list.  It might have been nice to use the Lucy list, since a  
lot of the discussion was about Lucy, but the Lucy lists didn't exist  
at the time.


http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html

Thoughts on the document storage that occurred to me after that  
discussion: maybe the fdx file should spec two numbers: a file  
pointer, and a integer which indicates the class of object stored at  
that position in the fdt file.  The registry which maps integers to  
classes could be stored in some centralized file.  Perhaps one of  
these classes -- a LazyDoc -- could specify that only a few integer  
file pointers should be read right away, deferring reading of field  
data until later.



Are there good reasons this path has not been followed?


Hoss, that's your cue.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-08 Thread Chuck Williams
karl wettin wrote on 07/08/2006 12:27 PM:
> On Sat, 2006-07-08 at 11:08 -0700, Chuck Williams wrote:
>   
>> Karl, do you have specific reasons or use cases to normalize fields at
>> Document rather than at Index? 
>> 
>
> Nothing more than that the way the API looks it implies features that
> does not exist. Boost, store, index and vectors. I've learned, but I'm
> certain lots of newbies does the same assumptions as I did.
>   

I forgot one of my own use cases!  My app uses term vectors as an
optimization for determining excerpts (aka summaries).  Term vectors
increase the index size.  For large documents, the performance benefits
of using term vectors to find excerpts are large, but for small
documents they are non-existent or negative.  So, to optimize
performance and minimize index size, I store term vectors on the
relevant fields only when their values are sufficiently large.  This is
a concrete example of using the same field name with different
Field.TermVector values on different Documents.

Are there any similar examples for Field.Index or Field.Store?

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-08 Thread karl wettin
On Sat, 2006-07-08 at 11:08 -0700, Chuck Williams wrote:
> 
> Karl, do you have specific reasons or use cases to normalize fields at
> Document rather than at Index? 

Nothing more than that the way the API looks it implies features that
does not exist. Boost, store, index and vectors. I've learned, but I'm
certain lots of newbies does the same assumptions as I did.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-08 Thread Chuck Williams
karl wettin wrote on 07/08/2006 10:27 AM:
> On Sat, 2006-07-08 at 09:46 -0700, Chuck Williams wrote:
>   
>> Many things would be cleaner in Lucene if fields had a global semantics,
>> 
>
>   
>> Has this been considered before?  Are there good reasons this path has
>> not been followed?
>> 
>
> I've been posting some advocacy about the current Field. Basically I
> would like to see a more normalized field setting per document (instead
> of normalizing it in the writer), and I've been talking about something
> like this:
>
> [Document]<#>--- {1..*} ->[Value]-->[Field +name +store +index +vector]
> A
> | {0..*}
> |
>  [Index]
>
>   

And what I'm after would look like this:

[Document]<#>--- {1..*} ->[Value]
 A
 | {*..1}
 |
  [Field +store +index +vector +analyzer +directory]
 A
 | {1..1}
 |
[FieldName]
 A
 | {0..*}
 |
  [Index]



The key points are to have Index be a first-class object and to have
field names uniquely specify field properties.

Karl, do you have specific reasons or use cases to normalize fields at
Document rather than at Index?

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Global field semantics

2006-07-08 Thread karl wettin
On Sat, 2006-07-08 at 09:46 -0700, Chuck Williams wrote:
> Many things would be cleaner in Lucene if fields had a global semantics,

> Has this been considered before?  Are there good reasons this path has
> not been followed?

I've been posting some advocacy about the current Field. Basically I
would like to see a more normalized field setting per document (instead
of normalizing it in the writer), and I've been talking about something
like this:

[Document]<#>--- {1..*} ->[Value]-->[Field +name +store +index +vector]
A
| {0..*}
|
 [Index]

I've done lots of changes and added new features like this to my own
branch as it takes ten times the effort to fix deprications and
backwards compatibility for these things. 

I'm so up for a Lucene 3.0 code sandbox. 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Global field semantics

2006-07-08 Thread Chuck Williams
Many things would be cleaner in Lucene if fields had a global semantics,
i.e., if properties like text vs. binary, Index, Store, TermVector, the
appropriate Analyzer, the assignment of Directory in ParallelReader (or
ParallelWriter), etc. were a function of just the field name and the
index.  This approach would naturally admit a class, say IndexFieldSet,
that would hold global field semantics for an index.

Lucene today allows many field properties to vary at the Field level. 
E.g., the same field name might be tokenized in one Field on a Document
while it is untokenized in another Field on the same or different
Document.  Does anybody know how often this flexibility is used?  Are
there interesting use cases for which it is important?  It seems to me
this functionality is already problematic and not fully supported; e.g.,
indexing can manage tokenization-variant fields, but query parsing
cannot.  Various extensions to Lucene exacerbate this kind of problem.

Perhaps more controversially, the notion of global field semantics would
be even stronger if the set of fields is closed.  This would allow, for
example, QueryParser to validate field names.  This has a number of
benefits, including for example avoiding false-negative "no results" due
to misspelling a field name.

Has this been considered before?  Are there good reasons this path has
not been followed?

Thanks for any info,

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]