FieldType API

Michael McCandless (JIRA) Tue, 21 Oct 2014 02:33:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178179#comment-14178179
 ]


Michael McCandless commented on LUCENE-6005:
--------------------------------------------

I committed the current work-in-progress to a new branch
(https://svn.apache.org/repos/asf/lucene/dev/branches/lucene6005).

I added a new FieldTypes class (holds the optional write-once schema)
and Document2 (to replace Document eventually).

Net/net I think the approach can work well: it's a minimally intrusive
API to optionally build up the write-once schema.  You can skip the
API entirely and it will "learn" your schema by seeing which Java
types you are adding to your documents and setting sensible defaults
accordingly.  It's quite a bit simpler than the current oal.document
API: no more separate XXXField nor FieldType classes.

Indexed binary tokens work, via Document2.addAtom(...) (LUCENE-5989).

You can turn on/off sorting for a field, and this "translates" to the
appropriate DV type; I want to improve this by letting you specify the
default sort order, and also \[eventually] specify collator.  I plan to
similarly enable highlighting.

I also added search-time APIs, e.g. newSort, newTermQuery,
newRangeQuery.  These methods throw clear exceptions if the field name
is unknown, or it wasn't indexed with a type that "matches" that
method.

There are still many issues and nocommits:

  * Analyzer is passed to FieldTypes now; I would like to remove it
    from IndexWriterConfig.  To do this, I think I need to push
    multi-valued field handling out of IndexWriter up into "user
    space"... I already removed IndexableFieldType.tokenized as a
    first step.

  * Analyzers can't be serialized, so the app will have to
    re-initialize them on startup (like they must do anyway today with
    PFAW).  Same for Similarity.

  * You can only set per-field DVF and PF.

  * I only cutover a couple tests, but they lose randomness since
    FieldTypes provides the default IWC, vs LTC.newIWC().

  * I had to suck in a fork of KeywordTokenizer.


> Explore alternative to Document/Field/FieldType API
> ---------------------------------------------------
>
>                 Key: LUCENE-6005
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6005
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: Trunk
>
>
> Auto-prefix terms (LUCENE-5879) is blocked because it's impossible in
> Lucene today to add a simple API to use it, and I don't think we
> should commit features that only super-experts can figure out how to
> use: that's evil.
> The only realistic "workaround" for such new features is to instead
> add them directly to the various servers on top of Lucene, since they
> all already have nice schema APIs.
> I opened LUCENE-5989 to try do at least a baby step towards making it
> easier to use auto-prefix terms, so you can easily add singleton
> binary tokens, but even that has proven controversial.
> Net/net I think we have to solve the root cause of this by fixing the
> Document/Field/FieldType API so that new index-level features can have
> a usable API, properly defaulted for the right types of fields.
> Towards that, I'm exploring a replacement for
> Document/Field/FieldType.  The idea is to expose simple methods on the
> document class (no more separate Field and FieldType classes):
> {noformat}
>     doc.addLargeText("body", "some text");
>     doc.addShortText("title", "a title");
>     doc.addAtom("id", "29jafnn");
>     doc.addBinary("bytes", new byte[7]);
>     doc.addNumber("number", 17);
> {noformat}
> And then expose a separate FieldTypes class, that you pass to ctor of
> the new document class, which lets you set all the various per-field
> settings (stored, doc values, etc.).  E.g.:
> {noformat}
>     types.enableStored("id");
> {noformat}
> FieldTypes is a write-once schema, and it throws exceptions if you try
> to make invalid changes once a given setting is already written
> (e.g. enabling norms after having disabled them).  It will (I haven't
> implemented this yet) save its state into IndexWriter's commitData, so
> it's available when you open a new IndexWriter for append and when you
> open a reader.
> It has methods to set all the per-field settings (analyzer, stored,
> term vectors, norms, index options, doc values type), and chooses
> "reasonable" defaults based on the value's type when it suddenly sees
> a new field.  For example, when you add a number, it's indexed for
> range querying and sorting (numeric doc values) by default.
> FieldTypes provides the analyzer and codec (a little messy) that you
> pass to IndexWriterConfig.  Since it's effectively a persistent
> schema, it knows all about the available fields at search time, so we
> could use it to create queries (checking if they are valid given that
> field's type).  Query parsers and highlighters could consult it.
> Default UIs (above Lucene) could use it, etc.  This is all future .. I
> think for this issue the goal should be to "just" provide a "better"
> index-time API but not yet make use of it at search time.
> So with this change, for auto-prefix terms, we could add an "enable
> range queries/filters" option, but then validate that the selected
> postings format supports such an option.
> I know this exploration will be horribly controversial, but
> realistically I don't think Lucene can move on much further if we
> can't finally address this schema problem head on.
> This is long overdue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6005) Explore alternative to Document/Field/FieldType API

Reply via email to