Re: Lucene sorting case-sensitive by default?

Erick Erickson Mon, 14 Jan 2008 09:35:13 -0800

Sorry, I was confused about this for the longest time (and it shows!). You
don't actually have to store two separate fields. Field.Store.YES stores
the input exactly as is, without passing it through anything. So you
really only have to store your field. I still think of it conceptually as
two
entirely different things, but it's not.


This code:
   public static void main(String[] args) throws Exception
    {
        try {
            RAMDirectory dir = new RAMDirectory();
            IndexWriter iw = new IndexWriter(
                        dir,
                        new StandardAnalyzer(Collections.emptySet()),
                        true);

            Document doc = new Document();

            doc.add(
                    new Field(
                            "f",
                            "This is Some Mixed, case Junk($*%& With Ugly
SYmbols",
                            Field.Store.YES,
                            Field.Index.TOKENIZED));
            iw.addDocument(doc);
            iw.close();
            IndexReader ir =  IndexReader.open(dir);
            Document d = ir.document(0);
            System.out.println(d.get("f"));
        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.println("done");
    }

prints "This is Some Mixed, case Junk($*%& With Ugly SYmbols"
yet still finds the document with a search for "junk" using
StandardAnalyzer.

Sorry for the confusion!
Erick

On Jan 14, 2008 11:48 AM, Alex Wang <[EMAIL PROTECTED]> wrote:

> Thanks a lot Erik for the great tip! I do need to display all the fields
> and allow the users to sort by each field as they wish. My index is
> currently about 200 mb.
>
> Your suggestion about storing (but not index) the cased version, and
> indexing (but not store) the lower-case version is an excellent solution
> for me.
>
> Is it possible to do it in the same field or do I have to do it in 2
> separate fields? If I do it in one field, what are the Lucene
> class/methods I need to overwrite?
>
> Thanks again for your help!
>
> Alex
>
>
> This message may contain confidential and/or privileged information. If
> you are not the addressee or authorized to receive this for the
> addressee, you must not use, copy, disclose, or take any action based on
> this message or any information herein. If you have received this
> message in error, please advise the sender immediately by reply e-mail
> and delete this message. Thank you for your cooperation.
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:[EMAIL PROTECTED]
> Sent: Monday, January 14, 2008 11:24 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene sorting case-sensitive by default?
>
> Several things:
>
> 1> do you need to display all the fields? Would just storing them
> lower-case work? The only time I've needed to store fields case-
> sensitive is when I'm showing them to the user. If the user is just
> searching on them, I can store them any way I want and she'll never
> know.
>
> 2> You might very well be surprised at how little extra it takes to
> index (but not store) the lower-case version. How big is your index
> anyway? And be warned that the size increase is not linear, so
> just comparing the index sizes for, say, 10 document is misleading.
> If your index is 10M, there's no reason at all not to store twice. If
> it's
> 10G........
>
> 3> You could store (but not index) the cased version. You could
> index (but not store) the lower-case version. The total size of
> your index is (I believe) about the same as indexing AND storing
> the fields. That gives you a way to search caselessly and display
> case-sensitively.
>
> Best
> Erick
>
> On Jan 14, 2008 10:58 AM, Alex Wang <[EMAIL PROTECTED]> wrote:
>
> > Thanks everyone for your replies! Guess I did not fully understand the
> > meaning of "natural order" in the Lucene Java doc.
> >
> > To add another all-lower-case field for each sortable field in my
> index
> > is a little too much, since the app requires sorting on pretty much
> all
> > fields (over 100).
> >
> > Toke, you mentioned "Using a Collator works but does take a fair
> amount
> > of memory", can you please elaborate a little more on that. Thanks.
> >
> > Alex
> >
> > -----Original Message-----
> > From: Toke Eskildsen [mailto:[EMAIL PROTECTED]
> > Sent: Monday, January 14, 2008 3:13 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Lucene sorting case-sensitive by default?
> >
> > On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote:
> > > Looks like Lucene is separating upper case and lower case while
> > sorting.
> >
> > As Tom points out, default sorting uses natural order. It's worth
> noting
> > that this implies that default sorting does not produce usable results
> > as soon as you use non-ASCII characters. Using a Collator works but
> does
> > take a fair amount of memory.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Lucene sorting case-sensitive by default?

Reply via email to