Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, Ken,

Thanks for your email. You are right, I was meant to propose that Lucene 
switch to use true UTF-8, rather than having to work around this issue by 
fixing the caused problems elsewhere. 

Also, conforming to standards like UTF-8 will make the code easier for new 
developers to pick up.

Just my 2 cents.

Thanks,

Jian

On 8/27/05, Ken Krugler <[EMAIL PROTECTED]> wrote:
> 
> >On Aug 26, 2005, at 10:14 PM, jian chen wrote:
> >
> >>It seems to me that in theory, Lucene storage code could use true UTF-8 
> to
> >>store terms. Maybe it is just a legacy issue that the modified UTF-8 is
> >>used?
> 
> The use of 0xC0 0x80 to encode a U+ Unicode code point is an
> aspect of Java serialization of character streams. Java uses what
> they call "a modified version of UTF-8", though that's a really bad
> way to describe it. It's a different Unicode encoding, one that
> resembles UTF-8, but that's it.
> 
> >It's not a matter of a simple switch. The VInt count at the head of
> >a Lucene string is not the number of Unicode code points the string
> >contains. It's the number of Java chars necessary to contain that
> >string. Code points above the BMP require 2 java chars, since they
> >must be represented by surrogate pairs. The same code point must be
> >represented by one character in legal UTF-8.
> >
> >If Plucene counts the number of legal UTF-8 characters and assigns
> >that number as the VInt at the front of a string, when Java Lucene
> >decodes the string it will allocate an array of char which is too
> >small to hold the string.
> 
> I think Jian was proposing that Lucene switch to using a true UTF-8
> encoding, which would make things a bit cleaner. And probably easier
> than changing all references to CEUS-8 :)
> 
> And yes, given that the integer count is the number of UTF-16 code
> units required to represent the string, your code will need to do a
> bit more processing when calculating the character count, but that's
> a one-liner, right?
> 
> -- Ken
> --
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>


Re: Eliminating norms ... completley

2005-10-07 Thread jian chen
Hi, Chris,

Turning off norm looks like a very interesting problem to me. I remember
that in Lucene Road Map for 2.0, there is a requirement to turn off indexing
for some information, such as proximity.

Maybe optionally turning off the norm could be an experiment to show case
how to turn off the proximity down the road.

Looking at the Lucene source code, it seems to me that the code could be
further improved, bringing it more to the good OO design. For example,
abstract classes could be changed to interfaces if possible, using accessor
methods like getXXX() instead of public member variables, etc.

My hunch is that the changes would add clarity of style to the code and
wouldn't be a real performance drawback.

Just my thoughts. For sake of backward compatibility, these thoughts may not
be that valuable though.

Cheers,

Jian

On 10/7/05, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> Yonik and I have been looking at the memory requirements of an application
> we've got. We use a lot of indexed fields, primarily so I can do a lot
> of numeric tests (using RangeFilter). When I say "a lot" I mean
> arround 8,000 -- many of which are not used by all documents in the index.
>
> Now there are some basic usage changes I can make to cut this number in
> half, and some more complex biz rule changes I can make to get the number
> down some more (at the expense of flexibility) but even then we'd have
> arround 1,000 -- which is still a lot more then the recommended "handful"
>
> After discussing some options, I asked the question "Remind me again why
> having lots of indexed fields makes the memory requirements jump up --
> even if only a few documents use some field?" and Yonik reminded me about
> the norm[] -- an array of bytes representating the field boost + length
> boost for each document. One of these arrays exists for every indexed
> field.
>
> So then I asked the $50,000,000 question: "Is there any way to get rid of
> this array for certain fields? ... or any way to get rid of it completely
> for every field in a specific index?"
>
> This may sound like a silly question for most IR applications where you
> want length normalization to contribute to your scores, but in this
> particular case most of these fields are only used to store single numeric
> value, to be certain, there are some fields we have (or may add in the
> future) that could benefit from having a narms[] ... but if it had to be
> an all or nothing thing we could certainly live without them.
>
> It seems to me, that in an ideal world, deciding wether or not you wanted
> to store norms for a field would be like deciding wether you wanted to
> store TermVectors for a field. I can imagine a Field.isNormStored()
> method ... but that seems like a pretty significant change to the existing
> code base.
>
>
> Alternately, I started wondering if if would be possible to write our own
> IndexReader/IndexWriter subclasses that would ignore the norm info
> completely (with maybe an optional list of field names the logic should be
> lmited to), and return nothing but fixed values for any parts of the code
> base that wanted them. Looking at SegmentReader and MultiReader this
> looked very promising (especailly considering the way SegmentReader uses a
> system property to decide which acctaul class ot use). But I was less
> enthusiastic when i started looking at IndexWriter and the DocumentWriter
> classes  there doesn't seem to be any clean way to subclass the
> existing code base to eliminate the writing of the norms to the Directory
> (curses those final classes, and private final methods).
>
>
> So I'm curious what you guys think...
>
> 1) Regarding the root problem: is there any other things you can think
> of besides norms[] that would contribute to the memory foot print
> needed by a large number of indexed fields?
> 2) Can you think of a clean way for individual applications to eliminate
> norms (via subclassing the lucene code base - ie: no patching)
> 3) Yonik is currently looking into what kind of patch it would take to
> optionally turn off norms (I'm not sure if he's looking at doing it
> "per field" or "per index"). Is that the kind of thing that would
> even be considered for getting commited?
>
> --
>
> ---
> "Oh, you're a tricky one." Chris M Hostetter
> -- Trisha Weir [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Adding generic payloads to a Term's posting list

2005-10-10 Thread jian chen
Hi,

I have been studying the Lucene indexing code for a bit. I am not sure if I
understand the problem scope completely, but, storing extra information
using TermsInfoWriter may not solve the problem?

For the example of XML document tag depth, could that be a seperate field?
Because Lucene term is a combination of (field, termText), so, depth could
be a field and even though two XML tags are the same, if their depths are
different, they are still treated as separate terms.

This is what I could think about so far.

Jian

On 10/10/05, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard
>
> See item #11 of API changes. Maybe along the lines of what you are
> interested in, although I don't know if anyone has even attempted a design
> of it. I would also like to see this, plus the ability to store info at
> higher levels in the Index, such as Field (not on a per token basis),
> Document (info about the document that spans it's fields) and Index (such
> as
> coreference information). Alas, no time...
>
> -Grant
>
> >-Original Message-
> >From: Shane O'Sullivan [mailto:[EMAIL PROTECTED]
> >Sent: Monday, October 10, 2005 8:38 AM
> >To: java-dev@lucene.apache.org
> >Subject: Adding generic payloads to a Term's posting list
> >
> >Hi,
> >
> >To the best of my knowledge, it is not possible to add generic
> >data to a Term's posting list.
> >By this I mean info that is defined by the search engine, not
> >Lucene itself.
> >Whereas Lucene adds some data to the posting lists, such as
> >the term's position within a document, there are many other
> >useful types of information that could be attached to a term.
> >
> >Some examples would be in XML documents, to store the depth of
> >a tag in the document, or font information, such as if the
> >term appeared in a header or in the main body of text.
> >
> >Are there any plans to add such functionality to the API? If
> >not, where would be a the appropriate place to implement these
> >changes? I presume the TermInfosWriter and TermInfosReader
> >would have to be altered, as well as the classes which call
> >them. Could this be done without having to modify the index in
> >such a way that standard Lucene indexes couldn't read it?
> >
> >Thanks
> >
> >Shane
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


skipInterval

2005-10-15 Thread jian chen
Hi, All,

I was reading some research papers regarding quick inverted index lookups.
The classical approach to skipping dictates that a skip should be positioned
every sqrt(df) document pointers.

I looked at the the current Lucene implementation. The skipInterval is
hardcoded as follows in TermInfosWriter.java class:
int skipInterval = 16;

Therefore, I have two questions:

1) Would it be a good idea and feasible to use sqrt(df) to be the
skipInterval, rather than hardcode it?

2) When merging segments, for every term, the skip table is buffered first
in the RAMOutputStream and then written to the output stream. If there are
lot of documents for a term, this seems to consume a lot of memory, right?
If instead, we use sqrt(df) to be the skipInterval, the memory consumed will
be a lot less, as it is logarithmic.

Hope some one could shed more light on this. Thanks in advance,

Jian


Fwd: skipInterval

2005-10-15 Thread jian chen
Hi, All,

I should have sent to this email address rather than the old jakarta email
address. Sorry if double-posted.

Jian

-- Forwarded message --
From: jian chen <[EMAIL PROTECTED]>
Date: Oct 15, 2005 6:36 PM
Subject: skipInterval
To: Lucene Developers List 

Hi, All,

I was reading some research papers regarding quick inverted index lookups.
The classical approach to skipping dictates that a skip should be positioned
every sqrt(df) document pointers.

I looked at the the current Lucene implementation. The skipInterval is
hardcoded as follows in TermInfosWriter.java class:
int skipInterval = 16;

Therefore, I have two questions:

1) Would it be a good idea and feasible to use sqrt(df) to be the
skipInterval, rather than hardcode it?

2) When merging segments, for every term, the skip table is buffered first
in the RAMOutputStream and then written to the output stream. If there are
lot of documents for a term, this seems to consume a lot of memory, right?
If instead, we use sqrt(df) to be the skipInterval, the memory consumed will
be a lot less, as it is logarithmic.

Hope some one could shed more light on this. Thanks in advance,

Jian


Re: Fwd: skipInterval

2005-10-16 Thread jian chen
Hi, Paul,

Thanks for your email. I am not sure how the sqrt vs. constant for
skipInterval will pan out for two or multiple required terms. That needs
some experiments I guess.

Cheers,

Jian

On 10/16/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
>
> Jian,


> -- Forwarded message --
> > From: jian chen <[EMAIL PROTECTED]>
> > Date: Oct 15, 2005 6:36 PM
> > Subject: skipInterval
> > To: Lucene Developers List 
> >
> > Hi, All,
> >
> > I was reading some research papers regarding quick inverted index
> lookups.
> > The classical approach to skipping dictates that a skip should be
> positioned
> > every sqrt(df) document pointers.
>
> The typical use of skipping info in Lucene is in ConjunctionScorer, for a
> query with two required terms. There it helps for the case when one
> term occurs much less frequently than another.
> Iirc the sqrt() is optimal for a single lookup in a single level index,
> reducing the complexity from linear to logarithmic.
> Does the sqrt() also apply in the case of searching for two required terms
> and returning all the documents in which they both occur?
>
> Regards,
> Paul Elschot
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>


lucene inter-process locking question

2005-11-07 Thread jian chen
Hi, Lucene Developers,
 Just got a question regarding the locking mechanism in Lucene. I see in
IndexReader, first there is synchronized(directory) to synch up
multi-threads, then, inside, there is the statement for grabbing the
commit.lock.
 So, my question is, could the multi-thread synch be also done with
commit.lock? In other words, I don't understand why synchronized(directory)
is there meanwhile commit.lock could do both in and inter-process locking?
 Could anyone enlighten me about it?
 Thanks so much in advance,
 Jian


Fwd: lucene inter-process locking question

2005-11-07 Thread jian chen
Hi,
 I did some research and found an answer from the following url:

http://www.gossamer-threads.com/lists/lucene/java-dev/21808?search_string=synchronized%20directory;#21808
 So, now I understand that it is partly historical.
 Cheers,
 Jian

-- Forwarded message --
From: jian chen <[EMAIL PROTECTED]>
Date: Nov 7, 2005 4:17 PM
Subject: lucene inter-process locking question
To: java-dev@lucene.apache.org

 Hi, Lucene Developers,
 Just got a question regarding the locking mechanism in Lucene. I see in
IndexReader, first there is synchronized(directory) to synch up
multi-threads, then, inside, there is the statement for grabbing the
commit.lock .
 So, my question is, could the multi-thread synch be also done with
commit.lock? In other words, I don't understand why synchronized(directory)
is there meanwhile commit.lock could do both in and inter-process locking?
 Could anyone enlighten me about it?
 Thanks so much in advance,
 Jian


Re: Lucene Index backboned by DB

2005-11-15 Thread jian chen
Dear All,

I have some thoughts on this issue as well.

1) It might be OK to implement retrieving field values separately for a
document. However, I think from a simplicity point of view, it might be
better to have the application code do this drudgery. Adding this feature
could complicate the nice and simple design of Lucene without much benefit.

2) The application could separately a document into several documents, for
example, one document for indexing mainly, the other documents for storing
binary values for different fields. Thus, giving the relevant doc id, its
associated binary value for a particular field could be loaded very fast
with just a disk lookup (looking up the fdx file).

This way, only the relevant field is loaded into memory rather than all of
the fields for a doc. There is no change on Lucene side, only some more work
for the application code.

My view for a search library (or in general, a library), should be small and
efficient, since it is used by lot of applications, any additional feature
could potentially impact its robustness and liability to performance
drawback.

Welcome for any critics or comments?

Jian

On 11/15/05, Robert Kirchgessner <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> a discussion in
>
> http://issues.apache.org/jira/browse/LUCENE-196
>
> might be of interest to you.
>
> Did you think about storing the large pieces of documents
> in a database to reduce the size of Lucene index?
>
> I think there are good reasons to adding support for
> storing fields in separate files:
>
> 1. One could define a binary field of fixed length and store it
> in a separate file. Then load it into memory and have fast
> access for field contents.
>
> A use case might be: store calendar date (-MM-DD)
> in three bytes, 4 bits for months, 5 bits for days and up to
> 15 bits for years. If you want to retrieve hits sorted by date
> you can load the fields file of size (3 * documents in index) bytes
> and support sorting by date without accessing hard drive
> for reading dates.
>
> 2. One could store document contents in a separate
> file and fields of small size like title and some metadata
> in the way it is stored now. It could speed up access to
> fields. It would be interesting to know whether you gain
> significant perfomance leaving the big chunks out, i.e.
> not storing them in index.
>
> In my opinion 1. is the most interesting case: storing some
> binary fields (dates, prices, length, any numeric metrics of
> documents) would enable *really* fast sorting of hits.
>
> Any thoughts about this?
>
> Regards,
>
> Robert
>
>
>
> We have a similiar problem
>
> Am Dienstag, 15. November 2005 23:23 schrieb Karel Tejnora:
> > Hi all,
> > in our testing application using lucene 1.4.3. Thanks you guys for
> > that great job.
> > We have index file around 12GiB, one file (merged). To retrieve hits it
> > takes nice small amount of the time, but reading fields takes 10-100
> > times more (the stored ones). I think because all the fields are read.
> > I would like to try implement lucene index files as tables in db with
> > some lazy fields loading. As I have searched web I have found only impl.
> > of the store.Directory (bdb), but it only holds data as binary streams.
> > This technique will be not so helpful because BLOB operations are not
> > fast performing. On another side I will have a lack of the freedom from
> > documents fields variability but I can omit a lot of the skipping and
> > many opened files. Also IndexWriter can have document/term locking
> > granuality.
> > So I think that way leads to extends IndexWriter / IndexReader and have
> > own implementation of index.Segment* classes. It is the best way or I
> > missing smthg how achieve this?
> > If it is bad idea, I will be happy to heard another possibilities.
> >
> > I would like also join development of the lucene. Is there some points
> > how to start?
> >
> > Thx for reading this,
> > sorry if I did some mistakes
> >
> > Karel
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: DbDirectory with Berkeley DB Java Edition

2005-12-14 Thread jian chen
Hi,

I am pretty pessimistic about any DB directory implementation for Lucene.
The nature of the Lucene index files does not really fit well into a
relational database. Therefore, performance wise, the DB implementations
would suffer a lot. Basically, I would discourage anyone on the DB
implementation.

2 cents,

Jian

On 12/14/05, Steeley Andrew <[EMAIL PROTECTED]> wrote:
>
> Hello all,
>
> I know this question has been asked before, but I have not found a
> definitive answer.  Has anyone implemented a version of Andi Vajda's
> DbDirectory that works with the Berkeley DB Java Edition?  I am aware of
> Oscar Picasso's postings on this list on the topic, but haven't been able to
> contact Mr. Picasso or find his contribution as of yet.  Anyone else?
>
> Thanks much,
> Andy
>
>


Re: Filter

2006-03-10 Thread jian chen
Hi, All,

For the filter issue, my idea is to completely get rid of the filter
interface. Can we not use the HitCollector and have that to do the filtering
work?

I am in the process of writing a simpler engine based on Lucene source code.
I don't mind re-inventing the wheel, as I feel frustrated with the relations
among Query, Searcher, Scorer, etc.

I have done the initial round of my code already and it is in production and
works great. Basically, the search interfaces will be like the following:

public interface IndexSearcher
{
public void search(HitCollector hc, int maxDocNum)
  throws IOException;
}

public interface HitCollector
{
  public void collect(int doc, float score);

  // the total hits that meet the search criteria,
  // could be way bigger than the actual ScoreDocs
  // we record in the HitQueue
  public int getTotalHits();

  public ScoreDoc[] getScoreDocs();

  public int getNumScoreDocs();

  // max number of ScoreDocs this hit collector could hold
  public int getCapacity();
}

I have refactored the Scorers in Lucene to be just Searchers. Because the
scorer is the actual searcher that does the ranking, right?

I will publish my code using an open source license this year.

Cheers,

Jian Chen
Lead Developer, Seattle Lighting

On 3/10/06, eks dev <[EMAIL PROTECTED]> wrote:
>
> It looks to me everybody agrees here, not? If yes, it
> would be really usefull if somebody with commit rights
> could add 1) and 2) to the trunk (these patches
> practically allready exist).
>
> It is not invasive change and there are no problems
> with compatibility. Also, I have noticed a lot of
> people trying to "hack in" better Filter support using
> Pauls Patches from JIRA.
>
> That would open a window for some smart code to get
> commited into Lucene core.
>
> Just have a look at Filtering support in Solr,
> beautiful, but unfortunately also "hacked" just to
> overcome BitSet on Filter.
>
>
>
>
> --- Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> >   1) commit DocNrSkipper interface to the core code
> > base
> >   2) add the following method declaration to the
> > Filter class...
> > public DocNrSkipper getSkipper(IndexReader)
> > throws IOException
> >  ...impliment this method by calling bits, and
> > returning an instance
> >  of BitSetSortedIntList
> >   3) indicate that Filter.bits() is deprecated.
> >   4) change all existing calls to Filter.bits() in
> > the core lucene code
> >  base to call Filter.getSkipper and do whatever
> > iterating is
> >  neccessary.
> >   5) gradually reimpliment all of the concrete
> > instances of Filter in
> >  the core lucene code base so they override the
> > getSkipper method
> >  with something that returns a more "iterator"
> > style DocNrSkipper,
> >  and impliment their bits() method to use the
> > DocNrSkipper from the
> >  new getSkipper() method to build up the bit set
> > if clients call it
> >  directly.
> >   6) wait a suitable amount of time.
> >   7) remove Filter.bits() and all of the concrete
> > implimentations from the
> >  lucene core.
> >
> >
> >
> >
> > -Hoss
> >
> >
> >
>
>
>
> ___
> To help you stay safe and secure online, we've developed the all new
> Yahoo! Security Centre. http://uk.security.yahoo.com
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: this == that

2006-05-01 Thread jian chen

I am wondering if interning Strings will be really that critical for
performance. The biggest bottle neck is still disk. So, maybe we can use
String.equals(...) instead of ==.

Jian

On 5/1/06, DM Smith <[EMAIL PROTECTED]> wrote:


karl wettin wrote:
> The code is filled with string equality code using == rather than
> equals(). I honestly don't think it saves a single clock tick as the
> JIT takes care of it when the first line of code in the equals method
> is if (this == that) return true;
If the strings are intern() then it should be a touch faster.
If the strings are not interned then I think it may be a premature
optimization.

IMHO, using intern to optimize space is a reasonable optimization, but
using == to compare such strings is error prone as it is possible that
the comparison is looking at strings that have not been interned.

Unless it object identity is what is being tested or intern is an
invariant, I think it is dangerous. It is easy to forget to intern or to
propagate the pattern via cut and paste to an inappropriate context.
>
> Please correct me if I'm wrong.
>
> I can commit to do the changes to the core code if it is considered
> interesting.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen

Hi, All,

Recently I have been following through the whole discussion on storing
text/string as standard UTF-8 and how to achieve that in Lucene.

If we are stroing the term text and the field strings as UTF-8 bytes, I now
understand that it is a tricky issue because of the performance problem we
are still facing when converting back and forth between the UTF-8 bytes and
java String. This especially seems to be a problem for the segment merger
routine, which loads the segment term enums and will convert the UTF-8 bytes
back to String during merge operation.

Just a thought here, could we always represent the term text as UTF-8 bytes
internally? So Term.java will have the private member variable:

private byte[] utf8bytes;

instead of

private String text;

Plus, Term object could be construct either from a String or from a utf8
byte array.

This way, for indexing new documents, the new Term(String text) is called
and utf8bytes will be obtained from the input term text. For segment term
info merge, the utf8bytes will be loaded from the Lucene index, which
already stores the term text as utf8 bytes. Therefore, no conversion is
needed.

I hope I explained my thoughts. Make sense?

Cheers,

Jian Chen


Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen

Hi, Marvin,

Thanks for your quick response. I am in the camp of fearless refactoring,
even at the expense of breaking compatibility with previous releases. ;-)

Compatibility aside, I am trying to identify if changing the implementation
of Term is the right way to go for this problem.

If it is, I think it would be worthwhile rather than putting band-aid on the
existing API.

Cheers,

Jian

Changing the implementation of Term

would have a very broad impact; I'd look for other ways to go about
it first.  But I'm not an expert on SegmentMerger, as KinoSearch
doesn't use the same technique for merging.

My plan was to first submit a patch that made the change to the file
format but didn't touch SegmentMerger, then attack SegmentMerger and
also see if other developers could suggest optimizations.

However, I have an awful lot on my plate right now, and I basically
get paid to do KinoSearch-related work, but not Lucene-related work.
It's hard for me to break out the time to do the java coding,
especially since I don't have that much experience with java and I'm
slow.  I'm not sure how soon I'll be able to get back to those
bytecount patches.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen

Hi, Chuck,

Using standard UTF-8 is very important for Lucene index so any program could
read the Lucene index easily, be it written in perl, c/c++ or any new future
programming languages.

It is like storing data in a database for web application. You want to store
it in such a way that other programs can manipulate easily other than only
the web app program. Because there will be cases that you want to mass
update or mass change the data, and you don't want to write only web apps
for doing it, right?

Cheers,

Jian


On 5/1/06, Chuck Williams <[EMAIL PROTECTED]> wrote:


Could someone summarize succinctly why it is considered a major issue
that Lucene uses the Java modified UTF-8 encoding within its index
rather than the standard UTF-8 encoding.  Is the only concern
compatibility with index formats in other Lucene variants?  The API to
the values is a String, which uses Java's char representation, so I'm
confused why the encoding in the index is so important.

One possible benefit of a standard UTF-8 index encoding would be
streaming content into and out of the index with no copying or
conversions.  This relates to the lazy field loading mechanism.

Thanks for any clarification,

Chuck


jian chen wrote on 05/01/2006 04:24 PM:
> Hi, Marvin,
>
> Thanks for your quick response. I am in the camp of fearless
refactoring,
> even at the expense of breaking compatibility with previous releases.
;-)
>
> Compatibility aside, I am trying to identify if changing the
> implementation
> of Term is the right way to go for this problem.
>
> If it is, I think it would be worthwhile rather than putting band-aid
> on the
> existing API.
>
> Cheers,
>
> Jian
>
> Changing the implementation of Term
>> would have a very broad impact; I'd look for other ways to go about
>> it first.  But I'm not an expert on SegmentMerger, as KinoSearch
>> doesn't use the same technique for merging.
>>
>> My plan was to first submit a patch that made the change to the file
>> format but didn't touch SegmentMerger, then attack SegmentMerger and
>> also see if other developers could suggest optimizations.
>>
>> However, I have an awful lot on my plate right now, and I basically
>> get paid to do KinoSearch-related work, but not Lucene-related work.
>> It's hard for me to break out the time to do the java coding,
>> especially since I don't have that much experience with java and I'm
>> slow.  I'm not sure how soon I'll be able to get back to those
>> bytecount patches.
>>
>> Marvin Humphrey
>> Rectangular Research
>> http://www.rectangular.com/
>>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen

Plus, as open source and open standard advocates, we don't want to be like
Micros$ft, who claims to use industrial "standard" XML as the next
generation word file format. However, it is very hard to write your own Word
reader, because their word file format is proprietary and hard to write
programs for.

Jian

On 5/1/06, jian chen <[EMAIL PROTECTED]> wrote:


Hi, Chuck,

Using standard UTF-8 is very important for Lucene index so any program
could read the Lucene index easily, be it written in perl, c/c++ or any new
future programming languages.

It is like storing data in a database for web application. You want to
store it in such a way that other programs can manipulate easily other than
only the web app program. Because there will be cases that you want to mass
update or mass change the data, and you don't want to write only web apps
for doing it, right?

Cheers,

Jian



On 5/1/06, Chuck Williams <[EMAIL PROTECTED]> wrote:
>
> Could someone summarize succinctly why it is considered a major issue
> that Lucene uses the Java modified UTF-8 encoding within its index
> rather than the standard UTF-8 encoding.  Is the only concern
> compatibility with index formats in other Lucene variants?  The API to
> the values is a String, which uses Java's char representation, so I'm
> confused why the encoding in the index is so important.
>
> One possible benefit of a standard UTF-8 index encoding would be
> streaming content into and out of the index with no copying or
> conversions.  This relates to the lazy field loading mechanism.
>
> Thanks for any clarification,
>
> Chuck
>
>
> jian chen wrote on 05/01/2006 04:24 PM:
> > Hi, Marvin,
> >
> > Thanks for your quick response. I am in the camp of fearless
> refactoring,
> > even at the expense of breaking compatibility with previous releases.
> ;-)
> >
> > Compatibility aside, I am trying to identify if changing the
> > implementation
> > of Term is the right way to go for this problem.
> >
> > If it is, I think it would be worthwhile rather than putting band-aid
> > on the
> > existing API.
> >
> > Cheers,
> >
> > Jian
> >
> > Changing the implementation of Term
> >> would have a very broad impact; I'd look for other ways to go about
> >> it first.  But I'm not an expert on SegmentMerger, as KinoSearch
> >> doesn't use the same technique for merging.
> >>
> >> My plan was to first submit a patch that made the change to the file
> >> format but didn't touch SegmentMerger, then attack SegmentMerger and
> >> also see if other developers could suggest optimizations.
> >>
> >> However, I have an awful lot on my plate right now, and I basically
> >> get paid to do KinoSearch-related work, but not Lucene-related work.
> >> It's hard for me to break out the time to do the java coding,
> >> especially since I don't have that much experience with java and I'm
> >> slow.  I'm not sure how soon I'll be able to get back to those
> >> bytecount patches.
> >>
> >> Marvin Humphrey
> >> Rectangular Research
> >> http://www.rectangular.com/
> >>
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread jian chen

Hi, Doug,

I totally agree with what you said. Yeah, I think it is more of a file
format issue, less of an API issue. It seems that we just need to add an
extra constructor to Term.java to take in utf8 byte array.

Lucene 2.0 is going to break the backward compability anyway, right? So,
maybe this change to standard UTF-8 could be a hot item on the Lucene 2.0list?

Cheers,

Jian Chen

On 5/2/06, Doug Cutting <[EMAIL PROTECTED]> wrote:


Chuck Williams wrote:
> For lazy fields, there would be a substantial benefit to having the
> count on a String be an encoded byte count rather than a Java char
> count, but this has the same problem.  If there is a way to beat this
> problem, then I'd start arguing for a byte count.

I think the way to beat it is to keep things as bytes as long as
possible.  For example, each term in a Query needs to be converted from
String to byte[], but after that all search computation could happen
comparing byte arrays.  (Note that lexicographic comparisons of UTF-8
encoded bytes give the same results as lexicographic comparisions of
Unicode character strings.)  And, when indexing, each Token would need
to be converted from String to byte[] just once.

The Java API can easily be made back-compatible.  The harder part would
be making the file format back-compatible.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: when was the document number initially written into .frq file?

2006-05-08 Thread jian chen

It is in DocumentWriter.java class.

Look at writePostings(...) method.

Here are the lines:

   // add an entry to the freq file
   int f = posting.freq;
   if (f == 1)  // optimize freq=1
 freq.writeVInt(1);  // set low bit of doc num.
   else {
 freq.writeVInt(0);  // the document number
 freq.writeVInt(f);  // frequency in doc
   }

Any other question?

Jian

On 5/6/06, Charlie <[EMAIL PROTECTED]> wrote:


Hello,

Would any developer please give me a hint of when the document number was
initially written into .frq file?

From: //it is not really write the doc# in writePostings()

final class DocumentWriter
  private final void writePostings(Posting[] postings, String segment)
int postingFreq = posting.freq;
if (postingFreq == 1) // optimize
freq=1
  freq.writeVInt(1);  // set low bit of doc
num.
else {
  freq.writeVInt(0);  // the document number
  freq.writeVInt(postingFreq);// frequency in
doc
}

//it is write the doc# in appendPostings()

final class SegmentMerger
  private final int appendPostings(SegmentMergeInfo[] smis, int n)

int docCode = (doc - lastDoc) << 1;   // use low bit to flag
freq=1
lastDoc = doc;

int freq = postings.freq();
if (freq == 1) {
  freqOutput.writeVInt(docCode | 1);  // write doc & freq=1
} else {
  freqOutput.writeVInt(docCode);  // write doc
  freqOutput.writeVInt(freq); // write frequency in
doc
}

//but then I am further confused that in order to call
int doc = postings.doc(); in appendPostings(),
the doc# should already been written.

Chicken-egg-chicken-egg ...

Should there be another place for the initial writing of the doc# ?

--
Thanks for your advice,
Charlie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: when was the document number initially written into .frq file?

2006-05-08 Thread jian chen

Looking at your email again.

You are confusing the initial writing of postings with the segment merging.

Once the doc number is written, the .frq file is not changed. The segment
merge process will write to a new .frq file.

Make sense?

Jian

On 5/8/06, jian chen <[EMAIL PROTECTED]> wrote:


It is in DocumentWriter.java class.

Look at writePostings(...) method.

Here are the lines:

// add an entry to the freq file
int f = posting.freq;
if (f == 1)  // optimize freq=1

  freq.writeVInt(1);  // set low bit of doc num.
else {
  freq.writeVInt(0);  // the document number
  freq.writeVInt(f);  // frequency in doc
}

Any other question?

Jian


On 5/6/06, Charlie <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> Would any developer please give me a hint of when the document number
> was
> initially written into .frq file?
>
> From: //it is not really write the doc# in writePostings()
>
> final class DocumentWriter
>   private final void writePostings(Posting[] postings, String segment)
> int postingFreq = posting.freq;
> if (postingFreq == 1) // optimize
> freq=1
>   freq.writeVInt(1);  // set low bit of doc
> num.
> else {
>   freq.writeVInt(0);  // the document number
>   freq.writeVInt(postingFreq);// frequency
> in doc
> }
>
> //it is write the doc# in appendPostings()
>
> final class SegmentMerger
>   private final int appendPostings(SegmentMergeInfo[] smis, int n)
>
> int docCode = (doc - lastDoc) << 1;   // use low bit to flag
> freq=1
> lastDoc = doc;
>
> int freq = postings.freq();
> if (freq == 1) {
>   freqOutput.writeVInt(docCode | 1);  // write doc & freq=1
> } else {
>   freqOutput.writeVInt(docCode);  // write doc
>   freqOutput.writeVInt(freq); // write frequency in
> doc
> }
>
> //but then I am further confused that in order to call
> int doc = postings.doc(); in appendPostings(),
> the doc# should already been written.
>
> Chicken-egg-chicken-egg ...
>
> Should there be another place for the initial writing of the doc# ?
>
> --
> Thanks for your advice,
> Charlie
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



Re: How To find which field has the search term in Hit?

2006-05-24 Thread jian chen

You can store the field values and then, load the field values to do a
real-time comparision. Simple solution...

Jian

On 5/24/06, N <[EMAIL PROTECTED]> wrote:


Hi

I am searching on multiple fields. Is it possible to retrieve the field
(s) which contains the search terms from the documents returned as Hits.

Best
Noon


-
Sneak preview the  all-new Yahoo.com. It's not radically different. Just
radically better.



Re: How To find which field has the search term in Hit?

2006-05-29 Thread jian chen

Hi, Noon,

Sorry I did not initially understand the detailed problem you have.

This sounds like a prefix match problem. You can create index for each field
and then do a prefix mach for these fields.

By the way, I think you question could be better served by posting to the
lucene user group.

Cheers,

Jian

On 5/29/06, N <[EMAIL PROTECTED]> wrote:


Thanks for the reply but I couldnt get your point..Could you elaborate it
further?
Fopr instance we have

FirstName (= Martin ), LastName (= Spaniol), Company (= Mark Co.) and we
search for the "Mar*" which will be found in FirstName and Company ..so how
can I retrieve this info that it is found in only FirstName and Company
fields.

Best
Noon.

jian chen <[EMAIL PROTECTED]> wrote: You can store the field values
and then, load the field values to do a
real-time comparision. Simple solution...

Jian

On 5/24/06, N  wrote:
>
> Hi
>
> I am searching on multiple fields. Is it possible to retrieve the field
> (s) which contains the search terms from the documents returned as Hits.
>
> Best
> Noon
>
>
> -
> Sneak preview the  all-new Yahoo.com. It's not radically different. Just
> radically better.
>



-
Blab-away for as little as 1ยข/min. Make  PC-to-Phone Calls using Yahoo!
Messenger with Voice.



Kudo to the wonderful Lucene search library

2006-06-02 Thread jian chen

Hi, All,

Our site, www.destinationlighting.com, went live yesterday. It is powered by
the Lucene search engine and the Velocity template engine. It will be the
best and most comprehensive online store for lighting fixtures and related
hardware.

Many thanks to the Lucene developers and the open source community.

Jian Chen
Lead Developer
www.destinationlighting.com


Re: Using Database instead of File system

2006-09-25 Thread jian chen

For real search engine, performance is the most important factor. I think
file system based system is better than storing the indexes in database
because of the pure speed you will get.

Cheers,

Jian

On 9/25/06, Simon Willnauer <[EMAIL PROTECTED]> wrote:


Have a look at the compass framework
http://www.opensymphony.com/compass/

Compass also provides a Lucene Jdbc Directory implementation, allowing
storing Lucene index within a database for both pure Lucene
applications and Compass enabled applications.

best regards simon

On 9/25/06, Reuven Ivgi <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I have just started to work with Lucene
>
> Is it possible to define the index files of lucene to be on a database
> (such as MySql), just for backup and restore porposes
>
> Thanks in Advance
>
>
>
> Reuven Ivgi
>
>
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread jian chen

Hi, Jeff,

I like the idea of impact based scoring. However, could you elaborate more
on why we only need to use single field at search  time?

In Lucene, the indexed terms are field specific, and two terms, even if they
are the same, are still different terms if they are of different fields.

So,  I think the multiple field scenario is still needed right? What if the
user wants to search on both subject and content for emails, for example,
and sometimes, only wants to search on subject, this type of tasks, without
multiple fields, how this would be handled.

I got lost on this,  could any one educate?

Thanks,

Jian

On 1/9/07, Dalton, Jeffery <[EMAIL PROTECTED]> wrote:


I'm not sure we fully understand one another, but I'll try to explain
what I am thinking.

Yes, it has use after sorting.  It is used at query time for document
scoring in place of the TF and length norm components  (new scorers
would need to be created).

Using an impact based index moves most of the scoring from query time to
index time (trades query time flexibility for greatly improved query
search performance).  Because the field boosts, length norm, position
boosts, etc... are incorporated into a single document-term-score, you
can use a single field at search time.  It allows one posting list per
query term instead of the current one posting list per field per query
term (MultiFieldQueryParser wouldn't be necessary in most cases).  In
addition to having fewer posting lists to examine, you often don't need
to read to the end of long posting lists when processing with a
score-at-a-time approach (see Anh/Moffat's Pruned Query Evaluation Using
Pre-Computed Impacts, SIGIR 2006) for details on one potential
algorithm.

I'm not quite sure what you mean when mention leaving them out and
re-calculating them at merge time.

- Jeff

> -Original Message-
> From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 09, 2007 2:58 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Beyond Lucene 2.0 Index Design
>
>
> On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
>
> > e. 
> > f. ],...[docN, freq
> > ,])
>
> Does the impact have any use after it's used to sort the postings?
> Can we leave it out of the index format and recalculate at merge-time?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread jian chen

Hi, Jeff,

Also, how to handle the phrase based queries?

For example, here are two posting lists:

TermA: X Y
TermB: Y X

I am not sure how you would return document X or Y for a search of the
phrase "TermA Term B". Which should come first?

Thanks,

Jian

On 1/9/07, Dalton, Jeffery <[EMAIL PROTECTED]> wrote:


I'm not sure we fully understand one another, but I'll try to explain
what I am thinking.

Yes, it has use after sorting.  It is used at query time for document
scoring in place of the TF and length norm components  (new scorers
would need to be created).

Using an impact based index moves most of the scoring from query time to
index time (trades query time flexibility for greatly improved query
search performance).  Because the field boosts, length norm, position
boosts, etc... are incorporated into a single document-term-score, you
can use a single field at search time.  It allows one posting list per
query term instead of the current one posting list per field per query
term (MultiFieldQueryParser wouldn't be necessary in most cases).  In
addition to having fewer posting lists to examine, you often don't need
to read to the end of long posting lists when processing with a
score-at-a-time approach (see Anh/Moffat's Pruned Query Evaluation Using
Pre-Computed Impacts, SIGIR 2006) for details on one potential
algorithm.

I'm not quite sure what you mean when mention leaving them out and
re-calculating them at merge time.

- Jeff

> -Original Message-
> From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, January 09, 2007 2:58 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Beyond Lucene 2.0 Index Design
>
>
> On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
>
> > e. 
> > f. ],...[docN, freq
> > ,])
>
> Does the impact have any use after it's used to sort the postings?
> Can we leave it out of the index format and recalculate at merge-time?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Beyond Lucene 2.0 Index Design

2007-01-11 Thread jian chen

I also got the same question. It seems it is very hard to efficiently do
phrase based query.

I think most search engines do phrase based query, or at least appear to be.
So, like in google, the query result must contain all the words user
searched on.

It seems to me that the impacted-sorted list makes sense if you are trying
to do pure vector space based ranking. This is from what I have read from
the research papers. They all talk about how to optimize the vector space
model using this impact-sorted list approach.

Unfortunately, the vector space model has serious drawbacks. It does not
take the inter-word relation into account. Thus, could result in a search
result where documents matching only some keywords ranked higher than
documents matching all of them.

I still yet to see whether the impact-sorted list approach could handle this
efficiently.

Cheers,

Jian

On 1/11/07, Marvin Humphrey <[EMAIL PROTECTED]> wrote:



On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
> e. 
> f. ],...[docN, freq
> ,])

How do you build an efficient PhraseScorer to work with an impact-
sorted posting list?

The way PhraseScorer currently works is: find a doc that contains all
terms, then see if the terms occur consecutively in phrase order,
then determine a score.  The TermDocs objects feeding PhraseScorer
return doc_nums in  ascending order, so finding an intersection is
easy.  But if the document numbers are returned in what looks to the
PhraseScorer like random order... ??

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: NewIndexModifier - - - DeletingIndexWriter

2007-02-09 Thread jian chen

Hey guys,

Following the Lucene dev mailing list for sometime now, I am concerned that
lucene is slowing losing all the simplicity and become a complicated mess.

I think keeping IndexReader and IndexWriter the way it works in 1.2 even is
better, no?

Software should be designed to be simple to use and maintain, that's my
concern.

Cheers,

Jian

On 2/9/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 2/9/07, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Yonik Seeley wrote:
> > As long as you wouldn't object to a org.apache.lucene package in
Solr...
> > With the understanding of course, that the onus would be on Solr
developers
> > to keep up with any changes.
>
> I wouldn't object to that.  Would you?

Nope... Solr bundles Lucene, so if there are changes that take longer
to adapt to - so be it.  Solr doesn't need to work with every Lucene
version.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: NewIndexModifier - - - DeletingIndexWriter

2007-02-13 Thread jian chen

I totally second Robert's thought.

My concern is, to get the raw speed of Lucene, you got to get to the basics.
If we start to apply layers upon layers of code to just mask off the
internals of Lucene, it will not do any good.

An example perhaps is the Windoze vs. Linux. As an end user, you get all the
fancy features in Windoze, but, as a software developer, you get frustrated
when not able to access the low level stuff easily. Linux is good in this
aspect.

I think the Lucene library should be designed simple and efficient in order
to allow tweaking for raw speed. That's the spirit for large scale search
engines, right? Even Google file system has to sacrifice some design for raw
speed, i.e., files are append-only.

Cheers,

Jian

On 2/13/07, robert engels <[EMAIL PROTECTED]> wrote:


Lucene is not a word processor. It is a development library. I think
an understanding of any development library is essential to using it
properly. Once you have even a basic understanding of the Lucene
design, it is very clear as to why deletes are performed using the
IndexReader.

If you attempt to use Lucene without understanding its use proper and
design (there are many people on this list that think it is a
database) you will probably get most things wrong.

On Feb 13, 2007, at 1:17 AM, Nadav Har'El wrote:

> On Fri, Feb 09, 2007, jian chen wrote about "Re: NewIndexModifier -
> - - DeletingIndexWriter":
>> Following the Lucene dev mailing list for sometime now, I am
>> concerned that
>> lucene is slowing losing all the simplicity and become a
>> complicated mess.
>> I think keeping IndexReader and IndexWriter the way it works in
>> 1.2 even is
>> better, no?
>> Software should be designed to be simple to use and maintain,
>> that's my
>> concern.
>
> Hi, I wonder - how do you see the original IndexReader and IndexWriter
> separation "simple to use"?
>
> Every single user of Lucene that I know, encountered very quickly
> the problem
> of how to delete documents; Many of them started to use
> IndexModifier, and
> then suddenly realized its performance makes it unusable; Many (as
> you can
> also see from examples sent to the user list once in a while) ended
> up writing
> their own complex code for buffering deletes (and similar solutions).
>
> So for users, the fact that an index "writer" cannot delete, but
> rather an
> index "reader" (!) is the one that can delete documents, wasn't
> simplicity -
> it was simply confusing, and hard to use. It meant each user needed
> to work
> hard to get around this limitation. Wouldn't it be better if Lucene
> included
> this functionality that many (if not most) users need, out of the box?
>
> --
> Nadav Har'El| Tuesday, Feb 13 2007, 25
> Shevat 5767
> IBM Haifa Research Lab
> |-
> |Just remember that if the
> world didn't
> http://nadav.harel.org.il   |suck, we would all fall off.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

2007-03-20 Thread jian chen

Hi, Mark,

Your program is very helpful. I am trying to understand your code but it
seems would take longer to do that than simply asking you some questions.

1) What is the sliding window used for? It is that the Analyzer remembers
the previously seen N tokens, and N is the window size?

2) As the Analyzer does text parsing, is it that the patterns happened
before (in the previous N token window) is used and any such pattern in the
latest N token window is recognized?

Could you provide some more insights how your algorithm works by removing
duplicate snippets of text from many documents?

Thanks and really appreciate your help.

Jian


On 3/20/07, Mark Harwood (JIRA) <[EMAIL PROTECTED]> wrote:



 [
https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

Mark Harwood updated LUCENE-725:


Attachment: NovelAnalyzer.java

Updated version can now process any number of documents and remove
"boilerplate" text tokens such as copyright notices etc.
New version automatically maintains only a sliding window of content in
which it searches for duplicate paragraphs enabling it to process unlimited
numbers of documents.

> NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all
"boilerplate" text
>
---
>
> Key: LUCENE-725
> URL: https://issues.apache.org/jira/browse/LUCENE-725
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Mark Harwood
>Priority: Minor
> Attachments: NovelAnalyzer.java, NovelAnalyzer.java
>
>
> This is a class I have found to be useful for analyzing small (in the
hundreds) collections of documents and  removing any duplicate content such
as standard disclaimers or repeated text in an exchange of  emails.
> This has applications in sampling query results to identify key phrases,
improving speed-reading of results with similar content (eg email
threads/forum messages) or just removing duplicated noise from a search
index.
> To be more generally useful it needs to scale to millions of documents -
in which case an alternative implementation is required. See the notes in
the Javadocs for this class for more discussion on this

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

2007-03-20 Thread jian chen

Also, how about this scenario.

1) The Analyzer does 100 documents, each with copy right notice inside. I
guess in this case, the copy right notices will be removed when indexing.

2) The Analyzer does another 50 documents, each without any copy right
notice inside.

3) Then, the Analyzer runs into a document that has copy right notice inside
again.

My question is, would the Analyzer be able to remove the copy right notice
in step 3)?

Cheers,

Jian

On 3/20/07, jian chen <[EMAIL PROTECTED]> wrote:


Hi, Mark,

Your program is very helpful. I am trying to understand your code but it
seems would take longer to do that than simply asking you some questions.

1) What is the sliding window used for? It is that the Analyzer remembers
the previously seen N tokens, and N is the window size?

2) As the Analyzer does text parsing, is it that the patterns happened
before (in the previous N token window) is used and any such pattern in the
latest N token window is recognized?

Could you provide some more insights how your algorithm works by removing
duplicate snippets of text from many documents?

Thanks and really appreciate your help.

Jian


On 3/20/07, Mark Harwood (JIRA) <[EMAIL PROTECTED] > wrote:
>
>
>  [
> 
https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Mark Harwood updated LUCENE-725:
> 
>
> Attachment: NovelAnalyzer.java
>
> Updated version can now process any number of documents and remove
> "boilerplate" text tokens such as copyright notices etc.
> New version automatically maintains only a sliding window of content in
> which it searches for duplicate paragraphs enabling it to process unlimited
> numbers of documents.
>
> > NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out
> all "boilerplate" text
> >
> 
---
> >
> > Key: LUCENE-725
> > URL: https://issues.apache.org/jira/browse/LUCENE-725
> > Project: Lucene - Java
> >  Issue Type: New Feature
> >  Components: Analysis
> >Reporter: Mark Harwood
> >Priority: Minor
> > Attachments: NovelAnalyzer.java, NovelAnalyzer.java
> >
> >
> > This is a class I have found to be useful for analyzing small (in the
> hundreds) collections of documents and  removing any duplicate content such
> as standard disclaimers or repeated text in an exchange of  emails.
> > This has applications in sampling query results to identify key
> phrases, improving speed-reading of results with similar content (eg email
> threads/forum messages) or just removing duplicated noise from a search
> index.
> > To be more generally useful it needs to scale to millions of documents
> - in which case an alternative implementation is required. See the notes in
> the Javadocs for this class for more discussion on this
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

2007-03-21 Thread jian chen

Hi, Mark,

Thanks a lot for your explanation. This code is very useful so it could even
be in a separate library  for text extraction.

Again, thanks for taking time to answer my question.

Jian

On 3/21/07, markharw00d <[EMAIL PROTECTED]> wrote:


The Analyzer keeps a window of (by default) the last 300 documents.
Every token created in these cached documents is stored for reference
and as new documents arrive their token sequences are examined to see if
any of the sequences was seen before, in which case the analyzer does
not emit them as tokens. A sequence is of a definable length but I have
found something like 10 to be a good value (passed to the constructor).

If I was indexing this newslist for example all of your content copied
below would be removed automatically (because it occurred more than once
within a 300 documents window).


>>My question is, would the Analyzer be able to remove the copy right
notice in step 3)?
In your example, "yes" - because it re-occurred within 300 documents.

>>Could you provide some more insights how your algorithm works

There are a number of optimizations to make it run fast that make the
code trickier to read.
The basis of it is:
1) a "tokens" map is contained with a key for every unique term
2) The value for each map entry is a list of ChainedTokens each of which
represent an occurrence of the term in a doc
3) ChainedTokens contain the current term plus a reference to the
previous term in that document.
4) The analyzer periodically (i.e not for every token) checks the tokens
map for the current term and looks at all previous occurrences of this
term, following the sequences of ChainedTokens looking for a common
pattern.
5) As soon as a pattern looks like it is established and the analyzer is
"onto something" it switches to a mode of concentrating solely on
comparing the current sequence with a single likely previous sequence
rather than testing ALL previous sequences as in step 4). If the
repeated chains of tokens is over the desired sequence length these
tokens are not emitted as part of the output TokenStream.
* Periodically the tokens map and ChainedToken occurrences are pruned to
avoid bloating memory. As part of this exercise "Stop words" are also
automatically identified and recorded to avoid the cost of chasing all
occurrences (step 4) or recording occurrences for very common words.

Glad you find it useful.

Cheers,
Mark


jian chen wrote:
> Also, how about this scenario.
>
> 1) The Analyzer does 100 documents, each with copy right notice inside.
I
> guess in this case, the copy right notices will be removed when
indexing.
>
> 2) The Analyzer does another 50 documents, each without any copy right
> notice inside.
>
> 3) Then, the Analyzer runs into a document that has copy right notice
> inside
> again.
>
> My question is, would the Analyzer be able to remove the copy right
> notice
> in step 3)?
>
> Cheers,
>
> Jian
>
> On 3/20/07, jian chen <[EMAIL PROTECTED]> wrote:
>>
>> Hi, Mark,
>>
>> Your program is very helpful. I am trying to understand your code but
it
>> seems would take longer to do that than simply asking you some
>> questions.
>>
>> 1) What is the sliding window used for? It is that the Analyzer
>> remembers
>> the previously seen N tokens, and N is the window size?
>>
>> 2) As the Analyzer does text parsing, is it that the patterns happened
>> before (in the previous N token window) is used and any such pattern
>> in the
>> latest N token window is recognized?
>>
>> Could you provide some more insights how your algorithm works by
>> removing
>> duplicate snippets of text from many documents?
>>
>> Thanks and really appreciate your help.
>>
>> Jian
>>
>>
>> On 3/20/07, Mark Harwood (JIRA) <[EMAIL PROTECTED] > wrote:
>> >
>> >
>> >  [
>> >
>>
https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
>>
>> >
>> > Mark Harwood updated LUCENE-725:
>> > 
>> >
>> > Attachment: NovelAnalyzer.java
>> >
>> > Updated version can now process any number of documents and remove
>> > "boilerplate" text tokens such as copyright notices etc.
>> > New version automatically maintains only a sliding window of
>> content in
>> > which it searches for duplicate paragraphs enabling it to process
>> unlimited
>> > numbers of documents.
>> >
>> > > NovelAnalyzer - wraps your choice of Lucene Analyzer and filters
out
>> > all "boilerplate" text
>> > >
>> &

(LUCENE-835) An IndexReader with run-time support for synonyms

2007-03-23 Thread jian chen

Hi, Mark,

Thanks for providing this original approach for synonyms. I read through
your code and think maybe this could be extended to handle the word stemming
problem as well.

Here is my thought.

1) Before indexing, create a Map> stemmedWordMap,
the key is the stemmed word.

1) At indexing, we still index the word as it is, but, we stem the word
(using PorterStemmer) and then insert/update the stemmedWordMap to add the
mapping: stemmedWord <=>Word.

Example, "lighting", "lighted", these two words will be stored in the
ArrayList with the key "light".

2) At query time, when someone searched on "lighting", we stem the word to
"light", then, find from the stemmedWordMap the synonyms for this word. In
this case, we find "lighted". Then, we perform the search using the synonyms
search.

This way, we can combine both the synonyms and the stemmed words together.
The nice part of this is, we only need to store the index with the original
words. Saving disk space as well as indexing time.

However, I do have the following concerns:

1) As documents could be removed from the index, the stemmedWordMap needs to
be somehow kept up to date. This could be done periodically by rebuilding
the stemmedWordMap?

2) Typically, people would like to see their exact match first. So, the
synonyms search could be enhanced to take advantage of the position level
boosting (payload for position). So, the search result for "lighting" should
rank the documents with 'lighting" higher than documents with "lighted".

3) I am still not sure if this is a best approach in general. Does it make
sense to keep the two indexes, one with original words indexed, the other
one with all words stemmed? Then, searching will be run against both
indexes.

4) How does Google perform this type of search? I guess the web search
engines have different approach. There maybe no need for using a stemmer at
all.

First, the web documents are huge, searching for "lighting" will bring up
enough results, who cares bringing back results with "lighted"?

Second, the anchor texts that point to a web page of interest would contain
all the variants (synonyms and stemmed words), so, they don't need to worry
about search results being incomplete? For example, search for "rectangular"
in google, http://www.google.com/search?hl=en&q=rectangular&btnG=Search, the
wikipedia page comes up first. It only contains "Rectangle", however, click
on Cached link, you will see "rectangular" is contained in the anchor text
that points to this page.

My ultimate question, if I want to do a search engine, as a general rule,
what's the best way to do it?

Mark, could be shed some light?

Thanks,

Jian

On 3/18/07, Mark Harwood (JIRA) <[EMAIL PROTECTED]> wrote:



 [
https://issues.apache.org/jira/browse/LUCENE-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]

Mark Harwood updated LUCENE-835:


Attachment: TestSynonymIndexReader.java

> An IndexReader with run-time support for synonyms
> -
>
> Key: LUCENE-835
> URL: https://issues.apache.org/jira/browse/LUCENE-835
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.1
>Reporter: Mark Harwood
> Assigned To: Mark Harwood
> Attachments: Synonym.java, SynonymIndexReader.java,
SynonymSet.java, TestSynonymIndexReader.java
>
>
> These classes provide support for enabling the use of synonyms for terms
in an existing index.
> While Analyzers can be used at Query-parse time or Index-time to inject
synonyms these are not always satisfactory means of providing support for
synonyms:
> * Index-time injection of synonyms is less flexible because changing the
lists of synonyms requires an index rebuild.
> * Query-parse-time injection is awkward because special support is
required in the parser/query logic  to recognise and cater for the tokens
that appear in the same position. Additionally, any statistical analysis of
the index content via TermEnum/TermDocs etc does not consider the synonyms
unless specific code is added.
> What is perhaps more useful is a transparent wrapper for the IndexReader
that provides a synonym-ized view of the index without requiring specialised
support in the calling code. All of the TermEnum/TermDocs interfaces remain
the same but behind the scenes synonyms are being considered/applied
silently.
> The classes supplied here provide this "virtual" view of the index and
all queries or other code that examines this index using the special reader
benefit from this view without requiring specialized code. A Junit test
illustrates this code in action.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additio

Re: Large scale sorting

2007-04-09 Thread jian chen

Hi, Doug,

I have been thinking about this as well lately and have some thoughts
similar to Paul's approach.

Lucene has the norm data for each document field. Conceptually it is a byte
array with one byte for each document field. At query time, I think the norm
array is loaded into memory the first time it is accessed, allowing for
efficient look up of the norm value for each document.

Now, if we could use integers to represent the sort field values, which is
typically the case for most applications, maybe we can afford to have the
sort field values stored in the disk and do disk lookup for each document
matched? The look up of the sort field value will be as simple as docNo * 4
* offset.

This way, we use the same approach as constructing the norms (proper merging
for incremental indexing), but, at search time, we don't load the sort field
values into memory, instead, just store them in disk.

Will this approach be good enough?

Thanks for your feedback.

Jian


On 4/9/07, Doug Cutting <[EMAIL PROTECTED]> wrote:


Paul Smith wrote:
> Disadvantages to this approach:
> * It's a lot more I/O intensive

I think this would be prohibitive.  Queries matching more than a few
hundred documents will take several seconds to sort, since random disk
accesses are required per matching document.  Such an approach is only
practical if you can guarantee that queries match fewer than a hundred
documents, which is not generally the case, especially with large
collections.

> I'm working on the basis that it's a LOT harder/more expensive to simply
> allocate more heap size to cover the current sorting infrastructure.
> One hits memory limits faster.  Not everyone can afford 64-bit hardware
> with many Gb RAM to allocate to a heap.  It _is_ cheaper/easier to build
> a disk subsystem to tune this I/O approach, and one can still use any
> RAM as buffer cache for the memory-mapped file anyway.

In my experience, raw search time starts to climb towards one second per
query as collections grow to around 10M documents (in round figures and
with lots of assumptions).  Thus, searching on a single CPU is less
practical as collections grow substantially larger than 10M documents,
and distributed solutions are required.  So it would be convenient if
sorting is also practical for ~10M document collections on standard
hardware.  If 10M strings with 20 characters are required in memory for
efficient search, this requires 400MB.  This is a lot, but not an
unusual amount on todays machines.  However, if you have a large number
of fields, then this approach may be problematic and force you to
consider a distributed solution earlier than you might otherwise.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Large scale sorting

2007-04-09 Thread jian chen

Hi, Paul,

Thanks for your reply. For your previous email about the need for disk based
sorting solution, I kind of agree about your points. One incentive for your
approach is that we don't need to warm-up the index anymore in case that the
index is huge.

In our application, we have to sync up the index pretty frequently, the
warm-up of the index is killing it.

To address your concern about single sort locale, what about creating a sort
field for each sort locale? So, if you have, say, 10 locales, you will have
10 sort fields, each utilizing the mechanism of constructing the norms.

At query time, in the HitCollector, for each doc id matched, you can load
the field value (integer) through the IndexReader. (here you need to enhance
the IndexReader to be able to load the sort field values). Then, you can use
that value to reject/accept the doc, or factor into the score.

How do you think?

Jian



On 4/9/07, Paul Smith <[EMAIL PROTECTED]> wrote:


>
> Now, if we could use integers to represent the sort field values,
> which is
> typically the case for most applications, maybe we can afford to
> have the
> sort field values stored in the disk and do disk lookup for each
> document
> matched? The look up of the sort field value will be as simple as
> docNo * 4
> * offset.
>
> This way, we use the same approach as constructing the norms
> (proper merging
> for incremental indexing), but, at search time, we don't load the
> sort field
> values into memory, instead, just store them in disk.
>
> Will this approach be good enough?

While a nifty idea, I think this only works for a single sort
locale.  I initially came up with a similar idea that the terms are
already stored in 'sorted' order and one might be able to use the
terms position for sorting, it's just that the terms ordering
position is different in different locales.

Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Large scale sorting

2007-04-09 Thread jian chen

Hi, Paul,

I think to warm-up or not, it needs some benchmarking for specific
application.

For the implementation of the sort fields, when I talk about norms in
Lucene, I am thinking we could borrow the same implmentation of the norms to
do it.

But, on a higher level, my idea is really just to create an array of
integers for each sort field. The array length is NumOfDocs in the index.
Each integer corresponds to a displayable string value. For example, if you
have a field of different colors, you can assign integers like this:

0 <=> whilte
1 <=> blue
2 <=> yellow
...

Thus, you don't need to use strings for sorting. For example, if you have
document number 0,1,2, which stores colors blue, white, yellow respectively,
the array would be:

{1, 0, 2}.

To do sorting, this array could be pre-loaded into memory (warming up the
index), or, during collecting the hits (in HitCollector), the relevant
integer values could be loaded from disk given a doc id.

If you have 10 million documents, for one sort field, you will have 10x4=40
MB array.

Cheers,

Jian


On 4/9/07, Paul Smith <[EMAIL PROTECTED]> wrote:


>
> In our application, we have to sync up the index pretty frequently,
> the
> warm-up of the index is killing it.
>

Yep, it speeds up the first sort, but at the cost of making all the
others slower (maybe significantly so).  That's obviously not ideal
but could make use of sorts in larger indexes practical.

> To address your concern about single sort locale, what about
> creating a sort
> field for each sort locale? So, if you have, say, 10 locales, you
> will have
> 10 sort fields, each utilizing the mechanism of constructing the
> norms.
>

I really don't understand norms properly so I'm not sure exactly how
that would help.  I'll have to go over your original email again to
understand.

My main goal is to get some discussion going amongst the community,
which hopefully we've kicked along.


Paul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Large scale sorting

2007-04-11 Thread jian chen

I agree. this falls into the area where technical limit is reached. Time to
modify the spec.

I thought about this issue over this couple of days, there is really NO
silver bullet. If the field is multi-value field and the distinct field
values are not too many, you might reduce memory usage by storing the field
as bitset. Each bit corresponding to a distinct value.

But either way, you have to load the whole thing into memory for good
performance.

Jian


On 4/10/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: I'm wondering then if the Sorting infrastructure could be refactored
: to allow  with some sort of policy/strategy where one can choose a
: point where one is not willing to use memory for sorting, but willing

...

: To accomplish this would require a substantial change to the
: FieldSortHitQueue et al, and I realize that the use of NIO

I don't follow ... why could this be implemented entirely via a new
SortComparatorSource?  (you would also need something to create your file,
but that could probably be done as a decorator or subclass of IndexWRiter
couldn't it?)

: immediately pins Lucene to Java 1.4, so I'm sure this is
: controversial.  But, if we wish Lucene to go beyond where it is now,

Java 1.5 is controversial, Lucene already has 1.4 dependencies.

: I think we need to start thinking about this particular problem
: sooner rather than later.

it depends on your timeline, Lucene's gotten pretty far with what it's
got.  Personally i'm banking on RAM getting cheaper fast enough that I
won't ever need to worry about this.

If i needed to support sorting on lots of fields with lots of differnet
locales, and my index was big enough that i couldn't feasibly keep all of
the FieldCaches in memory on one box, i wouldn't partition the index
across multiple boxes and merge results with a MultiSearcher ... i'd clone
the index across multiple boxes and partition the traffic based on the
field/locale it's searching on.

it's a question of cache management, if i know i have two very differnet
use cases for a Solr index, i partition those use case to seperate tiers
of machines to get better cache utilization, FieldCache is
just another type of cache.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




using FieldCache or storing quality score in lucene term index

2007-10-01 Thread jian chen
Hi,

This is probably a question for the user list. However, as it relates to the
performance issue, also Lucene index format, I think better to ask the gurus
in this list ;-)

In my application, I have implemented a quality score for each document. For
each search performed, the relevancy score is first computed using the
lucene scoring, then, the relevancy score is combined with the quality score
to finally score the document.

For storing the quality score, I could use the FieldCache feature and then
load the quality scores as a byte array into memory when warming up the
index. However, I pay the price for the warm up. However, if I store the
quality score in the term index, as in:

term, +

This way, no need to warm up the index. But, I guess the index would be
significantly bigger, and for each term, the quality score for a document is
stored.

I haven't done any testing yet to see which way is better.

But, in general, could anyone give me some advice which way is better? I
think it could be a classic time vs. space issue in computer science. But
still would get the opinions from you gurus.

Thanks in advance.

Jian


Re: possible segment merge improvement?

2007-10-31 Thread jian chen
Hi, Robert,

That's a brilliant idea! Thanks so much for suggesting that.

Cheers,

Jian

On 10/31/07, robert engels <[EMAIL PROTECTED]> wrote:
>
> Currently, when merging segments, every document is [parsed and then
> rewritten since the field numbers may differ between the segments
> (compressed data is not uncompressed in the latest versions).
>
> It would seem that in many (if not most) Lucene uses the fields
> stored within each document with an index are relatively static,
> probably changing for all documents added after point X, if at all.
>
> Why not check the fields dictionary for the segments being merged,
> and if the same, just copy the binary data directly?
>
> In the common case this should be a vast improvement.
>
> Anyone worked on anything like this? Am I missing something?
>
> Robert Engels
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>