Re: possible segment merge improvement?

2007-10-31 Thread jian chen
Hi, Robert, That's a brilliant idea! Thanks so much for suggesting that. Cheers, Jian On 10/31/07, robert engels <[EMAIL PROTECTED]> wrote: > > Currently, when merging segments, every document is [parsed and then > rewritten since the field numbers may differ between the segments > (compressed

using FieldCache or storing quality score in lucene term index

2007-10-01 Thread jian chen
Hi, This is probably a question for the user list. However, as it relates to the performance issue, also Lucene index format, I think better to ask the gurus in this list ;-) In my application, I have implemented a quality score for each document. For each search performed, the relevancy score is

Re: Large scale sorting

2007-04-11 Thread jian chen
I agree. this falls into the area where technical limit is reached. Time to modify the spec. I thought about this issue over this couple of days, there is really NO silver bullet. If the field is multi-value field and the distinct field values are not too many, you might reduce memory usage by st

Re: Large scale sorting

2007-04-09 Thread jian chen
Hi, Paul, I think to warm-up or not, it needs some benchmarking for specific application. For the implementation of the sort fields, when I talk about norms in Lucene, I am thinking we could borrow the same implmentation of the norms to do it. But, on a higher level, my idea is really just to c

Re: Large scale sorting

2007-04-09 Thread jian chen
Hi, Paul, Thanks for your reply. For your previous email about the need for disk based sorting solution, I kind of agree about your points. One incentive for your approach is that we don't need to warm-up the index anymore in case that the index is huge. In our application, we have to sync up th

Re: Large scale sorting

2007-04-09 Thread jian chen
Hi, Doug, I have been thinking about this as well lately and have some thoughts similar to Paul's approach. Lucene has the norm data for each document field. Conceptually it is a byte array with one byte for each document field. At query time, I think the norm array is loaded into memory the fir

(LUCENE-835) An IndexReader with run-time support for synonyms

2007-03-23 Thread jian chen
Hi, Mark, Thanks for providing this original approach for synonyms. I read through your code and think maybe this could be extended to handle the word stemming problem as well. Here is my thought. 1) Before indexing, create a Map> stemmedWordMap, the key is the stemmed word. 1) At indexing, we

Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

2007-03-21 Thread jian chen
rding occurrences for very common words. Glad you find it useful. Cheers, Mark jian chen wrote: > Also, how about this scenario. > > 1) The Analyzer does 100 documents, each with copy right notice inside. I > guess in this case, the copy right notices will be removed when indexing. >

Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

2007-03-20 Thread jian chen
into a document that has copy right notice inside again. My question is, would the Analyzer be able to remove the copy right notice in step 3)? Cheers, Jian On 3/20/07, jian chen <[EMAIL PROTECTED]> wrote: Hi, Mark, Your program is very helpful. I am trying to understand your code but it

Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

2007-03-20 Thread jian chen
Hi, Mark, Your program is very helpful. I am trying to understand your code but it seems would take longer to do that than simply asking you some questions. 1) What is the sliding window used for? It is that the Analyzer remembers the previously seen N tokens, and N is the window size? 2) As th

Re: NewIndexModifier - - - DeletingIndexWriter

2007-02-13 Thread jian chen
ple on this list that think it is a database) you will probably get most things wrong. On Feb 13, 2007, at 1:17 AM, Nadav Har'El wrote: > On Fri, Feb 09, 2007, jian chen wrote about "Re: NewIndexModifier - > - - DeletingIndexWriter": >> Following the Lucene dev mailing li

Re: NewIndexModifier - - - DeletingIndexWriter

2007-02-09 Thread jian chen
Hey guys, Following the Lucene dev mailing list for sometime now, I am concerned that lucene is slowing losing all the simplicity and become a complicated mess. I think keeping IndexReader and IndexWriter the way it works in 1.2 even is better, no? Software should be designed to be simple to us

Re: Beyond Lucene 2.0 Index Design

2007-01-11 Thread jian chen
I also got the same question. It seems it is very hard to efficiently do phrase based query. I think most search engines do phrase based query, or at least appear to be. So, like in google, the query result must contain all the words user searched on. It seems to me that the impacted-sorted list

Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread jian chen
Hi, Jeff, Also, how to handle the phrase based queries? For example, here are two posting lists: TermA: X Y TermB: Y X I am not sure how you would return document X or Y for a search of the phrase "TermA Term B". Which should come first? Thanks, Jian On 1/9/07, Dalton, Jeffery <[EMAIL PROTE

Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread jian chen
Hi, Jeff, I like the idea of impact based scoring. However, could you elaborate more on why we only need to use single field at search time? In Lucene, the indexed terms are field specific, and two terms, even if they are the same, are still different terms if they are of different fields. So,

Re: Using Database instead of File system

2006-09-25 Thread jian chen
For real search engine, performance is the most important factor. I think file system based system is better than storing the indexes in database because of the pure speed you will get. Cheers, Jian On 9/25/06, Simon Willnauer <[EMAIL PROTECTED]> wrote: Have a look at the compass framework ht

Kudo to the wonderful Lucene search library

2006-06-02 Thread jian chen
source community. Jian Chen Lead Developer www.destinationlighting.com

Re: How To find which field has the search term in Hit?

2006-05-29 Thread jian chen
in FirstName and Company ..so how can I retrieve this info that it is found in only FirstName and Company fields. Best Noon. jian chen <[EMAIL PROTECTED]> wrote: You can store the field values and then, load the field values to do a real-time comparision. Simple solution... Jian On 5/24/06, N

Re: How To find which field has the search term in Hit?

2006-05-24 Thread jian chen
You can store the field values and then, load the field values to do a real-time comparision. Simple solution... Jian On 5/24/06, N <[EMAIL PROTECTED]> wrote: Hi I am searching on multiple fields. Is it possible to retrieve the field (s) which contains the search terms from the documents retu

Re: when was the document number initially written into .frq file?

2006-05-08 Thread jian chen
Looking at your email again. You are confusing the initial writing of postings with the segment merging. Once the doc number is written, the .frq file is not changed. The segment merge process will write to a new .frq file. Make sense? Jian On 5/8/06, jian chen <[EMAIL PROTECTED]>

Re: when was the document number initially written into .frq file?

2006-05-08 Thread jian chen
It is in DocumentWriter.java class. Look at writePostings(...) method. Here are the lines: // add an entry to the freq file int f = posting.freq; if (f == 1) // optimize freq=1 freq.writeVInt(1); // set low bit of doc num. else { freq.writeVIn

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread jian chen
this change to standard UTF-8 could be a hot item on the Lucene 2.0list? Cheers, Jian Chen On 5/2/06, Doug Cutting <[EMAIL PROTECTED]> wrote: Chuck Williams wrote: > For lazy fields, there would be a substantial benefit to having the > count on a String be an encoded byte count rather

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
d hard to write programs for. Jian On 5/1/06, jian chen <[EMAIL PROTECTED]> wrote: Hi, Chuck, Using standard UTF-8 is very important for Lucene index so any program could read the Lucene index easily, be it written in perl, c/c++ or any new future programming languages. It is like storing data

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
nism. Thanks for any clarification, Chuck jian chen wrote on 05/01/2006 04:24 PM: > Hi, Marvin, > > Thanks for your quick response. I am in the camp of fearless refactoring, > even at the expense of breaking compatibility with previous releases. ;-) > > Compatibility aside,

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Hi, Marvin, Thanks for your quick response. I am in the camp of fearless refactoring, even at the expense of breaking compatibility with previous releases. ;-) Compatibility aside, I am trying to identify if changing the implementation of Term is the right way to go for this problem. If it is,

storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
? Cheers, Jian Chen

Re: this == that

2006-05-01 Thread jian chen
I am wondering if interning Strings will be really that critical for performance. The biggest bottle neck is still disk. So, maybe we can use String.equals(...) instead of ==. Jian On 5/1/06, DM Smith <[EMAIL PROTECTED]> wrote: karl wettin wrote: > The code is filled with string equality code

Re: Filter

2006-03-10 Thread jian chen
ng an open source license this year. Cheers, Jian Chen Lead Developer, Seattle Lighting On 3/10/06, eks dev <[EMAIL PROTECTED]> wrote: > > It looks to me everybody agrees here, not? If yes, it > would be really usefull if somebody with commit rights > could add 1) and 2) to t

Re: DbDirectory with Berkeley DB Java Edition

2005-12-14 Thread jian chen
Hi, I am pretty pessimistic about any DB directory implementation for Lucene. The nature of the Lucene index files does not really fit well into a relational database. Therefore, performance wise, the DB implementations would suffer a lot. Basically, I would discourage anyone on the DB implementat

Re: Lucene Index backboned by DB

2005-11-15 Thread jian chen
Dear All, I have some thoughts on this issue as well. 1) It might be OK to implement retrieving field values separately for a document. However, I think from a simplicity point of view, it might be better to have the application code do this drudgery. Adding this feature could complicate the nice

Fwd: lucene inter-process locking question

2005-11-07 Thread jian chen
Hi, I did some research and found an answer from the following url: http://www.gossamer-threads.com/lists/lucene/java-dev/21808?search_string=synchronized%20directory;#21808 So, now I understand that it is partly historical. Cheers, Jian -- Forwarded message -- From: jian

lucene inter-process locking question

2005-11-07 Thread jian chen
Hi, Lucene Developers, Just got a question regarding the locking mechanism in Lucene. I see in IndexReader, first there is synchronized(directory) to synch up multi-threads, then, inside, there is the statement for grabbing the commit.lock. So, my question is, could the multi-thread synch be also

Re: Fwd: skipInterval

2005-10-16 Thread jian chen
rwarded message -- > > From: jian chen <[EMAIL PROTECTED]> > > Date: Oct 15, 2005 6:36 PM > > Subject: skipInterval > > To: Lucene Developers List > > > > Hi, All, > > > > I was reading some research papers regarding quick inverted index > loo

Fwd: skipInterval

2005-10-15 Thread jian chen
Hi, All, I should have sent to this email address rather than the old jakarta email address. Sorry if double-posted. Jian -- Forwarded message -- From: jian chen <[EMAIL PROTECTED]> Date: Oct 15, 2005 6:36 PM Subject: skipInterval To: Lucene Developers List Hi, All,

skipInterval

2005-10-15 Thread jian chen
Hi, All, I was reading some research papers regarding quick inverted index lookups. The classical approach to skipping dictates that a skip should be positioned every sqrt(df) document pointers. I looked at the the current Lucene implementation. The skipInterval is hardcoded as follows in TermInf

Re: Adding generic payloads to a Term's posting list

2005-10-10 Thread jian chen
Hi, I have been studying the Lucene indexing code for a bit. I am not sure if I understand the problem scope completely, but, storing extra information using TermsInfoWriter may not solve the problem? For the example of XML document tag depth, could that be a seperate field? Because Lucene term i

Re: Eliminating norms ... completley

2005-10-07 Thread jian chen
Hi, Chris, Turning off norm looks like a very interesting problem to me. I remember that in Lucene Road Map for 2.0, there is a requirement to turn off indexing for some information, such as proximity. Maybe optionally turning off the norm could be an experiment to show case how to turn off the p

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
. Just my 2 cents. Thanks, Jian On 8/27/05, Ken Krugler <[EMAIL PROTECTED]> wrote: > > >On Aug 26, 2005, at 10:14 PM, jian chen wrote: > > > >>It seems to me that in theory, Lucene storage code could use true UTF-8 > to > >>store terms. Maybe it is just