Re: lucene memory consumption

2008-05-29 Thread jian chen
Not that I can think about. But, if you have any cached field data, norms array, that could be huge. Would be interested in knowing from others regarding this topic as well. Jian On 5/29/08, Alex <[EMAIL PROTECTED]> wrote: > > Hi, > other than the in memory terms (.tii), and the few kilobytes of

two copies of indexes vs. master/slave indexes

2008-05-16 Thread jian chen
I have seen two different designs for incremental index updates. 1) Have two copies of indexes A and B. The incremental updates happens on A index while B index is being used for search. Then, hot swap the two indexes. Bring B index up to date and perform incremental updates thereafter. In this s

simultaneous read and writes to the RAMDirectory

2008-05-16 Thread jian chen
Lucene gurus, I have a question regarding RAMDirectory usage. Can the IndexWriter keep adding documents to the index meanwhile the IndexReader is open on this RAMDirectory and searches going on? I know in a FSDirectory case, the IndexWriter can add documents to the index meanwhile IndexReader rea

Re: Build vs. Buy?

2006-02-10 Thread jian chen
For reading word document as text, you can try AntiWord. I have written a simplified Lucene that does Max words match. For example, if you are searching for aa, bb, cc, then, the document that contains all words (aa, bb, cc) will be definitely ranked higher than documents containing either aa, bb

Re: Urgent - File Lock in Lucene 1.2

2005-11-21 Thread jian chen
Hi, Karl, Therer have been quite some discussions regarding the "too many open files" problem. From my understanding, it is due to Lucene trying to open multiple segments at the same time (during search/merging segments), and the operating system wouldn't allow opening that many file handles. If

Re: About searching in multiple fields with one query

2005-11-13 Thread jian chen
Hi, Karl, Looking at the Lucene 1.2 source code, looks to me that the MultiFieldQueryParser generates a BooleanQuery. Each sub-query with the BooleanQuery is for one field. The actually calculation of the scoring is with BooleanScorer.java, where the scores from each sub-query is accumulated. So,

Re: List of removed stop words?

2005-10-31 Thread jian chen
Hi, In case you are using StandardAnalyzer, there is a stop word list. I have used StandardAnalyzer.STOP_WORDS, which is a String[]. Cheers, Jian On 10/31/05, Rob Young <[EMAIL PROTECTED]> wrote: > > Hi, > > Is there an easy way to list stop words that were removed from a string? > I'm using th

Re: trying to boost a phrase higher than its individual words

2005-10-27 Thread jian chen
Hi, It seems what you want to achieve could be implemented using the Cover Density algorithm. I am not sure if any existing query classes in the Lucene distribution does this already. But in case not, this is what I am think about: Make a custom query class, called CoverDensityQuery, which is mod

Re: java on 64 bits

2005-10-21 Thread jian chen
Hi, Also, I think you may try to increase the indexInterval, it is set to 128, but getting it larger, the .tii files will be smaller. Since .tii files are loaded into memory as a whole, so, your memory usage might be smaller. However, this change might affect your search speed. So, be careful abou

Re: Large queries

2005-10-16 Thread jian chen
Hi, Trond, By the way, it appears to me that Lucene uses the iterator pattern a lot, like SegmentTermEnum, TermDocs, TermPositions, etc. Each iterator uses the underlying fix sized buffer to load a chunck of data at a time. So, even you have millions of documents, you shouldn't run into memory pro

Re: Large queries

2005-10-16 Thread jian chen
Hi, Trond, It should be no problem for Lucene to handle 6 million documents. For your query, it seems you want to do a disjunctive (or'ed) query for multiple terms, 10 terms or 1 terms for example. The worst case I can think of is, you can very easily write your own query class to handle this

Re: maximum number of documents

2005-10-12 Thread jian chen
Hi, Koji, I think you are right, the max num of documents should be Integer.MAX_VALUE. Some more points below: 1) I double checked the Lucene documentation. It mentioned in the file format that SegSize is UInt32. I don't think this is accurate, as UInt32 is around 4 billion, but Integer.MAX_VAL

Re: who could tell me the equation of the scoring in detail? i'm confused about two days

2005-10-04 Thread jian chen
Hi, Lucene uses a variant version of vector space model for ranking the documents. You can look at 1) http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.htmlfor the fomula, 2) there is an article some one wrote about the Lucene's Ranking function vs. standard vsm. However

Re: Storing HashMap as an UnIndexed Field

2005-09-20 Thread jian chen
well, certainly you can serialize into a byte stream and encode it using base64. Jian On 9/20/05, Mordo, Aviran (EXP N-NANNATEK) <[EMAIL PROTECTED]> wrote: > > I can't think of a way you can use serialization, since lucene only > works with strings. > > -Original Message- > From: Trici

storing inverted document as a field

2005-09-19 Thread jian chen
Hi, I am playing with Lucene source code and have this somewhat stupid question, so please bear with me ;-) Basically, I want to implement a custom ranking algorithm. That is, iterating through the documents that contains all the search keywords, for each document, retrieve its inverted docum

Re: Small problem in searching

2005-09-15 Thread jian chen
Hi, I think Lucene transforms the prefix match query into all sub queries where the searching for a prefix could result into search for all terms that begin with that prefix. For "postfix" match, I think you need to do more work than relying on Lucene's query parser. You can iterate over the

Re: How do I avoid reindexing?

2005-09-10 Thread jian chen
delete document with this id and then add document with the same id. Jian On 9/10/05, Filip Anselm <[EMAIL PROTECTED]> wrote: > > ...well the title says it all > > I index some documents - all with the same fields... One of the fields, > "id" is unique for the indexed documents. If i try to ind

Re: How to search between dates?

2005-09-03 Thread jian chen
Hi, The other way maybe, store the date in a separate database, like derby embeded database. Then, do a joint query between the Lucene result and the derby database select result, and merge the two result set into one by handing coding a database intersection like type of operation. Just a th

Re: read past EOF

2005-08-27 Thread jian chen
Hi, It seems this problem only happens when the index files get really large. Could it be because java has trouble handling very large files on windows machine (guess there is max file size on windows)? In Lucene, I think there is a maxDoc kind of parameter that you can use to specify, when th

Re: Lucene does NOT use UTF-8.

2005-08-27 Thread jian chen
Hi, It seems to me that in theory, Lucene storage code could use true UTF-8 to store terms. Maybe it is just a legacy issue that the modified UTF-8 is used? Cheers, Jian On 8/26/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > Greets, > > [crossposted to java-user@lucene.apache.org and [E

Re: Books about Lucene?

2005-08-26 Thread jian chen
y consider making those > changes. > > Erik > > > On Aug 26, 2005, at 3:12 PM, jian chen wrote: > > > Hi, Erik, > > > > I some time ago played with the Lucene 1.2 source code and made some > > modifications to it, trying to add my own ranking algorithm.

Re: Books about Lucene?

2005-08-26 Thread jian chen
Hi, Erik, I some time ago played with the Lucene 1.2 source code and made some modifications to it, trying to add my own ranking algorithm. I am not sure if Licence wise, it is permissible to modify the earlier source code, also if it is allowed to put the modified version or the description of

Re: Serialized Java Objects

2005-08-25 Thread jian chen
Hi, I don't think by default it does so. But, you can certainly serialize the java object and use base 64 to encode it into a text string, then, you can store it as a field. Cheers, Jian On 8/25/05, Kevin L. Cobb <[EMAIL PROTECTED]> wrote: > I just had a thought this morning. Does Lucene have t

Re: QueryParser not thread-safe

2005-08-23 Thread jian chen
Right. My philosophy is that, make it work, then, make it better. Don't waste time on something that you are not sure if it would cause performance problem. Jian On 8/23/05, Paul Elschot <[EMAIL PROTECTED]> wrote: > On Tuesday 23 August 2005 19:01, Miles Barr wrote: > > On Tue, 2005-08-23 at 13

Re: MySimilarity with Lucene 1.2 ?

2005-08-18 Thread jian chen
Hi, I hacked the lucene 1.2 a little while ago and I am trying to use my own similarity algorithm. If you are interested in the changes I have made to the Lucene 1.2, you can email me back at chenjian1227 at gmail.com Cheers, Jian On 8/18/05, Karl Koch <[EMAIL PROTECTED]> wrote: > Hello Lucen

Re: Integrate Lucene with Derby

2005-08-13 Thread jian chen
learning it... > > I have asked the kind people of Derby-users, and they say there is no > solution for this yet. > > I guess we can ask the people on the -developer list.... > > > On 8/13/05, jian chen <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I am

Re: Integrate Lucene with Derby

2005-08-13 Thread jian chen
Hi, I am also interested in that. I haven't used Derby before, but it seems the java database of choice as it is open source and a full relational database. I plant to learn the simple usage of Derby and then think about integrating Derby with Lucene. May we should post our progress for the int

Re: DOM or XML representation of a query?

2005-08-10 Thread jian chen
Well, the good practice I think is to decouple the backend from the front end as much as possible. You might have different versions of java running for each end and also, there might be code compatibility issues with different versions. Jian On 8/10/05, Andrew Boyd <[EMAIL PROTECTED]> wrote: > Q

Re: Too many open files error using tomcat and lucene

2005-07-20 Thread jian chen
Hi, Dan, I think the problem you mentioned is the one that has been discussed lot of times in this mailing list. Bottomline is that you'd better use the compound file format to store indexes. I am not sure Lucene 1.3 has that available, but, if possible, can you upgrade to lucene 1.4.3? Cheers,

Re: Lucene index integrity during a system crash

2005-07-16 Thread jian chen
ey are > not recorded in segments file, they are irrelevant. > > Otis > P.S. > Did you ask you locking in Lucene the other day? > > > --- jian chen <[EMAIL PROTECTED]> wrote: > > > Hi, Otis, > > > > Thanks for your email. As this is very important

Re: Lucene index integrity during a system crash

2005-07-16 Thread jian chen
ed on this list so far was > the corruption of the segments file, and even that people have been > able to manually edit with a hex editor. > > Otis > > > --- jian chen <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I know Lucene does not have trans

Lucene index integrity during a system crash

2005-07-15 Thread jian chen
Hi, I know Lucene does not have transaction support at this stage. However, I want to know what will happen if there is an operating system crash during the indexing process, will the Lucene index got corrupted? Thanks, Jian -

Re: non-lexical comparisons

2005-07-07 Thread jian chen
Yeah, RDBMS makes sense. In this case, would it be better to simple store those in a relational database and just use Lucene to do indexing for the text? Cheers, Jian On 7/7/05, Leos Literak <[EMAIL PROTECTED]> wrote: > I know the answear, but just for curiosity: > > have you guys ever thought

Re: Retrieval model used by Lucene

2005-07-04 Thread jian chen
Well, I guess Lucene's Span query uses the Cover Density based model (proximity model). However, it is within the framework of the TF*IDF as well. Jian On 7/4/05, Dave Kor <[EMAIL PROTECTED]> wrote: > Quoting [EMAIL PROTECTED]: > > > Hi everybody, > > > > which kind of retrieval model is lucene

Re: No.of Files in Directory

2005-06-30 Thread jian chen
TED]> wrote: > Thanks Jian > > I need to retrive the original document sometimes. I did not quite understand > your second suggestion. > Can you please help me understand better, a pointer to some web resource will > also help. > > jian chen <[EMAIL PROTECTED]> wro

Re: No.of Files in Directory

2005-06-29 Thread jian chen
Hi, Depending on the operating system, there might be a hard limit on the number of files in one directory (windoze versions). Even with operating systems that don't have a hard limit, it is still better not to put too many files in one directory (linux). Typically, the file system won't be very

Re: Strategy for making short documents not bubble to the top?

2005-06-29 Thread jian chen
Hi, I would use pure span or cover density based ranking algorithm which do not take document length into consideration. (tweaking whatever currently in the standard Lucene distribution?) For example, searching for the keywords "beautiful house", span/cover ranking will treat a long document and

Re: Design question [too many fields?]

2005-06-29 Thread jian chen
Hi, Naimdjon, I have some suggestions as well along the lines of Mark Harwood. As an example, suppose for each hotel room there is a description, and you want the user to do free text search on the description field. You could do the following: 1) store hotel room reservation info as rows in a

question regarding the "commit.lock"

2005-06-28 Thread jian chen
Hi, I am looking at and trying to understand more about Lucene's reader/writer synchronization. Does anyone know when the commit.lock is release? I could not find it anywhere in the source code. I did see the write.lock is released in IndexWriter.close(). Thanks, ---

when is the commit.lock released?

2005-06-28 Thread jian chen
Hi, I am looking at and trying to understand more about Lucene's reader/writer synchronization. Does anyone know when the commit.lock is release? I could not find it anywhere in the source code. I did see the write.lock is released in IndexWriter.close(). Thanks, ---

Fwd: when is the commit.lock released?

2005-06-27 Thread jian chen
Hi, I haven't heard anything back. Probably this email got lost on the way or whatsoever. Anyway, could anyone enlighten me on this? Thanks, Jian -- Forwarded message -- From: jian chen <[EMAIL PROTECTED]> Date: Jun 26, 2005 12:59 PM Subject: when is the commit.lo

Re: Lock File exceptions

2005-06-27 Thread jian chen
Hi, Recently I looked at the locking mechanism of Lucene. If I am correct, I think the process for grabbing the lock file will time out by default in 10 seconds. When the process timed out, it will print out the IOException. The lucene locking mechanism is not within threads in the same JVM. It u

when is the commit.lock released?

2005-06-26 Thread jian chen
Hi, I am looking at and trying to understand more about Lucene's reader/writer synchronization. Does anyone know when the commit.lock is release? I could not find it anywhere in the source code. I did see the write.lock is released in IndexWriter.close(). Thanks, Jian -

document ids in "cached" in Hits and index merge

2005-06-24 Thread jian chen
Hi, I have a stupid question regarding the transient nature of the document ids. As I understand, documents will obtain new doc ids during the index merge. Suppose if you do a search and got the Hits object. When you iterate through the documents by id, the index merge happens. How the merge and

Re: Span query performance issue

2005-06-24 Thread jian chen
Hi, I think Span query in general should do more work than simple Phrase query. Phrase query, in its simplest form, should just try to find all terms that are adjacent to each other. Meanwhile, Span query does not necessary be adjacent to each other, but, with other words in between. Therefore, I

Re: Updateing Documents:

2005-06-21 Thread jian chen
Hi, You may look at this website http://www.zilverline.org Cheers, Jian On 6/21/05, Markus Atteneder <[EMAIL PROTECTED]> wrote: > I am looking for a SearchEngine for our Intranet and so i deal with Lucene. > I have read the FAQ and some Postings and i got first experiences with it > and now i

Re: how long should optimizing take

2005-06-02 Thread jian chen
Hi, optimize() merges the index segments into one single index segment. In your case, I guess the 2G index segment is quite large, if you merge it with any other small index segments, the merging process definitely will be slow. I think the performance should be ok without calling optimize(). Mor

Re: Indexing multiple languages

2005-05-31 Thread jian chen
owercasing, and such) > and separate CJK characters into separate tokens also. > > Erik > > > On May 31, 2005, at 5:49 PM, jian chen wrote: > > > Hi, > > > > Interesting topic. I thought about this as well. I wanted to index > > Chinese text with En

Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens. Right now I think maybe I have to write a special analyzer that takes the text input, and