Hi, Robert,
That's a brilliant idea! Thanks so much for suggesting that.
Cheers,
Jian
On 10/31/07, robert engels <[EMAIL PROTECTED]> wrote:
>
> Currently, when merging segments, every document is [parsed and then
> rewritten since the field numbers may differ between the segments
> (compressed
Hi,
This is probably a question for the user list. However, as it relates to the
performance issue, also Lucene index format, I think better to ask the gurus
in this list ;-)
In my application, I have implemented a quality score for each document. For
each search performed, the relevancy score is
I agree. this falls into the area where technical limit is reached. Time to
modify the spec.
I thought about this issue over this couple of days, there is really NO
silver bullet. If the field is multi-value field and the distinct field
values are not too many, you might reduce memory usage by st
Hi, Paul,
I think to warm-up or not, it needs some benchmarking for specific
application.
For the implementation of the sort fields, when I talk about norms in
Lucene, I am thinking we could borrow the same implmentation of the norms to
do it.
But, on a higher level, my idea is really just to c
Hi, Paul,
Thanks for your reply. For your previous email about the need for disk based
sorting solution, I kind of agree about your points. One incentive for your
approach is that we don't need to warm-up the index anymore in case that the
index is huge.
In our application, we have to sync up th
Hi, Doug,
I have been thinking about this as well lately and have some thoughts
similar to Paul's approach.
Lucene has the norm data for each document field. Conceptually it is a byte
array with one byte for each document field. At query time, I think the norm
array is loaded into memory the fir
Hi, Mark,
Thanks for providing this original approach for synonyms. I read through
your code and think maybe this could be extended to handle the word stemming
problem as well.
Here is my thought.
1) Before indexing, create a Map> stemmedWordMap,
the key is the stemmed word.
1) At indexing, we
rding occurrences for very common words.
Glad you find it useful.
Cheers,
Mark
jian chen wrote:
> Also, how about this scenario.
>
> 1) The Analyzer does 100 documents, each with copy right notice inside.
I
> guess in this case, the copy right notices will be removed when
indexing.
>
into a document that has copy right notice inside
again.
My question is, would the Analyzer be able to remove the copy right notice
in step 3)?
Cheers,
Jian
On 3/20/07, jian chen <[EMAIL PROTECTED]> wrote:
Hi, Mark,
Your program is very helpful. I am trying to understand your code but it
Hi, Mark,
Your program is very helpful. I am trying to understand your code but it
seems would take longer to do that than simply asking you some questions.
1) What is the sliding window used for? It is that the Analyzer remembers
the previously seen N tokens, and N is the window size?
2) As th
ple on this list that think it is a
database) you will probably get most things wrong.
On Feb 13, 2007, at 1:17 AM, Nadav Har'El wrote:
> On Fri, Feb 09, 2007, jian chen wrote about "Re: NewIndexModifier -
> - - DeletingIndexWriter":
>> Following the Lucene dev mailing li
Hey guys,
Following the Lucene dev mailing list for sometime now, I am concerned that
lucene is slowing losing all the simplicity and become a complicated mess.
I think keeping IndexReader and IndexWriter the way it works in 1.2 even is
better, no?
Software should be designed to be simple to us
I also got the same question. It seems it is very hard to efficiently do
phrase based query.
I think most search engines do phrase based query, or at least appear to be.
So, like in google, the query result must contain all the words user
searched on.
It seems to me that the impacted-sorted list
Hi, Jeff,
Also, how to handle the phrase based queries?
For example, here are two posting lists:
TermA: X Y
TermB: Y X
I am not sure how you would return document X or Y for a search of the
phrase "TermA Term B". Which should come first?
Thanks,
Jian
On 1/9/07, Dalton, Jeffery <[EMAIL PROTE
Hi, Jeff,
I like the idea of impact based scoring. However, could you elaborate more
on why we only need to use single field at search time?
In Lucene, the indexed terms are field specific, and two terms, even if they
are the same, are still different terms if they are of different fields.
So,
For real search engine, performance is the most important factor. I think
file system based system is better than storing the indexes in database
because of the pure speed you will get.
Cheers,
Jian
On 9/25/06, Simon Willnauer <[EMAIL PROTECTED]> wrote:
Have a look at the compass framework
ht
source community.
Jian Chen
Lead Developer
www.destinationlighting.com
in FirstName and Company ..so how
can I retrieve this info that it is found in only FirstName and Company
fields.
Best
Noon.
jian chen <[EMAIL PROTECTED]> wrote: You can store the field values
and then, load the field values to do a
real-time comparision. Simple solution...
Jian
On 5/24/06, N
You can store the field values and then, load the field values to do a
real-time comparision. Simple solution...
Jian
On 5/24/06, N <[EMAIL PROTECTED]> wrote:
Hi
I am searching on multiple fields. Is it possible to retrieve the field
(s) which contains the search terms from the documents retu
Looking at your email again.
You are confusing the initial writing of postings with the segment merging.
Once the doc number is written, the .frq file is not changed. The segment
merge process will write to a new .frq file.
Make sense?
Jian
On 5/8/06, jian chen <[EMAIL PROTECTED]>
It is in DocumentWriter.java class.
Look at writePostings(...) method.
Here are the lines:
// add an entry to the freq file
int f = posting.freq;
if (f == 1) // optimize freq=1
freq.writeVInt(1); // set low bit of doc num.
else {
freq.writeVIn
this change to standard UTF-8 could be a hot item on the Lucene 2.0list?
Cheers,
Jian Chen
On 5/2/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
Chuck Williams wrote:
> For lazy fields, there would be a substantial benefit to having the
> count on a String be an encoded byte count rather
d hard to write
programs for.
Jian
On 5/1/06, jian chen <[EMAIL PROTECTED]> wrote:
Hi, Chuck,
Using standard UTF-8 is very important for Lucene index so any program
could read the Lucene index easily, be it written in perl, c/c++ or any new
future programming languages.
It is like storing data
nism.
Thanks for any clarification,
Chuck
jian chen wrote on 05/01/2006 04:24 PM:
> Hi, Marvin,
>
> Thanks for your quick response. I am in the camp of fearless
refactoring,
> even at the expense of breaking compatibility with previous releases.
;-)
>
> Compatibility aside,
Hi, Marvin,
Thanks for your quick response. I am in the camp of fearless refactoring,
even at the expense of breaking compatibility with previous releases. ;-)
Compatibility aside, I am trying to identify if changing the implementation
of Term is the right way to go for this problem.
If it is,
?
Cheers,
Jian Chen
I am wondering if interning Strings will be really that critical for
performance. The biggest bottle neck is still disk. So, maybe we can use
String.equals(...) instead of ==.
Jian
On 5/1/06, DM Smith <[EMAIL PROTECTED]> wrote:
karl wettin wrote:
> The code is filled with string equality code
ng an open source license this year.
Cheers,
Jian Chen
Lead Developer, Seattle Lighting
On 3/10/06, eks dev <[EMAIL PROTECTED]> wrote:
>
> It looks to me everybody agrees here, not? If yes, it
> would be really usefull if somebody with commit rights
> could add 1) and 2) to t
Hi,
I am pretty pessimistic about any DB directory implementation for Lucene.
The nature of the Lucene index files does not really fit well into a
relational database. Therefore, performance wise, the DB implementations
would suffer a lot. Basically, I would discourage anyone on the DB
implementat
Dear All,
I have some thoughts on this issue as well.
1) It might be OK to implement retrieving field values separately for a
document. However, I think from a simplicity point of view, it might be
better to have the application code do this drudgery. Adding this feature
could complicate the nice
Hi,
I did some research and found an answer from the following url:
http://www.gossamer-threads.com/lists/lucene/java-dev/21808?search_string=synchronized%20directory;#21808
So, now I understand that it is partly historical.
Cheers,
Jian
-- Forwarded message --
From: jian
Hi, Lucene Developers,
Just got a question regarding the locking mechanism in Lucene. I see in
IndexReader, first there is synchronized(directory) to synch up
multi-threads, then, inside, there is the statement for grabbing the
commit.lock.
So, my question is, could the multi-thread synch be also
rwarded message --
> > From: jian chen <[EMAIL PROTECTED]>
> > Date: Oct 15, 2005 6:36 PM
> > Subject: skipInterval
> > To: Lucene Developers List
> >
> > Hi, All,
> >
> > I was reading some research papers regarding quick inverted index
> loo
Hi, All,
I should have sent to this email address rather than the old jakarta email
address. Sorry if double-posted.
Jian
-- Forwarded message --
From: jian chen <[EMAIL PROTECTED]>
Date: Oct 15, 2005 6:36 PM
Subject: skipInterval
To: Lucene Developers List
Hi, All,
Hi, All,
I was reading some research papers regarding quick inverted index lookups.
The classical approach to skipping dictates that a skip should be positioned
every sqrt(df) document pointers.
I looked at the the current Lucene implementation. The skipInterval is
hardcoded as follows in TermInf
Hi,
I have been studying the Lucene indexing code for a bit. I am not sure if I
understand the problem scope completely, but, storing extra information
using TermsInfoWriter may not solve the problem?
For the example of XML document tag depth, could that be a seperate field?
Because Lucene term i
Hi, Chris,
Turning off norm looks like a very interesting problem to me. I remember
that in Lucene Road Map for 2.0, there is a requirement to turn off indexing
for some information, such as proximity.
Maybe optionally turning off the norm could be an experiment to show case
how to turn off the p
.
Just my 2 cents.
Thanks,
Jian
On 8/27/05, Ken Krugler <[EMAIL PROTECTED]> wrote:
>
> >On Aug 26, 2005, at 10:14 PM, jian chen wrote:
> >
> >>It seems to me that in theory, Lucene storage code could use true UTF-8
> to
> >>store terms. Maybe it is just
38 matches
Mail list logo