Not that I can think about. But, if you have any cached field data,
norms array, that could be huge.
Would be interested in knowing from others regarding this topic as well.
Jian
On 5/29/08, Alex <[EMAIL PROTECTED]> wrote:
>
> Hi,
> other than the in memory terms (.tii), and the few kilobytes of
I have seen two different designs for incremental index updates.
1) Have two copies of indexes A and B. The incremental updates happens on A
index while B index is being used for search. Then, hot swap the two
indexes. Bring B index up to date and perform incremental updates
thereafter. In this s
Lucene gurus,
I have a question regarding RAMDirectory usage. Can the IndexWriter keep
adding documents to the index meanwhile the IndexReader is open on this
RAMDirectory and searches going on?
I know in a FSDirectory case, the IndexWriter can add documents to the index
meanwhile IndexReader rea
For reading word document as text, you can try AntiWord.
I have written a simplified Lucene that does Max words match.
For example, if you are searching for aa, bb, cc, then, the document that
contains all words (aa, bb, cc) will be definitely ranked higher than
documents containing either aa, bb
Hi, Karl,
Therer have been quite some discussions regarding the "too many open files"
problem. From my understanding, it is due to Lucene trying to open multiple
segments at the same time (during search/merging segments), and the
operating system wouldn't allow opening that many file handles.
If
Hi, Karl,
Looking at the Lucene 1.2 source code, looks to me that the
MultiFieldQueryParser generates a BooleanQuery. Each sub-query with the
BooleanQuery is for one field. The actually calculation of the scoring is
with BooleanScorer.java, where the scores from each sub-query is
accumulated.
So,
Hi,
In case you are using StandardAnalyzer, there is a stop word list. I have
used StandardAnalyzer.STOP_WORDS, which is a String[].
Cheers,
Jian
On 10/31/05, Rob Young <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> Is there an easy way to list stop words that were removed from a string?
> I'm using th
Hi,
It seems what you want to achieve could be implemented using the Cover
Density algorithm. I am not sure if any existing query classes in the Lucene
distribution does this already. But in case not, this is what I am think
about:
Make a custom query class, called CoverDensityQuery, which is mod
Hi,
Also, I think you may try to increase the indexInterval, it is set to 128,
but getting it larger, the .tii files will be smaller. Since .tii files are
loaded into memory as a whole, so, your memory usage might be smaller.
However, this change might affect your search speed. So, be careful abou
Hi, Trond,
By the way, it appears to me that Lucene uses the iterator pattern a lot,
like SegmentTermEnum, TermDocs, TermPositions, etc. Each iterator uses the
underlying fix sized buffer to load a chunck of data at a time. So, even you
have millions of documents, you shouldn't run into memory pro
Hi, Trond,
It should be no problem for Lucene to handle 6 million documents.
For your query, it seems you want to do a disjunctive (or'ed) query for
multiple terms, 10 terms or 1 terms for example. The worst case I can
think of is, you can very easily write your own query class to handle this
Hi, Koji,
I think you are right, the max num of documents should be Integer.MAX_VALUE.
Some more points below:
1) I double checked the Lucene documentation. It mentioned in the file
format that SegSize is UInt32. I don't think this is accurate, as UInt32 is
around 4 billion, but Integer.MAX_VAL
Hi,
Lucene uses a variant version of vector space model for ranking the
documents.
You can look at
1)
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.htmlfor
the fomula,
2) there is an article some one wrote about the Lucene's Ranking function
vs. standard vsm. However
well, certainly you can serialize into a byte stream and encode it using
base64.
Jian
On 9/20/05, Mordo, Aviran (EXP N-NANNATEK) <[EMAIL PROTECTED]> wrote:
>
> I can't think of a way you can use serialization, since lucene only
> works with strings.
>
> -Original Message-
> From: Trici
Hi,
I am playing with Lucene source code and have this somewhat stupid question,
so please bear with me ;-)
Basically, I want to implement a custom ranking algorithm. That is,
iterating through the documents that contains all the search keywords, for
each document, retrieve its inverted docum
Hi,
I think Lucene transforms the prefix match query into all sub queries where
the searching for a prefix could result into search for all terms that begin
with that prefix.
For "postfix" match, I think you need to do more work than relying on
Lucene's query parser.
You can iterate over the
delete document with this id and then add document with the same id.
Jian
On 9/10/05, Filip Anselm <[EMAIL PROTECTED]> wrote:
>
> ...well the title says it all
>
> I index some documents - all with the same fields... One of the fields,
> "id" is unique for the indexed documents. If i try to ind
Hi,
The other way maybe, store the date in a separate database, like derby
embeded database. Then, do a joint query between the Lucene result and the
derby database select result, and merge the two result set into one by
handing coding a database intersection like type of operation.
Just a th
Hi,
It seems this problem only happens when the index files get really large.
Could it be because java has trouble handling very large files on windows
machine (guess there is max file size on windows)?
In Lucene, I think there is a maxDoc kind of parameter that you can use to
specify, when th
Hi,
It seems to me that in theory, Lucene storage code could use true UTF-8 to
store terms. Maybe it is just a legacy issue that the modified UTF-8 is
used?
Cheers,
Jian
On 8/26/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
> Greets,
>
> [crossposted to java-user@lucene.apache.org and [E
y consider making those
> changes.
>
> Erik
>
>
> On Aug 26, 2005, at 3:12 PM, jian chen wrote:
>
> > Hi, Erik,
> >
> > I some time ago played with the Lucene 1.2 source code and made some
> > modifications to it, trying to add my own ranking algorithm.
Hi, Erik,
I some time ago played with the Lucene 1.2 source code and made some
modifications to it, trying to add my own ranking algorithm. I am not sure
if Licence wise, it is permissible to modify the earlier source code, also
if it is allowed to put the modified version or the description of
Hi,
I don't think by default it does so. But, you can certainly serialize
the java object and use base 64 to encode it into a text string, then,
you can store it as a field.
Cheers,
Jian
On 8/25/05, Kevin L. Cobb <[EMAIL PROTECTED]> wrote:
> I just had a thought this morning. Does Lucene have t
Right. My philosophy is that, make it work, then, make it better.
Don't waste time on something that you are not sure if it would cause
performance problem.
Jian
On 8/23/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> On Tuesday 23 August 2005 19:01, Miles Barr wrote:
> > On Tue, 2005-08-23 at 13
Hi,
I hacked the lucene 1.2 a little while ago and I am trying to use my
own similarity algorithm. If you are interested in the changes I have
made to the Lucene 1.2, you can email me back at chenjian1227 at
gmail.com
Cheers,
Jian
On 8/18/05, Karl Koch <[EMAIL PROTECTED]> wrote:
> Hello Lucen
learning it...
>
> I have asked the kind people of Derby-users, and they say there is no
> solution for this yet.
>
> I guess we can ask the people on the -developer list....
>
>
> On 8/13/05, jian chen <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I am
Hi,
I am also interested in that. I haven't used Derby before, but it
seems the java database of choice as it is open source and a full
relational database.
I plant to learn the simple usage of Derby and then think about
integrating Derby with Lucene.
May we should post our progress for the int
Well, the good practice I think is to decouple the backend from the
front end as much as possible. You might have different versions of
java running for each end and also, there might be code compatibility
issues with different versions.
Jian
On 8/10/05, Andrew Boyd <[EMAIL PROTECTED]> wrote:
> Q
Hi, Dan,
I think the problem you mentioned is the one that has been discussed
lot of times in this mailing list.
Bottomline is that you'd better use the compound file format to store
indexes. I am not sure Lucene 1.3 has that available, but, if
possible, can you upgrade to lucene 1.4.3?
Cheers,
ey are
> not recorded in segments file, they are irrelevant.
>
> Otis
> P.S.
> Did you ask you locking in Lucene the other day?
>
>
> --- jian chen <[EMAIL PROTECTED]> wrote:
>
> > Hi, Otis,
> >
> > Thanks for your email. As this is very important
ed on this list so far was
> the corruption of the segments file, and even that people have been
> able to manually edit with a hex editor.
>
> Otis
>
>
> --- jian chen <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I know Lucene does not have trans
Hi,
I know Lucene does not have transaction support at this stage.
However, I want to know what will happen if there is an operating
system crash during the indexing process, will the Lucene index got
corrupted?
Thanks,
Jian
-
Yeah, RDBMS makes sense. In this case, would it be better to simple
store those in a relational database and just use Lucene to do
indexing for the text?
Cheers,
Jian
On 7/7/05, Leos Literak <[EMAIL PROTECTED]> wrote:
> I know the answear, but just for curiosity:
>
> have you guys ever thought
Well, I guess Lucene's Span query uses the Cover Density based model
(proximity model). However, it is within the framework of the TF*IDF
as well.
Jian
On 7/4/05, Dave Kor <[EMAIL PROTECTED]> wrote:
> Quoting [EMAIL PROTECTED]:
>
> > Hi everybody,
> >
> > which kind of retrieval model is lucene
TED]> wrote:
> Thanks Jian
>
> I need to retrive the original document sometimes. I did not quite understand
> your second suggestion.
> Can you please help me understand better, a pointer to some web resource will
> also help.
>
> jian chen <[EMAIL PROTECTED]> wro
Hi,
Depending on the operating system, there might be a hard limit on the
number of files in one directory (windoze versions). Even with
operating systems that don't have a hard limit, it is still better not
to put too many files in one directory (linux).
Typically, the file system won't be very
Hi,
I would use pure span or cover density based ranking algorithm which
do not take document length into consideration. (tweaking whatever
currently in the standard Lucene distribution?)
For example, searching for the keywords "beautiful house", span/cover
ranking will treat a long document and
Hi, Naimdjon,
I have some suggestions as well along the lines of Mark Harwood.
As an example, suppose for each hotel room there is a description, and
you want the user to do free text search on the description field.
You could do the following:
1) store hotel room reservation info as rows in a
Hi,
I am looking at and trying to understand more about Lucene's
reader/writer synchronization. Does anyone know when the commit.lock
is release? I could not find it anywhere in the source code.
I did see the write.lock is released in IndexWriter.close().
Thanks,
---
Hi,
I am looking at and trying to understand more about Lucene's
reader/writer synchronization. Does anyone know when the commit.lock
is release? I could not find it anywhere in the source code.
I did see the write.lock is released in IndexWriter.close().
Thanks,
---
Hi,
I haven't heard anything back. Probably this email got lost on the way
or whatsoever.
Anyway, could anyone enlighten me on this?
Thanks,
Jian
-- Forwarded message --
From: jian chen <[EMAIL PROTECTED]>
Date: Jun 26, 2005 12:59 PM
Subject: when is the commit.lo
Hi,
Recently I looked at the locking mechanism of Lucene. If I am correct,
I think the process for grabbing the lock file will time out by
default in 10 seconds. When the process timed out, it will print out
the IOException.
The lucene locking mechanism is not within threads in the same JVM. It
u
Hi,
I am looking at and trying to understand more about Lucene's
reader/writer synchronization. Does anyone know when the commit.lock
is release? I could not find it anywhere in the source code.
I did see the write.lock is released in IndexWriter.close().
Thanks,
Jian
-
Hi,
I have a stupid question regarding the transient nature of the document ids.
As I understand, documents will obtain new doc ids during the index
merge. Suppose if you do a search and got the Hits object. When you
iterate through the documents by id, the index merge happens. How the
merge and
Hi,
I think Span query in general should do more work than simple Phrase
query. Phrase query, in its simplest form, should just try to find all
terms that are adjacent to each other. Meanwhile, Span query does not
necessary be adjacent to each other, but, with other words in between.
Therefore, I
Hi,
You may look at this website
http://www.zilverline.org
Cheers,
Jian
On 6/21/05, Markus Atteneder <[EMAIL PROTECTED]> wrote:
> I am looking for a SearchEngine for our Intranet and so i deal with Lucene.
> I have read the FAQ and some Postings and i got first experiences with it
> and now i
Hi,
optimize() merges the index segments into one single index segment. In
your case, I guess the 2G index segment is quite large, if you merge
it with any other small index segments, the merging process definitely
will be slow.
I think the performance should be ok without calling optimize().
Mor
owercasing, and such)
> and separate CJK characters into separate tokens also.
>
> Erik
>
>
> On May 31, 2005, at 5:49 PM, jian chen wrote:
>
> > Hi,
> >
> > Interesting topic. I thought about this as well. I wanted to index
> > Chinese text with En
Hi,
Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather than Chinese text tokens.
Right now I think maybe I have to write a special analyzer that takes
the text input, and
49 matches
Mail list logo