In org.apache.lucene.demo.HTMLDocument you need to change the input stream
to use a different encoding. Replace the fis with this:
fis = new InputStreamReader(new FileInputStream(f), "UTF-16");
-Original Message-
From: Fred Toth [mailto:[EMAIL PROTECTED]
Sent: Friday, September 24, 2004
Can anyone help me with code to get the topterms of a given field for a
query resultset?
Here is code modified from Luke to get the topterms for a field:
public TermInfo[] mostCommonTerms( String fieldName, int numberOfTerms )
{
//make sure min will get a positive number
i
Dear Lucene Users:
What is the best way to get the most common terms for a subset of the total
documents in your index?
I know how to get the most common terms for a field for the entire index,
but what is the most efficient way to do this for a subset of documents?
Here is the code I am using t
A note to developers, the code checked into lucene CVS ~Aug 15th, post
1.4.1, was causing frequent index corruptions. When I reverted back to
version 1.4 I no longer am getting the corruptions.
I was unable to trace the problem to anything specific, but was using the
newer code to take advantage
I sent out an email to this list a few weeks ago about how to fix a corrupt
index. I basically edited the segments file with a hex editor removing the
entry for the missing file and decremented the total count of files from the
file count that is near the beginning of the segments file.
-Orig
..Á´
-George
--- Honey George <[EMAIL PROTECTED]> wrote:
> Wallen,
> Which hex editor have you used. I am also facing a
> similar problem. I tried to use KHexEdit and it
> doesn't seem to help. I am attaching with this e
http://www.ultraedit.com/ is the best!
However, I cannot imagine how another hexeditor wouldnt work.
-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 17, 2004 10:35 AM
To: Lucene Users List
Subject: RE: Restoring a corrupt index
Wallen,
Which hex
iter.optimize(IndexWriter.java:366)
at TryStuff.tryFixingLuceneIndex(TryStuff.java:60)
at TryStuff.main(TryStuff.java:49)
-Directory listing-
-rw-rw-r--1 wallen devs 383461 Jul 27 16:48 _1wtg.cfs
-rw-rw-r--1 wallen devs 754131765 Jul 27 21:12 _262q
-rw-rw-r--1 wallen devs 383461 Jul 27 16:48 _1wtg.cfs
-rw-rw-r--1 wallen devs 754131765 Jul 27 21:12 _262q.cfs
-rw-rw-r--1 wallen devs 754345785 Jul 29 11:43 _4c49.cfs
-rw-rw-r--1 wallen devs 719608798 Jul 31 04:38 _6i6l.cfs
-rw-rw-r--1
A ranged query that covers the full range does the same thing.
Of course it is also inefficient with term generation: myField[a TO z]
-Original Message-
From: Patrick Burleson [mailto:[EMAIL PROTECTED]
Sent: Friday, August 13, 2004 3:58 PM
To: Lucene Users List
Subject: Re: Finding All
The date is stored as a Long that is the number of seconds since jan 1970.
Anything before that would be negative.
-Original Message-
From: Terence Lai [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 04, 2004 6:25 PM
To: Lucene Users List
Subject: Question on the minimum value for DateFi
Are you certain that you are storing the field "contents" in your documents,
not just tokenizing...
If you use the overloaded method that takes a Reader you lose the content.
-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 28, 2004 5:35 PM
To: [EMA
I also question whether it could handle extreme volume with such good query
speed.
Has anyone done numbers with 1+ million documents?
-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 20, 2004 5:44 PM
To: Lucene Users List
Subject: Re: Lucene vs. MySQL F
It could also be that your disk space is filling up and the OS runs out of
swap room.
-Original Message-
From: Mark Florence [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 20, 2004 1:52 PM
To: Lucene Users List
Subject: Very slow IndexReader.open() performance
Hi -- We have a large index
If you know ahead of time which documents are viewable by a certain user
group you could add a field, such as group, and then when you index the
document you put the names of the user groups that are allowed to view that
document. Then your query tool can append, for example "AND
group:developers"
Has anyone had any experience with their index getting corrupted?
Are there any tools to repair it should it get corrupted?
I have not had any problems, but was curious at how resiliant this data
store seems to be.
-Will
-
To u
I have 2 suggestions:
1) use Eclipse, or an IDE that references the javadoc with mouseovers
2) if you are going to create constants, consider using a bitflag. Then
your constants can have a 2's value, ie
STORED = 1
INDEXED = 2
TOKENIZED = 4
Then you can have the constructor look like:
new Fiel
I do not know how to work around that.
It is indeed an interesting situation that would require more understanding
as to how the analyzer (in this case NullAnalyzer) interacts with the
special characters such as the * and ~.
You could try using the whitespace analyzer instead of the nullanalyzer!
The PerFieldAnalyzerWrapper is constructed with your default analyzer,
suppose this is the analyzer you use to tokenize. You then call the
addAnalyzer method for each non-tokenized/keyword fields.
In the case below, url is a keyword, all other fields are tokenized:
PerFieldAnalyzerWrapper analyz
Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper
Here is how I use it:
PerFieldAnalyzerWrapper analyzer = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer("url", new NullAnalyzer());
try
Can anyone give me advice on the best way to not have your keyword fields
analyzed by QueryParser?
Even though it seems like it would be a common problem, I have read the FAQ,
and found this relevant thread with no real answers.
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]
he.org&msgId=12
use forward slashes / instead of \ for your path:
c:/apache/group/index
OR if c: is your main drive
/apache/group/index
-Original Message-
From: Hetan Shah [mailto:[EMAIL PROTECTED]
Sent: Monday, June 21, 2004 5:55 PM
To: [EMAIL PROTECTED]
Subject: Demo 3 on windows
Hello,
I have bee
This depends on the analyzer you use.
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexi
ng&toc=faq#q13
-Original Message-
From: Lynn Li [mailto:[EMAIL PROTECTED]
Sent: Friday, June 18, 2004 5:03 PM
To: '[EMAIL PROTECTED]'
Subject: search "" and ""
When search
It sounds to me like you need a newer version of Java.
-Original Message-
From: milind honrao [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 02, 2004 5:36 PM
To: [EMAIL PROTECTED]
Subject: help needed in starting lucene
Hi,
I am just a beginner. I installed lucene according to the int
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#DEFAULT_MAX_FIELD_LENGTH
maxFieldLength
public int maxFieldLengthThe maximum number of terms that will be indexed
for a single field in a document. This limits the amount of memory required
for indexing, so that co
This sounds like a memory leakage situation. If you are using tomcat I
would suggest you make sure you are on a recent version, as it is known to
have some memory leaks in version 4. It doesn't make sense that repeated
queries would use more memory that the most demanding query unless objects
are
My understanding is that hard drive IO is the main bottleneck, as the
operation is mainly a file copy. So to directly answer your question, I
believe the overall file size of your indexes will linearly effect the
performance profile of your optimizations.
-Original Message-
From: Michael
Make sure you close your indexwriter.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#close()
-Original Message-
From: Steve Rajavuori [mailto:[EMAIL PROTECTED]
Sent: Friday, May 21, 2004 7:49 PM
To: '[EMAIL PROTECTED]'
Subject: Rebuild after corruption
I am not sure. See what google give you. I would guess you need to get a
table of entities and compare it to the unicode character. So if you parse
the word file you might see something like "&u12312;" (without quotes) this
corresponds to a single unicode character and you can use the java api t
I believe MS apps store non-ascii characters as entities internally instead
of using unicode. You can see evidence of this if you save your file as an
HTML file and look at the source. You will have to adjust your parser to
convert the Windows-1252 characters/entities to unicode (UTF-8 or UTF-16)
Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding.
public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPip
Is it possible to append to an existing document?
Judging by my own tests and this thread, NO.
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]
he.org&msgNo=3971
Wouldn't it be possible to look up an individual document (based upon a uid
of sorts), then load the Fields off of the old one, del
32 matches
Mail list logo