Hi all,
I am searching for a way to ignore XML tags in the input when indexing. Is
there a built in functionality in Lucene to get this done?
I am sorry if this was discussed before. I searched but couldn't find a
clear solution.
Thanks in advance
Kalani
--
Kalani Ruwanpathirana
Department of C
On Wed, Jul 23, 2008 at 7:47 PM, Jamie <[EMAIL PROTECTED]> wrote:
> Could this error be the result of the bad file descriptor close bug as
> described in
> http://256.com/gray/docs/misc/java_bad_file_descriptor_close_bug.shtml.
Hmmm, that's an interesting read.
Seems like maybe we should kill most
I think that is the best strategy at this point.
The head of 2.3 has a workaround (that so far *seems* to work around)
for that JRE bug.
Mike
Jamie wrote:
Hi
I feel like we are having to tip toe across JRE bugs to get this to
work right. I am definitely not pointing fingers, since the
Hi
I feel like we are having to tip toe across JRE bugs to get this to work
right. I am definitely not pointing fingers, since the issues and their
resolutions are complex but
I would appreciate some insight on the most reliable combination of JRE
6 and Lucene. I cannot downgrade the JRE to 5
Hi All,
I found something interesting
Could this error be the result of the bad file descriptor close bug as
described in
http://256.com/gray/docs/misc/java_bad_file_descriptor_close_bug.shtml.
This would definitely fit the description since this happened on JRE
1.6u3 apparently, up
Sebastin,
Lucene is just like a plain database table. It doesn't have uniqueness
constraint.
So you can have two documents of the exact same content.
What you should do is to check for duplication before adding. And if
duplication is found, delete the old Document and add a new Document.
This way
23 jul 2008 kl. 22.08 skrev Cam Bazz:
hello -
if I make a query and get the document ids and delete with the
document id -
could there be a side effect?
my index is committed periodically, but i can not say when it is
committed.
The only thing is that the deltions will not be visible u
Hello,
i found my mistake.
i forgot to add the files path to the document.
greetings
Johannes Dorn
Am 23.07.2008 um 23:14 schrieb Johannes Dorn:
Hello,
I am quite new to Lucene.
I've added a search to my application by combining the getting
started application and the xml#1 contribution.
Hello,
I am quite new to Lucene.
I've added a search to my application by combining the getting started
application and the xml#1 contribution.
Here are the methods used to generate the index.
public static void generateIndex() {
try {
if (!docD
On 07/23/2008 at 5:09 PM, Steven A Rowe wrote:
> Karl Wettin's recently committed ShingleMatrixAnalyzer
Oops, "ShingleMatrixAnalyzer" -> "ShingleMatrixFilter".
Steve
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional co
Hi Ryan,
Well, at 100 million+ keywords, Lucene might be the right tool.
One thing that you might check out for the query side is Karl Wettin's recently
committed ShingleMatrixAnalyzer (not in any Lucene release yet - only on the
trunk).
The JUnit test class TestShingleMatrixFilter has an exam
hello -
if I make a query and get the document ids and delete with the document id -
could there be a side effect?
my index is committed periodically, but i can not say when it is committed.
best regards,
-c.b.
Heh, actually I'm using Perl but I've always associated text-search
with Lucene, I'm not sure if it's the best solution or not. On the
small side there are 1.6 million keywords, on the large side there are
well over 100 million but I might find another way to break down the
searches into sm
Hi Ryan,
I'm not sure Lucene's the right tool for this job.
I have used regular expressions and ternary search trees in the past to do
similar things.
Is the set of keywords too large for an in-memory solution like these? If not,
consider using a tool like the Perl package Regex::PreSuf
You need to invert the process. Using Lucene may not be the best option... You
need to make your document a key into an index of key words. I've done the
same thing, but not with Lucene. You need to pass through the document and for
each word (token) lookup in some index (hashtable) to find po
Everything i've read and seen about luceen is search for keywords in
documents; I want to do the reverse. I have a huge list of
keywords("big boy","red ball","computer") and I have phrases that I
want to see if they keywords are in. For example using the small
keyword list above(store in documents
how reliable is the version in the trunk? is it ok for production?
On Wed, Jul 23, 2008 at 5:25 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> It's in the lucene trunk (current development version).
> IndexWriter.deleteDocuments(Query query)
>
> -Yonik
>
> On Wed, Jul 23, 2008 at 9:53 AM, Cam Ba
This log looks healthy.
Mike
Jamie wrote:
Hi
The index log file is attached. Many thanks in advance for your
consideration!
Jamie
Jamie wrote:
Wasn't there some index corruption issue with Java 1.6 and Lucene
2.3.2? Could this be the problem?
Jamie
Jamie wrote:
Hi Matthew
Thanks
Mindaugas ?ak?auskas wrote:
On Wed, Jul 23, 2008 at 5:46 PM, Matthew Hall <[EMAIL PROTECTED]
> wrote:
<..>As for the jave 1.6 lucene 2.3.2 index corruption issue <..>
Correct me if I'm wrong, but doesnt' the particular Sun bug [1] only
manifests itself if -Xbatch option is used? Also, the ex
On Wed, Jul 23, 2008 at 5:46 PM, Matthew Hall <[EMAIL PROTECTED]> wrote:
> <..>As for the jave 1.6 lucene 2.3.2 index corruption issue <..>
Correct me if I'm wrong, but doesnt' the particular Sun bug [1] only
manifests itself if -Xbatch option is used? Also, the exceptions
mentioned in LUCENE-1282
org.apache.lucene.index.CheckIndex checks the index for corruption,
and (if you specify -fix) will "repair" the index by removing any
segments that have problems.
There is this issue with Java 1.6.0_0{4,5}:
https://issues.apache.org/jira/browse/LUCENE-1282
Sun is making progress on fi
I'm not sure which file in particular would be the one
corrupter/missing, which is why I suggested looking at the index with luke.
As for the jave 1.6 lucene 2.3.2 index corruption issue, I'm not 100%
familiar with the details on that one, but as a quick test, you should
be able to swap to a 1
Hi
The index log file is attached. Many thanks in advance for your
consideration!
Jamie
Jamie wrote:
Wasn't there some index corruption issue with Java 1.6 and Lucene
2.3.2? Could this be the problem?
Jamie
Jamie wrote:
Hi Matthew
Thanks in advance for the suggestion.
Which file do you
Wasn't there some index corruption issue with Java 1.6 and Lucene 2.3.2?
Could this be the problem?
Jamie
Jamie wrote:
Hi Matthew
Thanks in advance for the suggestion.
Which file do you think does not exist?
This is what we have:
_15zw.cfs _19od.cfs _1a5d.cfs _1a7n.cfs _1ahf.cfs _1ahh
Hi Matthew
Thanks in advance for the suggestion.
Which file do you think does not exist?
This is what we have:
_15zw.cfs _19od.cfs _1a5d.cfs _1a7n.cfs _1ahf.cfs _1ahh.cfs
_qzl.cfs segments.gen
_1993.cfs _1a0w.cfs _1a7c.cfs _1a9m.cfs _1ahg.cfs _1ahi.cfs
segments_158j
Aside
Did you try to open the index using Luke?
Luke will be able to tell you whether or not the index is in fact
corrupted, but looking at your stack trace, it almost looks like the
file.. simply isn't there?
Matt
Jamie wrote:
Hi Everyone
I am getting the the following error when executing Hi
Hi Everyone
I am getting the the following error when executing Hits hits =
searchers.search(query, queryFilter, sort):
18007414-java.io.IOException: Bad file descriptor
18007455- at java.io.RandomAccessFile.seek(Native Method)
18007504- at
org.apache.lucene.store.FSDirectory$FSI
You can also use Luke after you've created your indexes to get their
exact size, and other interesting data points.
Like Ian said though, the decisions you make on a field by field basis
will make your index size vary quite a bit, so probably the best thing
you could do is simply try it out, a
I think there are too many variables to give a simple answer.
How much of your data are you storing? Indexing? Compressing?
Get a representative sample of your data and try it out.
--
Ian.
On Wed, Jul 23, 2008 at 5:00 PM, Ariel <[EMAIL PROTECTED]> wrote:
> I need to know what is the percent
I need to know what is the percent of size of lucene's index respect the
information I'm going to index, I have read some articles that say if a I
index 120 Gb of information the index will grow until 40 Gb, that means the
percent is 30 %, Could somebody tell me how can be proved that ?
Is there an
It's in the lucene trunk (current development version).
IndexWriter.deleteDocuments(Query query)
-Yonik
On Wed, Jul 23, 2008 at 9:53 AM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> hello,
>
> was not there a lucene delete by query feature coming up? I remember
> something like that, but I could not fin
OK, I'm finally catching on. You have to change the demo code to
get the contents into something besides an input stream, so you
can use one of the alternate forms of the Field constructor. For
instance, you could read it all into a string and use the form:
doc.add(new Field("content", ,
hello,
was not there a lucene delete by query feature coming up? I remember
something like that, but I could not find an references.
best regards,
-c.b.
This looks like a CLASSPATH issue. You need to make sure whatever jar
contains that class is in the CLASSPATH when you run that line.
Mike
Sandeep K wrote:
I have another problem also.
While parsing the files during indexing I get the following error,
java.lang.NoClassDefFoundError: de/s
Well, yes, that's expected behavior. Lucene makes no attempt to filter
"substantially similar" documents. From Lucene's perspective, you added
it twice, you must have had a good reason.
And no, it doesn't really display the same document twice. There are two
documents in Lucene that happen to have
http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831
--
Ian.
On Wed, Jul 23, 2008 at 12:26 PM, sandyg <[EMAIL PROTECTED]> wrote:
>
> Hi ALL,
>
> Please can u help how to overcome the exception
> org.apache.lucene.search.BooleanQuery$TooManyClauses: maxCl
Hi ALL,
Please can u help how to overcome the exception
org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set
to 1024
i got this when am using prefix query like 123*
--
View this message in context:
http://www.nabble.com/BooleanQuery%24TooManyClauses%3A-maxClauseCount-
I have another problem also.
While parsing the files during indexing I get the following error,
java.lang.NoClassDefFoundError: de/sty/io/mimetype/MimeTypeResolver
what is this? I use Lius-1.0
Its giving the error in the line where I have
IndexerFactory.getIndexer(file, LiusConfig obj)
plz help
OK then this should be fine. That single machine, on receiving a JMS
message, should use a single IndexWriter for making changes to the
index (ie, it should not try to open a 2nd IndexWriter while a
previous one is still working on a previous message).
Mike
Sandeep K wrote:
Thanks a
Thanks a lot Mike,
There will be only one machine which uses IndexWriter and its the
JMS server. This server will first create the file in the physical file
system(its Linux)
and then index the saved file.
Michael McCandless-2 wrote:
>
>
> Sandeep K wrote:
>
>>
>> Hi all..
>> I had a questi
Sandeep K wrote:
Hi all..
I had a question related to the write locks created by Lucene.
I use Lucene 2.3.2. Will this newwer version create locks while
indexing as
older ones?
or is there any other way that lucene handles its operations?
It still creates write locks, which are used to en
Hi Erik,
I don't remove the stop words, as I index parallel corpora which is used
for learning the translations between pair of languages. so every word is
important. I even develop my own analyzer for Arabic which is just remove
punctuations and special symbols and it return only Arabic text.
42 matches
Mail list logo