Hi Doron,
Thank you very much for your time and for the detailed explanations. This is
exactly what I meant and I am happy to see I understood correctly.
I am now using the Snowball which seems to work very good.
Thanks again and good day,
Barak Hecht.
--
View this message in context:
htt
If mergeFactor is set to 2 and no optimize() is ever done on the index,
what is the impact on
1) the number opened files during indexing
2) the number of opened files during searching
2) the search speed
3) the indexing speed
??
HC
---
I would say, that over time, the number of files will grow. and continue
growing if you never perform
an optimize(). After some very adviceful mails from Erick i settled on a
mergeFactor of 30, and since I do the indexing in large batches, I perform
an optimize() only in the end of the indexi
Dawid Weiss wrote:
> You could also try splitting the document into paragraphs and use Carrot2's
> Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters.
> Labelling routine in Lingo should extract 'key' phrases; this analysis is
> heavily frequency-based, but... you know, y
On 5/7/07, Doron Cohen <[EMAIL PROTECTED]> wrote:
With a query parser set to allowLeadingWildcard, this should do:
( +item -price:* ) ( +item +price:[0100 TO 0150] )
or, to avoid too-many-cluases risk:
( +item -price:[MIN TO MAX]) ( +item +price:[0100 TO 0150] )
where MIN and MAX cover (at least)
Hi Arsen,
I've seen another commercial one from a company called Connexor
(www.connexor.com) . It has a decent part-of-speech tagger that could be
used in keyphrase extraction with some heuristics on top of it.
-vishal.
-Original Message-
From: Mark Miller [mailto:[EMAIL PROTECTED]
S
The contrib/benchmark addition can help you characterize many of
these scenarios, especially if you write a DocMaker and QueryMaker
for your collection.
On May 8, 2007, at 5:30 AM, Stadler Hans-Christian wrote:
If mergeFactor is set to 2 and no optimize() is ever done on the
index,
what i
Last week I send a message with a doubt in a FuzzyQuery. I am working with
Lucene 2.1.0.
I would like to recover files including the set of strings
"société américaine" and "sociétés américaines"
from a fuzzy query relating the string "société américain"
I create a method "getDocuments" and I
here you have a very good tool for Keyphrase Extraction. It is GNU and
easy to integrate in Lucene.
http://www.paynter.info/academia/Kea.php
best
jose
On 5/8/07, Bill Janssen <[EMAIL PROTECTED]> wrote:
Dawid Weiss wrote:
> You could also try splitting the document into paragraphs and use Carr
I have a use case, in which I need to select the Analyzer based on a Locale.
For example:
"nl" => DutchAnalyzer
"nl_BE" => DutchAnalyzer
"fr" => FrenchAnalyzer
"foobar" => StandardAnalyzer (fallback)
I was wondering if lucene has any sort of "AutomaticAnalyzerResolver"
class that could do this f
There is nothing canned that I know of. I'm also not sure how this
would be used. If you're using a single index, how are you going
to index, then search using these analyzers? Or is there some
other magic going on?
Consider your document with a field "text". If you index into this
field with dif
Hi,
- Original Message
From: Stadler Hans-Christian <[EMAIL PROTECTED]>
If mergeFactor is set to 2 and no optimize() is ever done on the index,
what is the impact on
1) the number opened files during indexing
OG: it will grow a little, but frequently go down as Lucene merges segments
: There is nothing canned that I know of. I'm also not sure how this
: would be used. If you're using a single index, how are you going
: to index, then search using these analyzers? Or is there some
: other magic going on?
i suspect the use case is "shipped" software product, where you want to
h
I don't understand why I'm getting the results I'm getting.
If I search for "pandock*" I get 6 results
Np-pandock
Np-pandock-L
Np-pandock-1
Np-pandock-2
Np-pandock
Np-pandock-L1
If I search for np-pandock I get
Np-pandock
Np-pandock-L
If I search for pandock I get
Np-pandock
First question: What analyzers are you using at index and search time?
Second question: Have you tried using query.toString() to see how
the query parses? If so, you should post the results.
Third question: Have you used Luke to examine your index, to see
what's actually in there (which may surp
Mark Miller wrote:
The only commercial options that I have seen do not have a web presence
(that I know of or can find) and I don't recall the company names (only
peripherally involved).
Are we talking about Yahoo's buzz index and
Amazon's SIPs or CAPs?
I actually think the most interesting a
Here are the queries
In my code I have:
System.out.println("luceneQuery: " +
luceneQuery);
query =
MultiFieldQueryParser.parse(luceneQuery.toString(),
IndexerExternal.CARTABLE, IndexerExternal.getAnalyzer());
System.out.printl
On Tuesday 08 May 2007 23:42, John Powers wrote:
> I've had problems with luke in the past not being able to read the
> files.
Just make sure you specify the directory, not the files when opening an
index with Luke. Also use the latest version (0.7).
Regards
Daniel
--
http://www.danielnaber
I am indexing documents periodically every hour. I have a scenario.
For example, when you are indexing every hour and large document set
is present, it takes >1 hr to index the documents. Now you are
already behind indexing for the next hour. How do you design
something that is robust?
thanks.
Don't do it that way ? Is this an actual or theoretical
scenario? And do you reasonably expect it to become actual?
Otherwise, why bother?
And you've got other problems here. If you're indexing that
much data, you'll soon outgrow your disk. Unless you're
replacing most of the documents.
But assu
: For example, when you are indexing every hour and large document set
: is present, it takes >1 hr to index the documents. Now you are
: already behind indexing for the next hour. How do you design
: something that is robust?
fundementally, this question is really about issues in a producer/con
Hello everyone,
I am getting the following exception while searching:
Lock obtain timed out: java.io.IOException: Lock obtain timed out:
[EMAIL
PROTECTED]:\WINDOWS\TEMP\lucene-22e0ad3c019e26a6e2991b0e6ed97e1c-commit.lock
I have implemented MultiSearcher only, No other methods are updating/addi
How I can get the tag word score in lucene. suppose that you have searched a
tag word and 3 hit documents
are now found.
1 -How someone could find number of occurrences in any document so it could
sort the results.
Also I wan to have some other policies for ranking the results. What should
I do t
23 matches
Mail list logo