Hi All,
A bit of self-promotion again :) I hope you don't find it out of topic,
after all, some folks are using Carrot2 with Lucene and Solr, and Nutch has
a Carrot2-based clustering plugin.
Staszek
[EMAIL PROTECTED]
___
Hi, John
I think you cost too much time in I/O,and if you use RAMDirectory first
will better.see http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
kai
-邮件原件-
发件人: Erick Erickson [mailto:[EMAIL PROTECTED]
发送时间: 2007年8月13日 星期一 1:57
收件人: java-user@lucene.apache.org
主题: Re: In
Hi,
Lukas Vlcek wrote:
Enis,
Thanks for your time.
I gave a quick glance at Pig and it seems good (seems it is directly based
on Hadoop which I am starting to play with :-). It obvious that a huge
amount of data (like user queries or access logs) should be stored in flat
files which makes it co
To me, it looks like what you are trying to achieve is more suitable to be
the database where it can help you do grouping and sorting, etc. But if you
still want to achieve it using lucene, you might want to post some code so
that I can go through it and see why it uses so much resource then.
Ch
Hi, Rohit,
You need to create index reader in the sub directory where you created
the index files. Lucene's IndexReader won't find your index if you
simply move the index to a sub directory.
Yes, if you have several index directory, you need to combine them
together. But you can achieve this by u
Enis,
thanks for excellent answer!
Lukas
On 8/13/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> Lukas Vlcek wrote:
> > Enis,
> >
> > Thanks for your time.
> > I gave a quick glance at Pig and it seems good (seems it is directly
> based
> > on Hadoop which I am starting to play with :-).
13 aug 2007 kl. 12.49 skrev Lukas Vlcek:
But I am looking for more IR oriented application of this
information. I
remember that once I read on Lucene mail list that somebody
suggested
utilization of previously issued user queries for suggestions of
similar/other/related queries or for typo
Have you tried the very simple techinque if just making an OR clause
containing all the sources for a particular query and just letting
it run? I was surprised at the speed...
But before doing *any* of that, you need to find out, and tell us, what
exactly is taking the time. Are you opening a new
There is no *lucene* limitation of a 2GB index file. I've had no trouble
with single indexes over 8G. If you're referring to this page...
http://wiki.apache.org/lucene-java/LuceneFAQ?highlight=%282gb%29
then it's talking about an *operating system* limitation. So I wouldn't
worry about this unless
U, because I didn't write the code? You can always contribute a patch.
On 8/13/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
>
> Thanks Erick but unfortunately NumberTools works only with long primitive
> type I am wondering why you didn't put some method for double and float.
>
>
>
> On 8/1
I would presume that (like a lot of things) there is power-law at play in the
popularity of publication sources (i.e. a small number of popular sources and a
lot of unpopular ones).
The "Zipf" plugin in Luke can be used to illustrate this distribution for the
values in your "publication source"
Here's a scenario I just ran into, though I don't know how to make
Lucene do it (or even if it can).
I have two lists; to keep things simply lets assume (A B C D E F G) and (X Y).
I want to form a query so that when matches appear from both lists,
results rank higher, than if many elements matche
I've been experimenting with using SpanQuery to perform what is essentially
a limited type of database 'join'. Each document in the index contains 1 or
more 'rows' of meta data from another 'table'. The meta data are simple
tokens representing a column name/value pair ( e.g. color$red or
location$1
Hoss wrote:
this would be meaningless even if it were easier...
http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03
FAQ: "Can I filter by score?"
-Hoss
I've read the warnings referenced there; but still have a problem to
solve. We have "fact-based" infor
Thanks for writing this up. Do you think this is an appropriate subject
for the Wiki performance page?
Erick
On 8/13/07, Peter Keegan <[EMAIL PROTECTED]> wrote:
>
> I've been experimenting with using SpanQuery to perform what is
> essentially
> a limited type of database 'join'. Each document in
I suppose it could go under performance or HowTo/Interesting uses of
SpanQuery.
Peter
On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> Thanks for writing this up. Do you think this is an appropriate subject
> for the Wiki performance page?
>
> Erick
>
> On 8/13/07, Peter Keegan <[EMAIL P
Donna,
If I understand the problem correctly, it is: given a [job
description], find [candidates] that we would not otherwise find. That
seems to be a "user-weighted similarity" problem more than a simple
search problem.
IOW:
1. Given a [job description], create a set of queries that look for
There is also a Use Cases item on the Wiki...
On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote:
I suppose it could go under performance or HowTo/Interesting uses of
SpanQuery.
Peter
On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
Thanks for writing this up. Do you think this is an appr
That is wonderful to hear. (I love that I am not stressing the technology
near its limits.)
What if my concern is more in terms of having a large number of requests per
second? When should I start to be worried and start thinking about more than
low end hardware?
Thanks!
On 8/12/07, karl wettin
I don't think that's all that large, though I have only been working with
Lucene for a short while. I have two corpuses with 445834 documents (3.43M
terms) and 132217 documents (1.6M terms). I don't have trouble querying either
of these with Luke.
- Original Message
From: lucene user
14 aug 2007 kl. 00.17 skrev lucene user:
What if my concern is more in terms of having a large number of
requests per
second? When should I start to be worried and start thinking about
more than
low end hardware?
I have served one request every 10th millisecond, 24/7 on a single
machine
A few questions on custom score queries:
[1] I need to rank matches by some combination of keyword match, popularity
and recency of the doc. I read the docs about CustomScoreQuery and seems to
be a resonable fit. An alternate way of achieving my goals is to use a
custom sort. What are the trade-
On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> Have you tried the very simple techinque if just making an OR clause
> containing all the sources for a particular query and just letting
> it run? I was surprised at the speed...
I think the TermsFilter that I use does exactly that.
>
> But
I figured out the answer to 2[a] - its because by default CustomScoreQuery
does weight normalization. To disable that, one should use
customQuery.setStrict(true). Once I do this, I get the original values that
I stored during the indexing process.
Help with the other two questions ([1] and [2]b)
Have a look at the DisjunctionMaxQuery class. I don't think it is
exactly what you are looking for, but it might give you some ideas on
how to proceed, as it sounds similar to what you are trying to do.
Hope this helps,
Grant
On Aug 13, 2007, at 2:20 PM, Walt Stoneburner wrote:
Here's a s
On 8/13/07, mark harwood <[EMAIL PROTECTED]> wrote:
> I would presume that (like a lot of things) there is power-law at play in the
> popularity of publication sources (i.e. a small number of popular sources and
> a lot of unpopular ones).
> The "Zipf" plugin in Luke can be used to illustrate thi
Hi all,
Lucene says if we use compound file format than it greatly increase the
number of file descriptors used by indexing and by searching. Can you please
tell me what does it mean. Which file are opened by during indexing and
searching. I know something but still not very clear. I have some oth
Hello Luceners
I have started a new project and need to index pdf documents.
There are several projects around, which allow to extract the content,
like pdfbox, xpdf and pjclassic.
As far as I studied the FAQ's and examples, all these
tools allow simple text extraction.
Which of these open sour
28 matches
Mail list logo