[release announcement] Carrot2 version 2.1 released

2007-08-13 Thread Stanislaw Osinski
Hi All, A bit of self-promotion again :) I hope you don't find it out of topic, after all, some folks are using Carrot2 with Lucene and Solr, and Nutch has a Carrot2-based clustering plugin. Staszek [EMAIL PROTECTED] ___

答复: Indexing correctly?

2007-08-13 Thread Kai Hu
Hi, John I think you cost too much time in I/O,and if you use RAMDirectory first will better.see http://wiki.apache.org/lucene-java/ImproveIndexingSpeed kai -邮件原件- 发件人: Erick Erickson [mailto:[EMAIL PROTECTED] 发送时间: 2007年8月13日 星期一 1:57 收件人: java-user@lucene.apache.org 主题: Re: In

Re: How to keep user search history and how to turn it into information?

2007-08-13 Thread Enis Soztutar
Hi, Lukas Vlcek wrote: Enis, Thanks for your time. I gave a quick glance at Pig and it seems good (seems it is directly based on Hadoop which I am starting to play with :-). It obvious that a huge amount of data (like user queries or access logs) should be stored in flat files which makes it co

RE: High CPU usage duing index and search

2007-08-13 Thread testn
To me, it looks like what you are trying to achieve is more suitable to be the database where it can help you do grouping and sorting, etc. But if you still want to achieve it using lucene, you might want to post some code so that I can go through it and see why it uses so much resource then. Ch

Re: Index file size limitation of 2GB

2007-08-13 Thread Chris Lu
Hi, Rohit, You need to create index reader in the sub directory where you created the index files. Lucene's IndexReader won't find your index if you simply move the index to a sub directory. Yes, if you have several index directory, you need to combine them together. But you can achieve this by u

Re: How to keep user search history and how to turn it into information?

2007-08-13 Thread Lukas Vlcek
Enis, thanks for excellent answer! Lukas On 8/13/07, Enis Soztutar <[EMAIL PROTECTED]> wrote: > > Hi, > > Lukas Vlcek wrote: > > Enis, > > > > Thanks for your time. > > I gave a quick glance at Pig and it seems good (seems it is directly > based > > on Hadoop which I am starting to play with :-).

Re: How to keep user search history and how to turn it into information?

2007-08-13 Thread karl wettin
13 aug 2007 kl. 12.49 skrev Lukas Vlcek: But I am looking for more IR oriented application of this information. I remember that once I read on Lucene mail list that somebody suggested utilization of previously issued user queries for suggestions of similar/other/related queries or for typo

Re: performance on filtering against thousands of different publications

2007-08-13 Thread Erick Erickson
Have you tried the very simple techinque if just making an OR clause containing all the sources for a particular query and just letting it run? I was surprised at the speed... But before doing *any* of that, you need to find out, and tell us, what exactly is taking the time. Are you opening a new

Re: Index file size limitation of 2GB

2007-08-13 Thread Erick Erickson
There is no *lucene* limitation of a 2GB index file. I've had no trouble with single indexes over 8G. If you're referring to this page... http://wiki.apache.org/lucene-java/LuceneFAQ?highlight=%282gb%29 then it's talking about an *operating system* limitation. So I wouldn't worry about this unless

Re: Range queries in Lucene - numerical or lexicographical

2007-08-13 Thread Erick Erickson
U, because I didn't write the code? You can always contribute a patch. On 8/13/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote: > > Thanks Erick but unfortunately NumberTools works only with long primitive > type I am wondering why you didn't put some method for double and float. > > > > On 8/1

Re: performance on filtering against thousands of different publications

2007-08-13 Thread mark harwood
I would presume that (like a lot of things) there is power-law at play in the popularity of publication sources (i.e. a small number of popular sources and a lot of unpopular ones). The "Zipf" plugin in Luke can be used to illustrate this distribution for the values in your "publication source"

Rank based on lists.

2007-08-13 Thread Walt Stoneburner
Here's a scenario I just ran into, though I don't know how to make Lucene do it (or even if it can). I have two lists; to keep things simply lets assume (A B C D E F G) and (X Y). I want to form a query so that when matches appear from both lists, results rank higher, than if many elements matche

SpanQuery and database join

2007-08-13 Thread Peter Keegan
I've been experimenting with using SpanQuery to perform what is essentially a limited type of database 'join'. Each document in the index contains 1 or more 'rows' of meta data from another 'table'. The meta data are simple tokens representing a column name/value pair ( e.g. color$red or location$1

Re: How to implement cut of score ?

2007-08-13 Thread Donna L Gresh
Hoss wrote: this would be meaningless even if it were easier... http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03 FAQ: "Can I filter by score?" -Hoss I've read the warnings referenced there; but still have a problem to solve. We have "fact-based" infor

Re: SpanQuery and database join

2007-08-13 Thread Erick Erickson
Thanks for writing this up. Do you think this is an appropriate subject for the Wiki performance page? Erick On 8/13/07, Peter Keegan <[EMAIL PROTECTED]> wrote: > > I've been experimenting with using SpanQuery to perform what is > essentially > a limited type of database 'join'. Each document in

Re: SpanQuery and database join

2007-08-13 Thread Peter Keegan
I suppose it could go under performance or HowTo/Interesting uses of SpanQuery. Peter On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > Thanks for writing this up. Do you think this is an appropriate subject > for the Wiki performance page? > > Erick > > On 8/13/07, Peter Keegan <[EMAIL P

Re: How to implement cut of score ?

2007-08-13 Thread N. Hira
Donna, If I understand the problem correctly, it is: given a [job description], find [candidates] that we would not otherwise find. That seems to be a "user-weighted similarity" problem more than a simple search problem. IOW: 1. Given a [job description], create a set of queries that look for

Re: SpanQuery and database join

2007-08-13 Thread Grant Ingersoll
There is also a Use Cases item on the Wiki... On Aug 13, 2007, at 3:26 PM, Peter Keegan wrote: I suppose it could go under performance or HowTo/Interesting uses of SpanQuery. Peter On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote: Thanks for writing this up. Do you think this is an appr

Re: Amount of RAM needed to support a growing lucene index?

2007-08-13 Thread lucene user
That is wonderful to hear. (I love that I am not stressing the technology near its limits.) What if my concern is more in terms of having a large number of requests per second? When should I start to be worried and start thinking about more than low end hardware? Thanks! On 8/12/07, karl wettin

Re: Amount of RAM needed to support a growing lucene index?

2007-08-13 Thread Kai_testing Middleton
I don't think that's all that large, though I have only been working with Lucene for a short while. I have two corpuses with 445834 documents (3.43M terms) and 132217 documents (1.6M terms). I don't have trouble querying either of these with Luke. - Original Message From: lucene user

Re: Amount of RAM needed to support a growing lucene index?

2007-08-13 Thread karl wettin
14 aug 2007 kl. 00.17 skrev lucene user: What if my concern is more in terms of having a large number of requests per second? When should I start to be worried and start thinking about more than low end hardware? I have served one request every 10th millisecond, 24/7 on a single machine

Question on custom scoring

2007-08-13 Thread Srinivas.N.
A few questions on custom score queries: [1] I need to rank matches by some combination of keyword match, popularity and recency of the doc. I read the docs about CustomScoreQuery and seems to be a resonable fit. An alternate way of achieving my goals is to use a custom sort. What are the trade-

Re: performance on filtering against thousands of different publications

2007-08-13 Thread Cedric Ho
On 8/13/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > Have you tried the very simple techinque if just making an OR clause > containing all the sources for a particular query and just letting > it run? I was surprised at the speed... I think the TermsFilter that I use does exactly that. > > But

Re: Question on custom scoring

2007-08-13 Thread Srinivas.N.
I figured out the answer to 2[a] - its because by default CustomScoreQuery does weight normalization. To disable that, one should use customQuery.setStrict(true). Once I do this, I get the original values that I stored during the indexing process. Help with the other two questions ([1] and [2]b)

Re: Rank based on lists.

2007-08-13 Thread Grant Ingersoll
Have a look at the DisjunctionMaxQuery class. I don't think it is exactly what you are looking for, but it might give you some ideas on how to proceed, as it sounds similar to what you are trying to do. Hope this helps, Grant On Aug 13, 2007, at 2:20 PM, Walt Stoneburner wrote: Here's a s

Re: performance on filtering against thousands of different publications

2007-08-13 Thread Cedric Ho
On 8/13/07, mark harwood <[EMAIL PROTECTED]> wrote: > I would presume that (like a lot of things) there is power-law at play in the > popularity of publication sources (i.e. a small number of popular sources and > a lot of unpopular ones). > The "Zipf" plugin in Luke can be used to illustrate thi

File decriptors

2007-08-13 Thread rohit saini
Hi all, Lucene says if we use compound file format than it greatly increase the number of file descriptors used by indexing and by searching. Can you please tell me what does it mean. Which file are opened by during indexing and searching. I know something but still not very clear. I have some oth

Indexing PDF documents with structure information

2007-08-13 Thread Thomas Arni
Hello Luceners I have started a new project and need to index pdf documents. There are several projects around, which allow to extract the content, like pdfbox, xpdf and pjclassic. As far as I studied the FAQ's and examples, all these tools allow simple text extraction. Which of these open sour