Re: Question concerning speed of Lucene.....

2004-08-27 Thread Paul Elschot
Oliver,

On Friday 27 August 2004 22:20, you wrote:
> Hi,
>
> I guess this one of the most often asked question on this mailing list, but
> hopefully my question is more specific, so that I can get some input from
> you.
>
> My project is to implement an agency system for newspapers. So I have to
> handle about 30 days of text and IPTC data. The later is taken from images
> provided by the agencies. I basically get a constant stream of text
> messages from the agencies (roughly 2000 per day per agency) and images
> (roughly 1000 per day per agency). I have to deal with 4 text and 6 image
> agencies. So my daily input is 8000 text messages and 6000 images. The
> extracted documents from these text messages and images have a size of
> about 1kb.
>
> The extraction of the data and converting them to Document objects is
> already finished and the search using lucence works like a charm. Brilliant
> software!
>
> But now to my questions. In order to understand what I am doing, like to
> talk a little about the kind of queries and data I have to deal with.
>
> * Every message has a priority. An integer value ranging from 1 to 6.
> * Every message has a receive date.
> * Every message has an agency assigned, basically a unique string
> identifier for it.
> * Every message has some header data, that is also indexed for refined
> searches.
> * And of course the actual text included in the text message itself or
> the IPTC header of an image.
>
> Typically I have to kinds of queries.
>
> * Typical relational queries
>
> * Show every text messages from a certain agency in the last X days.

Probably good for a date filter, see the wiki on RangeQuery, and evt. my
previous message on filters (using 2nd index on constraining). Lucene has
no facilities for primary keys, so that is up to you.

> * Show every image or text message with a higher priority then Y and
> from a certain period of time.

RangeQuery again for the priority.
One can store images in Lucene, but currently only in String format, ie.
they'll need some conversion. There was some talk on binary
objects (not too) recently, but that is still in development. I'd probably store the
images in a file system or in another db for now. OTOH, if you're willing
to help storing binary images lucene-dev is nearby.

> * Fulltext search

Yes :)

> * A real fulltext search over all elements using the full power of
> lucences query language.

Span queries are currently not supported by the query language,
you might have a look at the org.apache.lucene.search.spans package.

> It is absolutely no question anymore, that the later queries will be done
> using Lucene. But can the first type of query is the thing I am thinking
> about. Can this be done effeciently with Lucene? So far we use a system

Lucene can be as fast as relational databases, provided your lower level
java code on IndexReader plays nice with system resources like disk heads
and RAM.
That means using filters, sorting on index order before using an index
and evt. sorting on document number before retrieving stored fields.
Lucene's IndexSearcher for searching text queries is quite well behaved
in that respect. 

> that uses a SQL database engine for storing the relevant data and is used
> in these queries. But if Lucene is fast enough with these queries too, I am
> willing to skip the SQL database at all. But I have to remind, that I will
> be indexing about 400.000 messages per month.

To easily keep the primary keys in sync between the SQL db and Lucene,
I'd start by keeping the images and the full text only in the SQL db.
Lucene optimisations (needed after adding/deleting docs) copy all data
so it pays to keep the Lucene indexes small.

Later you might need multiple indexes, MultiSearcher, and occasionally
a merge of the indexes.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Question concerning speed of Lucene.....

2004-08-27 Thread Oliver Andrich
Hi,
 
I guess this one of the most often asked question on this mailing list, but
hopefully my question is more specific, so that I can get some input from
you.
 
My project is to implement an agency system for newspapers. So I have to
handle about 30 days of text and IPTC data. The later is taken from images
provided by the agencies. I basically get a constant stream of text messages
from the agencies (roughly 2000 per day per agency) and images (roughly 1000
per day per agency). I have to deal with 4 text and 6 image agencies. So my
daily input is 8000 text messages and 6000 images. The extracted documents
from these text messages and images have a size of about 1kb.
 
The extraction of the data and converting them to Document objects is
already finished and the search using lucence works like a charm. Brilliant
software!
 
But now to my questions. In order to understand what I am doing, like to
talk a little about the kind of queries and data I have to deal with.

*   Every message has a priority. An integer value ranging from 1 to 6.
*   Every message has a receive date.
*   Every message has an agency assigned, basically a unique string
identifier for it.
*   Every message has some header data, that is also indexed for refined
searches.
*   And of course the actual text included in the text message itself or
the IPTC header of an image.

Typically I have to kinds of queries.

*   Typical relational queries

*   Show every text messages from a certain agency in the last X days.
*   Show every image or text message with a higher priority then Y and
from a certain period of time.

*   Fulltext search

*   A real fulltext search over all elements using the full power of
lucences query language.

It is absolutely no question anymore, that the later queries will be done
using Lucene. But can the first type of query is the thing I am thinking
about. Can this be done effeciently with Lucene? So far we use a system that
uses a SQL database engine for storing the relevant data and is used in
these queries. But if Lucene is fast enough with these queries too, I am
willing to skip the SQL database at all. But I have to remind, that I will
be indexing about 400.000 messages per month.
 
Thanks in advance for every answer. Now I will be going back to having fun
with Lucene.
 
Best regards,
Oliver Andrich
 


Re: Using 2nd Index to constraing Search

2004-08-27 Thread Paul Elschot
On Friday 27 August 2004 20:10, Mike Upshon wrote:
> Hi
>
> Just starting to evaluate Lucene and hope somone can answer this
> question.
>
> I am looking at using Lucene to index a very large databse. There is a
> documents table and a few other tables that define what users can view
> what documents. My question is, is it posible to have an index of the

The normal way of doing that is to:
- make a list of all doc id's for the user.
- from this list construct a Filter for use in the full text index.
Sort the doc id's, use an IndexReader on the full text index, construct
a Term for each doc id, walk the termDocs() for the Term, and set
a bit in the filter to allow the document number for the doc id.
- keep this filter to restrict the searches for the user by
IndexSearcher.search(Query,Filter)
- rebuild the filter when the doc id's for the user change, or when
the full text index changes (a document deletion followed
by an optimize or an add can change any other document's number).

Hmm, this is getting to be a FAQ.

> full text contents of the documents and another index that contains the
> document id's and the user id's and then use the 2nd index to qualify
> the full text search over the document table. The reason I want to do
> this is to reduce the numbers of documents that the full text query will
> run.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: complex searche (newbie)

2004-08-27 Thread Otis Gospodnetic
This should work:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20int,%20org.apache.lucene.search.Sort)

Otis

--- Wermus Fernando <[EMAIL PROTECTED]> wrote:

> Thanks :)
> 
>   It Works :) . One more question. I need to order the hits by a
> field in contact called lastname. What do I have to add to the query?
> 
> 
> 
> 
> -Mensaje original-
> De: Bernhard Messer [mailto:[EMAIL PROTECTED] 
> Enviado el: Jueves, 26 de Agosto de 2004 02:17 p.m.
> Para: Lucene Users List
> Asunto: Re: complex searche (newbie)
> 
> hi,
> 
> in general the query parser doesn't allow queries which start with a 
> wildcard Those queries could end up with very long response times and
> 
> block your system. This is not what you want.
> 
> I'm not sure if i understand what you want to do. I expect that you
> have
> 
> a field within a lucene document with name "type". For this field you
> 
> can have different values like "contact,account" etc. Now you want to
> 
> search all documents where type is "contact". So the query to do this
> 
> would be "type:contact", nothing else is required.
> 
> can you try that and give some feedback ?
> 
> best regards
> Bernhard
> 
> 
> Wermus Fernando wrote:
> 
> >I am using multifieldQueryParse to look up some models. I have
> several
> >models: account, contacts, tasks, etc. The user chooses models and a
> >query string to look up. Besides fields for searching, I add some
> >conditions to the query string.
> > 
> >If he puts "john" to look up and chooses contacts, I add to the
> query
> >string the following
> > 
> >Query string: "john and type:contact"
> > 
> >But, If he wants to look up any contact, multifieldQueryParse throws
> an
> >exception. In these case, the query string is the following:
> > 
> >Query string: "* and type:contact"
> > 
> >Am I choosing the wrong QueryParser or is there another easy way to
> look
> >up several fields and the same time any content?
> >
> >  
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using 2nd Index to constraing Search

2004-08-27 Thread Mike Upshon
Hi

Just starting to evaluate Lucene and hope somone can answer this
question.

I am looking at using Lucene to index a very large databse. There is a
documents table and a few other tables that define what users can view
what documents. My question is, is it posible to have an index of the
full text contents of the documents and another index that contains the
document id's and the user id's and then use the 2nd index to qualify
the full text search over the document table. The reason I want to do
this is to reduce the numbers of documents that the full text query will
run.

E.g. if there are 50 million documents in the documents table and 20,000
usesr if the full text search was able to be restricted by the user the
full text search over the documents would be run over a substantialy
smaller set.

Thanks


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: complex searche (newbie)

2004-08-27 Thread Wermus Fernando
Thanks :)

It Works :) . One more question. I need to order the hits by a
field in contact called lastname. What do I have to add to the query?




-Mensaje original-
De: Bernhard Messer [mailto:[EMAIL PROTECTED] 
Enviado el: Jueves, 26 de Agosto de 2004 02:17 p.m.
Para: Lucene Users List
Asunto: Re: complex searche (newbie)

hi,

in general the query parser doesn't allow queries which start with a 
wildcard Those queries could end up with very long response times and 
block your system. This is not what you want.

I'm not sure if i understand what you want to do. I expect that you have

a field within a lucene document with name "type". For this field you 
can have different values like "contact,account" etc. Now you want to 
search all documents where type is "contact". So the query to do this 
would be "type:contact", nothing else is required.

can you try that and give some feedback ?

best regards
Bernhard


Wermus Fernando wrote:

>I am using multifieldQueryParse to look up some models. I have several
>models: account, contacts, tasks, etc. The user chooses models and a
>query string to look up. Besides fields for searching, I add some
>conditions to the query string.
> 
>If he puts "john" to look up and chooses contacts, I add to the query
>string the following
> 
>Query string: "john and type:contact"
> 
>But, If he wants to look up any contact, multifieldQueryParse throws an
>exception. In these case, the query string is the following:
> 
>Query string: "* and type:contact"
> 
>Am I choosing the wrong QueryParser or is there another easy way to
look
>up several fields and the same time any content?
>
>  
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Range query problem

2004-08-27 Thread Stephane James Vaucher
A description on how to search numerical fields is available on the wiki:

http://wiki.apache.org/jakarta-lucene/SearchNumericalFields

sv

On Thu, 26 Aug 2004, Alex Kiselevski wrote:

>
> Thanks, I'll try it
>
> -Original Message-
> From: Daniel Naber [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 26, 2004 12:59 PM
> To: Lucene Users List
> Subject: Re: Range query problem
>
>
> On Thursday 26 August 2004 11:02, Alex Kiselevski wrote:
>
> > I have a strange problem with range query "PERIOD:[1 TO 9]" It works
> > only if the second parameter is equals or less than 9 If it's greater
> > than 9 , it finds no documents
>
> You have to store your numbers so that they will appear in the right
> order
> when sorted lexicographically, e.g. save 1 as 01 if you save numbers up
> to
> 99, or as 0001 if you save numbers up to . You also have to use this
>
> format for searching I think.
>
> Regards
>  Daniel
>
> --
> http://www.danielnaber.de
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> The information contained in this message is proprietary of Amdocs,
> protected from disclosure, and may be privileged.
> The information is intended to be conveyed only to the designated recipient(s)
> of the message. If the reader of this message is not the intended recipient,
> you are hereby notified that any dissemination, use, distribution or copying of
> this communication is strictly prohibited and may be unlawful.
> If you have received this communication in error, please notify us immediately
> by replying to the message and deleting it from your computer.
> Thank you.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-27 Thread Yonik Seeley
FYI, this optimization resulted in a fantastic
performance boost!  I went from 133 queries/sec to 990
queries per sec!  I'm now more limited by socket
overhead, as I get 1700 queries/sec when I stick the
clients right in the same process as the server.

Oddly enough, the performance increased, but the CPU
utilization decreased to around 55% (in both
configurations above).  I'll have to look into that
later, but any additional performance at this point is
pure gravy.

-Yonik


--- Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Doug wrote:
> > For example, Nutch automatically translates such
> > clauses into QueryFilters.
> 
> Thanks for the excellent pointer Doug!  I'll will
> definitely be implementing this optimization.




__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Content from multiple folders in single index

2004-08-27 Thread Erik Hatcher
You should consider using the Ant  task in the Sandbox
(contributions/ant directory).  You'll need to write a custom document
handler implementation to handle PDF's and any other types you like.
The built-in handler does text and HTML files, but is pluggable.
The  task uses Ant's filesets to determine what should be
indexed, so you could simply have an excludes="include/" to exclude
that directory.
Erik
On Aug 25, 2004, at 7:00 PM, John Greenhill wrote:
Hi,
I suspect this is an easy one but I didn't see a reference in the FAQ's
so I thought I'd ask. I have a file structure like this:
web
  - pages
  - downloads (pdf docs)
  - include
I want to index the html in pages and the pdf's in downloads, but not
the html in include, so I don't want to start my index at web. I've
modified the IndexHTML in demo to do the pdf's.
What is the best way to do this? Thanks for your suggestions.
John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: weird lock behavior

2004-08-27 Thread Otis Gospodnetic
Yes, creating a unit test that demonstrates things like this is always
a problem, but that's probably the only way we can see what's going on.

Otis

--- [EMAIL PROTECTED] wrote:

> Otis, 
> 
> no idea how I run into it. That's why to create a unit is a bit 
> problematic.
> 
> In Lucene code it looks like that reader tries every 1 sec to read
> the 
> index 10 times. After that it says - it just can not be that it's
> locked 
> so long, I'd query anyway. That's why 10sec delay.
> 
> 
> 
> 
> 
> 
> >Iouli,
> 
> >This sounds like something that should never happen.  It never
> happens
> >for me at simpy.com - and I use Lucene A LOT there.  My guess is
> that
> >the problem with the commit lock has to do with misuse of
> >IndexReader/Searcher.  Like with the memory leak issue, please try
> >isolating the problem in a unit test that we (Lucene developers) can
> >run on our machines.
> 
> >Thanks,
> >Otis
> 
> 
> 
> 
> Bernard,
> 
> it's not a problem to unlock the index. The problem is to know that
> it's 
> the time to intercept and do it manually (i.e. it's not a normal lock
> came 
> from reader/writer proceses but just a fantom that got lost).
> 
> 
> 
> hi,
> 
> the IndexReader class provides some public static methodes to check
> if 
> an index is locked. If this is the case, there is also a method to 
> unlock an existing index. You could do something like:
> 
> Directory dir = FSDirectory.getDirectory(indexDir, false);
> if (IndexReader.isLocked(dir)) {
> IndexReader.unlock(dir);
> }
> dir.close();
> 
> You also should catch the possible IOException in case of an error or
> 
> the index can't be unlocked.
> 
> fun with it
> Bernhard
> 
> [EMAIL PROTECTED] wrote:
> 
> >Hi,
> >I experienced following situation:
> >
> >Suddenly my query became too slow (c.10sec instead of c.1sec) and
> the 
> >number of returned hits changed from c. 2000 to c.1800.
> >
> >Tracing the case I've found locking file "abc...-commit.lck".
> After 
> >deletion of this file everything turned back to normal behavior,
> i.e. I 
> >got my 2000 hits in 1sec.
> >
> >There were no concurent writing or reading processes running
> parallely.
> >
> >Probably the lock file was lost because of abnormal termination (
> during 
> >development it's ok, but may happen in production as well)
> >My question is how to handle such situation,  find out and repair in
> case 
> 
> >it happens (in real life there are many concurensy processes and I
> have 
> no 
> >idea which lock file to kill).
> >
> >
> >
> > 
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: weird lock behavior

2004-08-27 Thread iouli . golovatyi
Otis, 

no idea how I run into it. That's why to create a unit is a bit 
problematic.

In Lucene code it looks like that reader tries every 1 sec to read the 
index 10 times. After that it says - it just can not be that it's locked 
so long, I'd query anyway. That's why 10sec delay.






>Iouli,

>This sounds like something that should never happen.  It never happens
>for me at simpy.com - and I use Lucene A LOT there.  My guess is that
>the problem with the commit lock has to do with misuse of
>IndexReader/Searcher.  Like with the memory leak issue, please try
>isolating the problem in a unit test that we (Lucene developers) can
>run on our machines.

>Thanks,
>Otis




Bernard,

it's not a problem to unlock the index. The problem is to know that it's 
the time to intercept and do it manually (i.e. it's not a normal lock came 
from reader/writer proceses but just a fantom that got lost).



hi,

the IndexReader class provides some public static methodes to check if 
an index is locked. If this is the case, there is also a method to 
unlock an existing index. You could do something like:

Directory dir = FSDirectory.getDirectory(indexDir, false);
if (IndexReader.isLocked(dir)) {
IndexReader.unlock(dir);
}
dir.close();

You also should catch the possible IOException in case of an error or 
the index can't be unlocked.

fun with it
Bernhard

[EMAIL PROTECTED] wrote:

>Hi,
>I experienced following situation:
>
>Suddenly my query became too slow (c.10sec instead of c.1sec) and the 
>number of returned hits changed from c. 2000 to c.1800.
>
>Tracing the case I've found locking file "abc...-commit.lck". After 
>deletion of this file everything turned back to normal behavior, i.e. I 
>got my 2000 hits in 1sec.
>
>There were no concurent writing or reading processes running parallely.
>
>Probably the lock file was lost because of abnormal termination ( during 
>development it's ok, but may happen in production as well)
>My question is how to handle such situation,  find out and repair in case 

>it happens (in real life there are many concurensy processes and I have 
no 
>idea which lock file to kill).
>
>
>
> 
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]