Re: Scalability of Lucene indexes

2005-02-19 Thread Andy
Hi Bryan,

How big is your index?

Also what is the advantage of binding a user to a
server? 

Thanks.
Andy

--- Bryan McCormick <[EMAIL PROTECTED]> wrote:

> Hi chris, 
> 
> I'm responsible for the webshots.com search index
> and we've had very
> good results with lucene. It currently indexes over
> 100 Million
> documents and performs 4 Million searches / day. 
> 
> We initially tested running multiple small copies
> and using a
> MultiSearcher and then merging results as compared
> to running a very
> large single index. We actually found that the
> single large instance
> performed better. To improve load handling we
> clustered multiple
> identical copies together, then session bind a user
> to particular server
> and cache the results, but each server is running a
> single index. 
> 
> Bryan McCormick
> 
> 
> On Fri, 2005-02-18 at 08:01, Chris D wrote: 
> > Hi all, 
> > 
> > I have a question about scaling lucene across a
> cluster, and good ways
> > of breaking up the work.
> > 
> > We have a very large index and searches sometimes
> take more time than
> > they're allowed. What we have been doing is during
> indexing we index
> > into 256 seperate indexes (depending on the
> md5sum) then distribute
> > the indexes to the search machines. So if a
> machine has 128 indexes it
> > would have to do 128 searches. I gave
> parallelMultiSearcher a try and
> > it was significantly slower than simply iterating
> through the indexes
> > one at a time.
> > 
> > Our new plan is to somehow have only one index per
> search machine and
> > a larger main index stored on the master.
> > 
> > What I'm interested to know is whether having one
> extremely large
> > index for the master then splitting the index into
> several smaller
> > indexes (if this is possible) would be better than
> having several
> > smaller indexes and merging them on the search
> machines into one
> > index.
> > 
> > I would also be interested to know how others have
> divided up search
> > work across a cluster.
> > 
> > Thanks,
> > Chris
> > 
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Access Lucene from PHP or Perl

2005-02-10 Thread Andy
Greetings.

Can anyone point me to a how-to tutorial on how to
access Lucene from a web page generated by PHP pr
Perl? I've been looking but couldn't find anything.
Thanks a lot.

And

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Penalty for storing unrelated field?

2005-01-28 Thread Andy Goodell
You should be fine.

On Fri, 28 Jan 2005 15:21:50 -0600, Bill Tschumy <[EMAIL PROTECTED]> wrote:
>  I just want to make sure
> that adding the unrelated field to a single doc won't cause all the
> other documents to increase their storage space. 
> --

I have lots of fields that only occur in one document, but it doesn't
phase lucene.  Actually when choosing an indexing solution, we chose
lucene mostly because of its ability to index and store unlimited
kinds of metadata.

- andy g

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Filtering w/ Multiple Terms

2005-01-20 Thread Andy Goodell
Maybe you should try making a BooleanQuery out of the TermQuerys and
then passing that to QueryFilter.  I've never tried it, but it should
work, right?

- andy g


On Thu, 20 Jan 2005 16:02:26 -0600, Jerry Jalenak
<[EMAIL PROTECTED]> wrote:
> In looking at the examples for filtering of hits, it looks like I can only
> specify a single term; i.e.
> 
> Filter f = new QueryFilter(new TermQuery(new Term("acct",
> "acct1")));
> 
> I need to specify more than one term in my filter.  Short of using something
> like ChainFilter, how are others handling this?
> 
> Thanks!
> 
> Jerry Jalenak
> Senior Programmer / Analyst, Web Publishing
> LabOne, Inc.
> 10101 Renner Blvd.
> Lenexa, KS  66219
> (913) 577-1496
> 
> [EMAIL PROTECTED]
> 
> This transmission (and any information attached to it) may be confidential and
> is intended solely for the use of the individual or entity to which it is
> addressed. If you are not the intended recipient or the person responsible for
> delivering the transmission to the intended recipient, be advised that you
> have received this transmission in error and that any use, dissemination,
> forwarding, printing, or copying of this information is strictly prohibited.
> If you have received this transmission in error, please immediately notify
> LabOne at the following email address: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene integration with relational database

2005-01-18 Thread Andy Goodell
I do these kinds of queries all the time.  I found that the fastest
performance for my collections (millions of documents) came from
subclassing Filter using the set of primary keys from the database to
make the Filter, and then doing the query with the
Searcher.search(query,filter) interface.  I was previously using the
in memory merge, but the memory requirements were crashing the JVM
when we had a lot of simultaneous users.

- andy g


On Sat, 15 Jan 2005 23:03:00 +0530, sunil goyal <[EMAIL PROTECTED]> wrote:
> Hi all,
> 
> Thanks for the answers. I was looking for a best practice guide to do
> the same. If anyone already had had some practical experience with
> such kind of queries, it will be great to know his thoughts.
> 
> Thanks
> 
> Regards
> Sunil
> 
> 
> On Sat, 15 Jan 2005 09:00:35 -0800, jian chen <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Still minor additions to the steps:
> >
> > 1) do lucene query and get the hits (keyed by the database primary
> > key, for example, employee id)
> >
> > 2) do database query and get the primary keys (i.e., employee id) for
> > the result rows, ordered by primary key
> >
> > 3) for each lucene query result, look into db query result and see if
> > the primary key is there (since db query result is sorted already by
> > primary key, so, a binary search could be applied)
> >
> > if the primary key is there, store this result, else, discard it
> >
> > 4) when top k results are obtained, send back to the user.
> >
> > How does this sound?
> >
> > Cheers,
> >
> > Jian
> >
> > On Sat, 15 Jan 2005 08:36:16 -0800, jian chen <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > To further the discussion. Would the following detailed steps work:
> > >
> > > 1) do lucene query and get the hits (keyed by the database primary
> > > key, for example, employee id)
> > >
> > > 2) do database query and get the primary keys (i.e., employee id) for
> > > the result rows, ordered by primary key
> > >
> > > 3) merge the two sets of primary keys (for example, in memory two-way
> > > merge) and take the top k records
> > >
> > > 4) display the top k result rows
> > >
> > > Cheers,
> > >
> > > Jian
> > >
> > > On Sat, 15 Jan 2005 12:40:04 +, Peter Pimley <[EMAIL PROTECTED]> 
> > > wrote:
> > > > sunil goyal wrote:
> > > >
> > > > >But can i do for instance a unified query where i want to take certain
> > > > >parameters (non-textual e.g. age < 30 ) from relational databases and
> > > > >keywords from the lucene index ?
> > > > >
> > > > >
> > > > >
> > > > When I have had to do this, I've done the lucene search first, and then
> > > > manually filtered out the hits that fail on other criteria.
> > > >
> > > > I'd suggest doing that first (as it's easiest) and then seeing whether
> > > > the performance is acceptable.
> > > >
> > > > -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Corrupted indexes

2004-10-22 Thread Andy Goodell
Recently, I've been getting a lot of corrupted lucene indexes.  They
appear to return search results normally, but there is really no good
way to test whether information is missing.  The main problem is that
when i try to optimize, i get the following Exception:

java.io.IOException: read past EOF
at 
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:218)
at org.apache.lucene.store.InputStream.readBytes(InputStream.java:61)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:356)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:323)
at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:422)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:94)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)

this is preventing me from optimizing the indexes, and also scares me
that information might be missing.

Does anybody know what's going on here, and what might be wrong?

Thanks for your time,
- andy g

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index files version and lucene 1.4

2004-10-22 Thread Andy Goodell
I had this problem when i initially upgraded to 1.4, but tomcat was
still searching with the old 1.3 jar.  Make sure you have fully
updated its path variables, include directories, etc.

- andy g 


On Fri, 22 Oct 2004 16:00:42 +0200, gaudinat
<[EMAIL PROTECTED]> wrote:
> Thanks,
> 
> Finally my problem seems to come from TOMCAT (5.0) and lucene 1.4
> installation.
> 
> To summerize:
> 
> Throught TOMCAT with the same application (lucene 1.4) and index 1.4 I
> have no Hits while I have Hits with index 1.3.
> Without TOMCAT with the same application (lucene 1.4) I have Hits for
> both version of index files 1.3 and 1.4.
> 
> Is someone have an idea, please?
> 
> Arno.
> 
> 
> 
> Aviran wrote:
> 
> >Lucene 1.4 changed the file format for indexes. You can access a old index
> >using lucene 1.4 but you can't access index which was created using lucene
> >1.4 with older versions.
> >I suggest you rebuild your index using lucene 1.4
> >
> >Aviran
> >http://aviran.mordos.com
> >
> >-Original Message-
> >From: arnaud gaudinat [mailto:[EMAIL PROTECTED]
> >Sent: Thursday, October 21, 2004 12:10 PM
> >To: Lucene Users List
> >Subject: index files version and lucene 1.4
> >
> >
> >Hi,
> >Certainly  a stupid question!
> >I have just upgraded to 1.4, I have succeeded to access my 1.3 index files
> >but not my new 1.4 index files. In fact I have no error, but no hits for 1.4
> >index files. More, I don't know if it's normal but now I have just 3 files
> >for my index (.cfs, deletable and segments). However if I use Luke with the
> >1.4 index files, It works perfectly.
> >
> >An idea?
> >
> >Regards,
> >
> >Arno.
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



removing duplicate Documents from Hits

2004-10-01 Thread Timm, Andy (ETW)
Hello, I've searched on previous posts on this topic but couldn't find an answer.  I 
want to query my index (which are a number of 'flattened' Oracle tables) for some 
criteria, then return Hits such that there are no Documents that duplicate a 
particular field.  In the case where table A has a one-to-many relationship to table 
B, I get one Document for each (A1-B1, A1-B2, A1-B3...).  My index needs to have each 
of these records as 'B' is a searchable field in the index.  However, after the query 
is executed, I want my resulting Hits on be unique on 'A'.  I'm only returning the 
Oracle object ID, so once I've seen it once I don't need it again.  It looks like some 
sort of custom Filter is in order.  My fix at the moment is to run the query, then 
store unique id's in a Map to build another query that will return singletons on field 
'A'.  I could skip this step if there was a way to remove documents from Hits (I 
didn't see a way).  Has anyone written a filter that does this?  Are there others 
using Lucene to mimic a relational DB?  I've got a complex SQL search that joins (most 
outer) 40 some tables.  Query performance is important, and the tables are relatively 
static.  I find the ID's of the objects that match the users' criteria, then go to the 
DB to instantiate them.  Any comments are appreciated.  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problems with Lucene + BDB (Berkeley DB) integration

2004-09-20 Thread Andy Goodell
I used BDB + lucene successfully using the lucene 1.3 distribution,
but it broke in my application with the 1.4 distribution.  The 1.4
dist uses a different file system by default, the "cluster file
system", so maybe that is the source of the issues.

good luck,
andy g


On Mon, 20 Sep 2004 19:36:51 -0300, Christian Rodriguez
<[EMAIL PROTECTED]> wrote:
> Hi everyone,
> 
> I am trying to use the Lucene + BDB integration from the sandbox
> (http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db/).
> I installed C Berkeley DB 4.2.52 and I have the Lucene jar file.
> 
> I have an example program that indexes 4 small text files in a
> directory (its very similar to the IndexFiles.java in the Lucene demo,
> except that it uses BDB + Lucene). The problem I have is that
> executing the indexing program generates different results each time I
> run it. For example: If I start with an empty index, run the indexing
> program and then query the index I get the correct results; then I
> delete the index to start from scratch again, and perform the same
> sequence and I get no results. (?)
> 
> What puzzles me is the non-deterministic results... the same execution
> sequence generates two different results. I then wrote a program to
> dump the index and I found out that the list of files that end up in
> the index is different every time I index those 4 files.
> 
> For example:
> 1st run: contents of directory: _4.f2, _4.f3, _4.cfs, _4.fdx, _4.fnm,
> _4.frq, _4.prx, _4.tii, segments, deletable. (9 files)
> 2nd run: contents of directory: 0:_4.f1, _4.cfs, _4.fdt, _4.fdx,
> _4.fnm, _4.frq, _4.prx, _4.tii, _4.tis, segments, deletable. (11
> files)
> 
> Does anyone have any idea why this is happening?
> Has anyone been able to use the BDB + Lucene integration with no problems?
> 
> Id appreciate any help or pointers.
> Thanks!
> Xtian
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I prevent Sort fields from influencing score?

2004-06-02 Thread Andy Goodell
I build the query myself, its really easy, I just use the normal query
parser with IndexReader.getFieldNames(true) and loop through all of
them to search everything at once.  You can either make a really big
BooleanQuery or make a bunch of small queries and merge the results,
depending on what kind of results you are looking for.  It's probably
not as fast as the one big data field method, but speed is not an
issue yet for anything i've done, whereas code maintenance is a pain,
witness my question that started this thread.

- andy g

On Wed, 2 Jun 2004 13:43:41 -0700 , Gus Kormeier <[EMAIL PROTECTED]> wrote:
> 
> Just curious,
> Are you building your query or using a particular Query Parser?
> which one?
> 
> Are you using MultiFieldQueryParser?  I had problems with MFQP before and
> was looking for other solutions besides dumping fields into a massive
> "content" field.
> 
> TIA,
> -Gus
> 
> 
> 
> -Original Message-
> From: Andy Goodell [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, June 02, 2004 1:30 PM
> To: Lucene Users List
> Subject: Re: Can I prevent Sort fields from influencing score?
> 
> thanks that was my problem, i had code extending the search out to all
> the fields, now it only extends the search out to the fields i'm
> interested in.
> 
> - andy g
> 
> On Wed, 2 Jun 2004 14:21:24 -0500 , Tim Jones <[EMAIL PROTECTED]> wrote:
> >
> > This seems like it would be determined by how you generate your query - if
> > your query doesn't search in the sorted fields, they shouldn't affect the
> > scoring of your documents ...
> >
> >
> >
> > > -Original Message-
> > > From: Andy Goodell [mailto:[EMAIL PROTECTED]
> > > Sent: Wednesday, June 02, 2004 12:22 PM
> > > To: [EMAIL PROTECTED]
> > > Subject: Can I prevent Sort fields from influencing score?
> > >
> > >
> > > I have been using the new lucene 1.4 SortField implementation wih some
> > > custom fields added to old indexes so that the results can be sorted
> > > by them.  My problem here is that some of the String fields that I add
> > > to the index come up in the search terms, so my results in sort by
> > > score order are different.  Here's an example:
> > >
> > > I added the field AUTHOR_SORTABLE to most of the documents in the
> > > index.  But if one of the AUTHOR_SORTABLE field in a document is set
> > > to "andy", and i search for "andy", this document gets a very
> > > different score than it used to.
> > >
> > > Since my added fields aren't set in stone, I'm interested in a general
> > > solution, where all fields containing the text "SORTABLE" in the name
> > > aren't considered for matches, only for sorting.  Could I do this by
> > > overriding Similarity?  I tried doing this to set the lengthNorm() for
> > > each of my sortable fields to 0, but it hasnt worked yet.  Is there a
> > > different way to store the sortable fields that will prevent this?
> > >
> > > Any help would be greatly appreciated.
> > >
> > > - andy g
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I prevent Sort fields from influencing score?

2004-06-02 Thread Andy Goodell
thanks that was my problem, i had code extending the search out to all
the fields, now it only extends the search out to the fields i'm
interested in.

- andy g

On Wed, 2 Jun 2004 14:21:24 -0500 , Tim Jones <[EMAIL PROTECTED]> wrote:
> 
> This seems like it would be determined by how you generate your query - if
> your query doesn't search in the sorted fields, they shouldn't affect the
> scoring of your documents ...
> 
> 
> 
> > -Original Message-
> > From: Andy Goodell [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, June 02, 2004 12:22 PM
> > To: [EMAIL PROTECTED]
> > Subject: Can I prevent Sort fields from influencing score?
> >
> >
> > I have been using the new lucene 1.4 SortField implementation wih some
> > custom fields added to old indexes so that the results can be sorted
> > by them.  My problem here is that some of the String fields that I add
> > to the index come up in the search terms, so my results in sort by
> > score order are different.  Here's an example:
> >
> > I added the field AUTHOR_SORTABLE to most of the documents in the
> > index.  But if one of the AUTHOR_SORTABLE field in a document is set
> > to "andy", and i search for "andy", this document gets a very
> > different score than it used to.
> >
> > Since my added fields aren't set in stone, I'm interested in a general
> > solution, where all fields containing the text "SORTABLE" in the name
> > aren't considered for matches, only for sorting.  Could I do this by
> > overriding Similarity?  I tried doing this to set the lengthNorm() for
> > each of my sortable fields to 0, but it hasnt worked yet.  Is there a
> > different way to store the sortable fields that will prevent this?
> >
> > Any help would be greatly appreciated.
> >
> > - andy g
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Can I prevent Sort fields from influencing score?

2004-06-02 Thread Andy Goodell
I have been using the new lucene 1.4 SortField implementation wih some
custom fields added to old indexes so that the results can be sorted
by them.  My problem here is that some of the String fields that I add
to the index come up in the search terms, so my results in sort by
score order are different.  Here's an example:

I added the field AUTHOR_SORTABLE to most of the documents in the
index.  But if one of the AUTHOR_SORTABLE field in a document is set
to "andy", and i search for "andy", this document gets a very
different score than it used to.

Since my added fields aren't set in stone, I'm interested in a general
solution, where all fields containing the text "SORTABLE" in the name
aren't considered for matches, only for sorting.  Could I do this by
overriding Similarity?  I tried doing this to set the lengthNorm() for
each of my sortable fields to 0, but it hasnt worked yet.  Is there a
different way to store the sortable fields that will prevent this?

Any help would be greatly appreciated.

- andy g

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 1.4 Sort API compatible with 1.3 index?

2004-06-01 Thread Andy Goodell
In my experience, the only barrier to using Sort with a 1.3 index is
that the Sort interface requires sortable fields to be indexed in a
certain way (not analyzed, indexed, and it doesnt matter if stored),
from the javadoc:

document.add (new Field ("byNumber", Integer.toString(x), false, true, false));

so unless you have fields of this prototype already in your index, you
may need to do some degree of re-indexing.  if you do already have the
fields indexed in this fashion, then you should be in good shape,
although i have only done cursory testing of this setup, since i have
migrated my setup entirely to 1.4.

- andy g

On Tue, 1 Jun 2004 07:56:27 -0700 (PDT), Greg Gershman
<[EMAIL PROTECTED]> wrote:
> 
> I looked around a bit, but couldn't find an answer to
> this question.  There doesn't seem to be any reason
> why it wouldn't, from what I can see, but I just want
> to make sure I don't have to rebuild my index to use
> the Sort functionality provided in 1.4 with an index
> build with 1.3.
> 
> Thanks!
> 
> Greg Gershman
> 
> __
> Do you Yahoo!?
> Friends.  Fun.  Try the all-new Yahoo! Messenger.
> http://messenger.yahoo.com/
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-18 Thread Andy Goodell
In our application we had a similar problem with non-date ranges until
we realized that it wasnt so much that we were searching for the
values in the range as restricting the search to that range, and then
we used an extension to the org.apache.lucene.search.Filter class, and
our implementation got much simpler and faster.

- andy g

On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne
<[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> I have over 60,000 documents in my index which is slightly over a 1 GB
> in size.  The documents range from the late seventies up to now.  I
> have indexed dates as a keyword field using a string because the dates
> are in MMDD format.  When I do range queries things are OK as long
> as I don't exceed the built-in number of boolean clauses, so that's a
> range of 3 years, e.g. 1979 to 1981.  The users are not only doing
> complex queries but also want to query over long ranges, e.g. [19790101
> TO 19991231].
> 
> Given these requirements, I am thinking of doing a query without the
> date range, bring the unique ids back from the hits and then do a date
> query in the SQL database I have that contains the same data.  Another
> alternative is to do the query without the date range in Lucene and
> then sort the results within the range.  I still have to learn how to
> use the new sorting code and confessed I did not have time to look at
> it yet.
> 
> Is there a simpler, easier way to do this?
> 
> Claude
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: clean up html before indexing or add tags to ignore list

2004-05-13 Thread Andy Goodell
If you are running linux, i recommend before indexing with lucene, you
use the program lynx with the option -dump which dumps the formatted
text without the tags, and runs really really fast in most cases.

- andy g

On Thu, 13 May 2004 03:46:37 -0700 (PDT), Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> 
> Clean up seems cleaner.  Just extract the textual information from HTML
> using NekoHTML or JTidy or HTMLParser (.sf.net) or some such.
> 
> You can also get fancy and preserve the 'structural' information (e.g.
> H1 text is more important that H2, which is more important than BODY,
> which is more important that DIV, etc.) and combine it with field
> boosting at index time.
> 
> Otis
> 
> 
> 
> --- Sebastian Ho <[EMAIL PROTECTED]> wrote:
> > Hi
> >
> > This is a typical web crawler, indexing and search application
> > development. I have wrote my crawler and planning to add lucene in
> > next.
> > One questions pop to my mind, in terms of performance, do i clean up
> > the
> > html removing all tags before indexing, or i add all tags into the
> > ignore list during indexing/search stage.
> >
> > Which is better?
> >
> > Thanks
> >
> > Sebastian Ho
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query performance on a 315 Million document index (1TB)

2004-05-07 Thread Andy Goodell
Although I've never indexed anything quite that large, i've had good
experiences with splitting the index out over a cluster.  (for
example, a set that would be about 4 seconds per complicated query on
one of our machines becomes around a second when spread out over 6)  I
think the reason why this helps is because of the disk I/O speed
bounding of performance that the others have mentioned, and how adding
another disk array adds to the effective disk bandwidth.

good luck
- andy g

On Fri, 07 May 2004 04:47:55 +0500, Will Allen <[EMAIL PROTECTED]> wrote:
> 
> Hi,
> I am considering a project that would index 315+ million documents. I am 
> comfortable that the indexing will work well in creating an index ~800GB in size, 
> but am concerned about the query performance. (Is this a = bad
> assumption?)
> 
> What are the bottlenecks of performance as an index scales?  Memory?  = Cost is not 
> a concern, so what would be the shortcomings of a theoretical = machine with 16GB of 
> ram, 4-16 cpus and 1-2 terabytes of space?  Would it be = better to cluster machines 
> to break apart the query?
> 
> Thank you for your serious responses,
> Will Allen
> --
> ___
> Sign-up for Ads Free at Mail.com
> http://promo.mail.com/adsfreejump.htm
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Bug in Sandbox - Berkeley DB

2004-04-28 Thread Andy Goodell
IndexReader.delete(int docid) doesn't work with the Berkeley DB
implementation of org.apache.lucene.store.Directory

This error message appears when closing an IndexReader which has a deletion:
PANIC: Invalid argument

I get this stack trace:
java.io.IOException: DB_RUNRECOVERY: Fatal error, run database recovery
   at org.apache.lucene.store.db.Block.put(Block.java:128)
   at org.apache.lucene.store.db.DbOutputStream.close(DbOutputStream.java:111)
   at org.apache.lucene.util.BitVector.write(BitVector.java:155)
   at org.apache.lucene.index.SegmentReader$1.doBody(SegmentReader.java:162)
   at org.apache.lucene.store.Lock$With.run(Lock.java:148)
   at org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java:157)
   at org.apache.lucene.index.IndexReader.close(IndexReader.java:422)

Help!

- andy g

code that triggers this:
// dbdir is a working DbDirectory, docid is a search result
IndexReader read = IndexReader.open(dbdir);
read.delete(docid);
read.close();

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene MBean service for JBoss

2003-09-25 Thread Andy Scholz
Thanks Otis...

With any luck my current employer will also chip in a few bucks to help 
maintain the project (I'm working on it)...

cheers
-andy
Otis Gospodnetic wrote:

Thanks, I'm finally including this on the Contributions page.

Otis

--- Andy Scholz <[EMAIL PROTECTED]> wrote:
 

Hi All,

For those that may be interested, I have written a full text indexing

service for the JBoss application server that uses Lucene as its
engine. It 
allows lucene to be used as a service rather than a standalone app
with 
thread pooling, access synchronization, management etc. Index and
search 
interfaces are accessable via JNDI and remotely via session EJB's.

Additionally I have provided content filters for common formats like
HTML, 
MSWord, MSExcel, xml etc (with some help from other projects). A
simple 
interface also allows you to write your own filters for different
formats.

It is available under an LGPL license and source code, binaries and
info 
are avaialble here:
http://ejindex.sourceforge.net

I'd love to get some feedback, so if your iterested, please let me
know 
your comments or suggestions;)

regards,
Andy Scholz
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



__
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene MBean service for JBoss

2003-07-29 Thread Andy Scholz
Hi Dan,

I'm not sure whats going on there, I've checked the moveNext & it seems OK. 
It seems that somehow the page got indexed twice, or two hits are being 
returned for some reason. I cant replicate get this tho so I'd aprreciate 
any more info you might be able to give me - e.g. if you comment out the 
setDocumentURL (so that only metadata is indexed) and change the query to 
to something like "title:jboss", does it sill return two hits. Also there 
are some unit tests in the ejindexXX_tests.zip file you might want to run - 
this has a bunch of tests that test the service both locally and remotely - 
these should(!) fail if there is a problem and hopefully give more 
indication as to what the problem might be.

Thanks for your feedback!

Regards,
Andy Scholz
Hi Andy
This looks like a very useful MBean (quite a bit more developed than the 
one I was working on).

One quick query on the quickstart example though, when I run it I get the 
output twice:




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Spanish analyzer and Indexing StarOffice docs

2003-07-21 Thread Andy Scholz
We tried the udk approach late last year but it was an awfully clumsy 
soultion - requiring you actually run an OO app instance as a 'server'. It 
kind of worked but the show-stopper was that OO is so tied into the UI, 
whenever an error occured (i.e. file not found etc), a dialog box pops up 
and nothing else happens till it gets acknowledged. In fact, AFIK you have 
to have X windows (or some other gui) for it to even run at all (the 
giveaway being the splash screen that shows when you start OO via the sdk 
interface).

Apperently a new command line option was being added to suppress UI (but 
then how do get error messages?), but it was only in the cvs head (as of 
late last year) but we gave up on it becasue it was just too messey a 
solution for our server-side needs. If you want to use it as an app on a 
workstation though, it might work fine for you.

There also was(is) a project underway to provide a filter interface (called 
x-filter I think) that provides a set of import/export filters for OO that 
would be ideal to use for text indexing purposes, but I think it will be 
quite some time before that becomes available.

I havent looked at it in a while but - I'd stick to Peter's first 
suggestion - unzip it & read the XML.

cheers
-andy
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene MBean service for JBoss

2003-07-20 Thread Andy Scholz
Hi All,

For those that may be interested, I have written a full text indexing 
service for the JBoss application server that uses Lucene as its engine. It 
allows lucene to be used as a service rather than a standalone app with 
thread pooling, access synchronization, management etc. Index and search 
interfaces are accessable via JNDI and remotely via session EJB's.

Additionally I have provided content filters for common formats like HTML, 
MSWord, MSExcel, xml etc (with some help from other projects). A simple 
interface also allows you to write your own filters for different formats.

It is available under an LGPL license and source code, binaries and info 
are avaialble here:
http://ejindex.sourceforge.net

I'd love to get some feedback, so if your iterested, please let me know 
your comments or suggestions;)

regards,
Andy Scholz
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]