Re: sounds like spellcheck

2005-02-09 Thread Kelvin Tan
Hey Aad, I believe http://jakarta.apache.org/lucene/docs/contributions.html has 
a link to Phonetix 
(http://www.companywebstore.de/tangentum/mirror/en/products/phonetix/index.html),
 an LGPL-licensed lib for phonetic algorithms like Soundex, Metaphone and 
DoubleMetaphone. There are Lucene adapters.

As to the suitability of the algorithms, I haven't taken a look at the Phonetix 
implementation, but if 
http://spottedtiger.tripod.com/D_Language/D_DoubleMetaPhone.html is anything to 
go by (do a search for "dutch"), then it should meet your needs, or at least 
won't be difficult to customize.

Is that what you're looking for?

k

On Wed, 09 Feb 2005 13:23:57 +0100, Aad Nales wrote:
> In my Clipper days I could build an index on English words using a
> technique that was called soundex. Searching in that index resulted
> in hits of words that sounded the same. From what i remember this
> technique only worked for English. Has it ever been generalized?
>
> What i am trying to solve is this. A customer is looking for a
> solution to spelling mistakes made by children (upto 10) when
> typing in queries. The site is Dutch. Common mistakes are 'sgool'
> when searching for 'school'. The 'normal' spellcheckers and
> suggestors typically generate a list where the 'sounds like'
> candidates' are too far away from the result. So what I am thinking
> about doing is this:
>
> 1. create a parser that takes a word and creates a soundindex entry.
>
> 2. create list of 'correctly' spelled words either based on the
> index of the website or on some kind of dictionary.
> 2a. perhaps create a n-gram index based on these words
>
> 3. accept a query, figure out that a spelling mistake has been made
> 3a find alternatives by parsing the query and searching the 'sound
> like index' and then calculate and order  the results
>
> Steps 2 and 3 have been discussed at length in this forum and have
> even made it to the sandbox. What I am left with is 1.
>
> My thinking is processing a series of replacement statements that
> go like: --
> g sounds like ch if the immediate predecessor is an s. o sounds
> like oo if the immediate predecessor is a consonant --
>
> But before I takes this to the next step I am wondering if anybody
> has created or thought up alternative solutions?
>
> Cheers,
> Aad
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem searching Field.Keyword field

2005-02-08 Thread Kelvin Tan
Erik, I was thinking about the case where

category:"document management"
category:"document publishing"

and the user wants to search category:document and have both turn up. But 
that's obviously not the use-case in the situation of a drop-down, so you're 
right about this, Field.Keyword is correct here. Sorry for misleading you, Mike.

k

On Tue, 8 Feb 2005 12:02:15 -0500, Erik Hatcher wrote:
> Kelvin - I respectfully disagree - could you elaborate on why this
> is not an appropriate use of Field.Keyword?
>
> If the category is "How To", Field.Text would split this (depending
> on the Analyzer) into "how" and "to".
>
> If the user is selecting a category from a drop-down, though, you
> shouldn't be using QueryParser on it, but instead aggregating a
> TermQuery("category", "How To") into a BooleanQuery with the rest
> of it.  The rest may be other API created clauses and likely a
> piece from QueryParser.
>
> Erik
>
>
> On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:
>
>> As I posted previously, Field.Keyword is appropriate in only
>> certain situations. For your use-case, I believe Field.Text is
>> more suitable.
>>
>> k
>>
>> On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
>>
>>> This may or may not be correct, but I am indexing it as a
>>> keyword  because I provide a (required) radio button on the add
>>> screen for  the user to determine which category the document
>>> should be  assigned.  Then in the search, provide a dropdown
>>> that can be used  in the advanced search so that they can
>>> search only for a specific  category of documents (like HowTo,
>>> Troubleshooting, etc).
>>>
>>> -Original Message-
>>> From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent:
>>> Tuesday,  February 08, 2005 9:32 AM To: Lucene Users List
>>> Subject: RE: Problem searching Field.Keyword field
>>>
>>> Mike, is there a reason why you're indexing "category" as
>>> keyword  not text?
>>>
>>> k
>>>
>>> On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
>>>
>>>> Thanks for the quick response.
>>>>
>>>> Sorry for my lack of understanding, but I am learning!  Won't
>>>> the   query parser still handle this query?  My limited
>>>> understanding  was  that the search call provides the 'all'
>>>> field as default  field for  query terms in the case where
>>>> fields aren't specified.    Using the  current code, searches
>>>> like author:Mike" and  title:Lucene work fine.
>>>>
>>>> -Original Message-
>>>> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:  
>>>>   Tuesday, February 08, 2005 8:08 AM To: Lucene Users List
>>>> Subject:   Re: Problem searching Field.Keyword field
>>>>
>>>> You're using the query parser with the standard analyser. You
>>>>should construct a term query manually instead.
>>>>
>>>>
>>>> --
>>>> Miles Barr <[EMAIL PROTECTED]> Runtime Collective
>>>> Ltd.
>>>>
>>>> --
>>>>   --  - To unsubscribe, e-mail: lucene-user-
>>>>[EMAIL PROTECTED] For additional commands, e-
>>>> mail: [EMAIL PROTECTED]
>>>>
>>>>
>>>> --
>>>>   --  - To unsubscribe, e-mail: lucene-user-
>>>>[EMAIL PROTECTED] For additional commands, e-
>>>> mail: [EMAIL PROTECTED]
>>>
>>>
>>> 
>>>   - To unsubscribe, e-mail: lucene-user-
>>>[EMAIL PROTECTED] For additional commands, e-mail:
>>>[EMAIL PROTECTED]
>>>
>>>
>>> 
>>>   - To unsubscribe, e-mail: lucene-user-
>>>[EMAIL PROTECTED] For additional commands, e-mail:
>>>[EMAIL PROTECTED]
>>
>>
>> --
>> --- To unsubscribe, e-mail: lucene-user-
>>[EMAIL PROTECTED] For additional commands, e-mail:
>>[EMAIL PROTECTED]
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Problem searching Field.Keyword field

2005-02-08 Thread Kelvin Tan
As I posted previously, Field.Keyword is appropriate in only certain 
situations. For your use-case, I believe Field.Text is more suitable.

k

On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
> This may or may not be correct, but I am indexing it as a keyword
> because I provide a (required) radio button on the add screen for
> the user to determine which category the document should be
> assigned.  Then in the search, provide a dropdown that can be used
> in the advanced search so that they can search only for a specific
> category of documents (like HowTo, Troubleshooting, etc).
>
> -Original Message-
> From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
> February 08, 2005 9:32 AM To: Lucene Users List
> Subject: RE: Problem searching Field.Keyword field
>
> Mike, is there a reason why you're indexing "category" as keyword
> not text?
>
> k
>
> On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
>
>> Thanks for the quick response.
>>
>> Sorry for my lack of understanding, but I am learning!  Won't the
>>  query parser still handle this query?  My limited understanding
>> was  that the search call provides the 'all' field as default
>> field for  query terms in the case where fields aren't specified.
>>   Using the  current code, searches like author:Mike" and
>> title:Lucene work fine.
>>
>> -Original Message-
>> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:  
>> Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
>>  Re: Problem searching Field.Keyword field
>>
>> You're using the query parser with the standard analyser. You  
>> should construct a term query manually instead.
>>
>>
>> --
>> Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
>>
>> --
>> --  - To unsubscribe, e-mail: lucene-user-
>>[EMAIL PROTECTED] For additional commands, e-mail:  
>>[EMAIL PROTECTED]
>>
>>
>> --
>> --  - To unsubscribe, e-mail: lucene-user-
>>[EMAIL PROTECTED] For additional commands, e-mail:  
>>[EMAIL PROTECTED]
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Problem searching Field.Keyword field

2005-02-08 Thread Kelvin Tan
Mike, is there a reason why you're indexing "category" as keyword not text?

k

On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
> Thanks for the quick response.
>
> Sorry for my lack of understanding, but I am learning!  Won't the
> query parser still handle this query?  My limited understanding was
> that the search call provides the 'all' field as default field for
> query terms in the case where fields aren't specified.   Using the
> current code, searches like author:Mike" and title:Lucene work fine.
>
> -Original Message-
> From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:
> Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
> Re: Problem searching Field.Keyword field
>
> You're using the query parser with the standard analyser. You
> should construct a term query manually instead.
>
>
> --
> Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Problem searching Field.Keyword field

2005-02-08 Thread Kelvin Tan
Javadocs for Field.Keyword says:

Constructs a Date-valued Field that is not tokenized and is indexed, and stored 
in the index, for return with hits.

For most purposes dealing with Strings, use Field.Text, unless you have a date, 
a GUID or some other string you don't want tokenized or processed in any way. 
This basically means that Field.Keyword indexes the field as-is.

k

On Tue, 8 Feb 2005 07:54:57 -0600, Mike Miller wrote:
> First let me say - Awesome tool!  Almost too easy to be true, but
> with that being said
>
> Hi,  I have read several articles and postings that indicate that
> the Field.Keyword field should be searchable but it's not working
> for me, until I change to Field.Text.  Parts of the index and
> search code are included below - mostly lifted from articles,etc,
> including Erik Hatches article on java.net.   I created a small
> KnowledgeBase web application that contains a category field, which
> I want to be searchable. Searching using a query string of
> category:Doc* or
> category:Documentation does not find a hit unless I change the code
> to add the category to the index as a Field.Text instead of
> Field.Keyword. The field value is out there:   I have verified this
> using the TermEnum to list the term values for field category and
> Documentation is in the list of values.
>
> The intention is to provide a 'Advanced Search' page that allows
> the user to search specific fields, like category, title and author
> instead of always using the 'all' field.
>
> What am I doing wrong???     Thanks in advance.
>
> Index code:
>
> public boolean index(ArticleFormBean article) throws IOException {
> IndexWriter writer = new IndexWriter(indexDir, new
> StandardAnalyzer(), false);
>
> Document doc = new Document();
> doc.add(Field.UnStored("content", article.getContent()));
> doc.add(Field.Text("title", article.getTitle()));
> doc.add(Field.Text("author", article.getAuthor()));
> doc.add(Field.UnIndexed("articleId",
> String.valueOf(article.getArticleId(;
> doc.add(Field.Keyword("createdDate", article.getCreateDate()));
> doc.add(Field.Keyword("modDate", article.getModDate()));
> doc.add(Field.Keyword("category", article.getCategory()));
>
> // create an 'all' field
> StringBuffer sb = new StringBuffer(4000);
> sb.append(article.getTitle()).append("
> ").append(article.getAuthor()).append(" ");
> sb.append(article.getContent()).append("
> ").append(article.getCategory());
> doc.add(Field.UnStored("all", sb.toString()));
>
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
>
> return false;
> }
>
> Search code:
> File indexDir = new File("c:/dev/java/kb/index"); Directory fsDir =
> FSDirectory.getDirectory(indexDir, false); IndexSearcher is = new
> IndexSearcher(fsDir); Query query = QueryParser.parse(q, "all", new
> StandardAnalyzer()); Hits hits = is.search(query);
>
>
> Mike Miller
> JDA Software Group, Inc.
> 7501 Ester's Blvd, Suite 100
> Irving, Texas 75063



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Retrieve all documents - possible?

2005-02-07 Thread Kelvin Tan
Don't forget to test if a document is deleted with reader.isDeleted(i)

On Mon, 07 Feb 2005 12:09:35 +0100, Bernhard Messer wrote:
> you could use something like:
>
> int maxDoc = reader.maxDoc();
> for (int i = 0; i < maxDoc; i++) {
> Document doc = reader.document(i);
> }
>
> Bernhard
>
>> Hi,
>>
>> is it possible to retrieve ALL documents from a Lucene index?
>> This should then actually not be a search...
>>
>> Karl
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: PHP-Lucene Integration

2005-02-06 Thread Kelvin Tan
How about XML-RPC/SOAP, or REST?

For REST, just have a servlet listening for HTTP Gets and respond with XML that 
your PHP app can parse (for searching). For indexing, let's say you want to 
index an uploaded file, construct a URL with the fields and field values, and 
also pass the location of the file on the FS. Shouldn't be that difficult.

I'm guessing its more desirable to have all your code in one place, which is an 
advantage to using Java in PHP. But it feels cleaner to have the Java stuff in 
one codebase and the PHP in another. May make debugging easier. No idea how 
widely used the PHP-Java binding is.

k

On Sun, 6 Feb 2005 10:10:36 -0700, Owen Densmore wrote:
> I'm building a lucene project for a client who uses php for their
> dynamic web pages.  It would be possible to add servlets to their
> environment easily enough (they use apache) but I'd like to have
> minimal impact on their IT group.
>
> There appears to be a php java extension that lets php call back &
> forth to java classes, but I thought I'd ask here if anyone has had
> success using lucene from php.
>
> Note: I looked in the Lucene In Action search page, and yup, I
> bought the book and love it!  No examples there tho.  The list
> archives mention that using java lucene from php is the way to go,
> without saying how.  There's mention of a lucene server and a php
> interface to that.  And some similar comments.  But I'm a bit
> surprised there's not a bit more in terms of use of the official
> java extension to php.
>
> Thanks for the great package!
>
> Owen
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Kelvin Tan
Alternatively, add a dummy field-value to all documents, like 
doc.add(Field.Keyword("foo", "bar"))

Waste of space, but allows you to perform negated queries.

On Thu, 03 Feb 2005 19:19:15 +0100, Maik Schreiber wrote:
>> Negating a term must be combined with at least one nonnegated
>> term to return documents; in other words, it isn't possible to
>> use a query like NOT term to find all documents that don't
>> contain a term.
>>
>> So does that mean the above example wouldn't work?
>>
> Exactly. You cannot search for "-kcfileupload:jpg", you need at
> least one clause that actually _includes_ documents.
>
> Do you by chance have a field with known contents? If so, you could
> misuse that one and include it in your query (perhaps by doing
> range or wildcard/prefix search). If not, try IndexReader.terms()
> for building a Query yourself, then use that one for search.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser Help

2005-02-02 Thread Kelvin Tan
Can you provide more information? Your syntax looks ok. Is it possible that one 
of the docs have been deleted? Sounds stupid, but have you tried searching via 
a manual query, or using one of the Lucene index browsers?

Kelvin

On Wed, 2 Feb 2005 18:05:47 -0500, Luke Shannon wrote:
> Hello;
>
> Getting squinted with Query Parsing.  I have a questions:
>
> Query query = MultiFieldQueryParser
> .parse("mario", new String[] { "name", "desc" }, new int[] {
> MultiFieldQueryParser.NORMAL_FIELD,
> MultiFieldQueryParser.NORMAL_FIELD }, new StandardAnalyzer());
> IndexSearcher searcher = new IndexSearcher(fsDir); Hits hits =
> searcher.search(query);
> System.out.printing("Keywords : " + hits.length()+ " " +
> query.toString()); assertEquals(2, hits.length());
>
> This test is successful.
>
> But, I know "name" contains 2 documents, I also know "desc"
> contains one. This may be a dumb question but why does Hits not
> contain pointers to 3 results (1 from name, 2 from desc)?
>
> Thanks
>
> Luke
>
>
> 
> - To unsubscribe, e-mail: lucene-user-
>[EMAIL PROTECTED] For additional commands, e-mail:
>[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan


On Tue, 1 Feb 2005 14:12:54 -0800, Chris Fraschetti wrote:
> Definitely a good idea on the one line idea... that could possibly
> save a good amount of time. I'm using .stringValue ... in reality,
> I hadn't ever even considered readerValue ... is there a strong
> performance difference between the two? or is it simply on the
> functionality side?

Not that I'm aware of (performance). Reader fields are useful when reading in 
bulky data which doesn't make sense to be loaded into mem as a String.

K



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan
Please see inline.

On Tue, 1 Feb 2005 09:27:26 -0800, Chris Fraschetti wrote:
> Well all my fields are strings when I index them. They're all very
> short strings, dates, hashes, etc. The largest field has a cap of
> 256 chars and there is only one of them, the rest are all fairly
> small.
>
> Can you explain what you meant by 'string or reader' ?

Sorry, I meant to ask if you're using String fields (field.stringValue()) or 
reader fields (field.readerValue()).

Can you elaborate on the post-processing you need to do? Have you thought about 
concatenating the fields you require into a single non-indexed field 
(Field.UnIndexed) for simple retrieval? It'll increase the size of your index, 
but should be faster to retrieve them all at one go.

Kelvin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene docs in bulk read?

2005-02-01 Thread Kelvin Tan
Hi Chris, are your fields string or reader? How large do your fields get?

Kelvin

On Tue, 1 Feb 2005 01:40:39 -0800, Chris Fraschetti wrote:
> Hey folks.. thanks in advance to any who respond...
>
> I do a good deal of post-search processing and the file io to read
> the fields I need becomes horribly costly and is definitely a
> problem. Is there any way to either retrieve 1. the entire doc (all
> fields that can be retrieved) and/or 2. a group of docs.. specified
> by say an array of doc ids?
>
> I've optimized to retrieve the entire list of fields instead of 1
> by 1.. and also retrieve only the minimal number of fields that I
> can.. but still my profilers show me that the lucene io to read the
> doc fields is where I spend 95% of my time. Of course this is
> obvious given the nature of how it all works.. but can anyone think
> of a better way to go about retrieving docs in bulk? Are the
> different types of fields quicker/slower than others when
> retrieving them from the index?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use an executable from java ...

2005-01-31 Thread Kelvin Tan
Check out http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html 
which provides some pointers and code which should be helpful.

Cheers,
Kelvin
http://www.supermind.org

On Mon, 31 Jan 2005 19:01:11 +0100, Bertrand VENZAL wrote:
> Hi all,
>
> I ve a kind of problem to execute a converting tool to modify a pdf
> to an html under Linux. In fact, i have an executable "pdftohtml"
> which work correctly on batch mode, and when I want to use it
> through Java under Windows 2000 works also,BUT it does not work at
> all on the server under linux. I m using the following code.
>
> scommand = "/bin/sh -c \"myCommand fileName output\" ";
>
> Runtime runtime = Runtime.getRuntime();
> Process proc = runtime.exec(scommand);
> proc.waitFor();
>
>
> I m running my code under Linux-redhat with a classic shell. Is
> there an other way to do the same thing or maybe am i missing
> something ? Any help will be grandly appreciate.
>
> Thanks
> Bertrand



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Having common word in the search

2004-08-02 Thread Kelvin Tan
I think the answer really really depends on the query input source, and
user-savvy.

If the source is a web-based form AND users only enter "basic" searches, then
lucenequeryconstructor.js in sandbox does an adequate job of building complex
queries from a simple form. Alternatively, just use javascript to modify the
query before form submission.

In any event, many people seem to miss the second parse method in
MultiFieldQueryParser:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/queryParser/MultiFie
ldQueryParser.html#parse(java.lang.String, java.lang.String[], int[],
org.apache.lucene.analysis.Analyzer)

Still, the queries that can be constructed using MultiFieldQueryParser aren't
complex, as compared to lucenequeryconstructor for instance.

On Mon, 2 Aug 2004 13:18:38 +0530, lingaraju said:
> Dear  All
> Searcher searcher = new IndexSearcher("C:/index");
> Analyzer analyzer = new StandardAnalyzer();
> String line="curry asia";
> line=line+"recipe";
> String fields[] = new String[2];
> fields[0] = "title";
> fields[1] = "contents";
> Query q = MultiFieldQueryParser.parse(line,fields,analyzer);
> Hits hits1 = searcher.search(q);
> In the above code Hits will return the documnet  that contains
> the word
> 1)"Curry OR asia OR recipe"
> 2)"Curry OR asia AND recipe"
> 3)"Curry AND asia AND recipe"
> 4)"Curry AND asia OR recipe"
> But I want the result should be
> Like this
> 1)"Curry AND asia AND recipe"
> 2)("Curry OR asia) AND recipe"
> My question is how to give the condition
> Actually my requirement is like this
> User will enter some text in "text box" it may be one word or two word or n
word.(Eg "curry asia")
> but when i am searching i will append "recipe" word in the search string so
the search must
> contains "recipe"  word.
> Finally search should contains
> 1)"Curry AND asia AND recipe"
> 2)("Curry OR asia) AND recipe"
> search should not contains
> 1)"Curry AND asia OR recipe"
> 2)"Curry OR asia OR recipe"
>
> Thanks and regards
> Raju




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: authentication support in lucene

2004-07-23 Thread Kelvin Tan

On Fri, 23 Jul 2004 10:09:25 +0100, Dave Spencer said:
>> I implemented ACL checking via Filters. Caching filters definitely helps, but
>> may not be applicable in every situation. I stored the UUID of each document
in
>> the database as well as in Lucene. That way, by retrieving a list of
accessible
>> documents via SQL, I can create the necessary BitSet.
>
> Maybe the only hope then is different indexes based on coarse grained
> "roles", not find grained ACLs.

That really depends on how much access (low/high-level) you have to the security
subsystem.

Different indexes can be pretty expensive to maintain, and creating a new role
involves creation of a new index? *ugh*

kelvin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: authentication support in lucene

2004-07-23 Thread Kelvin Tan
If you don't have low-level access to the framework that can retrieve a batch
list of accessible IDs, document-by-document checking of ACL will be _painful_.

I implemented ACL checking via Filters. Caching filters definitely helps, but
may not be applicable in every situation. I stored the UUID of each document in
the database as well as in Lucene. That way, by retrieving a list of accessible
documents via SQL, I can create the necessary BitSet.

Kelvin

On Thu, 22 Jul 2004 19:59:27 +0200, John Wang said:
> Hi:
> Maybe this has been asked before.
> Is there a plan to support ACL check on the documents in lucene?
> Say I have a customized ACL check module, e.g.:
> boolean ACLCheck(int docID,String user,String password);
> And have some sort of framework to plug in something like that.
> I was looking at the Filter class. I guess I can read the entire
> index, for each document, feed it to the authentication module, if
> authenticated, bitset the docID and return the BitSet instance. I
> sounds very slow for large hits. I guess  I can play with cacheing
> etc.
> Any other ideas?
> Thanks
> -John
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultifieldQueryParser.parse()

2004-07-07 Thread Kelvin Tan
Hi Sergiu,

First of all, if your application is web-based, its not necessary to
programmatically construct the query based on user-input (via
MultiFieldQueryParser). you can use luceneQueryConstructor.js in Lucene sandbox.
You can find the documentation here:
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene-sandbox/contribution
s/javascript/queryConstructor/luceneQueryConstructor.html

Secondly, if still necessary to programmatically construct the query, perhaps
you can consider creating an int[] of MultiFieldQueryParser.REQUIRED_FIELD and
using
public static Query parse(String query, String[] fields, int[] flags,
Analyzer analyzer)
instead?

Kelvin

On Tue, 06 Jul 2004 10:09:00 +0200, Sergiu Gordea said:
>
> Hi all,
> I have a question,
> I have an index with more fileds and I have to create conjunctive
> queries by default.
> So what I'm trying to say is that we develop a project and we provide
> search functionality
> basing on lucene indexer.
> From what I can see, Multifield query parser creates disjunctive queries:
> if I search for "best test" in fields {title, description} the
> MultiFieldQueryParser.parse(string, fields, analizer)
> will create a query that will mean "fields contain 'best' OR fileds
> contain 'test'" [1]
> by I want to create "fields contain 'best' AND fileds contain 'test'"[2]
> I know, I can place a + before each of this terms, but we also want to
> let the users to create
> custom queries using logical operators and + -, grouping and exact phrases.
> So in this situation we have to parse the query string twice wit the
> only change that we will ad AND operator to
> link the TERMS in the places were no operator is found.
> This seems to me to be just overhead, and I think that tha best way
> would be to overload parse function to
> MultiFieldQueryParser.parse(String queryString, String[] fields,
> Analizer analizer, String/int defaultoperator)[3]
> were default operator can be "AND" or "OR"
> so that I can choose if I want to create query [1] or query [2].
> Do we have an alternative solution, reasonably simple for this problem?
> What do you think about my suggestion of implementing the [3] method .
>
> Thanks for understanding,
> Sergiu
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

dow


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disappearing segments

2004-05-02 Thread Kelvin Tan

> you backup the index during an indexing you could end up with a limp index
> missing a few files, hence the missing segments, I would check for write and
> commit locks pre-backup so as to avoid that.

That's what I needed to know! Thanks!

Kelvin

> -Original Message-----
> From: Kelvin Tan [mailto:[EMAIL PROTECTED]
> Sent: Monday, May 03, 2004 6:52 AM
> To: Lucene Users List
> Subject: RE: Disappearing segments
>
>
> Thanks for responding Nader.
>
> h...you've hit the nail on the spot. I do have a cron job which backs up
> the
> index. Its run in a batch index scheduled job.
>
> The logic is basically
>
> backupindex()
> try
> {
> batchindex()
> }
> catch(Exception e)
> {
> deleteindex();
> copyfrombackuptoindex()
> deletebackup();
> }
>
> I assume that the original index before backing up was complete and
> 'working'.
> I'm also deleting the index that failed, instead of just overwriting. Where
> did
> I go wrong?
>
> I'm not checking that the index isn't write-locked before backing up, but I
> don't think that's the problem (though it very well can be a separate
> problem).
>
> Kelvin
>
> On Fri, 30 Apr 2004 23:20:42 +0400, Nader Henein said:
>> Could you share you're indexing code, and just to make sure id there
>> anything running on your machine that could delete these files, like
>> an a cron job that'll back up the index.
>>
>> You could go by process of elimination and shut down your server and
>> see if the files disappear, coz if the problem is contained within the
>> server you know that you can safely go on the DEBUG rampage.
>>
>> Nader
>>
>> -Original Message-
>> From: Kelvin Tan [mailto:[EMAIL PROTECTED]
>> Sent: Friday, April 30, 2004 9:15 AM
>> To: Lucene Users List
>> Subject: Re: Disappearing segments
>>
>> An update:
>>
>> Daniel Naber suggested using IndexWriter.setUseCompoundFile() to see
>> if it happens with the compound index format. Before I had a chance to
>> try it out, this happened:
>>
>> java.io.FileNotFoundException: C:\index\segments (The system cannot
>> find the file specified) at java.io.RandomAccessFile.open(Native
>> Method) at java.io.RandomAccessFile.(RandomAccessFile.java:200)
>> at
>> org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.j
>> ava:321)
>> at
>> org.apache.lucene.store.FSInputStream.(FSDirectory.java:329)
>> at
>> org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
>> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:71)
>> at
>> org.apache.lucene.index.IndexWriter$1.doBody(IndexWriter.java:154)
>> at org.apache.lucene.store.Lock$With.run(Lock.java:116)
>> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:149)
>> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:131)
>>
>> so even the segments file somehow got deleted. Hoping someone can shed
>> some light on this...
>>
>> Kelvin
>>
>> On Thu, 29 Apr 2004 11:45:36 +0800, Kelvin Tan said:
>>> Errr, sorry for the cross-post to lucene-dev as well, but I realized
>>> this mail really belongs on lucene-user...
>>>
>>> I've been experiencing intermittent disappearing segments which
>>> result in the following stacktrace:
>>>
>>> Caused by: java.io.FileNotFoundException: C:\index\_1ae.fnm (The
>>> system cannot find the file specified) at
>>> java.io.RandomAccessFile.open(Native Method) at
>>> java.io.RandomAccessFile.(RandomAccessFile.java:200)
>>> at
>>> org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.j
>>> a
>>> va:321) at
>>> org.apache.lucene.store.FSInputStream.(FSDirectory.java:329)
>>> at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
>>> at org.apache.lucene.index.FieldInfos.(FieldInfos.java:78)
>>> at
>>> org.apache.lucene.index.SegmentReader.(SegmentReader.java:104)
>>> at org.apache.lucene.index.SegmentReader.(SegmentReader.java:95)
>>> at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:112)
>>> at org.apache.lucene.store.Lock$With.run(Lock.java:116)
>>> at org.apache.lucene.index.IndexReader.open(IndexReader.java:103)
>>> at org.apache.lucene.index.IndexReader.open(IndexReader.java:91)
>>> at
>>> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:75)
>>>
>>> The segment that disappears (_1ae.fnm) varies.
>>>
>>> I can't seem to reproduce thi

RE: Disappearing segments

2004-05-02 Thread Kelvin Tan
Thanks for responding Nader.

h...you've hit the nail on the spot. I do have a cron job which backs up the
index. Its run in a batch index scheduled job.

The logic is basically

backupindex()
try
{
batchindex()
}
catch(Exception e)
{
deleteindex();
copyfrombackuptoindex()
deletebackup();
}

I assume that the original index before backing up was complete and 'working'.
I'm also deleting the index that failed, instead of just overwriting. Where did
I go wrong?

I'm not checking that the index isn't write-locked before backing up, but I
don't think that's the problem (though it very well can be a separate problem).

Kelvin

On Fri, 30 Apr 2004 23:20:42 +0400, Nader Henein said:
> Could you share you're indexing code, and just to make sure id there
> anything running on your machine that could delete these files, like an a
> cron job that'll back up the index.
>
> You could go by process of elimination and shut down your server and see if
> the files disappear, coz if the problem is contained within the server you
> know that you can safely go on the DEBUG rampage.
>
> Nader
>
> -Original Message-
> From: Kelvin Tan [mailto:[EMAIL PROTECTED]
> Sent: Friday, April 30, 2004 9:15 AM
> To: Lucene Users List
> Subject: Re: Disappearing segments
>
> An update:
>
> Daniel Naber suggested using IndexWriter.setUseCompoundFile() to see if it
> happens with the compound index format. Before I had a chance to try it out,
> this happened:
>
> java.io.FileNotFoundException: C:\index\segments (The system cannot find the
> file specified)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.(RandomAccessFile.java:200)
> at
> org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.j
> ava:321)
> at
> org.apache.lucene.store.FSInputStream.(FSDirectory.java:329)
> at
> org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:71)
> at
> org.apache.lucene.index.IndexWriter$1.doBody(IndexWriter.java:154)
> at org.apache.lucene.store.Lock$With.run(Lock.java:116)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:149)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:131)
>
> so even the segments file somehow got deleted. Hoping someone can shed some
> light on this...
>
> Kelvin
>
> On Thu, 29 Apr 2004 11:45:36 +0800, Kelvin Tan said:
>> Errr, sorry for the cross-post to lucene-dev as well, but I realized
>> this mail really belongs on lucene-user...
>>
>> I've been experiencing intermittent disappearing segments which result
>> in the following stacktrace:
>>
>> Caused by: java.io.FileNotFoundException: C:\index\_1ae.fnm (The
>> system cannot find the file specified) at
>> java.io.RandomAccessFile.open(Native Method) at
>> java.io.RandomAccessFile.(RandomAccessFile.java:200)
>> at
>> org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.ja
>> va:321) at
>> org.apache.lucene.store.FSInputStream.(FSDirectory.java:329)
>> at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
>> at org.apache.lucene.index.FieldInfos.(FieldInfos.java:78)
>> at
>> org.apache.lucene.index.SegmentReader.(SegmentReader.java:104)
>> at org.apache.lucene.index.SegmentReader.(SegmentReader.java:95)
>> at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:112)
>> at org.apache.lucene.store.Lock$With.run(Lock.java:116)
>> at org.apache.lucene.index.IndexReader.open(IndexReader.java:103)
>> at org.apache.lucene.index.IndexReader.open(IndexReader.java:91)
>> at
>> org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:75)
>>
>> The segment that disappears (_1ae.fnm) varies.
>>
>> I can't seem to reproduce this error consistently, so don't have a
>> clue what might cause it, but it usually happens after the application
>> has been running for some time. Has anyone experienced something
>> similar, or can anyone point
> me
>> in the right direction?
>>
>> When this occurs, I need to rebuild the entire index for it to be
>> usable. Very troubling indeed...
>>
>> Kelvin
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Disappearing segments

2004-04-29 Thread Kelvin Tan
An update:

Daniel Naber suggested using IndexWriter.setUseCompoundFile() to see if it
happens with the compound index format. Before I had a chance to try it out,
this happened:

java.io.FileNotFoundException: C:\index\segments (The system cannot find the
file specified)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:200)
at org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.j
ava:321)
at org.apache.lucene.store.FSInputStream.(FSDirectory.java:329)
at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:71)
at org.apache.lucene.index.IndexWriter$1.doBody(IndexWriter.java:154)
at org.apache.lucene.store.Lock$With.run(Lock.java:116)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:149)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:131)

so even the segments file somehow got deleted. Hoping someone can shed some
light on this...

Kelvin

On Thu, 29 Apr 2004 11:45:36 +0800, Kelvin Tan said:
> Errr, sorry for the cross-post to lucene-dev as well, but I realized this mail
> really belongs on lucene-user...
>
> I've been experiencing intermittent disappearing segments which result in the
> following stacktrace:
>
> Caused by: java.io.FileNotFoundException: C:\index\_1ae.fnm (The system cannot
> find the file specified)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.(RandomAccessFile.java:200)
> at
> org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:321)
> at org.apache.lucene.store.FSInputStream.(FSDirectory.java:329)
> at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
> at org.apache.lucene.index.FieldInfos.(FieldInfos.java:78)
> at org.apache.lucene.index.SegmentReader.(SegmentReader.java:104)
> at org.apache.lucene.index.SegmentReader.(SegmentReader.java:95)
> at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:112)
> at org.apache.lucene.store.Lock$With.run(Lock.java:116)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:103)
> at org.apache.lucene.index.IndexReader.open(IndexReader.java:91)
> at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:75)
>
> The segment that disappears (_1ae.fnm) varies.
>
> I can't seem to reproduce this error consistently, so don't have a clue what
> might cause it, but it usually happens after the application has been running
> for some time. Has anyone experienced something similar, or can anyone point
me
> in the right direction?
>
> When this occurs, I need to rebuild the entire index for it to be usable. Very
> troubling indeed...
>
> Kelvin
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Disappearing segments

2004-04-29 Thread Kelvin Tan
Errr, sorry for the cross-post to lucene-dev as well, but I realized this mail
really belongs on lucene-user...

I've been experiencing intermittent disappearing segments which result in the
following stacktrace:

Caused by: java.io.FileNotFoundException: C:\index\_1ae.fnm (The system cannot
find the file specified)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:200)
at
org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:321)
at org.apache.lucene.store.FSInputStream.(FSDirectory.java:329)
at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
at org.apache.lucene.index.FieldInfos.(FieldInfos.java:78)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:104)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:95)
at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:112)
at org.apache.lucene.store.Lock$With.run(Lock.java:116)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:103)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:91)
at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:75)

The segment that disappears (_1ae.fnm) varies.

I can't seem to reproduce this error consistently, so don't have a clue what
might cause it, but it usually happens after the application has been running
for some time. Has anyone experienced something similar, or can anyone point me
in the right direction?

When this occurs, I need to rebuild the entire index for it to be usable. Very
troubling indeed...

Kelvin




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: status of LARM project

2004-04-27 Thread Kelvin Tan
As far as I know, LARM is defunct. I read somewhere, perhaps apocryphal, that
Clemens got a job which wasn't supportive of his continued development on LARM.
AFAIK there aren't any other active developers of LARM (at least at the time it
branched off to SF).

Otis recently posted to use Nutch instead of LARM.

Kelvin

On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said:
> Hi
>
> I have look at LARM website and I get different results
>
> http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages
> It says that development has stopped for this project.
>
> LARM hosted on sourceforge.
> The last message was dated 2003 in the mailing list. Is it still
> supported and active?
>
> LARM hosted on apache.
> It says the project is moved to sourceforge.
>
> Any one here who is active in LARM can comment on the status?
>
> Regards
>
> Sebastian Ho
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search in all fields

2004-03-16 Thread Kelvin Tan


On Tue, 16 Mar 2004 08:11:34 -0500, Grant Ingersoll said:
>
> You can use the MultiFieldQueryParser, which will generate a query against all
of the fields you
> specify, or you could index all of your documents into one or two common
fields and search
> against them.  Since you have a lot of fields, I would guess the latter is the
better choice.
>

Don't you mean "add one or two common fields to all your documents"? Or am I
mistaken.

Anyway, I believe adding of this common field constitutes a best-practice, since
you definitely need these fields if one wishes to perform a date-range-only
search.

Would probably be a good idea to start a best-practices page in the Wiki.

K

>
 [EMAIL PROTECTED] 03/16/04 07:56AM >>>
> In QueryParser.parse method I must give which is the default field.
>
> Does this means ttah non-adressed queris are executed only over
> this field?
>
> The main question is:
> How I can search in all fields in all documents in the index?
> Note that I don't know field names, there can be thousands field
> names in all documnets.
>
> 10x in advance.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query validation in web app

2004-03-05 Thread Kelvin Tan
Neither! :-) I was just wondering if there were better ways to do it, that's
all. I'm a regex newbie and I found it rather difficult to validate the entire
Lucene query syntax (including escaping!) using regex.

Anyway, I'm writing unit tests for the query validator right now courtesy of
jsunit...

Kelvin

On Fri, 5 Mar 2004 13:21:55 -0800, Dror Matalon said:
> I was responding to
>
>>> How are people checking/validating queries from a web-app?
>
> So should I be embarrassed or should Kelvin be flattered :-)?
>
>
> On Fri, Mar 05, 2004 at 12:12:35PM -0800, Otis Gospodnetic wrote:
>> Funny - Kelvin Tan is the author of that code :)
>>
>> Otis
>>
>> --- Dror Matalon <[EMAIL PROTECTED]> wrote:
>>> On Fri, Mar 05, 2004 at 04:21:07PM +0800, Kelvin Tan wrote:
>>>> Lucene reacts pretty badly to non-wellformed queries, not throwing
>>> a
>>>> checked/unchecked Exception but throwing an Error. The error
>>> message is also
>>>> unintelligible to a user (non-developer).
>>>>
>>>> How are people checking/validating queries from a web-app?
>>>
>>> Look at the javascript validator in the Lucene sandbox. A quite
>>> elegant
>>> solution, unless you're opposed to using javascript.
>>>
>>>>
>>>> I have some checked-in code in sandbox that does javascript
>>> validation, but I
>>>> wonder if there's a smarter way to do query validation..
>>>>
>>>> k
>>>>
>>>>
>>> -
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail:
>>> [EMAIL PROTECTED]
>>>>
>>>
>>> --
>>> Dror Matalon
>>> Zapatec Inc
>>> 1700 MLK Way
>>> Berkeley, CA 94709
>>> http://www.fastbuzz.com
>>> http://www.zapatec.com
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query validation in web app

2004-03-05 Thread Kelvin Tan
On Fri, 5 Mar 2004 04:18:29 -0500, Erik Hatcher said:
> Kelvin,
>
> In what scenarios does QueryParser fail without throwing a
> ParseException?
>
> I think we should fix those cases to ensure a ParseException is thrown.
>
> Erik
>
>

Sorry, my bad. Was it ever throwing Errors? Probably not, but somehow I had the
impression it was..

Kelvin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query validation in web app

2004-03-05 Thread Kelvin Tan
Lucene reacts pretty badly to non-wellformed queries, not throwing a
checked/unchecked Exception but throwing an Error. The error message is also
unintelligible to a user (non-developer).

How are people checking/validating queries from a web-app?

I have some checked-in code in sandbox that does javascript validation, but I
wonder if there's a smarter way to do query validation..

k


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best Practices for indexing in Web application

2004-03-01 Thread Kelvin Tan


On Mon, 01 Mar 2004 12:24:26 +0100, Michael Steiger said:
> 1. Concurrency of IndexWriter and IndexReader
> It seems that it is not allowed to open an IndexWriter and an
> IndexReader at the same time. But if one user is changing records in the
> database (and therefore changing documents in the Lucene index) and
> another user is querying the index, I would need to open them both.

Not necessarily so. Shouldn't you be using an IndexSearcher to query the index
instead of an IndexReader? IndexReader should be used primarily for deleting and
low-level document retrieval via Terms.

>
> 2. Optimizing the index
> This is maybe related to my first issue.
> I assume that while optimizing the index no queries are allowed. How
> often should the index be optimized?
>

Again, IndexSearcher has no problem with the index being modified (via
optimizing, or otherwise) whilst searching. However, you'll need to have some
way of refreshing your IndexSearcher when this happens, otherwise the
IndexSearcher  would be obsolete.

Kelvin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Hits not serializable?

2003-08-22 Thread Kelvin Tan
You'll want to checkout Andy Scholz's excellent
http://ejindex.sourceforge.net/. It's an EJB implementation of a layer built on
top of Lucene to allow for distributed searching.

K

On Fri, 22 Aug 2003 10:08:17 +0200, Lars Hammer said:
>Has anyone experimented with using EJB's for carrying out searches?
>I'm thinking of using an EJB for carrying out the searches and
>return the hits to a JSP page, which handles displaying of the
>results.
>But Hits isn't serializable, so it cannot be used for sending across
>the network from for example JBoss to Tomcat.
>
>Does anyone has any experience with using Lucene through EJB's 
>
>Thanks in advance
>
>/Lars Hammer
>
>www.dezide.com





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reuse IndexSearcher?

2003-08-19 Thread Kelvin Tan
Yep. What I've done is hack a little class to pool the searchers, and whenever
I update the index, I inform this manager class, and it refreshes the
searchers. Of course, you can add sugar on top of that, like specifying a TTL
or something like that.

Kelvin

On Tue, 19 Aug 2003 13:18:24 -0500, Scott Ganyo said:
>Yes.  You can (and should for best performance) reuse an
>IndexSearcher
>as long as you don't need access to changes made to the index.  An
>open
>IndexSearcher won't pick up changes to the index, so if you need to
>see
>the changes, you will need to open a new searcher at that point.
>
>Scott
>
>Aviran Mordo wrote:
>
>>Can I reuse one Instance of IndexSearcher to do multiple searches
>>(in
>>multiple threads) or do I have to instantiate a new IndexSearcher
>>for
>>each search?
>>
>>
>>
>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Max Field size.

2003-08-14 Thread Kelvin Tan
Check out

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.ht
ml#maxFieldLength


Otis, a possible FAQ entry?

On Wed, 13 Aug 2003 18:07:14 +1000, Victor Hadianto said:
>Hi All,
>
>Is there a maximum field size in Lucene? I found that by having a
>really big
>String as a field in a document, I couldn't search the tokens that
>exist in
>the end of the document.
>
>I'm stumped, I've never read anywhere in the documentation that
>there is such
>limit in Field. Am I the only one having the problem here?
>
>
>
>Regards,
>
>Victor Hadianto
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using javacc for QueryParser.jj

2003-07-31 Thread Kelvin Tan
Claude,

Maik's code is not updated for current Lucene build. A couple of days ago, I
posted a link to Mark Harwood's code which works nicely.

http://home.clara.net/markharwood/lucene/highlight.htm

On Wed, 30 Jul 2003 16:04:59 +0200, Claude Libois said:
>Hello. I wanted to use the Highlighter from Maik Schreiber but i had
>a
>big problem:
>In order to use it I need to change Lucene core . That's what I did
>but
>there is a file QueryParser.jj and I don't know how to compile it. I
>found that I have to use javacc but I don't know how. Can someone
>help
>me whih that?
>CLaude




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fwd: Lucene highlighting

2003-07-29 Thread Kelvin Tan
An update of Maik's code for Lucene 1.2. Mark says

> It shouldnt be too hard to add-in the fuzzy/prefix/wildcard options to my
code
> once the core Lucene API exposes the necessary methods.


--- Original Message ---
From: "Mark Harwood" <[EMAIL PROTECTED]>
To:  <[EMAIL PROTECTED]>
Cc:
Sent: Tue, 29 Jul 2003 11:20:06 +0100
Subject: Lucene highlighting

>Hi Kelvin,

>
>You may be interested in some extension I did to Maik's work too.
>It can do "best snippets" highlighting from large bodies of text.
>
>See here for details
>
>http://home.clara.net/markharwood/lucene/highlight.htm
>
>
>Cheers
>Mark





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Different Analyzer for each Field

2003-07-28 Thread Kelvin Tan
Perhaps one way to do it is to have 2 separate indices for the 2 analyzers.
Then, depending on which field you wish to search, you can choose from either
index.

AFAIK, there is a one-one mapping between an index and an analyzer.

Kelvin

On Mon, 28 Jul 2003 10:32:21 +0200, Claude Libois said:
>My question is in the title: how can I use a different   Analyzer
>for
>each field of a Document object? My problem is that if I use
>LetterTokenizer for a field which contains a String representation
>of a
>number, after I can't delete it. Probably because this analyzer
>threw
>away my number. So I need to use whitespaceTokenizer for this field
>but
>I would like to use LetterTokenizer for the other. Can someone help
>me?
>thank you
>Claude Libois
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting term match on Fuzzy queries.

2003-07-22 Thread Kelvin Tan
I know of one way to do it: just make MultiTermQuery's getEnum(IndexReader
reader) a public method and iterate through the terms...

Kelvin

On Tue, 22 Jul 2003 19:24:35 +1000, Victor Hadianto said:
>Hi All,
>
>I managed to get the terms match from a query for other queries but
>not Fuzzy,
>does anyone know a quick and dirty way given a Fuzzy query, retrieve
>all the
>terms used for search?
>
>Thanks,





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene Turbine Service

2003-03-04 Thread Kelvin Tan
Seth,

Not really. I've included here my SearchService interface though, which you
might want to check out.

I'm not sure what your intended or current usage is, so don't know how relevant
my implementation is for you...

public interface SearchService
{
public static final String CONFIGURATION_FILE = "search.xml";

public static final String
INCREMENTAL_INDEX_ENABLED_KEY = "index.incremental";

public static final String
SEARCH_RESULT_DEFAULT_NAMESPACE_KEY =
"[EMAIL PROTECTED]";

public SearchResults search(Query query) throws ServiceException;

public SearchResults search(Query query, Filter filter) throws
ServiceException;

public SearchResults search(Query query, Filter filter,
int from, int to) throws ServiceException;

public SearchResults search(Query query, Filter filter,
int from, int to, String namespace) throws
ServiceException;

public void batchIndex() throws ServiceException;

public boolean incrementalIndexEnabled();

public void addObjectIndex(ObjectIndexer indexer) throws Exception;

public void updateObjectIndex(ObjectIndexer indexer)
throws Exception;

public Document deleteObjectIndex(String objectId) throws IOException,
InterruptedException;

public Analyzer getAnalyzer();

public IndexWriter getIndexWriter(boolean create) throws IOException;
}

On Sun, 2 Mar 2003 15:04:17 -0500, Seth Weiner said:
>Kelvin,
>
>Have you had a chance to check in any of your search subsystem
>components?  I know it's been a while since I mentioned the issue,
>but I'd love to make some headway on a solid Turbine search
>subsystem for general consumption.
>
>Thanks, Seth
>
>-Original Message-
>From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Sunday,
>January 26, 2003 8:17 PM To: Lucene Users List Subject: RE: Lucene
>Turbine Service
>
>
>Seth,
>
>I had been meaning to do it for awhile, but inertia was
>overwhelming. Then I recently needed to be able to modify the
>configuration of the service at runtime, and Fulcrum didn't support
>that, so I just refactored my way out of it. :-)
>
>Why had I been wanting to do it? well, on hindsight, I think it
>never was a good candidate for a turbine service in the first place.
>the way i see it, a good candidate requires
>
>a) Lifecycle support b) Configuration c) Pluggable implementations
>
>For LuceneSearchService, a) was minimal, b) yes but not a big factor
> and c) turned out to be impractical. I had hopes of creating a
>SearchService where one could plug-in various implementations (check
> out http://www.mail-archive.com/lucene-
>[EMAIL PROTECTED]/msg01461.htm l) but gave up in the end.
>
>Let me see if I can cleanup the subsystem I've refactored out and
>check it in to Sandbox, then maybe we can discuss from there?
>
>KT
>
>On Tue, 21 Jan 2003 19:35:40 -0500, Seth Weiner said:
>>Thanks for the pointer!  Might I ask what the motivation for the
>>refactoring was?
>>
>>On a separate note I just took a look at the service in the
>>sandbox. It
>
>>doesn't appear to support addition of documents to the index.
>>Shouldn't
>
>>be hard to add, just seems like a strange ommision.  Also, wouldn't
>>it be more efficient for the service to maintain a pool of
>>IndexSearchers rather than creating a new one for each search?  Or
>>is there a problem with holding one or more index searchers open on
>>an index?  To add/remove a document to the index, would all of the
>>IndexSearchers need to be closed, or is this safe to do?
>>
>>The simple app I'm trying to write allows for the searching of an
>>index
>
>>through a turbine webapp interface as well as the ability to upload
>>a document through the webapp to be added to the index.  If the
>>document exists in the index it's deleted and then the new version
>>is added.
>>If anyone's already done this feel free to share;)  And while
>>you're in a charitable mood, my next step is to take the Lucene
>>service and make it an XMLRPC webservice.  Thoughts and suggestions
>>on that idea are greatly appreciated.
>>
>>Thanks, Seth
>>
>>-Original Message-
>>From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
>>January 21, 2003 7:16 PM To: Lucene Users List Subject: Re: Lucene
>>Turbine Service
>>
>>
>>Yep. Look in Lucene Sandbox. Interesting you should ask, though,
>>because after about a year of using the LuceneSearchService, I've
>>recently refactored it out into a subsystem of its own...:-)
>>
>>
>>Regards

Re: How to index a Word document

2003-01-30 Thread Kelvin Tan
Check out
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&t
oc=faq#q12

OK. But it really doesn't say very much. :-)

Seriously, how are people on this list doing it, if at all?

Regards,
Kelvin


The book giving manifesto - http://how.to/sharethisbook


On Fri, 31 Jan 2003 09:20:12 +0530, Nellai said:
>Hi!
>
>Can anyone tell me how to include word document for indexing. Is
>there any parser available for that.
>
>Thanks in advance
>
>Nellai...




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: or

2003-01-30 Thread Kelvin Tan
My suggestion would be to modify HTMLParser to do the job. Don't think it's
very difficult. I'm unaware of any existing HTML Parsers which support that
functionality...


Regards,
Kelvin


The book giving manifesto - http://how.to/sharethisbook


On Thu, 30 Jan 2003 10:56:50 +0100, Michael Wechner said:
>Hi
>
>I am looking for an HTMLParser which skips text tagged by
>
>  or something similar. This way I could exclude for
>instance a "global navigation section" within the HTML
>
> International Business Science ...
>
>
>It seems that the current demo/HTMLParser
>(http://lucene.sourceforge.net/cgi-
>bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q11) is not
>capable of doing something like that.
>
>Any pointers are very welcome.
>
>Thanks a lot
>
>Michael
>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: Lucene Turbine Service

2003-01-26 Thread Kelvin Tan
Seth,

I had been meaning to do it for awhile, but inertia was overwhelming.
Then I recently needed to be able to modify the configuration of the
service at runtime, and Fulcrum didn't support that, so I just
refactored my way out of it. :-)

Why had I been wanting to do it? well, on hindsight, I think it never
was a good candidate for a turbine service in the first place. the
way i see it, a good candidate requires

a) Lifecycle support
b) Configuration
c) Pluggable implementations

For LuceneSearchService, a) was minimal, b) yes but not a big factor
and c) turned out to be impractical. I had hopes of creating a
SearchService where one could plug-in various implementations (check
out
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01461.htm
l) but gave up in the end.

Let me see if I can cleanup the subsystem I've refactored out and
check it in to Sandbox, then maybe we can discuss from there?

KT

On Tue, 21 Jan 2003 19:35:40 -0500, Seth Weiner said:
>Thanks for the pointer!  Might I ask what the motivation for the
>refactoring was?
>
>On a separate note I just took a look at the service in the sandbox.
>It doesn't appear to support addition of documents to the index.
>Shouldn't be hard to add, just seems like a strange ommision.  Also,
>wouldn't it be more efficient for the service to maintain a pool of
>IndexSearchers rather than creating a new one for each search?  Or
>is there a problem with holding one or more index searchers open on
>an index?  To add/remove a document to the index, would all of the
>IndexSearchers need to be closed, or is this safe to do?
>
>The simple app I'm trying to write allows for the searching of an
>index through a turbine webapp interface as well as the ability to
>upload a document through the webapp to be added to the index.  If
>the document exists in the index it's deleted and then the new
>version is added.
>If anyone's already done this feel free to share;)  And while you're
>in a charitable mood, my next step is to take the Lucene service and
>make it an XMLRPC webservice.  Thoughts and suggestions on that idea
>are greatly appreciated.
>
>Thanks, Seth
>
>-Original Message-
>From: Kelvin Tan [mailto:[EMAIL PROTECTED]] Sent: Tuesday,
>January 21, 2003 7:16 PM To: Lucene Users List Subject: Re: Lucene
>Turbine Service
>
>
>Yep. Look in Lucene Sandbox. Interesting you should ask, though,
>because after about a year of using the LuceneSearchService, I've
>recently refactored it out into a subsystem of its own...:-)
>
>
>Regards, Kelvin
>
>
>The book giving manifesto - http://how.to/sharethisbook
>
>
>On Tue, 21 Jan 2003 18:44:33 -0500, Seth Weiner said:
>>Hi,
>>
>>I'm fairly new to Lucene and am trying to create a 'simple' Turbine
>> webapp that will use Lucene for some indexing.  Has anyone written
>>a simple Turbine/Fulcrum Lucene service for searching and indexing
>>documents?
>>
>>Thanks, Seth Weiner
>>
>>
>>--
>>To unsubscribe, e-mail:   <mailto:lucene-user-
>>[EMAIL PROTECTED]> For additional commands, e-mail:
>><mailto:lucene-user-
>>[EMAIL PROTECTED]>
>
>
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-user-
>[EMAIL PROTECTED]> For additional commands, e-mail:
><mailto:[EMAIL PROTECTED]>
>
>
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-user-
>[EMAIL PROTECTED]> For additional commands, e-mail:
><mailto:lucene-user-
>[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Lucene Turbine Service

2003-01-26 Thread Kelvin Tan
Otis,

hahaha...I'm not sure its all THAT much better than the one in
Sandbox, but I'll clean it up when its ready and check it in for
people to take a look anyways.

What's with our Lucene App Framework? seems to have lost a little
steam...

KT

On Sat, 25 Jan 2003 21:13:46 -0800 (PST), Otis Gospodnetic said:
>Kelvin,
>
>--- Kelvin Tan <[EMAIL PROTECTED]> wrote:
>>Yep. Look in Lucene Sandbox. Interesting you should ask, though,
>>because after about a year of using the LuceneSearchService, I've
>>recently refactored it out into a subsystem of its own...:-)
>
>Something better than your existing fulcrum contribution in the
>Sandbox?  Something you'd want to donate?
>
>Otis
>
>
>__ Do you Yahoo!?
>Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
>http://mailplus.yahoo.com
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-user-
>[EMAIL PROTECTED]> For additional commands, e-mail:
><mailto:lucene-user-
>[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Lucene Turbine Service

2003-01-21 Thread Kelvin Tan
Yep. Look in Lucene Sandbox. Interesting you should ask, though,
because after about a year of using the LuceneSearchService, I've
recently refactored it out into a subsystem of its own...:-)


Regards,
Kelvin


The book giving manifesto - http://how.to/sharethisbook


On Tue, 21 Jan 2003 18:44:33 -0500, Seth Weiner said:
>Hi,
>
>I'm fairly new to Lucene and am trying to create a 'simple' Turbine
>webapp that will use Lucene for some indexing.  Has anyone written a
>simple Turbine/Fulcrum Lucene service for searching and indexing
>documents?
>
>Thanks, Seth Weiner
>
>
>--
>To unsubscribe, e-mail:   [EMAIL PROTECTED]> For additional commands, e-mail:
>[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




NPE in Hits.getMoreDocs

2003-01-15 Thread Kelvin Tan
Weird error, only occurs when searching. Indexing is ok.

Has anyone else encountered this?

java.lang.NullPointerException
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:91)
at org.apache.lucene.search.Hits.(Hits.java:81)
at org.apache.lucene.search.Searcher.search(Searcher.java:74)

Looking at the src, it really doesn't make very much sense, since the
only object that could be null in line 91 is the searcher object.
Still, maybe someone can make more out of it...

Regards,
Kelvin


The book giving manifesto - http://how.to/sharethisbook



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




OpenCMS

2003-01-01 Thread Kelvin Tan
FYI, I chanced upon a Lucene module for OpenCMS.
http://www.opencms.com/opencms/opencms/service/modules.html

The dist includes a PDF parser. I have not tested it yet, the source
is not available, and its not clear what the licensing is (OpenCMS is
LGPL, but modules are not bound by that license).

Just thought people want to know about it, and how it may be yet
another alternative for pdf parsing.

Regards,
Kelvin


The book giving manifesto - http://how.to/sharethisbook



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: PDF Text extraction

2002-12-26 Thread Kelvin Tan
Try this?

InputStream input = new FileInputStream(file);
COSDocument document = parseDocument(input);
PDFTextStripper stripper = new PDFTextStripper();
StringWriter output = new StringWriter()
stripper.writeText(document, output);
System.out.println(output.toString())

errmmm...the code may not be 100% correct, but you get the idea.

Regards,
Kelvin


The book giving manifesto - http://how.to/sharethisbook


On Fri, 27 Dec 2002 12:04:11 +0530, Suhas Indra said:
>Hello List
>
>I am using PDFBox to index some of the PDF documents. The parser
>works fine and I can read the summary. But the contents are
>displayed as java.io.InputStream.
>
>When I try the following:
>System.out.println(doc.getField("contents")) (where doc is the
>Document object)
>
>The result will be:
>
>Text
>
>I want to print the extracted data.
>
>Can anyone please let me know how to extract the contents?
>
>Regards
>
>Suhas
>
>
>
>--
>Robosoft Technologies - Partners in Product Development
>
>
>
>
>
>
>
>
>
>--
>To unsubscribe, e-mail:   [EMAIL PROTECTED]> For additional commands, e-mail:
>[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Search PDF, Excel, Word, RTF files

2002-12-18 Thread Kelvin Tan
Eric,

Please refer to the FAQ.

On Thu, 19 Dec 2002 10:21:05 +0800, Eric Chow said:
>Hello,
>
>Is it possible to search PDF, Excel, Word, RTF files in Lucene ?
>
>
>Would you please to give me a simple example?
>
>Best regards, Eric
>
>== If you know what you are doing, it is not
>called RESEARCH!
>==




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Fwd: The LARM project

2002-12-02 Thread Kelvin Tan


--- Original Message ---
From: "Clemens Marschner" <[EMAIL PROTECTED]>
To: Lucene Developers List <[EMAIL PROTECTED]>
Cc:
Sent: Mon, 2 Dec 2002 01:13:47 +0100
Subject: The LARM project

>It's been weeks now that there were talks on this list about how the
>future of Lucene could look like. We all agreed that the core engine
>is solid and only needs little enhancements.
>
>However, we also agreed that in order to leverage Lucene to a higher
>level we will need a best-practices implementation of the most
>common cases the Lucene engine is used for.
>
>Since then, there were a lot of talks off-list among a group of
>people who mostly contributed code to the lucene-sandbox area, about
>how these parts could be brought together to form a greater whole.
>
>The result has now emerged, after weeks of discussion, in a proposal
>that we would like to discuss with the rest of you before spending
>even more weeks arguing about details. We currently regard it as a
>sandbox project.
>It consists of two parts: The Lucene Framework and the Lucene
>Retrieval Machine. The latter based upon the former, we call the
>whole package "LARM" (which is by coincidence also the name of the
>old crawler. We haven't figured out yet what the "A" stands for,
>though...).
>
>The mission we came up with reads like this:
>
>"The Lucene Retrieval Machine forms a complete and highly scalable
>search solution for end-users of the Lucene search engine: Capable
>of intelligently indexing data from various sources, preprocessing
>of source documents configurable by the end user, up to a best-
>practice implementation of online search functionality.
>
>It will be based on the Lucene Framework that provides
>implementations for data aggregation and indexing functionality
>utilizing the Lucene indexing API, while being easily extensible and
>constructible by application developers or researchers."
>
>As it turns out, much of the functionality we would regard to be an
>essential core of this search engine server (lifecycle management,
>configuration etc.) is already provided by the Apache Avalon
>project, especially by the Phoenix meta-server. Avalon comes with a
>different (and therefore a little more complicated) philosophy most
>of us are used to, but I have already pitched a note on their
>mailing list asking for support, and received an enormous response.
>We would like to get in contact with them -
>perhaps a viable collaboration can emerge.
>
>The rest of the documents can be downloaded from CVS from
>
>jakarta-lucene-sandbox/projects/larm/docs
>
>
>Regards,
>
>Clemens Marschner
>
>(with Otis Gospodnetic, Peter Carlson, and Kelvin Tan)
>
>
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-dev-
>[EMAIL PROTECTED]> For additional commands, e-mail:
><mailto:lucene-dev-
>[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Benchmarking Results

2002-11-28 Thread Kelvin Tan
Thanks for posting your benchmark results Hamish. A while ago, I
started collecting similar posts from a couple of folks who have been
generous enough to share their results.

Let me see if I can find those numbers again, and this time, maybe
they'll make it onto the website or FAQ, with each author's explicit
blessings of course...

Regards,
Kelvin


The book giving manifesto - http://how.to/sharethisbook


On Fri, 29 Nov 2002 14:41:30 +1300, Hamish Carpenter said:
>Hi Everyone,
>
>I've been lurking on this list for a couple of weeks now.  I thought
>I would contribute my experiences (with timings) of using lucene.
>
>The main issue we have is why does performance decrease
>significantly when searching with multiple threads?
>
>I hope this helps people starting out with lucene to compare with
>their performance.
>
>Hamish
>
>BTW: Optimizing took 3.5 minutes at 500,000 documents and 4.7
>minutes at 1,000,000 documents.  Sorry I don't have memory usage
>figures.
>
> Hardware environment 
>Dedicated machine for indexing (yes/no): yes CPU (Type, Speed and
>Quantity): Intel x86 P4 1.5Ghz RAM: 512 DDR Drive configuration
>(IDE, SCSI, RAID-1, RAID-5): IDE 7200rpm Raid-1
>
>Software environment 
>Java Version: 1.3.1 IBM JITC Enabled OS Version: Debian Linux
>2.4.18-686 Location of index directory (local/network): local
>
>Lucene indexing variables -
>Number of source documents: Random generator. Set to make 1M
>documents in 2x500,000 batches.
>Total filesize of source documents: > 1Gb if stored.
>Average filesize of source documents (in KB/MB): 1kb Source
>documents storage location (filesystem, DB, http,etc): fs File type
>of source documents: generated.
>Parser(s) used, if any: default Analyzer(s) used: default Number of
>fields per document: 11 Type of fields: 1 date, 1 id, 9 text Index
>persistence (FSDirectory, SqlDirectory, etc): FSDirectory
>
>Time taken (in ms/s as an average of at least 3 indexing runs): Time
>taken / 1000 docs indexed: 49seconds Memory consumption: unsure
>
>Notes (any special tuning/strategies):
>--
>A windows client ran a random document generator which created
>documents based on some arrays of values and an excerpt (approx 1kb)
>from a text file of the bible (King James version).
>These were submitted via a socket connection (open throughout
>indexing process).
>The index writer was not closed between index calls.
>This created a 400Mb index in 23 files (after optimization).
>
>Query details: --
>Set up a threaded class to start x number of simultaneous threads to
>search the above created index.
>
>Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) (Teaser:goo*
>Tea ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
>+DisplayStartDate:[mkwsw2jk0 -mq3dj1uq0] +EndDate:[mq3dj1uq0-
>ntlxuggw0]
>
>This query counted 34000 documents and I limited the returned
>documents to 5.
>
>This is using Peter Halacsy's IndexSearcherCache slightly modified
>to be a singleton returned cached searchers for a given directory.
>This solved an initial problem with too many files open and running
>out of linux handles for them.
>
>Threads|Avg Time per query (ms) 1   1009ms 2   2043ms 3
>3087ms 4   4045ms ...
>...
>10  10091ms
>
>I removed the two date range terms from the query and it made a HUGE
>difference in performance. With 4 threads the avg time dropped to
>900ms!
>
>Other query optimizations made little difference.
>
>
>
>
>--
>To unsubscribe, e-mail:   [EMAIL PROTECTED]> For additional commands, e-mail:
>[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Multiple field searches using AND and OR's

2002-11-13 Thread Kelvin Tan
Rob,

I believe MultiFieldQueryParser will do the job for you...

Regards,
Kelvin


On Wed, 13 Nov 2002 08:58:36 -0500, Rob Outar said:
>Hello all,
>
>I am wondering how I would do multiple field searches of the
form:
>
>field1 = value and field2 = value2 or field2 = value3
>
>I am thinking that each one of the above would be a term query but
>how would I string them together with AND's and OR's?
>
>Any help would be appreciated.
>
>Thanks,
>
>Rob
>
>PS I found this in the FAQ, but I was wondering if there was any
>other way to do it:
>
>My documents have multiple fields, do I have to replicate a query
>for each of them ?
>Not necessarily. A simple solution is to index the documents using a
>general field that contains a concatenation of the content of all
>the searchable fields ('author', 'title', 'body' etc). This way, a
>simple query will search in entire document content.
>
>The disadvantage of this method is that you cannot boost certain
>fields relative to others. Note also the matches in longer documents
>results in lower ranking.
>
>
>
>
>--
>To unsubscribe, e-mail:   [EMAIL PROTECTED]> For additional commands, e-mail:
>[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Deleting fields from a Document

2002-11-06 Thread Kelvin Tan
This brings me to a related discussion:  in-memory and index Field
representations.

Does an in-memory Field guarantee access to its name and value? Say I
retrieve a Field from a Document A, and add it to a new Document B.
Before writing B to the index, I delete A. Would B still contain the
Field? If so, does it work for both String-based and Reader-based
values?

Regards,
Kelvin


On Mon, 04 Nov 2002 10:40:40 -0800, Doug Cutting said:
>Kelvin Tan wrote:
>>Document maintains a linked list of Fields. It would be not be
>>difficult to delete a random Field, albeit a little inefficient.
>
>That would delete it from the in-memory representation, but, once it
>has been indexed, there is no easy way to remove a field value from
>a document other than to delete the document and re-add it.
>
>Doug
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-user-
>[EMAIL PROTECTED]> For additional commands, e-mail:
><mailto:lucene-user-
>[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>




Deleting fields from a Document

2002-10-30 Thread Kelvin Tan
There is currently no way to delete fields from a Document. I
wondered if this was evil, in any way, and looking at the source of
Document.java, found no evidence that it is so.

Document maintains a linked list of Fields. It would be not be
difficult to delete a random Field, albeit a little inefficient.

The reason why I need to delete fields, is that my index has been
inadvertently corrupted by fields with bad values from the
application. Attempts to add in correct values for these fields don't
solve the problem because the "bad" field still exists. One possible
solution is to create a new document, enumerate through all the
fields of the old document and add the ones you want. I don't have a
huge problem with that, but I also wonder if field deletion is truly
taboo.

Maybe someone can shed some light here?

Regards,
Kelvin


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Urgent

2002-10-29 Thread Kelvin Tan
Kajal,

Index the date fields using Field.Keyword, and use a DateFilter to limit
the results you need.

Regards,
Kelvin


On Tue, 29 Oct 2002 14:33:43 +0530, [EMAIL PROTECTED] wrote:
>Hi
>
>I would like to implement a search with date range. Can anyone
>explain how
>do i go about implementing this. I have enabled search on our
>intranet site
>and want to make changes to results.jsp so as to include the date
>option as
>well.
>
>Kajal
>
>
>--
>To unsubscribe, e-mail:   [EMAIL PROTECTED]>
>For additional commands, e-mail: [EMAIL PROTECTED]>





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Bitset Filters

2002-10-28 Thread Kelvin Tan
Terry,

What you typically want to do is something along the lines of

BitSet bits = new BitSet(reader.maxDoc());
Term t = new Term(field, fieldValue);
TermDocs termDocs = reader.termDocs(t);
try
{
while (termDocs.next())
{
int docNumber = termDocs.doc();
bits.set(docNumber);
}
}
finally
{
if (termDocs != null) termDocs.close();
}

this searches for all documents containing the term t, then allowing these
documents to be returned (note: everything else is disallowed by default).

Regards,
Kelvin


On Mon, 28 Oct 2002 08:35:28 -0500, Terry Steichen wrote:
>The Javadocs don't say much.  But thanks anyway.
>
>Terry
>
>- Original Message -
>From: "Peter Carlson" <[EMAIL PROTECTED]>
>To: "Lucene Users List" <[EMAIL PROTECTED]>
>Sent: Monday, October 28, 2002 12:22 AM
>Subject: Re: Bitset Filters
>
>
>>Check out the java docs on the Filter class.
>>
>>http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/
>>Filter.html
>>
>>--Peter
>>
>>On Friday, October 25, 2002, at 03:08 PM, Terry Steichen wrote:
>>
>>>Peter,
>>>
>>>Could you give, or point to, a couple of examples on how to use
>>>bitset
>>>filters in the way you describe below?
>>>
>>>Regards,
>>>
>>>Terry
>>>
>>>- Original Message -
>>>From: "Peter Carlson" <[EMAIL PROTECTED]>
>>>To: "Lucene Users List" <[EMAIL PROTECTED]>
>>>Sent: Tuesday, October 22, 2002 11:26 PM
>>>Subject: Re: Need Help URGENT
>>>
>>>
I think the answer is yes.

When creating a Lucene Document you can create a field which is
the
URL
field. If you are not searching for words within the field, I
would
probably make it a keyword field type so you don't tokenize it
into
multiple Terms.

Then you can great a multi-field search.


url:www.apache.org AND lucene

Where url is the field where the URL exists and the term you want
to
search for in your default field is Lucene.

To answer what I think your second question is I will restate
the
question.

Can Lucene support subsearching.
Well yes and no. I will answer how to accomplish this, there is
also
some information in the FAQ about this.

You can just add criteria to the search so

url:www.apache.org AND lucene AND indexing

This will return the subset of information.

If you are going to do the same search over and over again, you
may
also want to look at filters, which basically keep a bitset of a
Lucene
search results so you don't actually have to do the search again,
just
an intersection of two bitsets.

When you get the Hits back you can get the information from what
ever
field you want including the URL field that you will create.

I hope this helps and is on the mark. If not, the answer in can
you
use
Lucene to accomplish the task the answer is typically yes (The
questions then become just how much work has to be done on top
of
Lucene, or is Lucene the right tool).

--Peter



On Tuesday, October 22, 2002, at 04:32 PM, nandkumar rayanker
wrote:

> Hi,
>
> Forther to the request already made in my previous
> mail I would like to know:
>
> - Whether I can use lucene to search the remote site
> or not?
>
> Here is what I wnt to do.
> -Install Licene and search and create search info for
> a given URL.
>
> -Search the info from search info already created .
>
> Can do this sort of things using Lucene or not?
>
> thanks and regards
> Nandkumar
>
> --- nandkumar rayanker <[EMAIL PROTECTED]>
> wrote:
>> Hi,
>>
>> I need to develop search java stand alone
>> application,
>> which takes "SearchString" and "URL/URLS"
>>
>> "SearchString": string to be searched in web
>>
>> URL/URLS" : List of URLs where string needs to
>> searched.
>> return: List of URL/URLS where "SearchString" is
>> found.
>>
>> thanks & regards
>> Nandkumar
>>
>> --
>> To unsubscribe, e-mail:
>> 
>> For additional commands, e-mail:
>> 
>>
>
>
> --
> To unsubscribe, e-mail:
> 
> For additional commands, e-mail:
> 
>
>


--
To unsubscribe, e-mail:
>>>
For additional commands, e-mail:
>>>


>>>
>>>
>>>--
>>>To unsubscribe, e-mail:
>>>
>>>For additional commands, e-mail:
>>>
>>>
>>>
>>
>>
>>--
>>To unsubscribe, e-mail:
>

Re: Is Like Search Possible ?

2002-10-06 Thread Kelvin Tan

Ravi,

The question about Lucene's case-sensitivity is a moot one- it depends on
the analyzers/filters you use. If you use StandardAnalyzer (which you don't
have to), a lowercase filter is applied, hence the lack of
case-sensitivity.

Regards,
Kelvin


On Sun, 06 Oct 2002 23:24:45 -0500, Ravi Kothiyal wrote:
>Dear Suneetha,
>As far as I know the lucene is not case sensitive.
>
>Are you storing the Domain also as a field in the index . If yes
>than you can refine your query as "toAddress:abc* AND Domain:xyz.com"
>
>Otherwise you can refine it using fuzzy search.
>"toAddress:abc* AND toAddress:@xyz.com~"
>
>Hope this will help You
>
>Regards
>
>Ravi
>
>
>- Original Message -
>From: Suneetha Rao <[EMAIL PROTECTED]>
>Date: Sat, 05 Oct 2002 13:01:48 +0530
>To: Lucene Users List <[EMAIL PROTECTED]>
>Subject: Re: Is Like Search Possible ?
>
>
>>Dear Ravi,
>>Thanks for ur help but my problem is not solved  yet. I have
>>indexed the field ToAddress.
>>I'm able to get results if I search for  (toAddress:abc*)  it gives
>>me all mailids starting with abc
>>but I want it to search for in the domain how do I do it ??
>>Also I've found it does not return any results when I query for
>>(toAddress:Abc).
>>If Lucene is not case sensitive  why doesen't  it give me results.
>>
>>Regards,
>>Suneetha
>>
>>Ravi Kothiyal wrote:
>>
>>>Dear Suneetha,
>>>visit
>>>http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
>>>for the syntax about query. But this is for basic html search. But
>>>I think if you want to search through email's ToAddress field, You
>>>have a create an index which stores the toAddress than only you
>>>can retereve the search for toAddress.
>>>
>>>Hope this will help you
>>>
>>>Regards
>>>Ravi
>>>
>>>- Original Message -
>>>From: Suneetha Rao <[EMAIL PROTECTED]>
>>>Date: Sat, 05 Oct 2002 10:16:02 +0530
>>>To: Lucene Users List <[EMAIL PROTECTED]>
>>>Subject: Is Like Search Possible ?
>>>
Hi,
I've used lucene and indexed the whole database where I savd
the
mail headers
and some files where I saved the mail contents.I would like to to
a
search on email
ids.I'm using a Boolean Query to retirive results and is using
the
StandardAnalyzer.
How do I translate the SQL Statement
SELECT * FROM  where TOADDRESS LIKE '%infy%'
;
I  tried   the query  +(toAddress:infy*) but it does  does not
retrieve
any results.
I basically want to retrieve all records that have the
toAddress
like [EMAIL PROTECTED] there something wrong with the way Iquery?
How should I get to desired results.
Thanks in Advance

Regards,
Suneetha


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


>>>
>>>--
>>>__
>>>Sign-up for your own FREE Personalized E-mail at Mail.com
>>>http://www.mail.com/?sr=signup
>>>
>>>"Free price comparison tool gives you the best prices and cash
>>>back!"
>>>http://www.bestbuyfinder.com/download.htm
>>>
>>>--
>>>To unsubscribe, e-mail:   >>[EMAIL PROTECTED]>
>>>For additional commands, e-mail: >>[EMAIL PROTECTED]>
>>
>>
>>--
>>To unsubscribe, e-mail:   >[EMAIL PROTECTED]>
>>For additional commands, e-mail: >[EMAIL PROTECTED]>
>>
>>





--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




ArrayIndexOutOfBoundsException in FastCharStream.readChar

2002-08-13 Thread Kelvin Tan

Has anyone encountered this?

See stacktrace:

java.lang.ArrayIndexOutOfBoundsException
at org.apache.lucene.analysis.standard.FastCharStream.readChar(Unknown
Source)
at
org.apache.lucene.analysis.standard.StandardTokenizerTokenManagerjjMoveNfa_
0(Unknown Source)
at
org.apache.lucene.analysis.standard.StandardTokenizerTokenManagerjjMoveStri
ngLiteralDfa0_0(Unknown Source)
at
org.apache.lucene.analysis.standard.StandardTokenizerTokenManagergetNextTok
en(Unknown Source)
at org.apache.lucene.analysis.standard.StandardTokenizer.jj_ntk(Unknown
Source)
at org.apache.lucene.analysis.standard.StandardTokenizer.next(Unknown
Source)
at org.apache.lucene.analysis.standard.StandardFilter.next(Unknown Source)
at org.apache.lucene.analysis.LowerCaseFilter.next(Unknown Source)
at org.apache.lucene.analysis.StopFilter.next(Unknown Source)
at org.apache.lucene.index.DocumentWriter.invertDocument(Unknown Source)
at org.apache.lucene.index.DocumentWriter.addDocument(Unknown Source)
at org.apache.lucene.index.IndexWriter.addDocument(Unknown Source)

Regards,
Kelvin


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: CachedSearcher

2002-07-15 Thread Kelvin Tan

>FSDirectory closes files as they're GC'd, so you
>don't have to explicitly close the IndexReaders or Searchers.
>
>Doug
>

hmmm...is this documented somewhere? I go through quite abit of trouble
just to close Searchers (because Hits become invalid when the Searcher is
closed).

If the object has a close() method with public modifier, isn't it a common
idiom that client code needs to invoke close() explicitly? If there's no
real need to call close, maybe it can be changed to protected?

Regards,
Kelvin


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: PDF Text Stripper

2002-07-10 Thread Kelvin Tan

I just took a look at JPedal and I'm very impressed. Extracted some text as
XML data no problem.

Amazingly also creates thumbnails of the PDF file which is something I've
needed but couldn't find...:)

Regards,
Kelvin


On Wed, 10 Jul 2002 09:59:32 +0200, Jose Galiana wrote:
>Hi,
>
>I?ve used JPedal ( www.jpedal.org ). I?s distibuited under LGPL
>license and
>extract raw text, among other uses.
>
>I wrote code to extract text using Etymon PJ library, with PDF?s
>withs
>propietary fonts, I needed to create a cross tabla to translate
>Unicode to
>ASCII because Distiller inserts only a subset of Unicode tabla for
>each
>propietary font.
>
>JPedal has not problem with thats fonts and extract all text in XML,
>suitalble for use with Lucene.
>
>
>
>-Mensaje original-
>De: Ben Litchfield [mailto:[EMAIL PROTECTED]]
>Enviado el: martes, 09 de julio de 2002 16:48
>Para: [EMAIL PROTECTED]
>Asunto: PDF Text Stripper
>
>
>Hi,
>
>I have written a PDF library that can be used to strip text from PDF
>documents.  It is released under LGPL so have fun.
>
>There is one class which can be used to easily index PDF documents.
>pdfparser.searchengine.lucene.LucenePDFDocument  has a getDocument
>method which will take a PDF file and return a Lucene Document which
>you
>can add to an index.
>
>If you would like to see the quality of the text extraction you can
>run
>pdfparser.Main from the command line which will take a PDF document
>and
>write a txt file.
>
>I am looking for any input that you might have.  Please mail me if
>you
>have any bugs or feature requests.
>
>The library can be retrieved from
>http://www.csh.rit.edu/~ben/projects/pdfparser/
>
>-Ben Litchfield
>
>
>--
>To unsubscribe, e-mail:
>
>For additional commands, e-mail:
>
>
>
>
>--
>To unsubscribe, e-mail:   [EMAIL PROTECTED]>
>For additional commands, e-mail: [EMAIL PROTECTED]>
>




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Summarization tool?

2002-05-14 Thread Kelvin Tan

Gee, would I be interested in something like that...

I haven't seen anything of that sort at all, much less in Java (not to say
its inferior, but that its relatively new).

- Original Message -
From: "Nikhil G. Daddikar" <[EMAIL PROTECTED]>
To: "Lucene" <[EMAIL PROTECTED]>
Sent: Tuesday, May 14, 2002 10:31 AM
Subject: OT: Summarization tool?


> Hello,
>
> This is slightly off-topic but does anyone know of a good freeware
summarization tool i.e something that generates an abstract out
> of a text?
>
> Thanks.
>
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>
>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: PDF4J Project: Gathering Feature Requests

2002-05-06 Thread Kelvin Tan

Good point. Plus the xpdf project (AFAIK) is being actively developed. One
problem though: It's released under GPL, so any port will probably have to
adopt GPL too (unless they can be convinced to re-release it under a less
restrictive license).

- Original Message -
From: "Peter Carlson" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, May 07, 2002 10:36 AM
Subject: Re: PDF4J Project: Gathering Feature Requests


> I was just thinking of an existing code base that you could port to pure
> java.
>
> --Peter
>
> On 5/6/02 5:26 PM, "Kelvin Tan" <[EMAIL PROTECTED]> wrote:
>
> > there are JNI hooks to the code, I doubt it would be as nice as a Java
> > library for that.
> >
> > K
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: PDF4J Project: Gathering Feature Requests

2002-05-06 Thread Kelvin Tan


- Original Message -
From: "Peter Carlson" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, May 07, 2002 7:51 AM
Subject: Re: PDF4J Project: Gathering Feature Requests


> Have you looked at xpdf?
>
> Www.foolabs.com/xpdf

>From what I know of xpdf, it's not written in Java, probably C I think. Even
if there are JNI hooks to the code, I doubt it would be as nice as a Java
library for that.

K


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Performance benchmarks

2002-05-04 Thread Kelvin Tan

Great Peter. I've posted a new set of attributes based on your submission
and Otis' feedback. Let me think about the best way to consolidate these
numbers and stick them somewhere accessible for all.

- Original Message -
From: "Peter Carlson" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, May 03, 2002 9:50 PM
Subject: Performance benchmarks


> Some performance numbers
>
> Java Version: 1.3_01
> OS Version: Windows 2000
> CPU (Type, Speed and Quantity): Pentium 4, 1.5 GHz, 1 CPU
> RAM: 512 MB
> Drive configuration (IDE, SCSI, RAID-1, RAID-5): IDE (single)
> Number of source documents: 103009
> Total filesize of source documents: 430MB
> Average filesize of source documents (in KB/MB): 4.3KB
> Source documents storage location (filesystem, DB, http,etc): Filesystem
> File type of source documents: xml
> Parser(s) used, if any: Standard Analyzer
> Number of Fields per document: 8
> Time taken (in ms/s as an average of at least 3 indexing runs): 8387 sec
> (139 min)
> Time taken / 1000 docs indexed: 81 sec / 1000 docs
> Notes (any special tuning/strategies):
> I convert each document to a DOM, and use xpath to get the fields.
> I perform validation on the data and make sure that it meets certain
> criteria like total size > 150 characters, and verify there are no
> duplicates using a Hashmap. Without these checks, the indexing goes faster
> (about 60 seconds/1000 docs).
>
>
> I hope this is helpful.
> --Peter
>
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Indexing performance benchmarks

2002-05-04 Thread Kelvin Tan

Excellent. Otis suggested number and type of fields as well. I'd really like
to consolidate these figures and stick them somewhere on the website if its
ok with the people who contribute. Thanks Peter for adding the stats on
hardware and others.

Here's an updated and sorted list:


Hardware environment
Dedicated machine for indexing (yes/no):
CPU (Type, Speed and Quantity):
RAM:
Drive configuration (IDE, SCSI, RAID-1, RAID-5):

Software environment
Java Version:
OS Version:
Location of index directory (local/network):

Lucene indexing variables
Number of source documents:
Total filesize of source documents:
Average filesize of source documents (in KB/MB):
Source documents storage location (filesystem, DB, http,etc):
File type of source documents:
Parser(s) used, if any:
Analyzer(s) used:
Number of fields per document:
Type of fields:
Index persistence (FSDirectory, SqlDirectory, etc):

Time taken (in ms/s as an average of at least 3 indexing runs):
Time taken / 1000 docs indexed:
Memory consumption:

Notes (any special tuning/strategies):


If you'd like to contribute these stats but wish to remain anonymous, that's
cool too. You can mail me offline or something, and your boss will never
know...:)

- Original Message -
From: "Peter Carlson" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, May 03, 2002 9:21 PM
Subject: Re: Indexing performance benchmarks


> I like this idea too.
>
> It would also be good to know about system used
>
> Java Version
> OS Version
> CPU (Type, Speed and Quantity)
> RAM
> Drive configuration (IDE, SCSI, RAID-1, RAID-5)
>
>
>
> On 5/2/02 11:47 PM, "Kelvin Tan" <[EMAIL PROTECTED]> wrote:
>
> >
> > Number of source documents:
> > Total filesize of source documents:
> > Average filesize of source documents (in KB/MB):
> > Source documents storage location (filesystem, DB, http,etc):
> > File type of source documents:
> > Parser(s) used, if any:
> > Time taken (in ms/s as an average of at least 3 indexing runs):
> > Notes (any special tuning/strategies):
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: indexing PDF files

2002-05-04 Thread Kelvin Tan

You might want to take a look at WebSearch http://www.i2a.com/websearch/. It
has an _ok_ system going with respect to PDFs. PDFGo supports viewing of PDF
but a guy I contacted there says there's no current support for text
extraction but that he's "planning to do it".

Definitely agreed on the PJ resources bit. Doesn't really scale well in
terms of PDF file size.

If you haven't already seen the post, I once did a cursory examination of
the options for extracting text from PDF files via Java and the limitations
of the approaches.
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00280.html

The Etymon lib is GPL'ed, so I guess that's a nice place to start. As far as
the libs I've seen so far, most of them are really concerned with the
display and manipulation of PDF pages. Since we're looking for something
less complex (i.e text extraction), maybe it's not so bad. I've spent abit
of time in this area before so feel free to email me offline about this. Not
sure how much help I can be though.

- Original Message -
From: "petite_abeille" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, May 03, 2002 10:57 PM
Subject: Re: indexing PDF files


> On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote:
>
> > Can I assume none of the poeple on the lucene user group had
> > implemented indexing a pdf document using lucene.
>
> Who knows...?!? In any case, it's not public knowledge...
>
> >  If some one has.. Please help me by providing the solution.
>
> I use to believe in Santa Claus also... ;-)
>
> All that said, there seems to be a real demand to do something about pdf
> to text conversion (in java preferably). I'm willing to invest some time
> and brain cell to nail it down, but I'm note sure where to start...
>
> I'm aware of the PJ library, but it's really a pig as far as resources
> goes. Anything else?
>
> Any (concrete) pointer appreciated.
>
> Thanks.
>
> PA.
>
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Indexing performance benchmarks

2002-05-02 Thread Kelvin Tan

I think that it would be really useful if users can post performance
benchmarks for usage of Lucene in their app. I know its been done informally
on an ad hoc basis by various people in the past, but I'd like to propose a
standardized format:

Number of source documents:
Total filesize of source documents:
Average filesize of source documents (in KB/MB):
Source documents storage location (filesystem, DB, http,etc):
File type of source documents:
Parser(s) used, if any:
Time taken (in ms/s as an average of at least 3 indexing runs):
Notes (any special tuning/strategies):

This will really help users know what performance to expect when indexing
and should  help to raise warning flags when indexing times aren't similar
to benchmarks. Any one to start? :)

Regards,
Kelvin


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Javascript query validation

2002-04-24 Thread Kelvin Tan

FWIW, here's an updated version. The previous version ran into some issues
with handling complex queries (like (foo:("bar"))) and stuff. Added
validation to ensure quote marks are closed.

Kelvin
- Original Message -
From: "Kelvin Tan" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, April 10, 2002 5:41 PM
Subject: Javascript query validation


> Note: this file has only been tested in IE 6.0.
>
> Frustrated with curious TokenMgrErrors and ParseExceptions in your web
> forms? (I was) Not so good at regular expressions? (I'm not)
>
> See attached for a (simple) implementation of a regex-based javascript
query
> validator. Currently, only wildcards (*), plus and minus (+, -),
parentheses
> (round brackets) and field declarers (:) are validated for. The reason's
coz
> they're the most commonly used (IMHO) and I'm lazy.
>
> If you do add to it/fix any bugs, I'd appreciate if you could drop me a
buzz
> so I can update it, or post it to the list for everyone to use.
>
> Note: I've found that in IE at least, hitting the enter button with a text
> field in focus irritatingly submits the form without giving me a chance to
> validate the query. I've had to disable the "Enter" key for the form
fields.
> Email me offline if you need help with this.
>
> Regards,
> Kelvin Tan
>
> Relevanz Pte Ltd
> http://www.relevanz.com
>
> 180B Bencoolen St.
> The Bencoolen, #04-01
> S(189648)
>
> Tel: 6238 6229
> Fax: 6337 4417
>
>






> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>



luceneQueryValidator.zip
Description: Zip archive

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


Re: Re:_HTML_parser

2002-04-24 Thread Kelvin Tan

Otis, what's the final conclusion you've arrived at regarding the HTML
filter/parsing?

I have pretty much the same requirements as you do right now (extract text,
and obtain the title).

Kelvin

- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, April 22, 2002 12:27 AM
Subject: Re:_HTML_parser


> Laura,
>
> http://marc.theaimsgroup.com/?l=lucene-user&w=2&r=1&s=Spindle&q=b
>
> Oops, it's JoBo, not MoJo :)
> http://www.matuschek.net/software/jobo/
>
> Otis
>
> --- "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:
> > Hi Otis,
> >
> > thanks for your reply. I have been looking for Spindle and Mojo for 2
> >
> > hours but I don't found anything.
> >
> > Can you help me? Wher can I find something?
> >
> > Thanks for your help and time
> >
> >
> > Laura
> >
> >
> >
> >
> > > Laura,
> > >
> > > Search the lucene-user and lucene-dev archives for things like:
> > > crawler
> > > spider
> > > spindle
> > > lucene sandbox
> > >
> > > Spindle is something you may want to look at, as is MoJo (not
> > mentione
> > d
> > > on lucene lists, use Google).
> > >
> > > Otis
> > >
> > > > Did someone solve the problem to spider recursively a web pages?
> > >
> > > > > >While trying to research the same thing, I found the
> > > > following...here
> > > > 's a
> > > > > >good example of link extraction.
> > > > >
> > > > > Try http://www.quiotix.com/opensource/html-parser
> > > > >
> > > > > Its easy to write a Visitor which extracts the links; should
> > take
> > > > abou
> > > > t ten
> > > > > lines of code.
> > >
> > >
> > > __
> > > Do You Yahoo!?
> > > Yahoo! Games - play chess, backgammon, pool and more
> > > http://games.yahoo.com/
> > >
> > > --
> > > To unsubscribe, e-mail:    > [EMAIL PROTECTED]>
> > > For additional commands, e-mail:  > [EMAIL PROTECTED]>
> > >
> > >
>
>
> __
> Do You Yahoo!?
> Yahoo! Games - play chess, backgammon, pool and more
> http://games.yahoo.com/
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>
>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: PDF / Word document parsers

2002-04-18 Thread Kelvin Tan

>To parse word
> document you can have a look for OpenOffice. You can start OpenOffice to
> receive a socket connection. From your Java app, you open a connection to
> OpenOffice (using OpenOffice SDK), send the word document and it will
convert
> it to text.
>

That's actually quite a novel idea. I haven't tried it, is it complicated to
communicate with OpenOffice?

Kelvin


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: PDF / Word document parsers

2002-04-18 Thread Kelvin Tan

Anita,

I've experienced a moderate amount of success using Etymon for PDF parsing.
It does consume quite alot of memory for larger PDF documents, but otherwise
it's ok. What difficulties are you facing?

For MS Word parsing, The Jakarta POI project is working something out, but
in the meanwhile I've managed to search MS Word documents by reading the
file and stripping out nonsense characters. It's a hack I think, but if I
increase the indexWriter's maxFieldLength to about a million, I can search
like 13-15MB word documents with ease.

Kelvin
- Original Message -
From: "Anita Srinivas" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, April 19, 2002 2:13 PM
Subject: PDF / Word document parsers


Hi...

I have been looking for PDF and Word document parsers.  I have tried the
contributions page on the Lucene site as suggested by a Lucene User. The
PJEtymon does not have a Windows version.  The XPDF does not do the parsing
very well.

Can someone  suggest some better Word document or PDF parsers other than the
ones I mentioned here, .

Thanks

Anita Srinivas



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Javascript query validation

2002-04-10 Thread Kelvin Tan

Note: this file has only been tested in IE 6.0.

Frustrated with curious TokenMgrErrors and ParseExceptions in your web
forms? (I was) Not so good at regular expressions? (I'm not)

See attached for a (simple) implementation of a regex-based javascript query
validator. Currently, only wildcards (*), plus and minus (+, -), parentheses
(round brackets) and field declarers (:) are validated for. The reason's coz
they're the most commonly used (IMHO) and I'm lazy.

If you do add to it/fix any bugs, I'd appreciate if you could drop me a buzz
so I can update it, or post it to the list for everyone to use.

Note: I've found that in IE at least, hitting the enter button with a text
field in focus irritatingly submits the form without giving me a chance to
validate the query. I've had to disable the "Enter" key for the form fields.
Email me offline if you need help with this.

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 6238 6229
Fax: 6337 4417




luceneQueryValidator.js
Description: Binary data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


[Repost] FNFE while indexing

2002-04-04 Thread Kelvin Tan

This is a repost of a message I posted a month ago. Hope someone can pick it
up and shed some light on the mystery.

Encountering an odd FNFE during indexing...

2002-03-07 14:38:56,160 [Thread-2] ERROR
com.marketingbright.core.tasks.SearchIndexingTask - C:\index\_l.fnm (The
system cannot find the file specified)
java.io.FileNotFoundException:
C:\market\catalina\webapps\marketingbright\index\_l.fnm (The system cannot
find the file specified)
 at java.io.RandomAccessFile.open(Native Method)
 at java.io.RandomAccessFile.(RandomAccessFile.java:98)
 at java.io.RandomAccessFile.(RandomAccessFile.java:143)
 at org.apache.lucene.store.FSInputStream$Descriptor.(Unknown Source)
 at org.apache.lucene.store.FSInputStream.(Unknown Source)
 at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
 at org.apache.lucene.index.FieldInfos.(Unknown Source)
 at org.apache.lucene.index.SegmentReader.(Unknown Source)
 at org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
 at org.apache.lucene.index.IndexWriter.optimize(Unknown Source)
 at
com.marketingbright.core.service.search.SearchIndexer.index(SearchIndexer.ja
va:59)
 at
com.marketingbright.core.tasks.SearchIndexingTask.run(SearchIndexingTask.jav
a:47)
 at
com.marketingbright.core.services.schedule.ScheduledJob.execute(ScheduledJob
.java:88)
 at
com.marketingbright.core.services.schedule.WorkerThread.run(WorkerThread.jav
a:128)
 at java.lang.Thread.run(Thread.java:484)

I'm kinda puzzled, can anyone shed some light on this? Using v1.2rc4 on a
Win9x box.

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00969.html

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 6238 6229
Fax: 6337 4417



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Objects as search results

2002-04-03 Thread Kelvin Tan

Here's a topic which to my recollection (surprisingly) hasn't been brought
up: Assuming development in an object-oriented environment, it's a fair
assumption that the eventual target of searching is an object. How are
developers making this happen?

Are all fields of the objects indexed and displayed accordingly (this means
that the Document essentially takes the place of the object for search
results. bad idea IMHO)? Is there some way for the object to be
instantiated, then populated? How are these objects then displayed as search
results?

Here are some comments I have:

a) Documents shouldn't be used for displaying search results. To do so would
be inflexible and limit the type of data displayed as results to the fields
in a document. This means that if you wish to display more information, more
information has to be added to the document. This somewhat violates the
purpose of the document, I think, which is to provide an abstraction of a
atomic collection of searched/indexed fields. You may be able to get away
with it for simple applications, but I don't think it's a good idea.

Ideally, objects should be used to display the results then, since that's
what a result represents. I use Velocity, so this is easy for me. I retrieve
the objects as a collection (somehow), and stuff them into the Context for
rendering.

b) Different types of objects obviously have different types of metadata.
How can the different fields for each object be displayed, when the types of
objects to be indexed aren't fixed? (I use fields and metadata
interchangeably, so metadata is really a collection of fields of an object)

c) I use Torque, so object instantiation and population is a pretty easy
thing. I have no real solution to others, who don't have some kind of O/R
tool of sorts.

I have addressed these points to my satisfaction in a current app, but they
are terribly reliant on a specific combination (Torque and Velocity). I'm
really interested to know how other developers have approached this.

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 6238 6229
Fax: 6337 4417



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




re: Phonetic Encoders

2002-04-01 Thread Kelvin Tan

Peter,

Claus Engel was kind to contribute a couple of phonetic encoders for Lucene
a while ago. You might want to put them up in Contributions:

http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00105.html

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 6238 6229
Fax: 6337 4417



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Chainable Filter contribution

2002-03-27 Thread Kelvin Tan

Stephan,

I honestly don't know. There's going to be a /contrib section set up soon
though, so I think it might go in there at least.

Does it matter? :)

Regards,
Kelvin
- Original Message -
From: "Strittmatter Stephan (external)"
<[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Thursday, March 28, 2002 2:54 PM
Subject: RE: Chainable Filter contribution


> Hi Kelvin,
>
> I done som similar only doing XOR for my chains.
> But now your improved filter is better than my own.
> I think I will replace my own by yours.
> Will it be part of Lucene in future?
>
> Regards,
> Stephan
>
> > -Original Message-
> > From: Kelvin Tan [mailto:[EMAIL PROTECTED]]
> > Sent: Thursday, March 28, 2002 2:58 AM
> > To: Armbrust, Daniel C.
> > Cc: [EMAIL PROTECTED]
> > Subject: Re: Chainable Filter contribution
> >
> >
> > Dan,
> >
> > Totally my bad. I had since changed it but hadn't posted it
> > to the list coz
> > I didn't think anyone found it useful.
> >
> > Here's the correct version. I haven't really documented since
> > it's pretty
> > straightforward. Just holler if you need any help.
> >
> > Regards,
> > Kelvin
> > - Original Message -
> > From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Thursday, March 28, 2002 5:17 AM
> > Subject: Chainable Filter contribution
> >
> >
> > > I found this in the mailing list, and I do need something
> > like this, as I
> > > need to apply more than one filter at a time.  I'm fairly
> > new to lucene,
> > > however, and my knowledge of BitSets is very limited.
> > >
> > > My question, if you would be so kind as to donate a minute
> > of time to me,
> > is
> > > how does this combine the filters?  From my nieve look
> > through it, it
> > seems
> > > that all filter results would get discarded except for the
> > last filter
> > that
> > > was applied.
> > >
> > >
> > > Thanks,
> > >
> > > Dan
> > >
> > >
> > >
> > > import org.apache.lucene.index.IndexReader;
> > > import org.apache.lucene.search.Filter;
> > >
> > > import java.io.IOException;
> > > import java.util.BitSet;
> > >
> > > /**
> > >  * 
> > >  * A ChainableFilter allows multiple filters to be chained
> > >  * such that the result is the intersection of all the
> > >  * filters.
> > >  * 
> > >  * 
> > >  * Order in which filters are called depends on
> > >  * the position of the filter in the chain. It's probably
> > >  * more efficient to place the most restrictive filters
> > >  * /least computationally-intensive filters first.
> > >  * 
> > >  *
> > >  * @author mailto:[EMAIL PROTECTED]";>Kelvin Tan
> > >  */
> > > public class ChainableFilter extends Filter
> > > {
> > > /** The filter chain */
> > > private Filter[] chain = null;
> > >
> > > /**
> > >  * Creates a new ChainableFilter.
> > >  *
> > >  * @param chain The chain of filters.
> > >  */
> > > public ChainableFilter(Filter[] chain)
> > > {
> > > this.chain = chain;
> > > }
> > >
> > > public BitSet bits(IndexReader reader) throws IOException
> > > {
> > > BitSet result = null;
> > > for (int i = 0; i < chain.length; i++)
> > > {
> > > result = chain[i].bits(reader);
> > > }
> > > return result;
> > > }
> > > }
> > >
> > >
> >
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Question on the FAQ list with filters

2002-03-27 Thread Kelvin Tan

The API provides the Filter mechanism for filtering out hits before they are
searched.

The alternative is to write your own classes to filter out documents
returned after searching.
If its not efficient to check on every single document, or the results are
not obtained in batch, then this method is probably better.

I currently run a query through my database to return a list of documents
which a particular user is allowed to access. The Filter method, thus makes
a good deal of sense for me, since I'm able to obtain the results in batch.

>From my interpretation of the FAQ, it seems that you're expected to write
your own code to perform post-search filtering and not use/subclass the
Filter class. Of course, this could be made slightly clearer...

HTH

Regards,
Kelvin
- Original Message -
From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Thursday, March 28, 2002 5:52 AM
Subject: Question on the FAQ list with filters


> From the FAQ:
>
> ***
> 16. What is filtering and how is it performed ?
>
> Filtering means imposing additional restriction on the hit list to
eliminate
> hits that otherwise would be included in the search results. There are two
> ways to filter hits:
>
> * Search Query - in this approach, provide your custom filter object to
the
> when you call the search() method. This filter will be called exactly once
> to evaluate every document that resulted in non zero score.
>
> * Selective Collection - in this approach you perform the regular search
and
> when you get back the hit list, collect only those that matches your
> filtering criteria. In this approach, your filter is called only for hits
> that returned by the search method which may be only a subset of the non
> zero matches (useful when evaluating your search filter is expensive).
>
> ***
>
> I don't see why the second way is useful.  Yes, your filter is called only
> for hits that got returned by the search method, but aren't those the same
> hits that the search() method would run through the filter?  Maybe I'm
just
> not reading it close enough.
>
> Is my assumption that it is faster to provide a filter to the search()
> method, than to do a selective collation correct?
>
>
>
>
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Chainable Filter contribution

2002-03-27 Thread Kelvin Tan

Dan,

Totally my bad. I had since changed it but hadn't posted it to the list coz
I didn't think anyone found it useful.

Here's the correct version. I haven't really documented since it's pretty
straightforward. Just holler if you need any help.

Regards,
Kelvin
- Original Message -
From: "Armbrust, Daniel C." <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, March 28, 2002 5:17 AM
Subject: Chainable Filter contribution


> I found this in the mailing list, and I do need something like this, as I
> need to apply more than one filter at a time.  I'm fairly new to lucene,
> however, and my knowledge of BitSets is very limited.
>
> My question, if you would be so kind as to donate a minute of time to me,
is
> how does this combine the filters?  From my nieve look through it, it
seems
> that all filter results would get discarded except for the last filter
that
> was applied.
>
>
> Thanks,
>
> Dan
>
>
>
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.search.Filter;
>
> import java.io.IOException;
> import java.util.BitSet;
>
> /**
>  * 
>  * A ChainableFilter allows multiple filters to be chained
>  * such that the result is the intersection of all the
>  * filters.
>  * 
>  * 
>  * Order in which filters are called depends on
>  * the position of the filter in the chain. It's probably
>  * more efficient to place the most restrictive filters
>  * /least computationally-intensive filters first.
>  * 
>  *
>  * @author mailto:[EMAIL PROTECTED]";>Kelvin Tan
>  */
> public class ChainableFilter extends Filter
> {
> /** The filter chain */
> private Filter[] chain = null;
>
> /**
>  * Creates a new ChainableFilter.
>  *
>  * @param chain The chain of filters.
>  */
> public ChainableFilter(Filter[] chain)
> {
> this.chain = chain;
> }
>
> public BitSet bits(IndexReader reader) throws IOException
> {
> BitSet result = null;
> for (int i = 0; i < chain.length; i++)
> {
> result = chain[i].bits(reader);
> }
> return result;
> }
> }
>
>



ChainableFilter.java
Description: Binary data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


Re: Indexing and Duplication

2002-03-21 Thread Kelvin Tan



> I'd suggest to reconsider the use of a Hashtable to communicate
> between threads. I know a Hashtable is thread safe, but some form of queue
> is more like the thing one would expect there. Also, with a bounded queue
> a limit on memory usage is easily enforced because the feeding thread
> will wait as long as needed. For more about queues:
> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html
> The faq entry there about producer and consumer threads convinced me
> to use bounded queues after I got some out of memory crashes...

Thanks Ype, I'll definitely take your advice. Thanks for the link too...

Regards,
Kelvin

>
> Have fun,
> Ype
>
> --
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>
>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Indexing and Duplication

2002-03-21 Thread Kelvin Tan


> >
> Yes, I think if the Reader is a regular FileReader that has been opened,
> it would consume the file handle. On the other hand, if it is just a
> String or a StringReader it would consume memory equal (probably
> greater) to the size of the data. One way to fix this is to create your
> own Reader class, say DelayedReader, which does not open a file upon
> creation, but only upon the first read. That would help safe the file
> handles.

That's an excellent suggestion actually, Dmitry. Thanks!

>
> Dmitry
>
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Multiple field searching

2002-03-21 Thread Kelvin Tan


- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, March 22, 2002 1:24 AM
Subject: Re: Multiple field searching


>
> --- Kelvin Tan <[EMAIL PROTECTED]> wrote:
> > hmmm...really?
> >
> > My impression was that the "AND"s are treated equivalently with "+"s
> > by the
> > parser, so they're redundant.
>
> Correct.
>
> > The "{" and "}"s aren't part of the syntax, are they?
>
> I was wondering where those came from.
> I don't think I've seen them in QueryParser.jj.

It could possibly be the documentation in
MultiFieldQueryParser...

/**
 * 
 * Parses a query which searches on the fields specified.
 * 
 * If x fields are specified, this effectively constructs:
 * 
 * 
 * ({field1}:{query}) ({field2}:{query})
({field3}:{query})...({fieldx}:{query})
 * 
 * 
 *
 * @param query Query string to parse
 * @param fields Fields to search on
 * @param analyzer Analyzer to use
     */

Kelvin

>
> Otis
>
> > - Original Message -
> > From: "Mehran Mehr" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>; "Kelvin
> > Tan"
> > <[EMAIL PROTECTED]>
> > Sent: Thursday, March 21, 2002 8:11 PM
> > Subject: Re: Multiple field searching
> >
> >
> > > this is the right syntax:
> > >
> > > +(keyword:{computers}) AND +(subject:{News}) AND
> > > content:xml
> > >
> > >
> > > __
> > > Do You Yahoo!?
> > > Yahoo! Movies - coverage of the 74th Academy Awards®
> > > http://movies.yahoo.com/
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> > > For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> >
>
>
> __
> Do You Yahoo!?
> Yahoo! Movies - coverage of the 74th Academy Awards®
> http://movies.yahoo.com/
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Multiple field searching

2002-03-21 Thread Kelvin Tan

hmmm...really?

My impression was that the "AND"s are treated equivalently with "+"s by the
parser, so they're redundant. The "{" and "}"s aren't part of the syntax,
are they?

Kelvin
- Original Message -
From: "Mehran Mehr" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>; "Kelvin Tan"
<[EMAIL PROTECTED]>
Sent: Thursday, March 21, 2002 8:11 PM
Subject: Re: Multiple field searching


> this is the right syntax:
>
> +(keyword:{computers}) AND +(subject:{News}) AND
> content:xml
>
>
> __
> Do You Yahoo!?
> Yahoo! Movies - coverage of the 74th Academy Awards®
> http://movies.yahoo.com/
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Multiple field searching

2002-03-20 Thread Kelvin Tan


- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>; "Kelvin Tan"
<[EMAIL PROTECTED]>
Sent: Thursday, March 21, 2002 2:13 PM
Subject: Re: Multiple field searching


> Kelvin,
>
> Right now I can't imagine a situation where one would need to pass
> multiple query strings in.  I certainly don't need it anywhere where I
> use Lucene, at least not yet.  So I'd say hold off with that addition
> unless people say they really need it.  Couldn't one always just call
> the parse method multiple times with a different query string each
> time, and then combine returned Query instances?
>

Otis,

I suppose that's possible.

Actually if you need to construct complex multi-field boolean queries via a
web-based form, my recommendation is to use client-side Javascript. I posted
one such script a while back (for the life of me I don't know why it slipped
my mind about it). The difficulty, IMHO, is not getting QueryParser to
construct the query, but making query construction by the end-user simple
enough and idiot-proof...:)

Regards,
Kelvin


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Multiple field searching

2002-03-20 Thread Kelvin Tan

Tate,


> Could not see how the MultiFieldQueryParser could take in multiple
queries. ie.  SUBJECT, KEYWORD & CONTENT field
> have different values to be queried.  It only takes a single query.

You're right in that it doesn't. What I'm wondering right now, is if that's
a reasonably common request to warrant its inclusion into the
MultiFieldQueryParser class...

Regards,
Kelvin

>
> On Wed, 20 Mar 2002 17:58, Otis Gospodnetic wrote:
> > I'm using MultiTermQueryParser and it works for me.
> >
> > Otis
> >
> > --- Tate Jones <[EMAIL PROTECTED]> wrote:
> > > hi,
> > >
> > > I am trying to search across multiple fields using the following
> > > query
> > >
> > > +keyword:computers +subject:News content:xml
> > > or
> > > +(keyword:{computers}) +(subject:{News}) content:xml
> > >
> > > i have added the fields to the document correctly.
> > >
> > > Have also tried using the MutipleFieldQueryParser without success.
> > >
> > > The only query that works is, which is not correct as they are OR's
> > > keyword:computers subject:IT content:xml
> > >
> > > Is anyone having the same problems
> > >
> > > Thanks in advance
> > > Tate
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > > 
> > > For additional commands, e-mail:
> > > 
> >
> > __
> > Do You Yahoo!?
> > Yahoo! Sports - live college hoops coverage
> > http://sports.yahoo.com/
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>
>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




[OT] Extracting text from PDF via Etymon Pj

2002-03-20 Thread Kelvin Tan

I've received a couple of private mails from users on how to extract text
from PDF files using the Etymon lib. I thought I'd just post it for the
archives in case anyone's interested.

If you still need help just holler! The references to cat are Log4j's
category. You can remove it
without side-effects if you don't use Log4j.

private String getContent(Pdf pdf, int pageNo)
{
String content = null;
PjStream stream = null;
StringBuffer strbf = new StringBuffer();
try
{
PjPage page = (PjPage) pdf.getObject(pdf.getPage(pageNo));
PjObject pobj = (PjObject) pdf.resolve(page.getContents());
if (pobj instanceof PjArray)
{
PjArray array = (PjArray) pobj;
Vector vArray = array.getVector();
int size = vArray.size();
for (int j = 0; j < size; j++)
{
stream = (PjStream) pdf.resolve((PjObject)
vArray.get(j));
strbf.append(getStringFromPjStream(stream));
}
content = strbf.toString();
}
else
{
stream = (PjStream) pobj;
content = getStringFromPjStream(stream);
}
}
catch (InvalidPdfObjectException pdfe)
{
cat.error("Invalid PDF Object:" + pdfe, pdfe);
}
catch (Exception e)
{
cat.error("Exception in getContent() " + e, e);
}
return content;
}

private String getStringFromPjStream(PjStream stream)
{
StringBuffer strbf = new StringBuffer();
try
{
int start,end = 0;
stream = stream.flateDecompress();
String longString = stream.toString();
int strlen = longString.length();
int lastIndex = longString.lastIndexOf(")");
while (lastIndex != -1 && end != lastIndex)
{
start = longString.indexOf("(", end);
end = longString.indexOf(")", start);
String text = longString.substring(start + 1, end);
strbf.append(text);
}
}
catch (InvalidPdfObjectException pdfe)
{
cat.error("InvalidObjectException:" + pdfe.getMessage(), pdfe);
    }
return strbf.toString();
}

Good luck!

Regards,
Kelvin

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 238 6229
Fax: 337 4417



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Multiple field searching

2002-03-19 Thread Kelvin Tan

Tate,

The correct syntax is something like +(keyword:computers) -(subject:News).

HTH.

Maybe it would be helpful to add parse(String[] query, String[] fields,
Analyzer analyzer) methods into MultiFieldQueryParser? What do you think
Otis?

Kelvin
- Original Message -
From: "Tate Jones" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, March 20, 2002 3:36 PM
Subject: Multiple field searching


> hi,
>
> I am trying to search across multiple fields using the following query
>
> +keyword:computers +subject:News content:xml
> or
> +(keyword:{computers}) +(subject:{News}) content:xml
>
> i have added the fields to the document correctly.
>
> Have also tried using the MutipleFieldQueryParser without success.
>
> The only query that works is, which is not correct as they are OR's
> keyword:computers subject:IT content:xml
>
> Is anyone having the same problems
>
> Thanks in advance
> Tate
>
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Indexing and Duplication

2002-03-19 Thread Kelvin Tan

Ype,

- Original Message -
From: "Ype Kingma" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, March 19, 2002 3:57 AM
Subject: Re: Indexing and Duplication


> Kelvin,
>
> >Ype,
> >
> >That would be a good solution to my problem if only I weren't performing
> >multi-threaded indexing. :(
> >The Reader obtained by any one thread may not be an accurate reflection
of
> >the actual state of the index, just what the state when the Reader was
> >instantiated.
>
> Why share index readers between threads?
> For searching this is fine, off course, but importing can be done
> differently.

Even if each thread has its own reader, after it has obtained the reader,
another thread may have written to the index...

>
> You might consider changing the functionality of your threads a bit:
> one or more threads for indexing and one or
> more other threads for extracting the lucene documents.
>
> You could eg. use a bounded queue of batches of lucene docs as input to
> the indexing threads. The extracting thread(s) can then put
> lucene docs in the next batch and put the batch on the queue.
>
> The only exclusive serial part would then be opening the index reader,
> deleting a batch of old docs, and closing the reader. Adding a batch of
> new docs can be done by eg. two threads while not using the reader.
>
> For incremental imports an index reader is also needed to check whether a
> document has been imported or not. Such checks might be done up front
> during a single run of the import program.
>
> In this way the index readers are used for rather short periods
> to do some batch of work, and there is no need to share them
> between threads.

h...interesting. It's a good suggestion, and I'll need to think abit
more about it.

>
> >My current solution is that I hold a collection of documents with the key
as
> >my object identifier and only write them to the writer after indexing is
>
> What's the difference between 'writing to the writer' and 'indexing'?
>

Sorry. I should've been more explicit. Indexing means the creation of
Document objects and adding fields to them, not adding them to the writer
yet.

> >done. I chose it because it saved me having to write, then delete a
> >document, etc. However, it's not so ideal because the memory consumed by
> >such an approach may be prohibitive.
>
> >What do you think?
>
> Memory usage can be limited by using a bounded queue. A single batch
> of docs on the queue can be limited by eg. the total size of the docs.
>
> I assumed you need to delete old docs while adding new ones. In case
> you don't need to delete old docs, you you might not need an
> index reader at all.

I know. My approach wasn't working with batches at all. Each indexing thread
was just adding documents to a hashtable. The main thread would then iterate
through the hashtable and add them to the writer.

This seems like a silly question, but will keeping hold of Document objects
cause me to run into "Too many files open" problems? If each document object
has a Field.Text which contains a Reader, and the Reader isn't closed till
the document is indexed, would this be an issue? Is the memory consumed by
Document objects directly proportional to the size of the object the Reader
reads?

Thanks.

Regards,
Kelvin

>
> Ype
>
>
> >Regards,
> >Kelvin
> >- Original Message -
> >From: "Ype Kingma" <[EMAIL PROTECTED]>
> >To: "Lucene Users List" <[EMAIL PROTECTED]>
> >Sent: Sunday, March 17, 2002 6:15 AM
> >Subject: Re: Indexing and Duplication
> >
> >
> > > Kelvin,
> > >
> > > >I've got a little problem with indexing that I'd like to throw to
> >everyone.
> > > >
> > > >My objects have a unique identifier. When indexing, before I create a
new
> > > >document, I'd like to check if a document has already been created
with
> >this
> > > >identifier. If so, I'd like to retrieve the document corresponding to
> >this
> > > >identifier, and add the fields I currently have to this document's
fields
> > > >and write it. If no such document exists, then I'd create a new
document,
> > > >add my fields and write it. What this really does, I guess, is ensure
> >that a
> > > >document object represents a body of information which really belongs
> > > >together, eliminating duplication.
> > > >
> > > >With the current API, writing and retrieving is performed by the
> >IndexWriter
> > > >and IndexReader respectively. This effectively means that in order to
do
> >the
> > > >above, I'd have to close the writer, create a new instance of the
index
> > > >reader after each document has been added in order for the reader to
have
> > > >the most updated version of the index (!).
> > > >
> >> >Does anyone have any suggestions how I might approach this?
> >>
> >> Avoid closing and opening too much by batching n docs at a time
> >> on the index reader and then to the things needed for the n docs on the
> >> index writer. You might have to delete docs on the reader, too.
> > >
> > > The reasons for usin

Re: Indexing and Duplication

2002-03-17 Thread Kelvin Tan

Ype,

That would be a good solution to my problem if only I weren't performing
multi-threaded indexing. :(
The Reader obtained by any one thread may not be an accurate reflection of
the actual state of the index, just what the state when the Reader was
instantiated.

My current solution is that I hold a collection of documents with the key as
my object identifier and only write them to the writer after indexing is
done. I chose it because it saved me having to write, then delete a
document, etc. However, it's not so ideal because the memory consumed by
such an approach may be prohibitive.

What do you think?

Regards,
Kelvin
- Original Message -
From: "Ype Kingma" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Sunday, March 17, 2002 6:15 AM
Subject: Re: Indexing and Duplication


> Kelvin,
>
> >I've got a little problem with indexing that I'd like to throw to
everyone.
> >
> >My objects have a unique identifier. When indexing, before I create a new
> >document, I'd like to check if a document has already been created with
this
> >identifier. If so, I'd like to retrieve the document corresponding to
this
> >identifier, and add the fields I currently have to this document's fields
> >and write it. If no such document exists, then I'd create a new document,
> >add my fields and write it. What this really does, I guess, is ensure
that a
> >document object represents a body of information which really belongs
> >together, eliminating duplication.
> >
> >With the current API, writing and retrieving is performed by the
IndexWriter
> >and IndexReader respectively. This effectively means that in order to do
the
> >above, I'd have to close the writer, create a new instance of the index
> >reader after each document has been added in order for the reader to have
> >the most updated version of the index (!).
> >
> >Does anyone have any suggestions how I might approach this?
>
> Avoid closing and opening too much by batching n docs at a time
> on the index reader and then to the things needed for the n docs on the
> index writer. You might have to delete docs on the reader, too.
>
> The reasons for using the reader for reading/searching/deleting
> and the using writer for adding have been discussed some time ago on this
> list. I can't provide a pointer into the list archives as I don't recall
> the original subject header, sorry.
>
> Regards,
> Ype
>
> --
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Need pointers on using a very small part of Lucene

2002-03-14 Thread Kelvin Tan

Robert,

>
> I just have one more question - how do I remove repeated words? Does
> anyone have a filter for doing this?
>
> For example, here's the result of one of my files being worked on:
> "todai customer.formattedmailingaddress3 dear customer.dearnam respond
> request inform productlongnam summari inform subjectssinglelin addit
> question topic question bertek product call 1-888-523-7835 ext.9877
> product inform productnam includ correspond product inform product bertek
> product electron internet site http wwwbertekcom interest bertek
> pharmaceut product appreci sincer responsiblehcp.signatur
> responsiblehcp.fullnam responsiblehcp.position.nam initiator.initi
> responsiblehcp.initi casenumb cc salesrep.fullnam enclosur
> enclosurestextbullet bodytext"
>
>
> If you look closely, you'll see the word 'question' repeated twice.

One way to do it is to write a TokenFilter, and basically construct a Set of
all token.termText. in the next() method, provide a check to see if this
token.termText already exists in the Set. If so, ignore it. If not, add it
to the set and carry on. Note that this may be rather memory-intensive...:)

HTH.

Regards,
Kelvin

>
>
> thanks,
> rob
>
>
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Indexing and Duplication

2002-03-14 Thread Kelvin Tan

I've got a little problem with indexing that I'd like to throw to everyone.

My objects have a unique identifier. When indexing, before I create a new
document, I'd like to check if a document has already been created with this
identifier. If so, I'd like to retrieve the document corresponding to this
identifier, and add the fields I currently have to this document's fields
and write it. If no such document exists, then I'd create a new document,
add my fields and write it. What this really does, I guess, is ensure that a
document object represents a body of information which really belongs
together, eliminating duplication.

With the current API, writing and retrieving is performed by the IndexWriter
and IndexReader respectively. This effectively means that in order to do the
above, I'd have to close the writer, create a new instance of the index
reader after each document has been added in order for the reader to have
the most updated version of the index (!).

Does anyone have any suggestions how I might approach this?

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 238 6229
Fax: 337 4417



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Maximum indexable data

2002-03-12 Thread Kelvin Tan

Ype,

>
> The 10,000 refers to the maximum nr. of terms per document.
> It's the default, and it's not hardcoded. Simply create an indexwriter
> and change this attribute before adding docs.

Ahhh, my bad. I didn't notice the maxFieldLength field. Guess I'm too used
to looking for getters/setters...:)

Thanks anyhow.

Regards,
Kelvin

>
> Regards,
> Ype
>
>
> >What do you think?
> >
> >Kelvin
> >
> >- Original Message -
> >From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> >To: "Lucene Users List" <[EMAIL PROTECTED]>
> >Sent: Monday, March 11, 2002 11:18 AM
> >Subject: Re: Maximum indexable data
> >
> >
> >> I haven't heard of any such limit.  There is a 'limit' of 10,000
> >> characters on a field length, but that is a limit only because that
> >> number is hard coded in the source.
> >> However, shouldn't this be very simple for you to test?
> >> Index something over and over and see if you ever hit the wall :)
> >>
> >> Otis
> >>
> >> --- Herman Chen <[EMAIL PROTECTED]> wrote:
> >> > Hi,
> >> >
> >> > Is there a limit for the amount of data indexable by a segment?
> >> > If so is there a limit for searching?  i.e. can I give MultiSearcher
> >> > several indices that are all close to the maximum size.  Thanks.
> >> >
> >> > --
> >> > Herman
> >> >
> >> >
> >>
> >>
> >> __
> >> Do You Yahoo!?
> >> Try FREE Yahoo! Mail - the world's greatest free email!
> >> http://mail.yahoo.com/
> >>
> >> --
> >> To unsubscribe, e-mail:
> >
> >> For additional commands, e-mail:
> >
> >>
> >
> >
> >--
> >To unsubscribe, e-mail:

> >For additional commands, e-mail:

>
>
> --
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Maximum indexable data

2002-03-11 Thread Kelvin Tan

Actually that's something which I'm not exactly thrilled about. Why is this
10,000 value hardcoded instead of configurable? Surely it's sufficient to be
a default instead of a limit.

What do you think?

Kelvin

- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, March 11, 2002 11:18 AM
Subject: Re: Maximum indexable data


> I haven't heard of any such limit.  There is a 'limit' of 10,000
> characters on a field length, but that is a limit only because that
> number is hard coded in the source.
> However, shouldn't this be very simple for you to test?
> Index something over and over and see if you ever hit the wall :)
>
> Otis
>
> --- Herman Chen <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Is there a limit for the amount of data indexable by a segment?
> > If so is there a limit for searching?  i.e. can I give MultiSearcher
> > several indices that are all close to the maximum size.  Thanks.
> >
> > --
> > Herman
> >
> >
>
>
> __
> Do You Yahoo!?
> Try FREE Yahoo! Mail - the world's greatest free email!
> http://mail.yahoo.com/
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




[Contrib] ChainableFilter.java

2002-03-11 Thread Kelvin Tan

I'm not sure if anyone's had a need for this, but I did. It's real simple,
but why duplicate something's that already been written? :-)

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 238 6229
Fax: 337 4417




ChainableFilter.java
Description: Binary data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


FNFE while indexing

2002-03-06 Thread Kelvin Tan

Encountering an odd FNFE during indexing...

2002-03-07 14:38:56,160 [Thread-2] ERROR
com.marketingbright.core.tasks.SearchIndexingTask - C:\index\_l.fnm (The
system cannot find the file specified)
java.io.FileNotFoundException:
C:\market\catalina\webapps\marketingbright\index\_l.fnm (The system cannot
find the file specified)
 at java.io.RandomAccessFile.open(Native Method)
 at java.io.RandomAccessFile.(RandomAccessFile.java:98)
 at java.io.RandomAccessFile.(RandomAccessFile.java:143)
 at org.apache.lucene.store.FSInputStream$Descriptor.(Unknown Source)
 at org.apache.lucene.store.FSInputStream.(Unknown Source)
 at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
 at org.apache.lucene.index.FieldInfos.(Unknown Source)
 at org.apache.lucene.index.SegmentReader.(Unknown Source)
 at org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
 at org.apache.lucene.index.IndexWriter.optimize(Unknown Source)
 at
com.marketingbright.core.service.search.SearchIndexer.index(SearchIndexer.ja
va:59)
 at
com.marketingbright.core.tasks.SearchIndexingTask.run(SearchIndexingTask.jav
a:47)
 at
com.marketingbright.core.services.schedule.ScheduledJob.execute(ScheduledJob
.java:88)
 at
com.marketingbright.core.services.schedule.WorkerThread.run(WorkerThread.jav
a:128)
 at java.lang.Thread.run(Thread.java:484)

I'm kinda puzzled, can anyone shed some light on this? Using v1.2rc4 on a
Win9x box.

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 238 6229
Fax: 337 4417



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: indexing and searching different file formats

2002-02-14 Thread Kelvin Tan

Known limitations here:
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00280.html

HTH.

Regards,
Kelvin

PS: Pj library is GPL'ed. Commercial licenses go for $5,000 per 100 copies
(1 CPU per copy).

- Original Message -
From: "Kelvin Tan" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Friday, February 15, 2002 9:09 AM
Subject: Re: indexing and searching different file formats


> Uhmmm, I can contribute something which does a pretty decent job if
anyone's
> interested...
>
> Just have to clean it up a little...
>
> Regards,
> Kelvin
> - Original Message -
> From: "W. Eliot Kimber" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Friday, February 15, 2002 1:10 AM
> Subject: Re: indexing and searching different file formats
>
>
> > Andrew Libby wrote:
> >
> > > and the text needs to be retrieved for indexing.  An extreeme example
is
> > > a PDF which has a considerably complicated document format.
> >
> > The PJ library from www.etymon.com provides a pretty complete and
> > easy-to-use API for getting info from PDF docs. It wouldn't be too hard
> > to write a PDF indexer for Lucene using this library. The main challenge
> > would be guessing word boundaries in strings where spaces have been
> > replaced with explicit shift values by the formatter.
> >
> > Cheers,
> >
> > Eliot
> > --
> > W. Eliot Kimber, [EMAIL PROTECTED]
> > Consultant, ISOGEN International
> >
> > 1016 La Posada Dr., Suite 240
> > Austin, TX  78752 Phone: 512.656.4139
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>
>



PdfTextExtractor.java
Description: Binary data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


Re: Searching multiple fields in one Index of Documents

2002-02-14 Thread Kelvin Tan

As requested,

http://www.relevanz.com/lucene_contrib.zip

Regards,
Kelvin
- Original Message -
From: "Mark Tucker" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, February 15, 2002 2:03 AM
Subject: RE: Searching multiple fields in one Index of Documents


Can you zip up those files or change the .js extension to .txt?  My mail
server strips out potentially harmful files.

Thanks,

Mark

-Original Message-
From: Kelvin Tan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 13, 2002 10:32 PM
To: Lucene Users List
Subject: Re: Searching multiple fields in one Index of Documents


Peter,

As advised, re-released under APL. :) There were some changes to QueryParser
constructors in rc3, and these are reflected here as well.

FWIW, I've also attached a javascript lib and accompanying HTML which
constructs a Lucene multi-field query using a HTML form.

Regards,
Kelvin

- Original Message -
From: "Peter Carlson" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, February 13, 2002 10:56 PM
Subject: Re: Searching multiple fields in one Index of Documents


> This is great Kelvin,
> Sorry I didn't see it before.
> I'll add it to the list of contributions.
>
> --Peter
>
> On 2/13/02 12:43 AM, "Kelvin Tan" <[EMAIL PROTECTED]> wrote:
>
> > Charles,
> >
> > See
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html
> >
> > Regards,
> > K
> >
> > - Original Message -
> > From: "Charles Harvey" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Tuesday, February 12, 2002 8:39 AM
> > Subject: Searching multiple fields in one Index of Documents
> >
> >
> >> I have a working installation of Lucene running against indexes created
by
> >> a database query.
> >> Each Document in the Index contains fifteen or twenty fields. I am
> >> currently searching only one field (that contains concatenated database
> >> columns) because I cannot figure out how to search multiple fields. So:
> >>
> >> How can I use Lucene to search more than one field in an Index of
> > Documents?
> >>
> >> eg:
> >> field CATEGORY is(or contains) 'bar'
> >> AND
> >> field BODY contains 'foo'
> >>
> >>
> >>
> >>
> >> _
> >>
> >> "The trouble with the rat-race is that even if you win you're still a
> > rat."
> >> --Lily Tomlin
> >> _
> >> Charles Harvey
> >> Developer
> >> http://www.philly.com
> >> Wk: 215 789 6057
> >> Cell: 215 588 0851
> >>
> >>
> >> --
> >> To unsubscribe, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> >> For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> >>
> >>
> >
> >
> > --
> > To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>

--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: indexing and searching different file formats

2002-02-14 Thread Kelvin Tan

Uhmmm, I can contribute something which does a pretty decent job if anyone's
interested...

Just have to clean it up a little...

Regards,
Kelvin
- Original Message -
From: "W. Eliot Kimber" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, February 15, 2002 1:10 AM
Subject: Re: indexing and searching different file formats


> Andrew Libby wrote:
>
> > and the text needs to be retrieved for indexing.  An extreeme example is
> > a PDF which has a considerably complicated document format.
>
> The PJ library from www.etymon.com provides a pretty complete and
> easy-to-use API for getting info from PDF docs. It wouldn't be too hard
> to write a PDF indexer for Lucene using this library. The main challenge
> would be guessing word boundaries in strings where spaces have been
> replaced with explicit shift values by the formatter.
>
> Cheers,
>
> Eliot
> --
> W. Eliot Kimber, [EMAIL PROTECTED]
> Consultant, ISOGEN International
>
> 1016 La Posada Dr., Suite 240
> Austin, TX  78752 Phone: 512.656.4139
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>
>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Searching multiple fields in one Index of Documents

2002-02-13 Thread Kelvin Tan

Odd, I'm positive I uploaded the HTML file with the other attachments. I've
uploaded it here anyway. http://www.relevanz.com/luceneQueryConstructor.html

HTH.

Regards,
Kelvin
- Original Message -----
From: "Kelvin Tan" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, February 14, 2002 1:32 PM
Subject: Re: Searching multiple fields in one Index of Documents


> Peter,
>
> As advised, re-released under APL. :) There were some changes to
QueryParser
> constructors in rc3, and these are reflected here as well.
>
> FWIW, I've also attached a javascript lib and accompanying HTML which
> constructs a Lucene multi-field query using a HTML form.
>
> Regards,
> Kelvin
>
> - Original Message -
> From: "Peter Carlson" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Wednesday, February 13, 2002 10:56 PM
> Subject: Re: Searching multiple fields in one Index of Documents
>
>
> > This is great Kelvin,
> > Sorry I didn't see it before.
> > I'll add it to the list of contributions.
> >
> > --Peter
> >
> > On 2/13/02 12:43 AM, "Kelvin Tan" <[EMAIL PROTECTED]> wrote:
> >
> > > Charles,
> > >
> > > See
> http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html
> > >
> > > Regards,
> > > K
> > >
> > > - Original Message -
> > > From: "Charles Harvey" <[EMAIL PROTECTED]>
> > > To: <[EMAIL PROTECTED]>
> > > Sent: Tuesday, February 12, 2002 8:39 AM
> > > Subject: Searching multiple fields in one Index of Documents
> > >
> > >
> > >> I have a working installation of Lucene running against indexes
created
> by
> > >> a database query.
> > >> Each Document in the Index contains fifteen or twenty fields. I am
> > >> currently searching only one field (that contains concatenated
database
> > >> columns) because I cannot figure out how to search multiple fields.
So:
> > >>
> > >> How can I use Lucene to search more than one field in an Index of
> > > Documents?
> > >>
> > >> eg:
> > >> field CATEGORY is(or contains) 'bar'
> > >> AND
> > >> field BODY contains 'foo'
> > >>
> > >>
> > >>
> > >>
> > >> _
> > >>
> > >> "The trouble with the rat-race is that even if you win you're still a
> > > rat."
> > >> --Lily Tomlin
> > >> _
> > >> Charles Harvey
> > >> Developer
> > >> http://www.philly.com
> > >> Wk: 215 789 6057
> > >> Cell: 215 588 0851
> > >>
> > >>
> > >> --
> > >> To unsubscribe, e-mail:
> > > <mailto:[EMAIL PROTECTED]>
> > >> For additional commands, e-mail:
> > > <mailto:[EMAIL PROTECTED]>
> > >>
> > >>
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > >
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> >
> >
>






> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Searching multiple fields in one Index of Documents

2002-02-13 Thread Kelvin Tan

Peter,

As advised, re-released under APL. :) There were some changes to QueryParser
constructors in rc3, and these are reflected here as well.

FWIW, I've also attached a javascript lib and accompanying HTML which
constructs a Lucene multi-field query using a HTML form.

Regards,
Kelvin

- Original Message -
From: "Peter Carlson" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, February 13, 2002 10:56 PM
Subject: Re: Searching multiple fields in one Index of Documents


> This is great Kelvin,
> Sorry I didn't see it before.
> I'll add it to the list of contributions.
>
> --Peter
>
> On 2/13/02 12:43 AM, "Kelvin Tan" <[EMAIL PROTECTED]> wrote:
>
> > Charles,
> >
> > See
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html
> >
> > Regards,
> > K
> >
> > - Original Message -
> > From: "Charles Harvey" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Tuesday, February 12, 2002 8:39 AM
> > Subject: Searching multiple fields in one Index of Documents
> >
> >
> >> I have a working installation of Lucene running against indexes created
by
> >> a database query.
> >> Each Document in the Index contains fifteen or twenty fields. I am
> >> currently searching only one field (that contains concatenated database
> >> columns) because I cannot figure out how to search multiple fields. So:
> >>
> >> How can I use Lucene to search more than one field in an Index of
> > Documents?
> >>
> >> eg:
> >> field CATEGORY is(or contains) 'bar'
> >> AND
> >> field BODY contains 'foo'
> >>
> >>
> >>
> >>
> >> _
> >>
> >> "The trouble with the rat-race is that even if you win you're still a
> > rat."
> >> --Lily Tomlin
> >> _
> >> Charles Harvey
> >> Developer
> >> http://www.philly.com
> >> Wk: 215 789 6057
> >> Cell: 215 588 0851
> >>
> >>
> >> --
> >> To unsubscribe, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> >> For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> >>
> >>
> >
> >
> > --
> > To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>



MultiFieldQueryParser.java
Description: Binary data


luceneQueryConstructor.js
Description: Binary data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>


Re: Searching multiple fields in one Index of Documents

2002-02-13 Thread Kelvin Tan

Charles,

See http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html

Regards,
K

- Original Message -
From: "Charles Harvey" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, February 12, 2002 8:39 AM
Subject: Searching multiple fields in one Index of Documents


> I have a working installation of Lucene running against indexes created by
> a database query.
> Each Document in the Index contains fifteen or twenty fields. I am
> currently searching only one field (that contains concatenated database
> columns) because I cannot figure out how to search multiple fields. So:
>
> How can I use Lucene to search more than one field in an Index of
Documents?
>
> eg:
> field CATEGORY is(or contains) 'bar'
> AND
> field BODY contains 'foo'
>
>
>
>
> _
>
> "The trouble with the rat-race is that even if you win you're still a
rat."
> --Lily Tomlin
> _
> Charles Harvey
> Developer
> http://www.philly.com
> Wk: 215 789 6057
> Cell: 215 588 0851
>
>
> --
> To unsubscribe, e-mail:

> For additional commands, e-mail:

>
>


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: New to Lucene

2002-02-09 Thread Kelvin Tan


- Original Message -
From: Ype Kingma <[EMAIL PROTECTED]>
To: Lucene Users List <[EMAIL PROTECTED]>
Sent: Saturday, February 09, 2002 4:48 PM
Subject: Re: New to Lucene


[snip]
> >Is this possible? I see that Lucene *can* index database data, but
something
> >needs to be coded to handle this? Has anyone built any thin framework or
> >have code snippets available? Has anyone ever used Lucene to replace
> >Fulcrum?
>
> Not me, perhaps someone else.
>
>

Just posted some code on lucene-dev which may be related to what you're
looking for.

Regards,
Kelvin


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Filtering results

2002-02-06 Thread Kelvin Tan

Ype,

Thanks. :)

Got it using IndexReader.termDocs( Term('IDENTIFIER', someAllowedValue)) as you said...

I kinda got misled by the use of TermEnum in DateFilter...

Kelvin
  - Original Message - 
  From: Ype Kingma 
  To: Lucene Users List 
  Sent: Wednesday, February 06, 2002 4:19 AM
  Subject: Re: Filtering results


  Kelvin,

  >I would like to exclude certain documents from my search and naturally chose
  >to extend Filter.
  >
  >I'm not too clear on the proper usage of Terms, TermEnum and TermDocs, and
  >though DateFilter has an example of how to implement a Filter, I'm none the
  >wiser on the proper way of filtering.
  >
  >My document objects contain a field (called IDENTIFIER) which I would like
  >to retrieve the value of, in order to compare it a list of allowed values to
  >determine if the document should be excluded. Initially, I tried to
  >implement it by iterating through all documents (reader.doc(i)), then
  >retrieving the value of IDENTIFIER (doc.get(IDENTIFIER)). It works, the
  >trouble is that I've no way of accessing the document number from within
  >IndexReader (which is needed to set the BitSet within the bits() method). It
  >seems to be available from only within TermEnum and TermDocs...
  >
  >Of course, I probably haven't performed the filtering correctly, so would
  >appreciate any help on it...

  It works so you did it correctly.
  You can also use IndexReader.termDocs( Term('IDENTIFIER', someAllowedValue))
  and enumerate the resulting TermDocs. It will get you to the doc nr you need
  to set in the BitSet.

  Good luck,
  Ype

  -- 

  --
  To unsubscribe, e-mail:   
  For additional commands, e-mail: 






Filtering results

2002-02-05 Thread Kelvin Tan

I would like to exclude certain documents from my search and naturally chose
to extend Filter.

I'm not too clear on the proper usage of Terms, TermEnum and TermDocs, and
though DateFilter has an example of how to implement a Filter, I'm none the
wiser on the proper way of filtering.

My document objects contain a field (called IDENTIFIER) which I would like
to retrieve the value of, in order to compare it a list of allowed values to
determine if the document should be excluded. Initially, I tried to
implement it by iterating through all documents (reader.doc(i)), then
retrieving the value of IDENTIFIER (doc.get(IDENTIFIER)). It works, the
trouble is that I've no way of accessing the document number from within
IndexReader (which is needed to set the BitSet within the bits() method). It
seems to be available from only within TermEnum and TermDocs...

Of course, I probably haven't performed the filtering correctly, so would
appreciate any help on it...

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 238 6229
Fax: 337 4417


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: Indexing and Searching happening together

2002-02-01 Thread Kelvin Tan

h...have read 
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00133.html

True (and it's great) that once an IndexReader is open, no actions on the IndexWriter 
affect it. 

However, if an IndexReader is opened _after_ indexing begins, I suppose it'll throw an 
exception? Doesn't it mean that when indexing is taking place, the search engine is 
effectively down...

I suppose then, that if I still want to have the search engine up whilst indexing 
(where indexing takes a non-trivial amount of time), I'll have to index to a temporary 
location, then copy the index files over? Alternatively, I guess it's possible to open 
an IndexReader _before_ commencement of indexing and provide this instance of 
IndexReader to clients who need to search?

I just realized that I've effectively rephrased my initial question (and post). 
Perhaps it's the way I phrased my question (or that I'm missing something), but I 
don't seem to have gotten my query across initially...

Regards,
Kelvin
  - Original Message - 
  From: Doug Cutting 
  To: 'Lucene Users List' 
  Sent: Friday, February 01, 2002 12:28 AM
  Subject: RE: Indexing and Searching happening together


  > From: Kelvin Tan [mailto:[EMAIL PROTECTED]]
  > 
  > In the case where indexing takes a non-trivial amount of 
  > time, what is the expected behaviour when a search is 
  > performed while indexing is still going on? 

  Once an IndexReader is open, no actions on an IndexWriter should affect it.
  Adding documents in another thread or process will not affect search results
  until a new IndexReader is opened.  Searching and indexing may proceed
  simultaneously.

  Doug

  --
  To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
  For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>





Indexing and Searching happening together

2002-01-30 Thread Kelvin Tan

In the case where indexing takes a non-trivial amount of time, what is the expected 
behaviour when a search is performed while indexing is still going on? 

Would it be a good solution to index in a temporary location, then copying the index 
files over to the final location when done?

Thanks..

Regards,
Kelvin Tan

Relevanz Pte Ltd
http://www.relevanz.com

180B Bencoolen St.
The Bencoolen, #04-01
S(189648)

Tel: 238 6229
Fax: 337 4417




  1   2   >