Re: Indexing flat files with out .txt extension

2005-01-11 Thread Erik Hatcher
On Jan 11, 2005, at 7:28 PM, Hetan Shah wrote:
Thanks for the pointers, I have modified the Indexer.java to index the
files from the directory by removing the file extenstion check of
(".txt"). Now I do get the index from the files.
...
java org.apache.lucene.demo.SearchFiles
The problem is you're using the SearchFiles demo code, which uses 
different field names than Indexer.java.  You need to be sure the 
searching and indexing code agree on the field names.  Since you 
borrowed from Indexer.java from LIA, keep borrowing from Searcher.java. 
 You can run "ant Searcher" from the LIA source code.

Be sure to really learn what's going on in that code rather than just 
accepting what its doing - this will pay off as you continue to evolve 
your application.  Indexer.java has only 6 (effective) lines of code 
tied to Lucene's API, and similarly very few lines of Lucene-dependent 
code in Searcher.java.  All of this is demo code, and is designed to be 
adapted to your needs.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How do I unlock?

2005-01-11 Thread Chris Hostetter
: 1) The FAQ has been moved to the Wiki, so feel free to stick it in
: there.

yeah ... i just wanted to give people a chance to chime in with weird ways
locks are used in case i wasn't aware of something.

: 2) http://www.lucenebook.com/search?query=unlock

ah ... yes.  The really pitiful thing is, I remember looking at that
method before.

FAQed...

http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-59be30838bbb5692e605384b5f4c2f224f3dfa6f



: > 1) There should probably be a FAQ on discussing:
: > 1) where lock files are typically found on various OSes
: > 2) the naming convention of lucene lock files.
: > 2) how to manually clean up lock files (and other files in the
: > index
: >directory) in a safe manner.
: >
: > 2) it might be a good idea to add a static utility method for cleanly
: > removing all locks (or all lokcs of a particular type) on an index
: > given a
: > Directory.  Javadocs would indicate this is an "Expert" method which
: > should only be used in code designed to try and recover from serious
: > errors.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-11 Thread David Spencer
Erik Hatcher wrote:
On Jan 10, 2005, at 6:54 PM, David Spencer wrote:
Hi...I wrote the WordNet sandbox code - but I'm not sure if I 
undertand this thread. Are we saying that it does not work w/ the new 
WordNet data, or that code in Eric's book is better/more up to date etc?

I have not tried the sandbox with any versions past WordNet 1.6.  
Karthik shows a Java API to it, which I have not used - only your code 
that parses the prolog files.  So the book code explains exactly what is 
in the sandbox and describes WordNet 1.6 integration.  Though WordNet 
has evolved.

If needed I can update the sandbox code..

It'd be awesome to have current WordNet support - I haven't looked at 
what is involved in making it so.

I verified that the code works w/ the latest WordNet (2.0), and it does 
so, no problem. The relevant data from WordNet has not changed so 
there's no need to upgrade WordNet for this package at least.

I added "query expansion" which takes in a simple query string and for 
every term adds their synonyms. There's an optional boost parameter to 
be used to "penalize" synonyms if you want to use the heuristic that the 
 user probably knows the right word.

One example of expansion with the synonym boost set to 0.9 is the query 
"big dog" expands to:

big adult^0.9 bad^0.9 bighearted^0.9 boastful^0.9 boastfully^0.9 
bounteous^0.9 bountiful^0.9 braggy^0.9 crowing^0.9 freehanded^0.9 
giving^0.9 grown^0.9 grownup^0.9 handsome^0.9 large^0.9 liberal^0.9 
magnanimous^0.9 momentous^0.9 openhanded^0.9 prominent^0.9 swelled^0.9 
vainglorious^0.9 vauntingly^0.9
 dog andiron^0.9 blackguard^0.9 bounder^0.9 cad^0.9 chase^0.9 click^0.9 
detent^0.9 dogtooth^0.9 firedog^0.9 frank^0.9 frankfurter^0.9 frump^0.9 
heel^0.9 hotdog^0.9 hound^0.9 pawl^0.9 tag^0.9 tail^0.9 track^0.9 
trail^0.9 weenie^0.9 wiener^0.9 wienerwurst^0.9

Amusingly then, documents with the terms "liberal wienerwurst" match 
"big dog"! :)

Javadoc is here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/WordNet/build/docs/api/org/apache/lucene/wordnet/package-summary.html
The new query expansion is here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/WordNet/build/docs/api/org/apache/lucene/wordnet/SynExpand.html
Want to try it out? This page *expands* a query and prints out the 
result (but doesn't execute it yet).
http://www.searchmorph.com/kat/synonym.jsp?syn=big

CVS tree here:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/WordNet/
If you just want to use a prebuild index it's here (1MB):
http://searchmorph.com/pub/syn_index.zip
The prebuilt jar file is here:
http://www.searchmorph.com/pub/lucene-wordnet-dev.jar
Redundant weblog entry here:
http://www.searchmorph.com/weblog/index.php?id=34
Hope y'all like it and someone finds it useful,
  Dave
PS
 Oh - it may need the 1.5 dev branch of Lucene to work - I'm not 
positive but it I tried to remove deprecated warnings and doing so may 
have tied it to the latest code...

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing flat files with out .txt extension

2005-01-11 Thread Hetan Shah
Hi Erik,

Thanks for the pointers, I have modified the Indexer.java to index the
files from the directory by removing the file extenstion check of
(".txt"). Now I do get the index from the files.

New situation is that when I run the FileSearch

java org.apache.lucene.demo.SearchFiles
Query: tty
Searching for: tty
3 total matching documents
0. No path nor URL for this document
1. No path nor URL for this document
2. No path nor URL for this document

I do not get the actual path from the index and using Luke I get the
three hits. Last two are from the index and not the real documents.

Any idea what is happeneing and how can I fix it.

Thanks.
-H

Erik Hatcher wrote:
> On Jan 10, 2005, at 7:06 PM, Hetan Shah wrote:
> 
>>Got the latest Ant and got the demo to work. I am however not sure 
>>which part in the whole source code is the indexing for different file 
>>types is done, say for example .html .txt and such?
> 
> 
> Your best bet is to dig around in the codebase.  The Indexer.java code 
> is hard-coded to only do .txt file extensions - this was on purpose as 
> the first example in the book, figuring someone using this code on the 
> their C:\ drive would be relatively safe and fast to run.
> 
> Their is also an example easily run from the Ant launcher to show how 
> various document types can be handled using an extensible framework.  
> Run "ant ExtensionFileHandler".  It doesn't actually index the document 
> it creates, but displays it to the console.  It would be pretty trivial 
> to pair the Indexer.java code up with the file handler framework to 
> crawl a directory tree and index any content it recognizes.
> 
> 
>>Appreciate your help. If you have any sample code would certainly 
>>appreciate that also.
> 
> 
> You got all the code already.  It should be fairly straightforward to 
> navigate the src tree, especially with the Table of Contents handy:
> 
>   http://www.lucenebook.com/toc
> 
> (incidentally, this dynamic TOC page is blending the blog content with 
> the TOC using an IndexReader to find all blog entries that refer to 
> each section - and you'll see the two, minor and cosmetic, errata 
> listed there already).
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: SQL Distinct sintax in Lucen

2005-01-11 Thread Chuck Williams
If I understand what you are trying to do, you don't have a problem.
You can OR to your heart's content and Lucene will properly create the
union of the results.  I.e., there will be no duplicates.

There is built-in support for this kind of thing.  See
MultiFieldQueryParser, and for better results, consider
http://issues.apache.org/bugzilla/show_bug.cgi?id=32674.

Chuck

  > -Original Message-
  > From: Carlos Franco Robles [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, January 11, 2005 2:05 PM
  > To: lucene-user@jakarta.apache.org
  > Subject: SQL Distinct sintax in Lucen
  > 
  > Hi all.
  > 
  > I'm starting to use lucene and I wonder if it is possible to make a
  > query syntax to ask for one string which can be in two different
fields
  > and filter duplicated results like with distinct in SQL syntax.
  > Something like:
  > 
  > distinct (+string OR OtherField:(+string))
  > 
  > Thanks a lot
  > 
  > 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SQL Distinct sintax in Lucen

2005-01-11 Thread Daniel Naber
On Tuesday 11 January 2005 23:05, Carlos Franco Robles wrote:

> I'm starting to use lucene and I wonder if it is possible to make a
> query syntax to ask for one string which can be in two different fields
> and filter duplicated results like with distinct in SQL syntax.

Lucene only knows documents and doesn't know what "duplicate" could mean. 
The easiest thing is to iterate over the result set and do the filtering 
yourself.

regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Chris Hostetter
: What about a shutdown hook?

Interesting idea, at the moment the file is created on disk, the
FSDirectory could add a shutdown hook that checked for the existence of
the file and if it's still there (implying that the Lock owner failed
without releasing the lock) it can forcably remove it.

Of course: this assumes that LockFiles are never shared between processes
-- ie: if client A is waiting on a lock that client B is holding, does the
lock A eventually gets use the same file that B's lock was using, or does
the old lock file get deleted and a new one created ?

(I don't really understand a lot of Lucene's locking code)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
Eh, that exactly :)  When I read my emails in reverse order

--- Chris Lamprecht <[EMAIL PROTECTED]> wrote:

> What about a shutdown hook?
>   
> Runtime.getRuntime().addShutdownHook(new Thread() {
> public void run() { /* whatever */ }
> });
> 
> see also
> http://www.onjava.com/pub/a/onjava/2003/03/26/shutdownhook.html
> 
> 
> On Tue, 11 Jan 2005 13:21:42 -0800, Doug Cutting <[EMAIL PROTECTED]>
> wrote:
> > Joseph Ottinger wrote:
> > > As one for whom the question's come up recently, I'd say that
> locks need
> > > to be terminated gracefully, instead. I've noticed a number of
> cases where
> > > the locks get abandoned in exceptional conditions, which is
> almost exactly
> > > what you don't want.
> > 
> > The problem is that this is hard to do from Java.  A typical
> approach is
> > to put the process id in the lock file, then, if that process is
> dead,
> > ignore the lock file.  But Java does not let one know process ids. 
> Java
> > 1.4 provides a LockFile mechanism which should mostly solve this,
> but
> > Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use
> that
> > feature.  Lucene 2.0 is likely to require Java 1.4 and should be
> able to
> > do a better job of automatically unlocking indexes when processes
> die.
> > 
> > Doug
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
I didn't pay full attention to this thread, but it sounds like somebody
may be interested in RuntimeShutdownHook (or some similar name) as a
place to try to release the locks.

Otis

--- Joseph Ottinger <[EMAIL PROTECTED]> wrote:

> On Tue, 11 Jan 2005, Doug Cutting wrote:
> 
> > Joseph Ottinger wrote:
> > > As one for whom the question's come up recently, I'd say that
> locks need
> > > to be terminated gracefully, instead. I've noticed a number of
> cases where
> > > the locks get abandoned in exceptional conditions, which is
> almost exactly
> > > what you don't want.
> >
> > The problem is that this is hard to do from Java.  A typical
> approach is
> > to put the process id in the lock file, then, if that process is
> dead,
> > ignore the lock file.  But Java does not let one know process ids. 
> Java
> > 1.4 provides a LockFile mechanism which should mostly solve this,
> but
> > Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use
> that
> > feature.  Lucene 2.0 is likely to require Java 1.4 and should be
> able to
> > do a better job of automatically unlocking indexes when processes
> die.
> 
> Agreed - but while there are some situations in which releasing locks
> is
> "difficult" (i.e., JVM catastrophic shutdown), there are others in
> which
> attempts could be made via finally blocks, etc.
> 
>
---
> Joseph B. Ottinger
> http://enigmastation.com
> IT Consultant   
> [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SQL Distinct sintax in Lucen

2005-01-11 Thread Carlos Franco Robles
Hi all.
 
I'm starting to use lucene and I wonder if it is possible to make a
query syntax to ask for one string which can be in two different fields
and filter duplicated results like with distinct in SQL syntax.
Something like:
 
distinct (+string OR OtherField:(+string))
 
Thanks a lot
 
 


Re: How do I unlock?

2005-01-11 Thread Chris Lamprecht
What about a shutdown hook?
  
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() { /* whatever */ }
});

see also http://www.onjava.com/pub/a/onjava/2003/03/26/shutdownhook.html


On Tue, 11 Jan 2005 13:21:42 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Joseph Ottinger wrote:
> > As one for whom the question's come up recently, I'd say that locks need
> > to be terminated gracefully, instead. I've noticed a number of cases where
> > the locks get abandoned in exceptional conditions, which is almost exactly
> > what you don't want.
> 
> The problem is that this is hard to do from Java.  A typical approach is
> to put the process id in the lock file, then, if that process is dead,
> ignore the lock file.  But Java does not let one know process ids.  Java
> 1.4 provides a LockFile mechanism which should mostly solve this, but
> Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use that
> feature.  Lucene 2.0 is likely to require Java 1.4 and should be able to
> do a better job of automatically unlocking indexes when processes die.
> 
> Doug
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Joseph Ottinger
On Tue, 11 Jan 2005, Doug Cutting wrote:

> Joseph Ottinger wrote:
> > As one for whom the question's come up recently, I'd say that locks need
> > to be terminated gracefully, instead. I've noticed a number of cases where
> > the locks get abandoned in exceptional conditions, which is almost exactly
> > what you don't want.
>
> The problem is that this is hard to do from Java.  A typical approach is
> to put the process id in the lock file, then, if that process is dead,
> ignore the lock file.  But Java does not let one know process ids.  Java
> 1.4 provides a LockFile mechanism which should mostly solve this, but
> Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use that
> feature.  Lucene 2.0 is likely to require Java 1.4 and should be able to
> do a better job of automatically unlocking indexes when processes die.

Agreed - but while there are some situations in which releasing locks is
"difficult" (i.e., JVM catastrophic shutdown), there are others in which
attempts could be made via finally blocks, etc.

---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Doug Cutting
Joseph Ottinger wrote:
As one for whom the question's come up recently, I'd say that locks need
to be terminated gracefully, instead. I've noticed a number of cases where
the locks get abandoned in exceptional conditions, which is almost exactly
what you don't want.
The problem is that this is hard to do from Java.  A typical approach is 
to put the process id in the lock file, then, if that process is dead, 
ignore the lock file.  But Java does not let one know process ids.  Java 
1.4 provides a LockFile mechanism which should mostly solve this, but 
Lucene 1.4.3 does not yet require Java 1.4 and hence cannot use that 
feature.  Lucene 2.0 is likely to require Java 1.4 and should be able to 
do a better job of automatically unlocking indexes when processes die.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Token Characters

2005-01-11 Thread Otis Gospodnetic
The best place to look is:
./src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj

You can see it at:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/analysis/standard/

Otis


--- Shawn Konopinsky <[EMAIL PROTECTED]> wrote:

> Hey There,
> 
> Wondering where I can find a list of the set of characters that the
> StandardAnalyzer will tokenize on when indexing text in Lucene.
> 
> Best,
> Shawn.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Joseph Ottinger
On Tue, 11 Jan 2005, Chris Hostetter wrote:

> 2) it might be a good idea to add a static utility method for cleanly
> removing all locks (or all lokcs of a particular type) on an index given a
> Directory.  Javadocs would indicate this is an "Expert" method which
> should only be used in code designed to try and recover from serious
> errors.

As one for whom the question's come up recently, I'd say that locks need
to be terminated gracefully, instead. I've noticed a number of cases where
the locks get abandoned in exceptional conditions, which is almost exactly
what you don't want.

---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Token Characters

2005-01-11 Thread Shawn Konopinsky
Hey There,

Wondering where I can find a list of the set of characters that the
StandardAnalyzer will tokenize on when indexing text in Lucene.

Best,
Shawn.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Otis Gospodnetic
Hello,

1) The FAQ has been moved to the Wiki, so feel free to stick it in
there.

2) http://www.lucenebook.com/search?query=unlock

Otis

--- Chris Hostetter <[EMAIL PROTECTED]> wrote:

> 
> : I'm getting
> : Lock obtain timed out.
> :
> : I was developing and forgot to close the writer.  How do I recover?
>  I
> : killed the program, put the close in, but it won't let me open
> again.
> 
> if you are using FSDirectory then a lock file was put onto your disk
> in
> the directory returned by...
> 
>
>
System.getProperty("org.apache.lucene.lockdir",System.getProperty("java.io.tmpdir"));
> 
> so if you haven't defined the property "org.apache.lucene.lockdir"
> then
> it's whereever your JVM normally puts tmp files (on *nix it's usually
> in
> /var/tmp ... somtimes /tmp)
> 
> 
> This question has come up a couple of times in the last few weeks,
> while
> leads me to think:
> 
> 1) There should probably be a FAQ on discussing:
> 1) where lock files are typically found on various OSes
> 2) the naming convention of lucene lock files.
> 2) how to manually clean up lock files (and other files in the
> index
>directory) in a safe manner.
> 
> 2) it might be a good idea to add a static utility method for cleanly
> removing all locks (or all lokcs of a particular type) on an index
> given a
> Directory.  Javadocs would indicate this is an "Expert" method which
> should only be used in code designed to try and recover from serious
> errors.
> 
> 
> 
>   thoughts?
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do I unlock?

2005-01-11 Thread Chris Hostetter

: I'm getting
: Lock obtain timed out.
:
: I was developing and forgot to close the writer.  How do I recover?  I
: killed the program, put the close in, but it won't let me open again.

if you are using FSDirectory then a lock file was put onto your disk in
the directory returned by...


System.getProperty("org.apache.lucene.lockdir",System.getProperty("java.io.tmpdir"));

so if you haven't defined the property "org.apache.lucene.lockdir" then
it's whereever your JVM normally puts tmp files (on *nix it's usually in
/var/tmp ... somtimes /tmp)


This question has come up a couple of times in the last few weeks, while
leads me to think:

1) There should probably be a FAQ on discussing:
1) where lock files are typically found on various OSes
2) the naming convention of lucene lock files.
2) how to manually clean up lock files (and other files in the index
   directory) in a safe manner.

2) it might be a good idea to add a static utility method for cleanly
removing all locks (or all lokcs of a particular type) on an index given a
Directory.  Javadocs would indicate this is an "Expert" method which
should only be used in code designed to try and recover from serious
errors.



thoughts?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How do I unlock?

2005-01-11 Thread Jim Lynch
I'm getting
Lock obtain timed out.
I was developing and forgot to close the writer.  How do I recover?  I 
killed the program, put the close in, but it won't let me open again.

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Looking for UK based Lucene consultant

2005-01-11 Thread Nick Burch
Hi All

My company is looking to hire someone UK-based for a few day's Lucene 
consultancy. Experience with coupling Lucene to large scale web 
spidering is a must, experience with term vectors would be a bonus.

Please contact me off-list if interested

Thanks
Nick



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what if the IndexReader crashes, after delete, before close.

2005-01-11 Thread Luke Shannon
Here is how I handle it.

The Indexer is a Runnable. All the members it uses are static. The run()
method calls a syncronized method called go(). This kicks off the indexing.

Before you even get to here, the method in the CMS code that created the
thread object and instaniated the index is also sychronized.

Here is the code that handles the potential lock file that may be left
behind from a Reader or Writer.

Note: I found I had to check if the index existed before checking if it was
locked. If I checked if it was locked and the index had not been created yet
I got an error.

//if we have gotten to hear that this is the only index running.
//the index should not be locked. if it is the lock is "stale"
//and must be released before we can continue
try {
if (index.exists() && IndexReader.isLocked(indexFileLocation)) {
Trace.ERROR("INDEX INFO: Had to clear a stale index lock");
IndexReader.unlock(FSDirectory.getDirectory(index, false));
}
} catch (IOException e3) {
Trace.ERROR("INDEX ERROR: IMPORTANT. Was unable to clear a stale index lock:
" + e3);
}

HTH

Luke

- Original Message - 
From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Tuesday, January 11, 2005 3:24 AM
Subject: RE: what if the IndexReader crashes, after delete, before close.




-Oorspronkelijk bericht-
Van: Luke Shannon [mailto:[EMAIL PROTECTED]
Verzonden: maandag 10 januari 2005 15:46
Aan: Lucene Users List
Onderwerp: Re: what if the IndexReader crashes, after delete, before
close.


>>One thing that will happen is the lock file
>>will get left behind. This means when you start
>>back up and try to create another Reader you will
>>get a file lock error.

I have figured out that part the hard way ;) Why can`t I access my index
anymore?? Ahh.. The lock file

>>Our system is threaded and synchronized.
>>Thus when a Reader is being created I know
>>it is the only one (the Writer comes after
>>the reader has been closed). Before creating
>>it I check if the Index is locked. If it is,
>>I forcefully clear it. This prevents the above
>>problem from happening.

You can have more than 1 reader open at anytime. Even while a delete or
add is in progress. But you can`t use a reader where documents are
deleted (IndexReader) and added(IndexWriter) at the same time. If you
don`t have other threads doing delete/add you won`t have to synchronize
anything.

And how do you synchronize on it? I have applied the ReadWriteLock From
Doug Lea`s concurrency library after I have build my own
synchronization brick and somebody pointed out that I was implementing
the ReadWriteLock. But at the moment I don`t do any synchronization.

And I want to have a component that is executed if the system is started
and knows that to do if there is rubbish in the index directory. I want
that component to restore my index to a usable version (and even small
loss of information is acceptable because everything is checked once and
a while. And user-added-information is going to be stored in the
database. So nothing gets lost. The index can be rebuild..




Luke

- Original Message -
From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]>
To: 
Sent: Saturday, January 08, 2005 4:08 AM
Subject: what if the IndexReader crashes, after delete, before close.


What happens to the Index if the IndexReader crashes, after I have
deleted
documents, and before I have called close. Are the deletes ignored? Is
the
Index screwed up? Is the filesystem screwed up (if a document is deleted
new
delete-files appear) so are the delete-files still there (and can these
be
ignored the next time?). Can I restore the index to the previous state,
just
by removing those delete-files?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SearchBlox Distributed Edition released

2005-01-11 Thread Robert Selvaraj

The SearchBlox Distributed Edition is a J2EE Search Component for the Akamai
EdgeComputing Platform, a J2EE Application Platform consisting of more than
14,000 servers in more than 1100 networks in over 70 countries. SearchBlox
Distributed Edition can handle search requests concurrently from thousands
of users at any time by distributing the handling of search requests to
thousands of servers on the EdgeComputing Platform. Each of these servers
runs an identical, standalone and special SearchBlox search application that
contains all the components required to handle the search requests. 
   
The Akamai-Certified SearchBlox Distributed Edition offers out-of-the-box
search functionality for fast and easy implementation with your websites,
applications, intranets and portals. SearchBlox uses the Lucene Search API
and incorporates integrated HTTP/HTTPS and File System crawlers, support for
various document formats including HTML, Word, PDF, PowerPoint and Excel,
support for indexing and searching content in 18 languages and customizable
search results, all controlled from a browser-based Admin Console. 

 
Main features: 
==

- On Demand Search. SearchBlox Distributed Edition can handle search
requests concurrently from thousands of users at any time. Moreover, when
there is a sudden surge in search requests, it can quickly scale to handle
the demand without increasing search times. 
 
- Security. SearchBlox Distributed Edition offers total security for
deployed search application by making it completely READONLY. This ensures
the integrity of the deployed application at all times. 

- Integration. SearchBlox Distributed Edition is seamlessly integrated with
Akamai EdgeComputing Platform. With a single click in the SearchBlox Admin
Console, the search application can be deployed on the Akamai EdgeComputing
Platform. 


For more information, please visit the website at
http://www.searchblox.com/distributed.html




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: some thoughts about adding transactions.

2005-01-11 Thread Scott Ganyo
I didn't want to let this drop this on the floor, but I haven't had the 
time to craft a response to it either.  So, just for the record I agree 
that transactions would be nice.  I think that it is important that the 
solution address change visibility and concurrent transactions within 
multiple VMs.  Also, it should be backward compatible so that 
applications can run without transactions.  So, I think that a good 
solution is probably more complex than it initially looks...

S
On Jan 8, 2005, at 6:47 AM, Peter Veentjer - Anchor Men wrote:
If have a question about transactions .
Lucene doesn`t support transactions but I find it very important and I 
think it is possible to add some kind of rollback/commit functionality 
to make sure the index doesn`t corrupt..

With lucene every segment is immutable (this is a perfect starting 
point), so after it has been created it will remain forever in a valid 
state. There are 3 ways to alter the index
1) deleting documents
2) adding documents
3) optimization

If I delete a document, a del file appears (but doesn`t alter the 
segment because it is immutable).
-if crash: the del files could be deleted to do a rollback.
-if succes: the del files finally will be used by the writer to skip 
those documents in the new segment.

If a new document is added, a new segment is created (finally).
-if succes: the new segment is created and the old segments can be 
deleted.
-if crash: the new segment (maybe it`s corrupted) can be deleted to do 
a rollback.

If the index is optimized a new segment is created based on older 
segments.
-if succes: the old segments can be deleted.
-if crash: the new segment (maybe it`s corrupted) can be deleted to do 
a rollback.

With this information it wouldn`t be to much trouble to add some kind 
of rollback/transaction functionality?

And how about those 'per index' files? Can these be corrupted? Can 
these be removed and recreated succesfully? Would it be an idea to 
make copies of these files and restore them if the tranaction is 
rollbacked?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Performance question

2005-01-11 Thread Jim Lynch
I would be tempted to index the text fields but not save them.  Since 
Lucene returns everything as Otis pointed out, it's inefficent to keep 
rarely used data in as content in the index.  Put the text fields in a 
database or a file tree somewhere and keep a pointer to it as a field in 
the index.  When you need the data just retrieve it from wherever using 
the saved pointer.

Jim.
Crump, Michael wrote:
Hello,

If I have large text fields that are rarely retrieved but need to be
searched often - Is it better to create 2 indices, one for searching and
one for retrieval, or just one index and put everything in it?

Or are there other recommendations?

Regards,

Michael
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: QUERYPARSIN & BOOSTING

2005-01-11 Thread Chuck Williams
Karthik,

I don't think the boost in your example does much since you are using an
AND query, i.e. all hits will have to contain both vendor:nike and
contents:shoes.  If you used an OR, then the boost would put nike
products above (non-nike) shoes, unless there was some other factor that
causes score of contents:shoes to be 10x greater than that of
vendor:nike.  It's a good idea to look at the results of explain() when
analyzing what's happening with scoring, tuning your boosts and your
Similarity.

Chuck

  > -Original Message-
  > From: Nader Henein [mailto:[EMAIL PROTECTED]
  > Sent: Tuesday, January 11, 2005 12:21 AM
  > To: Lucene Users List
  > Subject: Re: QUERYPARSIN & BOOSTING
  > 
  >  From the text on the Lucene Jakarta Site :
  > http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
  > 
  > 
  > Lucene provides the relevance level of matching documents based on
the
  > terms found. To boost a term use the caret, "^", symbol with a boost
  > factor (a number) at the end of the term you are searching. The
higher
  > the boost factor, the more relevant the term will be.
  > 
  > Boosting allows you to control the relevance of a document by
  > boosting its term. For example, if you are searching for
  > 
  > 
  > 
  > 
  > jakarta apache
  > 
  > 
  > 
  > 
  > and you want the term "jakarta" to be more relevant boost it
using
  > the ^ symbol along with the boost factor next to the term. You
would
  > type:
  > 
  > 
  > 
  > 
  > jakarta^4 apache
  > 
  > 
  > 
  > 
  > This will make documents with the term jakarta appear more
relevant.
  > You can also boost Phrase Terms as in the example:
  > 
  > 
  > 
  > 
  > "jakarta apache"^4 "jakarta lucene"
  > 
  > 
  > 
  > 
  > By default, the boost factor is 1. Although the boost factor
must be
  > positive, it can be less than 1 (e.g. 0.2)
  > 
  > 
  > Regards.
  > 
  > Nader Henein
  > 
  > 
  > Karthik N S wrote:
  > 
  > >Hi Guys
  > >
  > >
  > >
  > >Apologies...
  > >
  > >This Question may be asked million times on this form ,need some
  > >clarifications.
  > >
  > >1) FieldType =  keyword  name =  vendor
  > >
  > >2)FieldType =  text  name = contents
  > >
  > >Question:
  > >
  > >1) How to Construct a Query which would allow hits  avaliable for
the
  > VENDOR
  > >to  appear  first ?.
  > >
  > >2) If boosting is to be applied How TO   ?.
  > >
  > >3) Is the Query Constructed Below correct?.
  > >
  > >+Contents:shoes +((vendor:nike)^10)
  > >
  > >
  > >
  > >Please Advise.
  > >Thx in advance.
  > >
  > >
  > >WITH WARM REGARDS
  > >HAVE A NICE DAY
  > >[ N.S.KARTHIK]
  > >
  > >
  > >
  >
>-
  > >To unsubscribe, e-mail: [EMAIL PROTECTED]
  > >For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > >
  > >
  > >
  > >
  > >
  > 
  >
-
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: what if the IndexReader crashes, after delete, before close.

2005-01-11 Thread Peter Veentjer - Anchor Men
 

-Oorspronkelijk bericht-
Van: Luke Shannon [mailto:[EMAIL PROTECTED] 
Verzonden: maandag 10 januari 2005 15:46
Aan: Lucene Users List
Onderwerp: Re: what if the IndexReader crashes, after delete, before
close.


>>One thing that will happen is the lock file 
>>will get left behind. This means when you start 
>>back up and try to create another Reader you will 
>>get a file lock error.

I have figured out that part the hard way ;) Why can`t I access my index
anymore?? Ahh.. The lock file

>>Our system is threaded and synchronized. 
>>Thus when a Reader is being created I know 
>>it is the only one (the Writer comes after
>>the reader has been closed). Before creating 
>>it I check if the Index is locked. If it is, 
>>I forcefully clear it. This prevents the above 
>>problem from happening.

You can have more than 1 reader open at anytime. Even while a delete or
add is in progress. But you can`t use a reader where documents are
deleted (IndexReader) and added(IndexWriter) at the same time. If you
don`t have other threads doing delete/add you won`t have to synchronize
anything.

And how do you synchronize on it? I have applied the ReadWriteLock From
Doug Lea`s concurrency library after I have build my own 
synchronization brick and somebody pointed out that I was implementing
the ReadWriteLock. But at the moment I don`t do any synchronization.

And I want to have a component that is executed if the system is started
and knows that to do if there is rubbish in the index directory. I want
that component to restore my index to a usable version (and even small
loss of information is acceptable because everything is checked once and
a while. And user-added-information is going to be stored in the
database. So nothing gets lost. The index can be rebuild.. 




Luke

- Original Message -
From: "Peter Veentjer - Anchor Men" <[EMAIL PROTECTED]>
To: 
Sent: Saturday, January 08, 2005 4:08 AM
Subject: what if the IndexReader crashes, after delete, before close.


What happens to the Index if the IndexReader crashes, after I have
deleted
documents, and before I have called close. Are the deletes ignored? Is
the
Index screwed up? Is the filesystem screwed up (if a document is deleted
new
delete-files appear) so are the delete-files still there (and can these
be
ignored the next time?). Can I restore the index to the previous state,
just
by removing those delete-files?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QUERYPARSIN & BOOSTING

2005-01-11 Thread Nader Henein
From the text on the Lucene Jakarta Site : 
http://jakarta.apache.org/lucene/docs/queryparsersyntax.html

Lucene provides the relevance level of matching documents based on the 
terms found. To boost a term use the caret, "^", symbol with a boost 
factor (a number) at the end of the term you are searching. The higher 
the boost factor, the more relevant the term will be.

   Boosting allows you to control the relevance of a document by
   boosting its term. For example, if you are searching for


jakarta apache


   and you want the term "jakarta" to be more relevant boost it using
   the ^ symbol along with the boost factor next to the term. You would
   type:


jakarta^4 apache


   This will make documents with the term jakarta appear more relevant.
   You can also boost Phrase Terms as in the example:


"jakarta apache"^4 "jakarta lucene"


   By default, the boost factor is 1. Although the boost factor must be
   positive, it can be less than 1 (e.g. 0.2)
Regards.
Nader Henein
Karthik N S wrote:
Hi Guys

Apologies...
This Question may be asked million times on this form ,need some
clarifications.
1) FieldType =  keyword  name =  vendor
2)FieldType =  text  name = contents
Question:
1) How to Construct a Query which would allow hits  avaliable for the VENDOR
to  appear  first ?.
2) If boosting is to be applied How TO   ?.
3) Is the Query Constructed Below correct?.
+Contents:shoes +((vendor:nike)^10)

Please Advise.
Thx in advance.
WITH WARM REGARDS
HAVE A NICE DAY
[ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]