Re: Newbie questions

2005-02-14 Thread Paul Jans
Hi again,

So is SqlDirectory recommended for use in a cluster to
workaround the accessibility problem, or are people
using NFS or a standalone server instead?

Thanks in advance,
PJ

--- Paul Jans [EMAIL PROTECTED] wrote:

 I've already ordered Lucene in Action :)
 
  There is a LuceneRAR project that is still in its
  infancy here: 
  https://lucenerar.dev.java.net/
 
 I will keep an eye on that for sure.
 
  You can also store a Lucene index in Berkeley DB
  (look at the 
  /contrib/db area of the source code repository)
 
 We're already using Oracle, so would it be possible
 to
 store the index there, thus giving each cluster node
 easy access to it. I read about SqlDirectory in the
 archives but it looks like it didn't make it to the
 API and I don't see it on the contrib page.
 
 I'm more concerned about making the index accessible
 rather than transactional consistency, so NFS may be
 another option like you mention. I'm curious to hear
 about other systems which are clustered and how
 others
 are doing this; lessons learnt and best practices
 etc.
 
 Thanks again for the help. Lucene looks like a first
 class tool.
 
 PJ
 
 --- Erik Hatcher [EMAIL PROTECTED] wrote:
 
  
  On Feb 10, 2005, at 5:00 PM, Paul Jans wrote:
   A couple of newbie questions. I've searched the
   archives and read the Javadoc but I'm still
 having
   trouble figuring these out.
  
  Don't forget to get your copy of Lucene in
 Action
  too :)
  
   1. What's the best way to index and handle
 queries
   like the following:
  
   Find me all users with (a CS degree and a GPA 
  3.0)
   or (a Math degree and a GPA  3.5).
  
  Some suggestions:  index degree as a Keyword
 field. 
  Pad GPA, so that 
  all of them are the form #.# (or #.## maybe). 
  Numerics need to be 
  lexicographically ordered, and thus padded.
  
  With the right analyzer (see the AnalysisParalysis
  page on the wiki) 
  you could use this type of query with
 QueryParser:'
  
  degree:cs AND gpa:[3.0 TO 9.9]
  
   2. What are the best practices for using Lucene
 in
  a
   clustered J2EE environment? A standalone
  index/search
   server or storing the index in the database or
   something else ?
  
  There is a LuceneRAR project that is still in its
  infancy here: 
  https://lucenerar.dev.java.net/
  
  You can also store a Lucene index in Berkeley DB
  (look at the 
  /contrib/db area of the source code repository)
  
  However, most projects do fine with cruder
  techniques such as sharing 
  the Lucene index on a common drive and ensuring
 that
  locking is 
  configured to use the common drive also.
  
  Erik
  
  
 

-
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
  For additional commands, e-mail:
  [EMAIL PROTECTED]
  
  
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Helps protect you from nasty viruses. 
 http://promotions.yahoo.com/new_mail
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie questions

2005-02-14 Thread Erik Hatcher
On Feb 14, 2005, at 2:40 PM, Paul Jans wrote:
Hi again,
So is SqlDirectory recommended for use in a cluster to
workaround the accessibility problem, or are people
using NFS or a standalone server instead?
Neither.  As far as I know, Berkeley DB is the only viable DB 
implementation currently.

NFS has notoriously had issues with Lucene and file locking.  Search 
the archives for more details on this.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie questions

2005-02-11 Thread Erik Hatcher
On Feb 10, 2005, at 5:00 PM, Paul Jans wrote:
A couple of newbie questions. I've searched the
archives and read the Javadoc but I'm still having
trouble figuring these out.
Don't forget to get your copy of Lucene in Action too :)
1. What's the best way to index and handle queries
like the following:
Find me all users with (a CS degree and a GPA  3.0)
or (a Math degree and a GPA  3.5).
Some suggestions:  index degree as a Keyword field.  Pad GPA, so that 
all of them are the form #.# (or #.## maybe).  Numerics need to be 
lexicographically ordered, and thus padded.

With the right analyzer (see the AnalysisParalysis page on the wiki) 
you could use this type of query with QueryParser:'

degree:cs AND gpa:[3.0 TO 9.9]
2. What are the best practices for using Lucene in a
clustered J2EE environment? A standalone index/search
server or storing the index in the database or
something else ?
There is a LuceneRAR project that is still in its infancy here: 
https://lucenerar.dev.java.net/

You can also store a Lucene index in Berkeley DB (look at the 
/contrib/db area of the source code repository)

However, most projects do fine with cruder techniques such as sharing 
the Lucene index on a common drive and ensuring that locking is 
configured to use the common drive also.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie questions

2005-02-11 Thread Erik Hatcher
On Feb 11, 2005, at 1:36 PM, Erik Hatcher wrote:
Find me all users with (a CS degree and a GPA  3.0)
or (a Math degree and a GPA  3.5).
Some suggestions:  index degree as a Keyword field.  Pad GPA, so that 
all of them are the form #.# (or #.## maybe).  Numerics need to be 
lexicographically ordered, and thus padded.

With the right analyzer (see the AnalysisParalysis page on the wiki) 
you could use this type of query with QueryParser:'

	degree:cs AND gpa:[3.0 TO 9.9]
oops, to be completely technically correct, use curly brackets to get  
rather than =

degree:cs AND gpa:{3.0 TO 9.9}
(I'll assume GPA's only go to 4.0 or 5.0 :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie questions

2005-02-11 Thread Paul Jans
I've already ordered Lucene in Action :)

 There is a LuceneRAR project that is still in its
 infancy here: 
 https://lucenerar.dev.java.net/

I will keep an eye on that for sure.

 You can also store a Lucene index in Berkeley DB
 (look at the 
 /contrib/db area of the source code repository)

We're already using Oracle, so would it be possible to
store the index there, thus giving each cluster node
easy access to it. I read about SqlDirectory in the
archives but it looks like it didn't make it to the
API and I don't see it on the contrib page.

I'm more concerned about making the index accessible
rather than transactional consistency, so NFS may be
another option like you mention. I'm curious to hear
about other systems which are clustered and how others
are doing this; lessons learnt and best practices etc.

Thanks again for the help. Lucene looks like a first
class tool.

PJ

--- Erik Hatcher [EMAIL PROTECTED] wrote:

 
 On Feb 10, 2005, at 5:00 PM, Paul Jans wrote:
  A couple of newbie questions. I've searched the
  archives and read the Javadoc but I'm still having
  trouble figuring these out.
 
 Don't forget to get your copy of Lucene in Action
 too :)
 
  1. What's the best way to index and handle queries
  like the following:
 
  Find me all users with (a CS degree and a GPA 
 3.0)
  or (a Math degree and a GPA  3.5).
 
 Some suggestions:  index degree as a Keyword field. 
 Pad GPA, so that 
 all of them are the form #.# (or #.## maybe). 
 Numerics need to be 
 lexicographically ordered, and thus padded.
 
 With the right analyzer (see the AnalysisParalysis
 page on the wiki) 
 you could use this type of query with QueryParser:'
 
   degree:cs AND gpa:[3.0 TO 9.9]
 
  2. What are the best practices for using Lucene in
 a
  clustered J2EE environment? A standalone
 index/search
  server or storing the index in the database or
  something else ?
 
 There is a LuceneRAR project that is still in its
 infancy here: 
 https://lucenerar.dev.java.net/
 
 You can also store a Lucene index in Berkeley DB
 (look at the 
 /contrib/db area of the source code repository)
 
 However, most projects do fine with cruder
 techniques such as sharing 
 the Lucene index on a common drive and ensuring that
 locking is 
 configured to use the common drive also.
 
   Erik
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Yahoo! Mail - Helps protect you from nasty viruses. 
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Newbie questions

2005-02-10 Thread Paul Jans
Hi,

A couple of newbie questions. I've searched the
archives and read the Javadoc but I'm still having
trouble figuring these out. 

1. What's the best way to index and handle queries
like the following: 

Find me all users with (a CS degree and a GPA  3.0)
or (a Math degree and a GPA  3.5).

2. What are the best practices for using Lucene in a
clustered J2EE environment? A standalone index/search
server or storing the index in the database or
something else ?

Thank you in advance,
PJ




__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie Questions: Site Scoping, Page Type Filtering/Sorting, Localization, Clustering

2004-05-31 Thread Erik Hatcher
On May 30, 2004, at 10:34 PM, Sasha Haghani wrote:
I am a newbie to Lucene and I'm considering using it in an upcoming 
project.
I've read through the documentation but I still have a number of 
questions:
I'll do my best with some pointers below...
1. SEGMENTING AN INDEX  QUERIES BY SITE SCOPE
In my use case, I have a number of logical websites backed by the same
underlying content store.  A Document may be ultimately end up 
belonging
to one or more logical sites, but at a distinct URL for each.  The
simplistic solution is to maintain indices for each logical site, but 
this
will result in some unwanted duplication and the need to update 
multiple
indices on shared content changes.  Other than that, can anyone 
suggest
approaches for how to segment a single index to accomodate multiple 
logical
sites and allow queries within a particlar site's scope?  Are fields 
the
solution?  How should the distinct per-site URLs be managed?
I don't think there is a definitive best way to do this.  Per-site 
indexes is one option.  Using a site field is another.  Queries for a 
particular site could be done either by using QueryFilter or by 
wrapping all queries in a BooleanQuery with a required TermQuery for 
the site.

Sites could share documents by simply adding multiple 
Field.Keyword(site, site) to the documents.

2. LOCALIZED CONTENT
I understand that at its core, Lucene can support content from any 
locale
and character set supported by Java.  What is the best way of 
implementing
Lucene to handle a content base which includes numerous locales.  One 
index
per locale or should all Documents be placed in a single index and 
tagged
with a locale field?  Or is there another approach altogether?
Again, there isn't really a best way, I don't think.  How does the 
locale situation relate to the previously mentioned site separation?  A 
locale field is a perfectly reasonable way to go also.  I don't know 
of any other approach.

3. DOCUMENT URLS
Is the URL at which the original document can be retrieved generally 
(i.e.,
for linking search results to the original doc) stored as a non-index,
non-tokenized, stored Field in the Document?
It depends on whether you want to query for it or not.  Field.Keyword 
if you want to be able to query for it.  Field.UnIndexed if you want it 
with the attributes you specified.

4. QUERY FILTERING  SORTING BY FIELD VALUE
In my application I have a pretty typical need to distinguish between
different document types (e.g., FAQs, Articles, Reviews, etc.) in 
order to
allow the user to restrict their results to particular types of 
documents or
to sort results by type.  Are fields again the solution for this?  Can
Queries filter or sort results/hits on exact field values (i.e.,
non-tokenized field values).
Fields are generally the solution :)  What else is there?  Documents 
have Fields.  Fields are where you put metadata about documents.  A 
document type makes perfect sense to put in a field.

QueryFilter or the BooleanQuery AND trick mentioned above would allow 
you to narrow results down to a particular set of types.  Sorting works 
on exact values, yes, and you can write your own sorting implementation 
if lexicographic or numeric sorting are not sufficient which could key 
off external information if needed.  To sort on a field, it needs to be 
indexed and non-tokenized (stored is irrelevant).  There must be only a 
single term for that field in a document.  Check the Javadocs for the 
Sort class for more details on the sorting requirements.

5. DEPLOYING LUCENE IN A CLUSTERED WEB-APP ENVIRONMENT
How is Lucene to be deployed in a clustered web-app environment?  Do 
all
cluster nodes require access to a networked filesystem containing the 
index
files or is there another solution?  How is concurrency managed when 
the
index is being incrementally updated?
This is entirely up to you to manage.  I'm sure developers building 
solutions with Lucene have employed all sorts of various architectures.

Concurrency is managed via lock files that need to be shared among apps 
interacting with the index.  The short answer is only a single process 
(but multiple threads sharing an IndexWriter) can index at a time.  You 
would probably want to build some sort of queuing infrastructure and 
have a single indexer, or index into separate indexes and merge them.

Any answers and suggestions are much appreciated.  Thanks.
I hope this helps some.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Newbie Questions: Site Scoping, Page Type Filtering/Sorting, Localization, Clustering

2004-05-30 Thread Sasha Haghani
Hi there,
 
I am a newbie to Lucene and I'm considering using it in an upcoming project.
I've read through the documentation but I still have a number of questions:
 
1. SEGMENTING AN INDEX  QUERIES BY SITE SCOPE
In my use case, I have a number of logical websites backed by the same
underlying content store.  A Document may be ultimately end up belonging
to one or more logical sites, but at a distinct URL for each.  The
simplistic solution is to maintain indices for each logical site, but this
will result in some unwanted duplication and the need to update multiple
indices on shared content changes.  Other than that, can anyone suggest
approaches for how to segment a single index to accomodate multiple logical
sites and allow queries within a particlar site's scope?  Are fields the
solution?  How should the distinct per-site URLs be managed?
 
2. LOCALIZED CONTENT
I understand that at its core, Lucene can support content from any locale
and character set supported by Java.  What is the best way of implementing
Lucene to handle a content base which includes numerous locales.  One index
per locale or should all Documents be placed in a single index and tagged
with a locale field?  Or is there another approach altogether?
 
3. DOCUMENT URLS
Is the URL at which the original document can be retrieved generally (i.e.,
for linking search results to the original doc) stored as a non-index,
non-tokenized, stored Field in the Document?
 
4. QUERY FILTERING  SORTING BY FIELD VALUE
In my application I have a pretty typical need to distinguish between
different document types (e.g., FAQs, Articles, Reviews, etc.) in order to
allow the user to restrict their results to particular types of documents or
to sort results by type.  Are fields again the solution for this?  Can
Queries filter or sort results/hits on exact field values (i.e.,
non-tokenized field values).
 
5. DEPLOYING LUCENE IN A CLUSTERED WEB-APP ENVIRONMENT
How is Lucene to be deployed in a clustered web-app environment?  Do all
cluster nodes require access to a networked filesystem containing the index
files or is there another solution?  How is concurrency managed when the
index is being incrementally updated?
 
Any answers and suggestions are much appreciated.  Thanks.
 
--Daniel


Newbie Questions

2003-08-26 Thread Mark Woon
Hi all...

I've been playing with Lucene for a couple days now and I have a couple 
questions I'm hoping some one can help me with.  I've created a Lucene 
index with data from a database that's in several different fields, and 
I want to set up a web page where users can search the index.  Ideally, 
all searches should be as google-like as possible.  In Lucene terms, I 
guess this means the query should be fuzzy.  For example, if someone 
searches for cancer then I'd like to get back all resuls with any form 
of the word cancer in the term (cancerous, breast cancer, etc.).

So far, I seem to be having two problems:

1) How can I search all fields at the same time?  The QueryParser seems 
to only search one specific field.

2) How can I automatically default all searches into fuzzy mode?  I 
don't want my users to have to know that they must add a ~ at the end 
of all their terms.

Thanks,
-Mark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Newbie Questions

2003-08-26 Thread Gregor Heinrich
Hi Mark,

short answers to your questions:

ad 1: MultiFieldQueryParser is what you might want: you can specify the
fields to run the query on. Alternatively, the practice of duplicating the
contents of all separate fields in question into one additional merged field
has been suggested, which enables you to use QueryParser itself.

ad 2: Depending on the Analyzer you use, the query is normalised, i.e.,
stemmed (remove suffices from words) and stopword-filtered (remove highly
frequent words). Have a look at StandardAnalyzer.tokenStream(...) to see how
the different filters work. In the analysis package the 1.3rc2 Lucene
distribution has a Porter stemming algorithm: PorterStemmer.

Have fun,

Gregor

-Original Message-
From: Mark Woon [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 26, 2003 6:54 AM
To: [EMAIL PROTECTED]
Subject: Newbie Questions


Hi all...

I've been playing with Lucene for a couple days now and I have a couple
questions I'm hoping some one can help me with.  I've created a Lucene
index with data from a database that's in several different fields, and
I want to set up a web page where users can search the index.  Ideally,
all searches should be as google-like as possible.  In Lucene terms, I
guess this means the query should be fuzzy.  For example, if someone
searches for cancer then I'd like to get back all resuls with any form
of the word cancer in the term (cancerous, breast cancer, etc.).

So far, I seem to be having two problems:

1) How can I search all fields at the same time?  The QueryParser seems
to only search one specific field.

2) How can I automatically default all searches into fuzzy mode?  I
don't want my users to have to know that they must add a ~ at the end
of all their terms.

Thanks,
-Mark




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Newbie Questions

2003-08-26 Thread Aviran Mordo
1. You need to use MultiFieldQueryParser
2. I think you should use PorterStemFilter instead of fuzzy query
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/Por
terStemFilter.html

-Original Message-
From: Mark Woon [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 26, 2003 12:54 AM
To: [EMAIL PROTECTED]
Subject: Newbie Questions


Hi all...

I've been playing with Lucene for a couple days now and I have a couple 
questions I'm hoping some one can help me with.  I've created a Lucene 
index with data from a database that's in several different fields, and 
I want to set up a web page where users can search the index.  Ideally, 
all searches should be as google-like as possible.  In Lucene terms, I 
guess this means the query should be fuzzy.  For example, if someone 
searches for cancer then I'd like to get back all resuls with any form

of the word cancer in the term (cancerous, breast cancer, etc.).

So far, I seem to be having two problems:

1) How can I search all fields at the same time?  The QueryParser seems 
to only search one specific field.

2) How can I automatically default all searches into fuzzy mode?  I 
don't want my users to have to know that they must add a ~ at the end 
of all their terms.

Thanks,
-Mark




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie Questions

2003-08-26 Thread Erik Hatcher
On Tuesday, August 26, 2003, at 12:53  AM, Mark Woon wrote:
1) How can I search all fields at the same time?  The QueryParser 
seems to only search one specific field.
The common thing I've done and seen others do is glue all the fields 
together into a master searchable field named something like contents 
or keywords (be sure to put a space in between text so it can be 
tokenized properly).

2) How can I automatically default all searches into fuzzy mode?  I 
don't want my users to have to know that they must add a ~ at the 
end of all their terms.
Your description of searches for cancer finding cancerous isn't 
really what the fuzzy query is about.  What you're after, I think, is 
more the stemming algorithms used during the analysis phase.  Have a 
look at the SnowballAnalyzer in the Lucene sandbox.  There is a little 
bit about it in the article I wrote for java.net: 
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html - it 
definitely sounds like more work in the analysis phase is what you're 
after.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Newbie Questions

2003-08-26 Thread Gregor Heinrich
Hi Mark.

Sorry, it's rc1 really which is out. But if you go to the cvs server, then
you'll find the rc2-dev version.

Multiple calls to Document.add with the same field results in that their
text is treated as though appended for the purposes of search. (API doc).

Can you try out if there's a differece between the cases you mention? I don'
t know but I'd be interested as well;-).

Gregor




-Original Message-
From: Mark Woon [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 26, 2003 8:52 PM
To: Lucene Users List
Subject: Re: Newbie Questions


Gregor Heinrich wrote:

 ad 1: MultiFieldQueryParser is what you might want: you can specify the
 fields to run the query on. Alternatively, the practice of duplicating
 the
 contents of all separate fields in question into one additional merged
 field
 has been suggested, which enables you to use QueryParser itself.


Ah, I've been testing out something similar to the latter.  I've been
adding multiple values on the same key.  Won't this have the same
effect?  I've been assuming that if I do

doc.add(Field.Keyword(content, value1);
doc.add(Field.Keyword(content, value2);

And did a search on the content field for either value, I'd get a hit,
and it seems to work.  This way, I figure I'd be able to differentiate
between values that I want tokenized and values that I don't.

Is there a difference between this and building a StringBuffer
containing all the values and storing that as a single field-value?


 ad 2: Depending on the Analyzer you use, the query is normalised, i.e.,
 stemmed (remove suffices from words) and stopword-filtered (remove highly
 frequent words). Have a look at StandardAnalyzer.tokenStream(...) to
 see how
 the different filters work. In the analysis package the 1.3rc2 Lucene
 distribution has a Porter stemming algorithm: PorterStemmer.


There's an rc2 out?  Where??  I just checked the Lucene website and only
see rc1.


Thanks everyone for all the quick responses!

-Mark



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Newbie Questions

2002-04-07 Thread Chris Withers

Hi there,

I'm new to Lucene and have what will hopefully be a couple of simple questions.

1. Can I index numbers with Lucene? If so, ints or floats or ?

2. Can I index dates with Lucene?

In either case, is there any way I can sort the results returned by a search on
these fields?
Also, can I search for only documents which have been indexed with a range in
one of these fields?

For example: I only want documents where the 'cost' field is between 1000 and
2000 and where the date of manufacture was prior to 13th June 1978.

cheers,

Chris

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




newbie questions

2001-10-23 Thread David Bonilla
Title: En blanco



I´m trying to implement Lucene in my application but I´m really a 
newbie.

1) If I want to create a Index in the directory e:\Lucene, 
must I just do writer = new IndexWriter("E:/Lucene", null, true); 
?

2) How exactly can I create a Index in a database ? Can 
anybody send a sample ?

3) Talking about the boolean third parameter in 
IndexWriter if I write writer = new 
IndexWriter("E:/Lucene", null, false); and the index dont 
exist..is the Index created anyway ?
(I must use it to control if the index is already writed or not)

Thanks a lot !!!
__David Bonilla FuertesTHE BIT BANG 
NETWORKhttp://www.bit-bang.comProfesor Waksman, 
8, 6º B28036 MadridSPAINTel.: (+34) 914 577 747Móvil: 656 62 83 
92Fax: (+34) 914 586 
176__