amusing interaction between advanced tokenizers and highlighter package

2004-06-18 Thread David Spencer
I've run across an amusing interaction between advanced 
Analyzers/TokenStreams and the very useful "term highlighter": 
http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/

I have a custom Analyzer I'm using to index javadoc-generated web pages.
The Analyzer in turn has a custom TokenStream which tries to more 
intelligently tokenize java-language tokens.

A naive analyzer would turn something like "SyncThreadPool" into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a "0" position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so "SyncThread" 
and "ThreadPool" appear too].

The point behind this is someone searching for "threadpool" probably 
would want to see a match for "SyncThreadPool" even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me to 
no end.

So the analyzer/tokenizer works great, and I have a demo site about to 
come up that indexes lots of publicly avail javadoc as a kind of 
resource so you can easily find what's already been done.

The problem is as follows. In all cases I use my Analyzer to index the 
documents.
If I use my Analyzer with with the Highligher package,  it doesn't look 
at the position increment of the tokens and consequently a nonsense 
stream of matches is output. If I use a different Analyzer w/ the 
highlighter (say, the StandardAnalyzer), then it doesn't show the 
matches that really matched, as it doesn't see the "subtokens".

It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an incr 
of 0 and match one part of the query.

Has this come up before and is the issue clear?
thx,
Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Question on how to build a query

2004-06-18 Thread Jason St. Louis
Well, I seem to have gotten something to work.  Maybe someone could just 
 comment on my approach.

I wrote my indexer so that it added each field without tokenizing it:
Field fnameField = new Field("fname", fname.toLowerCase(), true, true, 
false);
Field lnameField = new Field("lname", lname.toLowerCase(), true, true, 
false);
Field cityField = new Field("city", position.toLowerCase(), true, true, 
false);

By the way, if this is the case, is the indexer even using the analyzer 
that I pass to it?

Then in my search code I create the firstname query as a WildcardQuery 
if the first name is provided (adding a * to the end if it's not already 
there):

Term fnameTerm = null;
Query fnameQuery = null;
if( fnameIn.length() > 0)
{
if( !fnameIn.endsWith("*") )
{
fnameIn += "*";
}
fnameTerm = new Term("fname", fnameIn);
fnameQuery = new WildcardQuery(fnameTerm);
}
I then create my lastname query as either a WildcardQuery or a term 
query depending on whether it contains a *:

Term lnameTerm = new Term("lname", lnameIn);
Query lnameQuery = null;
if( lnameIn.indexOf("*") != -1 )
{
lnameQuery = new WildcardQuery(lnameTerm);
}
else
{
lnameQuery = new TermQuery(lnameTerm);
}
Lastly, I create the city query as a TermQuery.
Finally, I add the 3 queries to a booleanQuery, not adding the first 
name query if it is null (this means a first name was not provided) and 
making lastname and city required:

if(fnameQuery != null)
{
overallQuery.add(fnameQuery, true, false);
}
overallQuery.add(lnameQuery, true, false);
overallQuery.add(positionQuery, true, false);
I then search my index and it appears to work. I haven't tested it 
extensively yet, though.

Does this seem like a reasonable way to approach this problem, or am I 
missing something that's going to bite me in the you-know-what?

Thanks.
Jason
Jason St. Louis wrote:
Hi everyone.  I'm wondering if someone could help me out.
I have created an index of a database of person records where I have 
created documents with the following fields:
database primary_key (stored, not indexed)
first name (indexed)
last name (indexed)
city (indexed)

I used SimpleAnalyzer when creating the index.
I am providing a web based form to search this index.  The form has 3 
fields for first name, last name and city (city is a drop down list).

I want to take the users input and from these 3 fields and build a query 
such that:
A)last name is mandatory and can be wildcarded (I will probably make 
sure the value begins with at least one letter)
B)First name can be wildcarded (same as last name, although if it is 
left blank, I would probably just search the last_name and city and 
ignore the first name)
C)city is mandatory and must match exactly

How would I go about building this query?
Do I create a wildcard query for first name and last name, a term query 
for city and then combine them into boolean query where all 3 terms must 
be matched?  I kind of feel like I'm grasping at straws here.  I think I 
just need a jumpstart to understand how the Query API works.

Thanks.
Jason

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene search integration with Portal Servers

2004-06-18 Thread Vladimir Yuryev
Hi!
For example:
http://www.lutece.paris.fr/en/jsp/site/Portal.jsp
Regards,
Vladimir.
On Fri, 18 Jun 2004 14:32:18 -0700
 Hetan Shah <[EMAIL PROTECTED]> wrote:
Hi All,
Has anyone tried or have any sample of working integration solution 
for
LUCENE with any J2EE portal servers? Also I am curious to know what 
are
the best or mostly used practices to link the search results with the
documents/files on the system.

Thanks all,
-H

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: demo indexing problems on linux

2004-06-18 Thread Morris Mizrahi
Thanks for your response Daniel.

This is the process I have followed:
1) Run the indexer for the first time with the -create option and it
adds all of my files to the index.
2) Run the indexer later in the day when some files have been added but
none have been removed. When it runs here it removes files from the
index that are still there and adds some of the new files. 
In the end the index is missing some of the files that should be there.
I have about 21000 files and there is about 2000 that are missing from
the index.

I have been using LIMO to analyze my index.

Does anyone have any thoughts or ideas?

Thanks for any help.

 Morris



-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 18, 2004 10:52 AM
To: Lucene Users List
Subject: Re: demo indexing problems on linux

On Thursday 17 June 2004 21:10, Morris Mizrahi wrote:

> When I run org.apache.lucene.demo.IndexHTML on Linux the indexer works
> fine when I am creating a new index (e.g. using -create -index
option).
> But when I run the indexer again (-index without the -create option)
for
> updates it does not properly update the index.

Morris,

what exactly happens when you run the update? Does it miss files that
have 
been modified? I just tried it on Linux and it works fine. Files that
have 
been modified (according to their file date) are deleted and then added 
again to the index.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Question on how to build a query

2004-06-18 Thread Jason St. Louis
Hi everyone.  I'm wondering if someone could help me out.
I have created an index of a database of person records where I have 
created documents with the following fields:
database primary_key (stored, not indexed)
first name (indexed)
last name (indexed)
city (indexed)

I used SimpleAnalyzer when creating the index.
I am providing a web based form to search this index.  The form has 3 
fields for first name, last name and city (city is a drop down list).

I want to take the users input and from these 3 fields and build a query 
such that:
A)last name is mandatory and can be wildcarded (I will probably make 
sure the value begins with at least one letter)
B)First name can be wildcarded (same as last name, although if it is 
left blank, I would probably just search the last_name and city and 
ignore the first name)
C)city is mandatory and must match exactly

How would I go about building this query?
Do I create a wildcard query for first name and last name, a term query 
for city and then combine them into boolean query where all 3 terms must 
be matched?  I kind of feel like I'm grasping at straws here.  I think I 
just need a jumpstart to understand how the Query API works.

Thanks.
Jason

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene search integration with Portal Servers

2004-06-18 Thread Hetan Shah
Hi All,
Has anyone tried or have any sample of working integration solution for
LUCENE with any J2EE portal servers? Also I am curious to know what are
the best or mostly used practices to link the search results with the
documents/files on the system.
Thanks all,
-H

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: search "" and ""

2004-06-18 Thread Lynn Li
Both StandardAnalyzer and SnowballAnalyzer remove them.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 18, 2004 4:05 PM
To: [EMAIL PROTECTED]
Subject: RE: search "" and ""

This depends on the analyzer you use.

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexi
ng&toc=faq#q13

-Original Message-
From: Lynn Li [mailto:[EMAIL PROTECTED]
Sent: Friday, June 18, 2004 5:03 PM
To: '[EMAIL PROTECTED]'
Subject: search "" and ""


When search "" or "", QueryParser parses them into "text". How
can I make it not to remove anchor brackets and slashes?

Thank you in advance,
Lynn

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: search "" and ""

2004-06-18 Thread wallen
This depends on the analyzer you use.

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexi
ng&toc=faq#q13

-Original Message-
From: Lynn Li [mailto:[EMAIL PROTECTED]
Sent: Friday, June 18, 2004 5:03 PM
To: '[EMAIL PROTECTED]'
Subject: search "" and ""


When search "" or "", QueryParser parses them into "text". How
can I make it not to remove anchor brackets and slashes?

Thank you in advance,
Lynn

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



search "" and ""

2004-06-18 Thread Lynn Li
When search "" or "", QueryParser parses them into "text". How
can I make it not to remove anchor brackets and slashes?

Thank you in advance,
Lynn


Re: Compound file format file size question

2004-06-18 Thread James Dunn
Otis,

Thanks for the response.

Yeah, I was copying the file to a brand new hard-drive
and it was formatted to FAT32 by default, which is
probably why it couldn't handle the 13GB file.  

I'm converting the drive to NTFS now, which should get
me through temporarily.  In the future though, I may
break the index up into smaller sub-indexes so that I
can distribute them across seperate physical disks for
better disk IO.

Thanks for your help!

Jim
--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:
> Hello,
> 
> --- James Dunn <[EMAIL PROTECTED]> wrote:
> > Hello all,
> > 
> > I have an index that's about 13GB on disk.  I'm
> using
> > 1.4 rc3 which uses the compound file format by
> > default.
> > 
> > Once I run optimize on my index, it creates one
> 13GB
> > ..cfs file.  This isn't a problem on Linux (yet),
> but
> > I'm having some trouble copying the file over to
> my
> > Windows XP box.
> 
> What is the exact problem? The sheer size of it or
> something else? 
> Just curious...
> 
> > Is there some way using the compound file format
> to
> > set the maximum file size and have Lucene break
> the
> > index into multiple files once it hits that limit?
> 
> Can't be done with Lucene, but I seem to recall some
> discussion about
> it.  Nothing concrete, though.
> 
> > Or do I need to go back to using the non-compound
> file
> > format?
> 
> The total size should be (about) the same, but you
> could certainly do
> that, if having more smaller files is better for
> you.
> 
> Otis
> 
> > Another solution, I suppose, would be to break up
> my
> > index into seperate smaller indexes.  This would
> be my
> > second choice, however.
> > 
> > Thanks a lot,
> > 
> > Jim
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Compound file format file size question

2004-06-18 Thread Otis Gospodnetic
Hello,

--- James Dunn <[EMAIL PROTECTED]> wrote:
> Hello all,
> 
> I have an index that's about 13GB on disk.  I'm using
> 1.4 rc3 which uses the compound file format by
> default.
> 
> Once I run optimize on my index, it creates one 13GB
> ..cfs file.  This isn't a problem on Linux (yet), but
> I'm having some trouble copying the file over to my
> Windows XP box.

What is the exact problem? The sheer size of it or something else? 
Just curious...

> Is there some way using the compound file format to
> set the maximum file size and have Lucene break the
> index into multiple files once it hits that limit?

Can't be done with Lucene, but I seem to recall some discussion about
it.  Nothing concrete, though.

> Or do I need to go back to using the non-compound file
> format?

The total size should be (about) the same, but you could certainly do
that, if having more smaller files is better for you.

Otis

> Another solution, I suppose, would be to break up my
> index into seperate smaller indexes.  This would be my
> second choice, however.
> 
> Thanks a lot,
> 
> Jim


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Compound file format file size question

2004-06-18 Thread James Dunn
Hello all,

I have an index that's about 13GB on disk.  I'm using
1.4 rc3 which uses the compound file format by
default.

Once I run optimize on my index, it creates one 13GB
.cfs file.  This isn't a problem on Linux (yet), but
I'm having some trouble copying the file over to my
Windows XP box.

Is there some way using the compound file format to
set the maximum file size and have Lucene break the
index into multiple files once it hits that limit?

Or do I need to go back to using the non-compound file
format?

Another solution, I suppose, would be to break up my
index into seperate smaller indexes.  This would be my
second choice, however.

Thanks a lot,

Jim

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: demo indexing problems on linux

2004-06-18 Thread Daniel Naber
On Thursday 17 June 2004 21:10, Morris Mizrahi wrote:

> When I run org.apache.lucene.demo.IndexHTML on Linux the indexer works
> fine when I am creating a new index (e.g. using -create -index option).
> But when I run the indexer again (-index without the -create option) for
> updates it does not properly update the index.

Morris,

what exactly happens when you run the update? Does it miss files that have 
been modified? I just tried it on Linux and it works fine. Files that have 
been modified (according to their file date) are deleted and then added 
again to the index.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Selecting documents which have field x

2004-06-18 Thread Otis Gospodnetic
Hello,

The best/easiest/only way I can think of to handle this is to have
another field that serves as a flag.  You could add that field only
when your document has that optional field.
Actually, you may also be able to make use of the ability to add
multiple values to the same field.  Then you could pick some obscure
and/or reserved value for a field to serve as a marker.

Something along the lines of:

if (oh, this type of doc has field X) {
 doc.add(Field.UnStored("fieldName", "field value here"));
 doc.add(Field.UnStored("fieldName", "__flag"));
}

Then you can search for fieldName:__flag and you will find all
documents with "fieldName".

The choice of __flag may not be the best, but you can play with it and
see what works best for you.

Otis

--- jt oob <[EMAIL PROTECTED]> wrote:
> Is it possible/what's the best way, to find all documents which have
> a
> given field. The field contents may be the empty string "".
> 
> Thanks,
> jt
> 
> 
>   
>   
>   
> ___ALL-NEW
> Yahoo! Messenger - so many all-new ways to express yourself
> http://uk.messenger.yahoo.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Selecting documents which have field x

2004-06-18 Thread jt oob
Is it possible/what's the best way, to find all documents which have a
given field. The field contents may be the empty string "".

Thanks,
jt





___ALL-NEW Yahoo! Messenger - 
so many all-new ways to express yourself http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]