Re: Multiple indexes

2005-03-01 Thread Erik Hatcher
It's hard to answer such a general question with anything very precise, 
so sorry if this doesn't hit the mark.  Come back with more details and 
we'll gladly assist though.

First, certainly do not copy/paste code.  Use standard reuse practices, 
perhaps the same program can build the two different indexes if passed 
different parameters, or share code between two different programs as a 
JAR.

What specifically are the issues you're encountering?
Erik
On Mar 1, 2005, at 8:06 PM, Ben wrote:
Hi
My site has two types of documents with different structure. I would
like to create an index for each type of document. What is the best
way to implement this?
I have been trying to implement this but found out that 90% of the
code is the same.
In Lucene in Action book, there is a case study on jGuru, it just
mentions them using multiple indexes. I would like to do something
like them.
Any resources on the Internet that I can learn from?
Thanks,
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]

2005-03-01 Thread Erik Hatcher
I had to moderate both Jonathan and Jon's messages in to the list.  
Please subscribe to the list and post to it with the address you've 
subscribed.  I cannot always guarantee I'll catch moderation messages 
and send them through in a timely fashion.

Erik
On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote:
Jon,
I too found some problems with the German analyser recently. Here's 
what
may help:
1. You can try reading Joerg Caumanns' paper "A Fast and Simple 
Stemming
Algorithm for German Words". This paper describes the algorithm
implemented by GermanAnalyser.
2. I guess German nouns all capitalized, so maybe that's why. Although 
you
would want to be indexing well written German and not emails or text
messages!
3. The German Stemmer converts umlauts into some funny form (the code 
is a
bit tricky, and I didn't spend any time looking at it), so maybe thats 
why
you can't find umlauts properly. I think the main reason for this 
umlaut
change is that many plurals are formed by umlauting: E.g. Haus, Haeuser
(that ae is a umlaut).

Finally, to really understand what's happening, get your hands on 
Luke. I
just got it last week, and its brilliant. It shows you everything about
your indexes. You can also feed text to an Analyser, and see what it 
makes
of it. This will show you the real reason why your umlaut search is
failing.
Ciao,
Jonathan O'Connor
XCOM Dublin


"Jon Humble" <[EMAIL PROTECTED]>
01/03/2005 09:35
Please respond to
"Lucene Users List" 
To

cc
Subject
Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]


Hello,
We?re using the GermanAnalyzer/Stemmer to index/search our (German)
Website.
I have a few questions:
(1) Why is the GermanAnalyzer case-sensitive? None of the other
language indexers seem to be. What does this feature add?
(2) With the German Analyzer, wildcard searches containing extended
German characters do not seem to work. So, a* is fine but anä* or ö*
always find zero results.
(3) In a similar vein to (2), wildcard searches with escaped 
special
characters fail to find results. So a search for co\-operative works 
but
a search for co\-op* fails.

I will be grateful for any light that can be shed on these problems.
With Thanks,
Jon.
Jon Humble
BSc (hons,)
Software Engineer
eMail: [EMAIL PROTECTED]
TecSphere Ltd
Centre for Advanced Industry
Coble Dene, Royal Quays
Newcastle upon Tyne NE29 6DE
United Kingdom
Direct Dial: +44 (191) 270 31 06
Fax: +44 (191) 270 31 09
http://www.tecsphere.com


*** Aktuelle Veranstaltungen der XCOM AG ***
XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events
Workshop-Reihe "Mobilisierung von Lotus Notes Applikationen"  in 
Berlin (05.03.2005)
Anmeldung und Information unter http://lotus.xcom.de/events

*** XCOM AG Legal Disclaimer ***
Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist 
allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. 
Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail 
untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich 
vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole 
use of the intended recipient. Any review, distribution by others or 
forwarding without express permission is strictly prohibited. If you 
are not the intended recipient, please contact the sender and delete 
all copies.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fast access to a random page of the search results.

2005-02-28 Thread Erik Hatcher
On Feb 28, 2005, at 10:39 AM, Stanislav Jordanov wrote:
> What did you do in your private investigation?
1. empirical tests with an index of nearly 75,000 docs (I am attaching 
the test source)
Only certain (.txt?) attachments are allowed to come through on the 
mailing list.

> Sorted by descending relevance (the default), or in some other way?
In some other way - sorted by some column (asc or desc - doesn't 
matter)
Using IndexSearcher(query, sort)?
 > If a search is fast enough, as you report, then you can simply start
> your access to Hits at the appropriate spot.  For the current systems
> I'm working on, this is the approach I've used - start iterating hits
> at (pageNumber - 1) * numberOfItemsPerPage.
>
> Is that approach insufficient?
I'm afraid this is not sufficient;
Either I am doing something wrong,
or it is not that simple:
following is a log from my test session;
It appears that IndexSearcher.search(...) finishes rather fast
compared to the time it takes to fetch the last document from the Hits 
object.
I assume you are only accessing the documents you wish to display 
rather than all of them up to where you need.   Also keep in mind that 
accessing a Document is when the document is pulled from the index.  If 
you have a large amount of data in a document it will take a 
corresponding amount of time to load it.  You may need to restructure 
what you store in a document to reduce the load times.  Or perhaps you 
need to investigate the (is it in the codebase already?) patch to load 
fields lazily upon demand instead.

Erik


The log starts here:
pa
Found 74222 document(s) that matched query 'pa'
Sorting by "sfile_name"
query executed in 16ms
Last doc accessed in 375ms
us
Found 74222 document(s) that matched query 'us'
Sorting by "sfile_name"
query executed in 31ms
Last doc accessed in 219ms
1
Found 74222 document(s) that matched query '1'
Sorting by "sfile_name"
query executed in 15ms
Last doc accessed in 235ms
5
Found 74222 document(s) that matched query '5'
Sorting by "sfile_name"
query executed in 422ms
Last doc accessed in 219ms
6
Found 72759 document(s) that matched query '6'
Sorting by "sfile_name"
query executed in 344ms
Last doc accessed in 250ms
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fast access to a random page of the search results.

2005-02-28 Thread Erik Hatcher
On Feb 28, 2005, at 6:00 AM, Stanislav Jordanov wrote:
my private investigation already left me sceptic about the outcome of 
this
issue,
but I've decided to post it as a final resort.
What did you do in your private investigation?
Suppose I have an index of about 5,000,000 docs
and I am running a single term queries against it, including queries 
which
return say 1,000,000 or even more hits.

The hits are sorted by some column and I am happy with the query 
execution
time (i.e. the time spent in the IndexSearcher.query(...) method).
Now comes the problem: it is a product requirement that the client is
allowed to quickly access (by scrolling) a random page of the result 
set.
Put in different words the app must quickly (in less that a second) 
respond
to requests like: "Give me the results from No 567100 to No 567200"
(remember the results are sorted thus ordered).
Sorted by descending relevance (the default), or in some other way?
If a search is fast enough, as you report, then you can simply start 
your access to Hits at the appropriate spot.  For the current systems 
I'm working on, this is the approach I've used - start iterating hits 
at (pageNumber - 1) * numberOfItemsPerPage.

Is that approach insufficient?
Erik

I took a look at Lucene's internals which only left me with the 
suspision
that this is an impossible task.
Would anyone, please, prove my suspision wrong?

Regards
Stanislav

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Boost doesn't works

2005-02-28 Thread Erik Hatcher
Use the IndexSearcher.explain() feature to look at how Lucene is 
calculating the score.

Erik
On Feb 28, 2005, at 3:32 AM, Claude Libois wrote:
I use MultiFieldQueryParser(search only done on summary,title and 
content)
with a FilteredQuery.
Claude Libois
[EMAIL PROTECTED]
Technical associate - Unisys

- Original Message -
From: "Morus Walter" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Monday, February 28, 2005 9:28 AM
Subject: Re: Boost doesn't works

Claude Libois writes:
Hello. I'm using Lucene for an application and I want to boost the 
title
of
my documents.
For that I use the setBoost method that is applied on the title 
field.
However when I look with luke(1.6) I don't see any boost on this 
field
and
when
I do a search the score isn't change. What's wrong?
How do you search?
I guess you cannot see a change unless you combine searches in 
different
fields, since scores are normalized.

Morus
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sorting date stored in milliseconds time

2005-02-26 Thread Erik Hatcher
Just an idea off the top of my head you could create a custom sort, 
or alternatively you could store the date as separate fields such as 
"year", "month", "day", "time", and provide multi-field sort.

Erik
On Feb 25, 2005, at 11:36 PM, Ben wrote:
Hi
I store my date in milliseconds, how can I do a sort on it? SortField
has INT, FLOAT and STRING. Do I need to create a new sort class, to
sort the long value?
Thanks
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: help with boolean expression

2005-02-25 Thread Erik Hatcher
On Feb 25, 2005, at 4:19 PM, Omar Didi wrote:
I have a problem understanding how would lucene iterpret this boolean 
expression : A AND B OR C .
it neither return the same count as when I enter (A AND B) OR C nor A 
AND (B OR C).
if anyone knows how it is interpreted i would be thankful.
Output the toString() of the returned Query instances to see how 
QueryParser interpreted things.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorted search

2005-02-24 Thread Erik Hatcher
Sorting by String uses up lots more RAM than a numeric sort.  If you 
use a numeric (yet lexicographically orderable) date format (e.g. 
MMDD) you'll see better performance most likely.

Erik
On Feb 24, 2005, at 1:01 PM, Yura Smolsky wrote:
Hello, lucene-user.
I have index with many documents, more than 40 Mil.
Each document has DateField (It is time stamp of document)
I need the most recent results only. I use single instance of 
IndexSearcher.
When I perform sorted search on this index:
  Sort sort = new Sort();
  sort.setSort( new SortField[] { new SortField ("modified", 
SortField.STRING, true) } );
  Hits hits =
searcher.search(QueryParser.parse("good", "content",
  StandardAnalyzer()), sort);

then search speed is not good.
Today I have tried search without "sort by modified", but with sort by
Relevance. Speed was much better!
I think that Sort by DateField is very slow. Maybe I do something
wrong about this kind of sorted search? Can you give me advices about
this?
Thanks.
Yura Smolsky.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-22 Thread Erik Hatcher
On Feb 22, 2005, at 8:50 PM, Chris Hostetter wrote:
: >>> Just curious: it would seem easier to use multiple fields for the
: >>> original case and lowercase searching. Is there any particular 
reason
: >>> you analyzed the documents to multiple indexes instead of 
multiple
: >>> fields?
: >>
: >> I considered that approach, however to expose QueryParser I'd 
have to
: >> get tricky.  If I have title_orig and title_lc fields, how would I
: >> allow freeform queries of title:something?

Why have seperate fields?
Why not index the title into the "title" field twice, once with each 
term
lowercased and once with the case left alone. (Using an analyzer that
tokenizes "The Quick BrOwN fox" as "[the] [quick] [brown] [fox] [The]
[Quick] [BrOwN] [fox]")

Then at search time, depending on the value of of the checkbox, 
construct
your QueryParser using the appropriate Analyzer.
I assume you mean to stack the tokens in the same positions, so it'd be 
like this:

[the]   [quick] [brown] [fox]
[The]   [Quick] [BrOwN] [fox]
Otherwise, if you simply string it together like what you show, then 
this phrase matches "fox The Quick", which is not in the original 
document.  Though putting in a large gap would do the trick in your 
example.

There is a fiddly issue with this technique that I'm not quite seeing 
at the moment, but I'll brainstorm on it and hopefully remember it or 
perhaps be proven wrong.

I'm Lucene-brain-dead I just did a presentation to our local Unix 
Users Group.I built a man page indexer/searcher with PyLucene 
(thank you Andi!).  I had to learn Python as well, which was a good 
exercise, and learned lots from Andi's helpful private e-mails coaching 
me through my learning curve.  Now that I've seen the beast known as 
Python, I'm yearning for a Ruby version based on GCJ/SWIG.  A local 
Ruby guru and I are planning on meeting for a few hours each week and 
take a stab at it.  I'll commit whatever we do directly to a /ruby 
directory in Subversion.

Here's an example of my PyLucene output:
$ mansearch.py interface section:5
remote - remote host description file
rtadvd.conf - config file for router advertisement daemon
ipnat - IP NAT file format
groff_out - groff intermediate output format
xinetd.conf - Extended Internet Services Daemon configuration file
plist - property list format
racoon.conf - configuration file for racoon
ssh_config - OpenSSH SSH client configuration files
sudoers - list of which users may execute what
Even with custom formatting:
$ mansearch.py --format=#filename interface section:5
/usr/share/man/man5/remote.5
/usr/share/man/man5/rtadvd.conf.5
/usr/share/man/man5/ipnat.5
/usr/share/man/man5/groff_out.5
/usr/share/man/man5/xinetd.conf.5
/usr/share/man/man5/plist.5
/usr/share/man/man5/racoon.conf.5
/usr/share/man/man5/ssh_config.5
/usr/share/man/man5/sudoers.5
suitable for xargs :)
Erik

The only problem i can think of would be inflated scores for terms that
are naturally lowercased, because they would wind up getting added to 
the
index twice, but based on what i've seen of hte data you are working
with, i imageing that if you used UPPERCASE instead of lowercase you
could drasticly reduce the likelyhood of any problems with that.


-Hoss
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: More Analyzer Question

2005-02-21 Thread Erik Hatcher
The problem is your KeywordSynonymAnalyzer is not truly a "keyword" 
analyzer in that it is tokenizing the field into parts.  So Document 1 
has [test] and [mario] as tokens that come from the LowerCaseTokenizer.

Look at Lucene's svn repository under contrib/analyzers and you'll see 
a KeywordTokenizer and corresponding KeywordAnalyzer you can use.

Erik
On Feb 18, 2005, at 5:44 PM, Luke Shannon wrote:
I have created an Analyzer that I think should just be converting to 
lower
case and add synonyms in the index (it is at the end of the email).

The problem is, after running it I get one more result than I was 
expecting
(Document 1 should not be there):

Running testNameCombination1, expecting: 1 result
The query: +(type:138) +(name:mario*) returned 2
Start Listing documents:
Document: 0 contains:
Name: Text
Desc: Text
Document: 1 contains:
Name: Text
Desc: Text
End Listing documents
Those same 2 documents in Luke look like this:
Document 0
Text
Text
Document 1
Text
Text
That looks correct to me. The query shouldn't match Document 1.
The analzyer used on this field is below and is applied like so:
//set the default
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new
SynonymAnalyzer(new FBSynonymEngine()));
//the analyzer for the name field (only converts to lower case and adds
synonyms
analyzer.addAnalyzer("name", new KeywordSynonymAnalyzer(new
FBSynonymEngine()));
Any help would be appreciated.
Thanks,
Luke
import org.apache.lucene.analysis.*;
import java.io.Reader;
public class KeywordSynonymAnalyzer extends Analyzer {
private SynonymEngine engine;
public KeywordSynonymAnalyzer(SynonymEngine engine) {
this.engine = engine;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new SynonymFilter(new
LowerCaseTokenizer(reader), engine);
return result;
}
}



Luke Shannon | Software Developer
FutureBrand Toronto
207 Queen's Quay, Suite 400
Toronto, ON, M5J 1A7
416 642 7935 (office)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Using the highlighter from the sandbox with a prefix query.

2005-02-21 Thread Erik Hatcher
On Feb 21, 2005, at 10:53 AM, Michael Celona wrote:
That the only stack I get.  One thing to mention that I am using a
MultiSearcher to rewrite the queries. I tried...
query = searcher_last.rewrite( query );
query = searcher_cur.rewrite( query );
using IndexSearcher and I don't get an error... However, I not able to
highlight wildcard queries.
I use Highlighter for lucenebook.com and have two indexes that I search 
with MultiSearcher.  Here's how I highlight:

IndexReader reader = readers[indexIndex];
QueryScorer scorer = new QueryScorer(query.rewrite(reader));
SimpleHTMLFormatter formatter =
new SimpleHTMLFormatter("",
"");
Highlighter highlighter = new Highlighter(formatter, scorer);
I get the appropriate IndexReader for the document being highlighted.  
You can get the index _index_ this way:
'
int indexIndex = searcher.subSearcher(hits.id(position));

Hope this helps.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Using the highlighter from the sandbox with a prefix query.

2005-02-21 Thread Erik Hatcher
On Feb 21, 2005, at 10:20 AM, Michael Celona wrote:
I am using
query = searcher.rewrite( query );
and it is throwing java.lang.UnsupportedOperationException .
Am I able to use the searcher rewrite method like this?
What's the full stack trace?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Fwd: lucene.apache.org problems again

2005-02-21 Thread Erik Hatcher
Looks like the issue has been resolved with the lucene.apache.org DNS.
Erik
Begin forwarded message:
From: Ask Bjørn Hansen <[EMAIL PROTECTED]>
Date: February 20, 2005 9:34:50 PM EST
To: "Noel J. Bergman" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>, "Erik Hatcher" <[EMAIL PROTECTED]>
Subject: Re: lucene.apache.org problems again
On Feb 20, 2005, at 9:16 AM, Noel J. Bergman wrote:
The bitname.com name servers haven't updated.  Surnet and Hyperreal 
have
done so.  Checking the allowed list, I see:
Fixed.  The process that does that had become stuck.  I unstuck it and 
added an anti-stuck measure.

Thanks!
  - ask
--
http://www.askbjoernhansen.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


JavaLobby Lucene presentation

2005-02-19 Thread Erik Hatcher
I recorded a "Meet Lucene" presentation at JavaLobby.  It is a  
multimedia Flash video that shows slides with my voice recorded over  
them which spans just over 20 minutes (you can jump to specific  
slides).Check it out here:

	http://www.javalobby.org/members-only/eps/meet-lucene/index.html? 
source=archives

It's tailored as a high-level overview, and a quick one at that.  It'll  
certainly be too basic for most everyone on this list, but maybe your  
manager would enjoy it :)

It's awkward to record this type of thing and it sounds dry to me as I  
ended up having to script what I was going to say and read it rather  
than ad-lib like I would do in a face-to-face presentation.  ah's and  
um's don't work well in an audio-only track.

I'd love to hear (perhaps best through the JavaLobby forum associated  
with the presentation) feedback on it.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-19 Thread Erik Hatcher
On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote:
By lowercasing the querytext and searching in title_lc ?
Well sure, but how about this query:
title:Something AND anotherField:someOtherValue
QueryParser, as-is, won't be able to do field-name swapping.  I could
certainly apply that technique on all the structured queries that I
build up with the API, but with QueryParser it is trickier.   I'm
definitely open for suggestions on improving how case is handled.  The
Overriding this (1.4.3 QueryParser.jj, line 286) might work:
protected Query getFieldQuery(String field, String queryText)
throws ParseException { ... }
It will be called by the parser for both parts of the query above, so 
one
could change the field depending on the requested type of search
and the field name in the query.
But that wouldn't work for any other type of query 
title:somethingFuzzy~

Though now that I think more about it, a simple s/title:/title_orig:/ 
before parsing would work, and of course make the default field 
dynamic.   I need to evaluate how many fields would need to be done 
this way - it'd be several.  Thanks for the food for thought!

only drawback now is that I'm duplicating indexes, but that is only an
issue in how long it takes to rebuild the index from scratch 
(currently
about 20 minutes or so on a good day - when the machine isn't 
swamped).
Once the users get the hang of this, you might end up having to 
quadruple
the index, or more.
Why would that be?   They want a case sensitive/insensitive switch.  
How would it expand beyond that?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-18 Thread Erik Hatcher
On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote:
On Friday 18 February 2005 21:55, Erik Hatcher wrote:
On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
Erik,
Just curious: it would seem easier to use multiple fields for the
original case and lowercase searching. Is there any particular reason
you analyzed the documents to multiple indexes instead of multiple
fields?
I considered that approach, however to expose QueryParser I'd have to
get tricky.  If I have title_orig and title_lc fields, how would I
allow freeform queries of title:something?
By lowercasing the querytext and searching in title_lc ?
Well sure, but how about this query:
title:Something AND anotherField:someOtherValue
QueryParser, as-is, won't be able to do field-name swapping.  I could 
certainly apply that technique on all the structured queries that I 
build up with the API, but with QueryParser it is trickier.   I'm 
definitely open for suggestions on improving how case is handled.  The 
only drawback now is that I'm duplicating indexes, but that is only an 
issue in how long it takes to rebuild the index from scratch (currently 
about 20 minutes or so on a good day - when the machine isn't swamped).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-18 Thread Erik Hatcher
On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote:
Erik,
Just curious: it would seem easier to use multiple fields for the
original case and lowercase searching. Is there any particular reason
you analyzed the documents to multiple indexes instead of multiple 
fields?
I considered that approach, however to expose QueryParser I'd have to 
get tricky.  If I have title_orig and title_lc fields, how would I 
allow freeform queries of title:something?

Erik
p.s. It's fun to see the types of queries folks have already tried 
since I sent this e-mail (repeated queries are possibly someone 
paging):

INFO: Query = +title:dog +archivetype:rap : hits = 3
INFO: Query = +title:dog +archivetype:rap : hits = 3
INFO: Query = +title:dog +archivetype:rap : hits = 3
INFO: Query = rosetti : hits = 3
INFO: Query = +year:[ TO 1911] +(archivetype:radheader OR 
archivetype:rap) : hits = 2182
INFO: Query = advil : hits = 0
INFO: Query = test : hits = 24
INFO: Query = td : hits = 1
INFO: Query = td : hits = 1
INFO: Query = woman : hits = 363
INFO: Query = woman : hits = 363
INFO: Query = hello : hits = 0
INFO: Query = +rosetta +archivetype:rap : hits = 0
INFO: Query = +year:[ TO 1911] +(archivetype:radheader OR 
archivetype:rap) : hits = 2182
INFO: Query = poem : hits = 316
INFO: Query = crisis : hits = 7
INFO: Query = "crisis at every moment" : hits = 1
INFO: Query = toy : hits = 41
INFO: Query = title:echer : hits = 0
INFO: Query = senori : hits = 0
INFO: Query = +dear +sirs : hits = 11
INFO: Query = title:more : hits = 0
INFO: Query = more : hits = 365
INFO: Query = title:rossetti : hits = 329
INFO: Query = +blessed +damozel : hits = 103
INFO: Query = title:test : hits = 0
INFO: Query = +test +archivetype:radheader : hits = 3
INFO: Query = "crisis at every moment" : hits = 1
INFO: Query = rome : hits = 70
INFO: Query = fdshjkfjkhkfad : hits = 0
INFO: Query = stone : hits = 153
INFO: Query = +title:shakespeare +archivetype:radheader : hits = 1
INFO: Query = title:"xx i ll" : hits = 0
INFO: Query = +dog +cat : hits = 6
INFO: Query = +year:[1280 TO 1305] +archivetype:radheader : hits = 0
INFO: Query = guru : hits = 0
INFO: Query = philosophy : hits = 14
INFO: Query = title:install : hits = 0
INFO: Query = +title:install +archivetype:radheader : hits = 0
INFO: Query = "help freeform.html" : hits = 0
INFO: Query = "help freeform.html" : hits = 0
INFO: Query = install : hits = 1
INFO: Query = life : hits = 554
INFO: Query = life : hits = 554
INFO: Query = life : hits = 554


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-18 Thread Erik Hatcher
On Feb 18, 2005, at 3:25 PM, Luke Shannon wrote:
Nice work Eric. I would like to spend more time playing with it, but I  
saw a
few things I really liked. When a specific query turns up no results  
you
prompt the client to preform a free form search. Less sauvy search  
users
will benefit from this strategy.
That's merely an artifact of all searches going to the results page,  
which just shows the free-form search on it.

 I also like the display of information when
you select a result. Everything is at your finger tips without clutter.
For comparison, check the older site's search is here:
http://jefferson.village.virginia.edu:2020/search.html
(don't bother trying it - it's SLOOOW)
And also for comparison, here is an older look:
	

Dig that URL!  The new look and URL is here:
http://www.rossettiarchive.org/docs/1-1847.s244.raw.html
I did get this error when a name search failed to turn up results and I
clicked 'help' in the free form search row (the second row).
 Page 'help-freeform.html' not found in application namespace.
I've corrected this and it'll be corrected in my next deployment :)
So nice to have community of testers!  Thanks.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in the Humanities

2005-02-18 Thread Erik Hatcher
And before too many replies happen on this thread, I've corrected the 
spelling mistake in the subject!  :O

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lius

2005-02-18 Thread Erik Hatcher
Rida,
Please add your project to the Lucene PoweredBy page on the wiki.
Also - I moderated in your messages - so please subscribe to the list 
to send to it in the future.

Erik
On Feb 17, 2005, at 5:13 PM, Rida Benjelloun wrote:
Hi,
I've just release an indexing framework based on Lucene witch is named 
LIUS.
LIUS is written in Java and it adds to Lucene many files format 
indexing functionalities as: Ms Word, Ms Excel, Ms PowerPoint, RTF, 
PDF, XML, HTML, TXT, Open Office suite and JavaBeans. All the 
indexation process is based on a configuration file.
You can visit this links for more informations about LIUS, 
documentation is available in English and French:
www.bibl.ulaval.ca/lius/index.en.html
www.sourceforge.net/projects/lius

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene in the Humanties

2005-02-18 Thread Erik Hatcher
It's about time I actually did something real with Lucene  :)
I have been working with the Applied Research in Patacriticism group at 
the University of Virginia for a few months and finally ready to 
present what I've been doing.  The primary focus of my group is working 
with the Rossetti Archive - poems, artwork, interpretations, 
collections, and so on of Dante Gabriel Rossetti.  I was initially 
brought on to build a collection and exhibit system, though I got 
detoured a bit as I got involved in applying Lucene to the archive to 
replace their existing search system.  The existing system used an old 
version of Tamino with XPath queries.  Tamino is not at fault here, at 
least not entirely, because our data is in a very complicated set of 
XML files with a lot of non-normalized and legacy metadata - getting at 
things via XPath is challenging and practically impossible in many 
cases.

My work is now presentable at
http://www.rossettiarchive.org/rose
(rose is for ROsetti SEarch)
This system is implicitly designed for academics who are delving into 
Rossetti's work, so it may not be all that interesting for most of you. 
 Have fun and send me any interesting things you discover, especially 
any issues you may encounter.

Here are some numbers to give you a sense of what is going on 
underneath... There are currently 4,983 XML files, totally about 110MB. 
 Without getting into a lot of details of the confusing domain, there 
are basically 3 types of XML files (works, pictures, and transcripts).  
It is important that  there be case-sensitive and case-insensitive 
searches.  To accomplish that, a custom analyzer is used in two 
different modes, one applying a LowerCaseFilter, and one not with the 
same documents written to two different indexes.  There is one 
particular type of XML file that gets indexed as two different types of 
documents (a specialized summary/header type).  In this first set of 
indexes, it is basically a one-to-one mapping of XML file to Lucene 
Document (with one type being indexed twice in different ways) - all 
said there are 5539 documents in each of the two main indexes.  The 
transcript type gets sliced into another set of original case and 
lowercased indexes with each document in that index representing a 
document division (a  element in the XML).  There are 12326 
documents in each of these -level indexes.   All said, the 4 
indexes built total about 3GB in size - I'm storing several fields in 
order to hit-highlight.  Only one of these indexes is being hit at a 
time - it depends on what parameters you use when querying for which 
index is used.

Lucene brought the search times into a usable, and impressive to the 
scholars, state.  The previous search solution often timed the browser 
out!  Search results now are in the milliseconds range.

The amount of data is tiny compared to most usages of Lucene, but 
things are getting interesting in other ways.   There has been little 
tuning in terms of ranking quality so far, but this is the next area of 
work.  There is one document type that is more important than the 
others, and it is being boosted during indexing.  There is now a 
growing interest in tinkering with all the new knobs and dials that are 
now possible.  Putting in similar and more-like-this features are 
desired and will be relatively straightforward to implement.  I'm 
currently using catch-all-aggregate-field technique for a default field 
for QueryParser searching.  Using a multi-field expansion is an area 
that is desirable instead though.  So, I've got my homework to do and 
catch up on all the goodness that has been mentioned in this list 
recently regarding all of these techniques.

An area where I'd like to solicit more help from the community relates 
to something akin to personalization.  The scholars would like to be 
able to tune results based on the role (such as "art historian") that 
is searching the site.  This would involve some type of training or 
continual learning process so that someone searching feeds back 
preferences implicitly for their queries by visiting the actual 
documents that are of interest.  Now that the scholars have seen what 
is possible (I showed them the cool SearchMorph comparison page 
searching Wikipedia for "rossetti"), they want more and more!

So - here's where I'm soliciting feedback - who's doing these types of 
things in the realm of Humanties?  Where should we go from here in 
terms of researching and applying the types of features dreamed about 
here?How would you recommend implementing these types of features?

I'd be happy to share more about what I've done under the covers.  As 
you may be able to tell, the web UI is Tapestry for the search and 
results pages (though you won't be able to tell from the URL's you'll 
see :).  The UI was designed primarily by one of our very graphical/CSS 
savvy post doc research associates, and was designed with the research 
scholar in mind.  I continue to 

Re: reuse of TokenStream

2005-02-18 Thread Erik Hatcher
I'm confused on how you're reusing a TokenStream object.  General  
Lucene usage would not involve a developer dealing with it directly.   
Could you share an example of what you're up to?

I'm not sure if this is related, but a technique I'm using is to index  
the same Document instance into two different IndexWriter instances  
(each uses a different Analyzer) - and this is working fine.

Erik
On Feb 17, 2005, at 6:04 AM, Harald Kirsch wrote:
Hi,
is it thread safe to reuse the same TokenStream object for several
fields of a document or does the IndexWriter try to parallelise
tokenization of the fields of a single document?
Similar question: Is it safe to reuse the same TokenStream object for
several documents if I use IndexWriter.addDocument() in a loop?  Or
does addDocument only put the work into a queue where tasks are taken
out for parallel indexing by several threads?
  Thanks,
  Harald.
--  
--- 
-
Harald Kirsch | [EMAIL PROTECTED] | +44 (0) 1223/49-2593
BioMed Information Extraction:  
http://www.ebi.ac.uk/Rebholz-srv/whatizit

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ParrellelMultiSearcher Question

2005-02-17 Thread Erik Hatcher
If you close a Searcher that goes through a RemoteSearchable, you'll 
close the remote index.  I learned this by experimentation for Lucene 
in Action and added a warning there:

http://www.lucenebook.com/search?query=RemoteSearchable+close
On Feb 17, 2005, at 8:27 PM, Youngho Cho wrote:
Hello,
Is there any pointer
how closing an index and how the server deals with index updates
for using ParrellelMultiSearcher with built in RemoteSearcher ??
Need your help.
Thanks,
Youngho
- Original Message -
From: "Youngho Cho" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 17, 2005 6:29 PM
Subject: ParrellelMultiSearcher Question

Hello,
I would like to use ParrellelMultiSearcher with few RemoteSearchables.
If one of the remote server is down,
Can I parrellelMultiSearcher set close() and
make new ParrellelMultiSearcher with other live RemoteSearchables ?
Thanks.
Youngho

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query Question

2005-02-17 Thread Erik Hatcher
On Feb 17, 2005, at 5:51 PM, Luke Shannon wrote:
My manager is now totally stuck about being able to query data with * 
in it.
He's gonna have to wait a bit longer, you've got a slightly tricky 
situation on your hands

WildcardQuery(new Term("name", "*home\**"));
The \* is the problem.  WildcardQuery doesn't deal with escaping like 
you're trying.  Your query is essentially this now:

home\*
Where backslash has no special meaning at all... you're literally 
looking for all terms that start with home followed by a backslash.  
Two asterisks at the end really collapse into a single one logically.

Any theories as to why the it would not match:
Document (relevant fields):
Keyword
Keyword
Is the \ escaping both * characters?
So, again, no escaping is being done here.  You're a bit stuck in this 
situation because * (and ?) are special to WildcardQuery, and it does 
no escaping.  Two options I think of:

	- Build your own clone of WildcardQuery that does escaping - or 
perhaps change the wildcard characters to something you do not index 
and use those instead.

	- Replace asterisks in the terms indexed with some other non-wildcard 
character, then replace it on your queries as appropriate.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query Question

2005-02-17 Thread Erik Hatcher
On Feb 17, 2005, at 2:44 PM, Luke Shannon wrote:
Hello;
Why won't this query find the document below?
Query:
+(type:203) +(name:*home\**)
Is that what the query toString is?  Or is that what you handed to 
QueryParser?

Depending on your analyzer, 203 may go away.  QueryParser doesn't 
support leading asterisks, so "*home" would fail to parse.

Document (relevant fields):
Keyword
Keyword
I was hoping by escaping the * it would be treated as a string. What 
am I
doing wrong?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Storing info about the index in the index

2005-02-17 Thread Erik Hatcher
On Feb 17, 2005, at 8:43 AM, Sanyi wrote:
Hi!
Is there any way to store info about the index in the index?
(You know, like in .doc files on Windows. You can store title, author, 
etc...)
I need to store the last indexed database UID in the index and maybe 
some other useful infos too.
I don't want to store them separately in the database or in another 
file because of administrative
reasons.
There is currently no feature to store additional information in the 
index like this, though you could use a special document in the index 
to do this.
You could also keep a .properties or .xml file alongside the index.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: big index and multi threaded IndexSearcher

2005-02-16 Thread Erik Hatcher
Are you using multiple IndexSearcher instances?Or only one and 
sharing it across multiple threads?

If using a single shared IndexSearcher instance doesn't help, it may be 
beneficial to port your code to Java and try it there.

I'm just now getting into PyLucene myself - building a demo for a Unix 
User's Group presentation I'm giving.

Erik
On Feb 16, 2005, at 3:04 PM, Yura Smolsky wrote:
Hello.
I use PyLucene, python port of Lucene.
I have problem about using big index (50Gb) with IndexSearcher
from many threads.
I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper
around a Java/libgcj thread that python is tricked into thinking
it's one of its own.
The core of problem:
When I have many threads (more than 5) I receive this exception:
  File "/usr/lib/python2.4/site-packages/PyLucene.py", line 2241, in 
search
def search(*args): return _PyLucene.Searcher_search(*args)
ValueError: java.lang.OutOfMemoryError
   <>

When I decrease number of threads to 3 or even 1 then search works.
How do many threads can affect to this exception?..
I have 2 Gb of memory. So with one thread the process takes like
1200-1300Mb.
Andi Vajda suggested that "There may be overhead involved in having
multiple threads against a given index."
Does anyone here have experience in handling big indexes with many
threads?
Any ideas are appreciated.
Yura Smolsky.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fieldinformation from Index

2005-02-15 Thread Erik Hatcher
On Feb 15, 2005, at 11:45 AM, Karl Koch wrote:
2) I need to know which Analyzer was used to index a field. One 
important
rule, as we all know, is to use the same analyzer for indexing and 
searching
a field. Is this information stored in the index or in full 
responsibilty of
the application developer?
The analyzer is not stored in the index, nor its name.  I believe this 
was discussed in the past, though.

It's not a rule that the same analyzer be used for both indexing and 
searching, and there are cases where it makes sense to use different 
ones.  The analyzers must be compatible though.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numbers in Index

2005-02-14 Thread Erik Hatcher
On Feb 14, 2005, at 4:32 PM, Miro Max wrote:
actually i'm using standard analyzer during my index
process. but when i browse the index with luke there
also numbers inside.
which analyzer should i use to eliminate this from my
index or should i specify this in my stopword list?
Don't use a stop word list to remove numbers.  You could do a couple of 
things use SimpleAnalyzer, or write a custom analyzer which uses 
the parts of StandardAnalyzer and applies a number removal filter at 
the end.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie questions

2005-02-14 Thread Erik Hatcher
On Feb 14, 2005, at 2:40 PM, Paul Jans wrote:
Hi again,
So is SqlDirectory recommended for use in a cluster to
workaround the accessibility problem, or are people
using NFS or a standalone server instead?
Neither.  As far as I know, Berkeley DB is the only viable DB 
implementation currently.

NFS has notoriously had issues with Lucene and file locking.  Search 
the archives for more details on this.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?

2005-02-14 Thread Erik Hatcher
On Feb 14, 2005, at 12:40 PM, Jim Lynch wrote:
I was trying to write some documentation on how to use the tool and 
issued a search for:

contact:DENNIS MORROW
Is that literally the QueryParser string you entered?  If so, that 
parses to:

contact:DENNIS OR defaultField:MORROW
most likely.
And now I get 648 hits, but in some of them the contact doesn't even 
remotely resemble the search pattern.  For instance here are the what 
the contact fields contain for some of these hits:
Contact: GENERIC CONTACT
Contact: Andre Gardinalli
Contact: Brett Morrow  (that's especially interesting)
Contact: KEN PATTERSON

And of course there are some with Dennis' name too.
Any idea why this is happening?  I'm using the QueryParser.parse 
method.
I'm not sure you'll be able to do this with QueryParser with spaces in 
an untokenized field.  First try it with an API created WildcardQuery 
to be sure it works the way you expect.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Erik Hatcher
Jim,
The Lucene website is transitioning to the new top-level space.  I have  
checked out the current site to the new lucene.apache.org area and set  
up redirects from the old Jakarta URL's.  The source code, though, is  
not an official part of the website.  Thanks to our conversion to  
Subversion, though, the source is browsable starting here:

http://svn.apache.org/repos/asf/lucene/java/trunk
The HTML of the website will need link adjustments to get everything  
back in shape.

The brackets are documented here:  
http://lucene.apache.org/queryparsersyntax.html

Erik
On Feb 14, 2005, at 10:31 AM, Jim Lynch wrote:
First I'm getting a
   The requested URL could not be retrieved
--- 
-

While trying to retrieve the URL:  
http://lucene.apache.org/src/test/org/apache/lucene/queryParser/ 
TestQueryParser.java

The following error was encountered:
   Unable to determine IP address from host name for /lucene.apache.org
   /Guess the system is down.
I'm getting this error:
org.apache.lucene.queryParser.ParseException: Encountered "is" at line  
1, column 15.
Was expecting:
   "]" ...
when I tried to parse the following string "[this is a test]".

I can't find any documentation that tells me what the brackets do to a  
query.  I had a user that was used to another search engine that used  
[] to do proximity or near searches and tried it on this one. Actually  
I'd like to see the documentation for what the parser does.  All that  
is mentioned in the javadoc is + - and ().  Obviously there are more  
special characters.

Thanks,
Jim.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DateFilter on UnStored field

2005-02-14 Thread Erik Hatcher
On Feb 14, 2005, at 6:27 AM, Sanyi wrote:
However, DateFilter will not work on fields indexed as "2004-11-05".
DateFilter only works on fields that were indexed using the DateField.
Well, can you post here a short example?
When I currently type "xxx.UnStored(.." I can simply type 
"xxx.DateField(.." ?
Does it take strings like "2004-11-05"?
DateField has a utility method to return a String:
DateField.timeToString(file.lastModified())
You'd use that String to pass to Field.UnStored.
I recommend, though, that you use a different format, such as the 
-MM-DD format you're using.

One option is to use a QueryFilter instead, filtering with a
RangeQuery.
I've read somewhere that classic range filtering can easily exceed the 
maximum number of boolean
query clauses. I need to filter a very large range of dates with day 
accuracy and I don't want to
increase the max. clause count to very high values. So, I decided to 
use DateFilter which has no
such problems AFAIK.
Right!
In Lucene's latest codebase (though not in 1.4.x) includes RangeFilter 
which would do the trick for you.  If you want to stick with Lucene 
1.4.x, that's fine... just grab the code for that filter and use it as 
a custom filter - its compatible with 1.4.x.

How much impact does DateFilter have on search times?
It depends on whether you instantiate a new filter for each search.  
Building a filter requires scanning through the terms in the index to 
build BitSet for the documents that fall in that range.  Filters are 
best used over multiple searches.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DateFilter on UnStored field

2005-02-13 Thread Erik Hatcher
Following up on PA's reply.  Yes, DateFilter works on *indexed* values, 
so whether a field is stored or not is irrelevant.

However, DateFilter will not work on fields indexed as "2004-11-05".  
DateFilter only works on fields that were indexed using the DateField.  
One option is to use a QueryFilter instead, filtering with a 
RangeQuery.

Erik
On Feb 13, 2005, at 7:09 AM, Sanyi wrote:
Hi!
Does DateFilter work on fields indexed as UnStored?
Can I filter an UnStored field with values like "2004-11-05" ?
Regards,
Sanyi

__
Do you Yahoo!?
Yahoo! Mail - 250MB free storage. Do more. Manage less.
http://info.mail.yahoo.com/mail_250
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple Keywords/Keyphrases fields

2005-02-12 Thread Erik Hatcher
The real question to answer is what types of queries you're planning on 
making.  Rather than look at it from indexing forward, consider it from 
searching backwards.

How will users query using those keyword phrases?
Erik
On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
I'm getting a bit more serious about the final form of our lucene 
index.  Each document has DocNumber, Authors, Title, Abstract, and 
Keywords.  By Keywords, I mean a comma separated list, each entry 
having possibly many terms in a phrase like:
	temporal infomax, finite state automata, Markov chains,
	conditional entropy, neural information processing

I presume I should be using a field "Keywords" which have many 
"entries" or "instances" per document (one per comma separated 
phrase).  But I'm not sure the right way to handle all this.  My 
assumption is that I should analyze them individually, just as we do 
for free text (the Abstract, for example), thus in the example above 
having 5 entries of the nature
	doc.add(Field.Text("Keywords", "finite state automata"));
etc, analyzing them because these are author-supplied strings with no 
canonical form.

For guidance, I looked in the archive and found the attached email, 
but I didn't see the answer.  (I'm not concerned about the dups, I 
presume that is equivalent to a boos of some sort) Does this seem 
right?

Thanks once again.
Owen
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Subject: Multiple equal Fields?
Date: Tue, 17 Feb 2004 12:47:58 +0100
Hi!
What happens if I do this:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "blah"));
Is there a field "foo" with value "blah" or are there two "foo"s 
(actually not
possible) or is there one "foo" with the values "bar" and "blah"?

And what does happen in this case:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
Does lucene store this only once?
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple Fields with same name

2005-02-11 Thread Erik Hatcher
On Feb 11, 2005, at 3:51 PM, Ramon Aseniero wrote:
I have not tried it -- Are there examples on the Lucene book? (I just 
bought
the book and cant find that’s related to my problem)
No, this particular item is not covered in the book.  My initial 
response was a succinct way of making a point.  A lot of times it is 
worth investing in giving something a try with a little bit of code, 
and doing this with Lucene is trivial.  I don't want to discourage 
anyone from asking questions, but rather encourage us all to do a 
little tinkering to find out things for ourselves and then ask if our 
assumptions don't come out as expected.

In fact, I'd have to mock up an example to find out myself for sure, 
but my hunch is that Lucene would maintain the order as it probably 
doesn't make sense algorithmically to do anything but keep the order.

Erik

Thanks,
Ramon
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, February 11, 2005 7:34 AM
To: Lucene Users List
Subject: Re: Multiple Fields with same name
On Feb 10, 2005, at 11:48 PM, Ramon Aseniero wrote:
If I store multiple fields with same name for example “Author” with 3
values
“bob,”jane”,”bill” once I retrieve the doc are the values in the same
order?
Did you try it?  :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
No virus found in this incoming message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.8.7 - Release Date: 2/10/2005
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.8.7 - Release Date: 2/10/2005

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie questions

2005-02-11 Thread Erik Hatcher
On Feb 11, 2005, at 1:36 PM, Erik Hatcher wrote:
Find me all users with (a CS degree and a GPA > 3.0)
or (a Math degree and a GPA > 3.5).
Some suggestions:  index degree as a Keyword field.  Pad GPA, so that 
all of them are the form #.# (or #.## maybe).  Numerics need to be 
lexicographically ordered, and thus padded.

With the right analyzer (see the AnalysisParalysis page on the wiki) 
you could use this type of query with QueryParser:'

	degree:cs AND gpa:[3.0 TO 9.9]
oops, to be completely technically correct, use curly brackets to get > 
rather than >=

degree:cs AND gpa:{3.0 TO 9.9}
(I'll assume GPA's only go to 4.0 or 5.0 :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie questions

2005-02-11 Thread Erik Hatcher
On Feb 10, 2005, at 5:00 PM, Paul Jans wrote:
A couple of newbie questions. I've searched the
archives and read the Javadoc but I'm still having
trouble figuring these out.
Don't forget to get your copy of "Lucene in Action" too :)
1. What's the best way to index and handle queries
like the following:
Find me all users with (a CS degree and a GPA > 3.0)
or (a Math degree and a GPA > 3.5).
Some suggestions:  index degree as a Keyword field.  Pad GPA, so that 
all of them are the form #.# (or #.## maybe).  Numerics need to be 
lexicographically ordered, and thus padded.

With the right analyzer (see the AnalysisParalysis page on the wiki) 
you could use this type of query with QueryParser:'

degree:cs AND gpa:[3.0 TO 9.9]
2. What are the best practices for using Lucene in a
clustered J2EE environment? A standalone index/search
server or storing the index in the database or
something else ?
There is a LuceneRAR project that is still in its infancy here: 
https://lucenerar.dev.java.net/

You can also store a Lucene index in Berkeley DB (look at the 
/contrib/db area of the source code repository)

However, most projects do fine with "cruder" techniques such as sharing 
the Lucene index on a common drive and ensuring that locking is 
configured to use the common drive also.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple Fields with same name

2005-02-11 Thread Erik Hatcher
On Feb 10, 2005, at 11:48 PM, Ramon Aseniero wrote:
If I store multiple fields with same name for example “Author” with 3 
values
“bob,”jane”,”bill” once I retrieve the doc are the values in the same 
order?
Did you try it?  :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Negative Match

2005-02-11 Thread Erik Hatcher
On Feb 11, 2005, at 9:52 AM, Luke Shannon wrote:
Hey Erik;
The problem with that approach is I get document that don't have a
kcfileupload field. This makes sense because these documents don't 
match the
prohibited
clause, but doesn't fit with the requirements of the system.
Ok, so instead of using the dummy field with a single dummy value, use 
a dummy field to list the field names.  
Field.Keyword("fields","kcfileupload"), but only for the documents that 
should have it, of course.  Then use a query like (using QueryParser 
syntax, but do it with the API as you have since QueryParser doesn't 
support leading wildcards):

+fields:kcfileupload -kcfileupload:*jpg*
Again, your approach is risky with term expansion.  Get more than 1,024 
unique kcfileupload values and you'll see!

Erik

What I like best about this approach is it doesn't require a filter. 
The
system I integrate with is presently designed to accept a query 
object. I
wasn't looking forward to having to add the possibility that queries 
might
require filters. I may have to still do this, but for now I would like 
to
try this and see how it goes.

Thanks,
Luke
- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Thursday, February 10, 2005 7:23 PM
Subject: Re: Negative Match

On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote:
I think I found a pretty good way to do a negative match.
In this query I am looking for all the Documents that have a
kcfileupload
field with any value except for jpg.
Query negativeMatch = new WildcardQuery(new
Term("kcfileupload",
"*jpg*"));
 BooleanQuery typeNegAll = new BooleanQuery();
Query allResults = new WildcardQuery(new Term("kcfileupload",
"*"));
IndexSearcher searcher = new IndexSearcher(fsDir);
BooleanClause clause = new BooleanClause(negativeMatch, 
false,
true);
typeNegAll.add(allResults, true, false);
typeNegAll.add(clause);
Hits hits = searcher.search(typeNegAll);

With the little testing I have done this *seems* to work. Does anyone
see a
problem with this approach?
Sure do you realize what WildcardQuery does under the covers?  It
literally expands to a BooleanQuery for all terms that match the
pattern.  There is an adjustable limit built-in of 1,024 clauses to
BooleanQuery.  You obviously have not hit that limit ... yet!
You're better off using the advice offered on this thread
previously create a single dummy field with a fixed value for all
documents.  Combine a TermQuery for that dummy value with a prohibited
clause like y our negativeMatch above.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Negative Match

2005-02-10 Thread Erik Hatcher
On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote:
I think I found a pretty good way to do a negative match.
In this query I am looking for all the Documents that have a 
kcfileupload
field with any value except for jpg.

Query negativeMatch = new WildcardQuery(new 
Term("kcfileupload",
"*jpg*"));
 BooleanQuery typeNegAll = new BooleanQuery();
Query allResults = new WildcardQuery(new Term("kcfileupload", 
"*"));
IndexSearcher searcher = new IndexSearcher(fsDir);
BooleanClause clause = new BooleanClause(negativeMatch, false,
true);
typeNegAll.add(allResults, true, false);
typeNegAll.add(clause);
Hits hits = searcher.search(typeNegAll);

With the little testing I have done this *seems* to work. Does anyone 
see a
problem with this approach?
Sure do you realize what WildcardQuery does under the covers?  It 
literally expands to a BooleanQuery for all terms that match the 
pattern.  There is an adjustable limit built-in of 1,024 clauses to 
BooleanQuery.  You obviously have not hit that limit ... yet!

You're better off using the advice offered on this thread 
previously create a single dummy field with a fixed value for all 
documents.  Combine a TermQuery for that dummy value with a prohibited 
clause like y our negativeMatch above.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: wildcards, stemming and searching

2005-02-10 Thread Erik Hatcher
How would you deal with a query like "a*z" though?
I suspect, however, that you only care about suffix queries and 
stemming those.  If thats the case, then you could subclass 
getWildcardQuery and do internal stemming (remove trailing wildcard, 
run it through the analyzer directly there and return a modified 
WildcardQuery instance.

With wildcard queries though, this is risky.  Prefixes won't 
necessarily stem to what the full word would stem to.

Erik
On Feb 9, 2005, at 6:26 PM, aaz wrote:
Hi,
We are not using QueryParser and have some custom Query construction.
We have an index that indexes various documents. Each document is 
Analyzed and indexed via

StandardTokenizer() ->StandardFilter() -> LowercaseFilter() -> 
StopFilter() -> PorterStemFilter()

We also want to support wildcard queries, hence on an inbound query we 
need to deal with "*" in the value side of the comparison. We also 
need to "analyze" the value side of the query against the same 
analyzer in which the index was built with. This leads to some 
problems and would like your solution opinion.

User queries.
somefield = united*
After the analyzer hits "united*", we get back "unit". Hence we cannot 
detect that the user requested a wildcard.

Lets say we come up with some solution to "escape" the "*" char before 
the Analyzer hits it. For example

somefield = united*  -> unitedXXWILDCARDXX
After analysis this then becomes "unitedxxwildcardxx", which we can 
then turn into a WildcardQuery "united*"

The problem here is that the term "united" will never exist in the 
indexing due to the stemming which did not stem properly due to our 
escape mechanism.

How can I solve this problem?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-09 Thread Erik Hatcher
On Feb 9, 2005, at 4:51 AM, mark harwood wrote:
A GUI plugin for Squirrel SQL (
http://squirrel-sql.sourceforge.net/) would make a
great way of configuring the mapping.
That would be slick!
1) Should we build this mapper into Luke instead? We
would have to lift a LOT of the DB handling "smarts"
from Squirrel. Luke however is doing a lot with
Analyzer configuration which would certainly be useful
code in any mapping tool (can we lift those and use in
Squirrel?).
The dilemma with Luke is that its not ASL'd (because of the thinlet 
integration).  Anyone up for a Swing conversion project?  :)

It would be quite cool if Lucene had a built-in UI tool (like or 
actually Luke).  Luke itself is ASL'd and I believe Andrzej has said 
he'd gladly donate it to Lucene's codebase, but the Thinlet LGPL is an 
issue.

2) What should the XML for the batch-driven
configuration look like? Is it ANT tasks or a custom
framework?
Don't concern yourselves with Ant at the moment.  Anything that is 
easily callable from Java can be made into an Ant task.  In fact, the 
minimum requirements for an Ant task is a "public void execute()" 
method.  Whatever Java infrastructure you come up with, I'll gladly 
create the Ant task wrapper for it when its ready.

3) If our mapping understands the make-up of the rdbms
and the Lucene index should we introduce a
higher-level software layer for searching which sits
over the rdbms and Lucene and abstracts them to some
extent? This layer would know where to go to retrieve
field values or construct filters ie understands
whether to retrieve "title" field for display from
database column or a Lucene "stored" field and whether
the "price< $100" search criteria is resolved by a
lucene query or an RDBMS-query to produce a Lucene
filter. It seems like currently, every DB+Lucene
integration project struggles with designing a
solution to manage this divide and handcodes the
solution.
Wow... that is getting pretty clever.  I like it!
I don't personally have a need for relational database indexing, but I 
support this effort to make a generalized mapping facility.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sounds like spellcheck

2005-02-09 Thread Erik Hatcher
On Feb 9, 2005, at 7:23 AM, Aad Nales wrote:
In my Clipper days I could build an index on English words using a 
technique that was called soundex. Searching in that index resulted in 
hits of words that sounded the same. From what i remember this 
technique only worked for English. Has it ever been generalized?
I do not know how Soundex/Metaphone/Double Metaphone work with 
non-English languages, but these algorithms are in Jakarta Commons 
Codec.  I used the Metaphone algorithm as a custom analyzer example in 
Lucene in Action.  You'll see it in the source code distribution under 
src/lia/analysis/codec.  I did a couple of variations, one that adds 
the metaphoned version as a token in the same position and one that 
simply replaces it in the token stream.

I even envisioned this sounds-like feature being used for children.  I 
was mulling over this idea while having lunch with my son one day last 
spring (he was 5 at the time).  I asked him how to spell "cool cat" and 
he replied "c-o-l c-a-t".  I tried it out with the metaphone algorithm 
and it matches!

http://www.lucenebook.com/search?query=cool+cat
Erik

What i am trying to solve is this. A customer is looking for a 
solution to spelling mistakes made by children (upto 10) when typing 
in queries. The site is Dutch. Common mistakes are 'sgool' when 
searching for 'school'. The 'normal' spellcheckers and suggestors 
typically generate a list where the 'sounds like' candidates' are too 
far away from the result. So what I am thinking about doing is this:

1. create a parser that takes a word and creates a soundindex entry.
2. create list of 'correctly' spelled words either based on the index 
of the website or on some kind of dictionary.
2a. perhaps create a n-gram index based on these words

3. accept a query, figure out that a spelling mistake has been made
3a find alternatives by parsing the query and searching the 'sound 
like index' and then calculate and order  the results

Steps 2 and 3 have been discussed at length in this forum and have 
even made it to the sandbox. What I am left with is 1.

My thinking is processing a series of replacement statements that go 
like:
--
g sounds like ch if the immediate predecessor is an s.
o sounds like oo if the immediate predecessor is a consonant
--

But before I takes this to the next step I am wondering if anybody has 
created or thought up alternative solutions?

Cheers,
Aad



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem searching Field.Keyword field

2005-02-09 Thread Erik Hatcher
The only caveat to your VerbatimAnalyzer is that it will still split 
strings that are over 255 characters.  CharTokenizer does that.  
Granted, though, that keyword fields probably don't make much sense to 
be that long.

As mentioned yesterday - I added the LIA KeywordAnalyzer into the 
contrib area of Subversion.  I had built one like you had also, but the 
one I contributed reads the entire input stream into a StringBuffer 
ensuring it does not get split like CharTokenizer would.

Erik
On Feb 9, 2005, at 4:40 AM, Miles Barr wrote:
On Tue, 2005-02-08 at 12:19 -0500, Steven Rowe wrote:
Why is there no KeywordAnalyzer?  That is, an analyzer which doesn't
mess with its input in any way, but just returns it as-is?
I realize that under most circumstances, it would probably be more 
code
to use it than just constructing a TermQuery, but having it would
regularize query handling, and simplify new users' experience.  And 
for
the purposes of the PerFieldAnalyzerWrapper, it could be helpful.
It's fairly straightforward to write one. Here's the one I put together
for PerFieldAnalyzerWrapper situations:
package org.apache.lucene.analysis;
import java.io.Reader;
public class VerbatimAnalyzer extends Analyzer {
public VerbatimAnalyzer() {
super();
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new VerbatimTokenizer(reader);
return result;
}
/**
 * This tokenizer assumes that the entire input is just one token.
 */
public static class VerbatimTokenizer extends CharTokenizer {
public VerbatimTokenizer(Reader reader) {
super(reader);
}
protected boolean isTokenChar(char c) {
return true;
}
}
}
--
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem searching Field.Keyword field

2005-02-08 Thread Erik Hatcher
On Feb 8, 2005, at 12:19 PM, Steven Rowe wrote:
Why is there no KeywordAnalyzer?  That is, an analyzer which doesn't  
mess with its input in any way, but just returns it as-is?

I realize that under most circumstances, it would probably be more  
code to use it than just constructing a TermQuery, but having it would  
regularize query handling, and simplify new users' experience.  And  
for the purposes of the PerFieldAnalyzerWrapper, it could be helpful.
It's long been on my TODO list.  I just adapted (changed the package  
names) the Lucene in Action KeywordAnalyzer and added it to the new  
contrib area:

	http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/ 
src/java/org/apache/lucene/analysis/KeywordAnalyzer.java

In the next official release of Lucene, the contrib (formerly known as  
the Sandbox) components will be packaged along with the Lucene core.   
I'm still working on this packaging build process as I migrate the  
Sandbox over to contrib.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem searching Field.Keyword field

2005-02-08 Thread Erik Hatcher
Kelvin - I respectfully disagree - could you elaborate on why this is 
not an appropriate use of Field.Keyword?

If the category is "How To", Field.Text would split this (depending on 
the Analyzer) into "how" and "to".

If the user is selecting a category from a drop-down, though, you 
shouldn't be using QueryParser on it, but instead aggregating a 
TermQuery("category", "How To") into a BooleanQuery with the rest of 
it.  The rest may be other API created clauses and likely a piece from 
QueryParser.

Erik
On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:
As I posted previously, Field.Keyword is appropriate in only certain 
situations. For your use-case, I believe Field.Text is more suitable.

k
On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
 This may or may not be correct, but I am indexing it as a keyword
 because I provide a (required) radio button on the add screen for
 the user to determine which category the document should be
 assigned.  Then in the search, provide a dropdown that can be used
 in the advanced search so that they can search only for a specific
 category of documents (like HowTo, Troubleshooting, etc).
 -Original Message-
 From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
 February 08, 2005 9:32 AM To: Lucene Users List
 Subject: RE: Problem searching Field.Keyword field
 Mike, is there a reason why you're indexing "category" as keyword
 not text?
 k
 On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
 Thanks for the quick response.
 Sorry for my lack of understanding, but I am learning!  Won't the
  query parser still handle this query?  My limited understanding
 was  that the search call provides the 'all' field as default
 field for  query terms in the case where fields aren't specified.
   Using the  current code, searches like author:Mike" and
 title:Lucene work fine.
 -Original Message-
 From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:  
 Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
  Re: Problem searching Field.Keyword field
 You're using the query parser with the standard analyser. You  
 should construct a term query manually instead.
 --
 Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
 --
 --  - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:  
[EMAIL PROTECTED]
 --
 --  - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:  
[EMAIL PROTECTED]

 
 - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:
[EMAIL PROTECTED]
 
 - To unsubscribe, e-mail: lucene-user-
[EMAIL PROTECTED] For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-08 Thread Erik Hatcher
On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote:
Hi Erik,
I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  
All I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

Erik
From what I was reading in the mailing list there are more lucene 
users that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?
I have no objections to that type of switch.  Please submit a path to 
QueryParser.jj that implements this as an option with the default to 
disallow suffix queries, along with a test case and I'd be happy to 
apply it.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Does anyone have a copy of the highligher code?

2005-02-08 Thread Erik Hatcher
On Feb 8, 2005, at 9:50 AM, Jim Lynch wrote:
Our firewall prevents me from using cvs to check out anything.  Does 
anyone have a jar file or a set of class files publicly available?
The "Lucene in Action" source code - http://www.lucenebook.com - 
contains JAR files, including the Highlighter, for lots of Lucene 
add-on goodies.

Also, Lucene just converted to using Subversion, which is much more 
firewall friendly.  Try this after you have installed the svn client:

svn co http://svn.apache.org/repos/asf/lucene/java/trunk
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Highlighter: how to specify text from external source?

2005-02-08 Thread Erik Hatcher
On Feb 8, 2005, at 6:29 AM, Yura Smolsky wrote:
Hello, lucene-user.
If I do not store text fields in the index, is there a way to specify
values for Highlighter from external source and how?
One of the parameters passed to the highlighting method is a String to 
highlight.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem searching Field.Keyword field

2005-02-08 Thread Erik Hatcher
The problem is that QueryParser analyzes all pieces of a query 
expression regardless of whether you indexed them as a Field.Keyword or 
not.  If you need to use QueryParser and still support keyword fields, 
you'll want to plug in an analyzer specific to that field using 
PerFieldAnalyzerWrapper.  You'll see this demonstrated in the "Lucene 
in Action" source code.  Here's a quick pointer to where we cover it in 
the book:

http://www.lucenebook.com/search?query=KeywordAnalyzer
On Feb 8, 2005, at 9:26 AM, Mike Miller wrote:
Thanks for the quick response.
Sorry for my lack of understanding, but I am learning!  Won't the query
parser still handle this query?  My limited understanding was that the
search call provides the 'all' field as default field for query terms 
in
the case where fields aren't specified.   Using the current code,
searches like author:Mike" and title:Lucene work fine.

-Original Message-
From: Miles Barr [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 08, 2005 8:08 AM
To: Lucene Users List
Subject: Re: Problem searching Field.Keyword field
You're using the query parser with the standard analyser. You should
construct a term query manually instead.
--
Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-08 Thread Erik Hatcher
I agree that it is a worthwhile contribution.
Some suggestions... allow the configuration to specify field boost 
values, and analyzer(s).  If analyzers are specified per-field, then 
wrap then automatically with a PerFieldAnalyzerWrapper.  Also, having a 
facility to aggregate fields into a "contents"-like field would be nice 
- though maybe this would be covered implicitly as part of the SQL 
mapping with one of the columns being an aggregate column.

Perhaps the configuration aspect of it (XML mapping of expressions to 
field details) could be generalized to work with an object graph as 
well as SQL result sets.  OGNL (www.ognl.org) makes expression language 
glue and I can see it being used for mappings - for example the "name" 
field could be mapped to "company.president.name", where "company" is 
an object (or Map) with a "president" property, and so on.

Erik

On Feb 8, 2005, at 2:42 AM, Aad Nales wrote:
If that is a general thought then I will plan for some time to put 
this in action.

Cheers,
Aad
David Spencer wrote:
Nice, very similar to what I was thinking of, where the most 
significant difference is probably just that I was thinking of a 
batch indexer, not one embedded in a web container. Probably a 
worthwhile contribution to the sandbox.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Fwd: SearchBean?

2005-02-07 Thread Erik Hatcher
I want to double-check with the user community now that I've run this 
past the lucene-dev list.

Anyone using SearchBean from the Sandbox?  If so, please speak up and 
let me know what it offers that the sort feature does not.  If this is 
now essentially deprecated, I'd like to remove it.

Thanks,
Erik
Begin forwarded message:
From: Erik Hatcher <[EMAIL PROTECTED]>
Date: February 6, 2005 10:02:37 AM EST
To: Lucene List 
Subject: SearchBean?
Reply-To: "Lucene Developers List" 
Is the SearchBean code in the Sandbox still useful now that we have 
sorting in Lucene 1.4?  If so, what does it offer that the core does 
not provide now?

As I'm cleaning up the sandbox and migrating it to a "contrib" area, 
I'm evaluating the pieces and making sure it makes sense to keep or if 
it is no longer useful or should be reorganized in some way.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query Analyzer

2005-02-07 Thread Erik Hatcher
On Feb 7, 2005, at 11:29 AM, Ravi wrote:
How do I set the analyzer when I build the query in my code instead of
using a query parser ?
You don't.  All terms you use for any Query subclasses you instantiate 
must match exactly the terms in the index.  If you need an analyzer to 
do this then you're responsible for doing it yourself, just as 
QueryParser does underneath.  I do this myself in my current 
application like this:

private Query createPhraseQuery(String fieldName, String string, 
boolean lowercase) {
RossettiAnalyzer analyzer = new RossettiAnalyzer(lowercase);
TokenStream stream = analyzer.tokenStream(fieldName, new 
StringReader(string));

PhraseQuery pq = new PhraseQuery();
Token token;
try {
  while ((token = stream.next()) != null) {
  pq.add(new Term(fieldName, token.termText()));
  }
} catch (IOException ignored) {
  // ignore - shouldn't get an IOException on a StringReader
}
if (pq.getTerms().length == 1) {
// optimize single term phrase to TermQuery
return new TermQuery(pq.getTerms()[0]);
}
return pq;
}
Hope that helps.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Similarity coord,lengthNorm

2005-02-07 Thread Erik Hatcher
On Feb 7, 2005, at 8:53 AM, Michael Celona wrote:
Would fixing the lengthNorm to 1 fix this problem?
Yes, it would eliminate the length of a field as a factor.
Your best bet is to set up a test harness where you can try out various 
tweaks to Similarity, but setting the length normalization factor to 
1.0 may be all you need to do, as the coord() takes care of the other 
factor you're after.

Erik
Michael
-Original Message-
From: Michael Celona [mailto:[EMAIL PROTECTED]
Sent: Monday, February 07, 2005 8:48 AM
To: Lucene Users List
Subject: Similarity coord,lengthNorm
I have varying length text fields which I am searching on.  I would 
like
relevancy to be dictated predominantly by the number of terms in my 
query
that match.  Right now I am seeing a high relevancy for a single word
matching in a small document even though all the terms in my query 
don't
match.  Does, anyone have an example of a custom Similarity sub class 
which
overrides the coord and lengthNorm methods.


Thanks..
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-07 Thread Erik Hatcher
On Feb 7, 2005, at 2:07 AM, sergiu gordea wrote:
Hi Erick,

"In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards * or 
?."

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change "must" to "should".
Will this change available in the next realease of lucene? How do you 
plan to implement this? Will this be available as an atributte of  
QueryParser?
I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  All 
I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PHP-Lucene Integration

2005-02-06 Thread Erik Hatcher
Eventually you can just do PHP within the servlet container
http://www.jcp.org/en/jsr/detail?id=223
and have your cake and eat it too!  :)
Erik
On Feb 6, 2005, at 12:10 PM, Owen Densmore wrote:
I'm building a lucene project for a client who uses php for their 
dynamic web pages.  It would be possible to add servlets to their 
environment easily enough (they use apache) but I'd like to have 
minimal impact on their IT group.

There appears to be a php java extension that lets php call back & 
forth to java classes, but I thought I'd ask here if anyone has had 
success using lucene from php.

Note: I looked in the Lucene In Action search page, and yup, I bought 
the book and love it!  No examples there tho.  The list archives 
mention that using java lucene from php is the way to go, without 
saying how.  There's mention of a lucene server and a php interface to 
that.  And some similar comments.  But I'm a bit surprised there's not 
a bit more in terms of use of the official java extension to php.

Thanks for the great package!
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-06 Thread Erik Hatcher
On Feb 4, 2005, at 9:37 PM, Chris Hostetter wrote:
If you want to start doing suffix queries (ie: all names ending with
"s", or all names ending with "Smith") one approach would be to use
WildcarQuery, which as Erik mentioned, will allow you to use a quey 
Term
that starts with a "*". ie...

   Query q3 = new WildcardQuery(new Term("name","*s"));
   Query q4 = new WildcardQuery(new Term("name","*Smith"));
(NOTE: Erik says you can do this, but the docs for WildcardQuery say 
you
can't I'll assume the docs are wrong and Erik is correct.)
I assume you mean this comment on WildcardQuery's javadocs:
"In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards * or 
?."

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change "must" to "should".  And 
yes, WildcardQuery itself supports a leading wildcard character exactly 
as you have shown.

Which leads me to my point: if you denormalize your data so that you 
store
both the Term you want, and the *reverse* of the term you want, then a
Suffix query is just a Prefix query on a reversed field -- by 
sacrificing
space, you can get all the speed efficiencies of a PrefixQuery when 
doing
a SuffixQuery...

   D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
   D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
   D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
   D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...
   Query q1 = new PrefixQuery(new Term("name","J*"));
   Query q2 = new PrefixQuery(new Term("name","Sue*"));
   Query q3 = new PrefixQuery(new Term("rname","s*"));
   Query q4 = new PrefixQuery(new Term("rname","htimS*"));
(If anyone sees a flaw in my theory, please chime in)
This trick has been mentioned on this list before, and is a good one.  
I'll go one step further and mention another technique I found in the 
book Managing Gigabytes, making "*string*" queries drastically more 
efficient for searching (though also impacting index size).  Take the 
term "cat".  It would be indexed with all rotated variations with an 
end of word marker added:

cat$
at$c
t$ca
$cat
The query for "*at*" would be preprocessed and rotated such that the 
wildcards are collapsed at the end to search for "at*" as a 
PrefixQuery.  A wildcard in the middle of a string like "c*t" would 
become a prefix query for "t$c*".

Has anyone tried this technique with Lucene?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Some questions about index...

2005-02-05 Thread Erik Hatcher
On Feb 5, 2005, at 10:04 AM, Karl Koch wrote:
1) Can I store all the information of the text file, but also apply a
analyser. E.g. I use the StopAnalyzer. After finding the document, I 
want to
extract the original text also from the index. Does this require that I
store the information twice in two different fields (one indexed and 
one
unindexed) ?
You should use a single stored, tokenized, and indexed field for this 
purpose.  Be cautious of how you construct the Field object to achieve 
this.

2) I would like to extract information from the index which can found 
in a
boolean way. I know that Lucene is a VSM which provides Boolean 
operators.
This however does not change its functioning. For example, I have a 
field
with contains an ID number and I want to use the search like a database
operatation (e.g. to find the document with id=1). I can solve the 
problem
by searching with query "id:1". However, this does not ensure that I 
will
only get one result. Usually the first result is the document I want. 
But it
could happen, that this sometimes does not work.
Why wouldn't it work?  For ID-type fields, use a Field.Keyword (stored, 
indexed, but not tokenized).  Search for a specific ID using a 
TermQuery (don't use QueryParser for this, please).  If the ID values 
are unique, you'll either get zero or one result.

 What happens if I should
get no results? I guess if I search for id=5 and 5 did not exist I 
would
probably get 50, 51, .. just because the contain 5. Did somebody work 
with
this and can suggest a stable solution?
No, this would not be the case, unless you're analyzing the ID field 
with some strange character-by-character analyzer or doing a wildcard 
"*5*" type query.

A good solution for these two questions would help me avoiding a 
database
which would need to replicate most the data which I already have in my
Lucene index...
You're on the right track and avoiding a database when it is overkill 
or duplicative is commendable :)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-04 Thread Erik Hatcher
It matches both because you're tokenizing the name field.  In both 
documents, the name field has a "testing" term in it (it gets 
lowercased also).  A PrefixQuery matches terms that start with the 
prefix.  Use an untokenized field type (Field.Keyword) if you want to 
keep the entire original string as-is for searching purposes - however 
you'd have issues with case-sensitivity in your example.

Also keep in mind that QueryParser only allows a trailing asterisk, 
creating a PrefixQuery.  However, if you use a WildcardQuery directly, 
you can use an asterisk as the starting character (at the risk of 
performance).

Erik
On Feb 4, 2005, at 7:50 PM, Luke Shannon wrote:
Hello;
I have these two documents:
Text
Keyword
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Keyword
Keyword
Text
Text
Text
Text
Text
Text
Text
Text
Brand Ide.>
Text
Text

I would like to be able to match a name fields that starts with testing
(specifically) and those that end with it.
I thought the below code would parse to a Prefix Query that would 
satisfy my
starting requirment (maybe I don't understand what this query is for). 
But
this matches both.

Query query = QueryParser.parse("testing*", "name", new 
StandardAnalyzer());

Has anyone done this before? Any tips?
Thanks,
Luke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document numbers and ids

2005-02-04 Thread Erik Hatcher
On Feb 4, 2005, at 12:24 PM, Simeon Koptelov wrote:
By "renumbered", it means it squeezes out holes left by deletes.  The
actual order does not change and thus does not affect a 
Sort.INDEXORDER
sort.

Documents are stored in the index in the order that they were indexed 
-
nothing changes this order.  Document id's are not permanent if 
deletes
occur followed by an optimize.
Thanks for clarification, Erik. Could you answer one more question: 
can I
control the assignment of document numbers during indexing?
No, you cannot control Lucene's document id scheme - it is basically 
"for internal use".

Maybe I should explain, why I'm asking.
I'm searching for documents, but for most (almost all) of them I don't 
really
care about their content. I only want to know a particular numeric 
field from
document (id of document's category).
I also need to know how many docs in category were found, so I can't 
index
categories instead of docs.
The result set can be pertty big (30K) and all must be handled in 
inner loop.
So I wanna use HitCollector and assign intervals of ids to categories 
of
documents. Following this way, there's no need to actually retrieve 
document
in inner loop.

Am I on the right way?
You should explore the use of IndexReader.  Index your documents with 
category id field, and use the methods on IndexReader to find all 
unique categories (TermEnum).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document numbers and ids

2005-02-04 Thread Erik Hatcher
On Feb 4, 2005, at 9:49 AM, Simeon Koptelov wrote:
The LiA says that I can use Sort.INDEXORDER when indexing order is 
relevant
and gives an example where documents' ids (got from Hits.id() ) are
increasing from top to bottom of resultset. Are that ids the same 
thing as
document numbers?
Yes, id is the same as document number.
If they are the same, how can it be that they are preserved during 
indexing
process? LiA says that documents are renumbered when merging segments.
By "renumbered", it means it squeezes out holes left by deletes.  The 
actual order does not change and thus does not affect a Sort.INDEXORDER 
sort.

Documents are stored in the index in the order that they were indexed - 
nothing changes this order.  Document id's are not permanent if deletes 
occur followed by an optimize.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Right way to make analyzer

2005-02-03 Thread Erik Hatcher
On Feb 3, 2005, at 9:26 AM, Owen Densmore wrote:
Is this the right way to make a porter analyzer using the standard 
tokenizer?  I'm not sure about the order of the filters.

Owen
class MyAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
return new PorterStemFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader))),
   StopAnalyzer.ENGLISH_STOP_WORDS));
  }
}
Yes, that is correct.
Analysis starts with a tokenizer, and chains the output of that to the 
next filter and so on.

I strongly recommend, as you start tinkering with custom analysis, to 
use a little bit of code to see how your analyzer works on some text.  
The Lucene Intro article I wrote for java.net has some code you can 
borrow to do this, as does Lucene in Action's source code.  Also, Luke 
has this capability - which is a tool I also highly recommend.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Subversion conversion

2005-02-03 Thread Erik Hatcher
We can work the 1.x and 2.0 lines of code however we need to.  We can 
branch (a branch or tag in Subversion is inexpensive and a constant 
time operation).  How we want to manage both versions of Lucene is open 
for discussion.  Nothing about Subversion changes how we manage this 
from how we'd do it with CVS.

Currently the 1.x and 2.x lines of code are one and the same.  Once 
they diverge in 2.0, it will depend on who steps up to maintain 1.x but 
I suspect there will be a strong interest in keeping it alive by some, 
but we would of course encourage everyone using 1.x upgrade to 1.9 and 
remove deprecation warnings.

Erik

On Feb 3, 2005, at 4:33 AM, Miles Barr wrote:
On Wed, 2005-02-02 at 22:11 -0500, Erik Hatcher wrote:
I've seen both of these types of procedures followed on Apache
projects.  It really just depends.  Lucene's codebase is not being
modified frequently, so it is not necessary to branch and merge back.
Rather we simply develop off of the trunk (HEAD) and when we're ready
for a release we'll just do it from the trunk.  Actually  we'd most
likely tag and build from that tag just to be clean about it.
What consequences does this have for the 1.9/2.0 releases? i.e. after
2.0 the deprecated API will be removed, does this mean 1.x will no
longer be supported after 2.0?
The typical scenario being a bug is found that affects 1.x and 2.x, 
it's
patched in 2.x (i.e. the trunk) but we can't patch the last 1.x 
release.
The other scenario being a bug is found in the 1.x code, but it cannot
be applied.

--
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Has anyone tried indexing xml files: DigesterXMLHandler.java file before?

2005-02-03 Thread Erik Hatcher
You're missing the Commons Digester JAR, which is in the lib directory 
of the LIA download.  Check the build.xml file for the build details of 
how the compile class path is set.  You'll likely need some other JAR's 
at runtime too.

Erik
On Feb 3, 2005, at 2:12 AM, jac jac wrote:
Hi,
I just tried to compile DigesterXMLHandler.java  from the LIA codes 
which I have gotten from the src directory.

I placed it into my own directory...
I could't seem to be able to compile DigesterXMLHandler.java:
It keeps prompting:
DigesterXMLHandler.java:9: package org.apache.commons.digester does 
not exist
import org.apache.commons.digester.Digester;
   ^
DigesterXMLHandler.java:19: cannot resolve symbol
symbol  : class Digester
location: class lia.handlingtypes.xml.DigesterXMLHandler
  private Digester dig;
  ^
DigesterXMLHandler.java:25: cannot resolve symbol
symbol  : class Digester
location: class lia.handlingtypes.xml.DigesterXMLHandler
dig = new Digester();

I have set the classpath...
May I know how do we run the file in order to get my index folder?
so sorry, i really can't interpret the way to run it...
are there any documentation around...?
thanks very much!
 Yahoo! Mobile
- Download the latest ringtones, games, and more!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Subversion conversion

2005-02-02 Thread Erik Hatcher
I've seen both of these types of procedures followed on Apache 
projects.  It really just depends.  Lucene's codebase is not being 
modified frequently, so it is not necessary to branch and merge back.  
Rather we simply develop off of the trunk (HEAD) and when we're ready 
for a release we'll just do it from the trunk.  Actually  we'd most 
likely tag and build from that tag just to be clean about it.

Erik
On Feb 2, 2005, at 7:49 PM, Chakra Yadavalli wrote:
Hello ALL, It might not be the right place for it but as we are talking
about SCM, I have a quick question. First, I haven't used CVS/SVN on 
any
project. I am a ClearCase/PVCS guy. I just would like to know WHICH
CONFIGURATION MANAGEMENT PLAN DO YOU FOLLOW IN LUCENE DEVELOPMENT.

PLAN A: DEVELOP IN TRUNK AND BRANCH OFF ON RELEASE
Recently I had a discussion with a friend about developing in the TRUNK
(which in the /main in ClearCase speak),  which my friend claims that 
is
done in the APACHE/Open Source projects. The main advantage he pointed
was that Merging could be avoided if you are developing in the TRUNK.
And when there is a release, they create a new Branch (say LUCENE_1.5
branch) and label them. That branch will be used for maintenance and 
any
code deltas will be merged back to TRUNK as needed.

PLAN B: BRANCH OF BEFORE PLANNED RELEASE AND MERGE BACK TO MAIN/TRUNK
As I am from a "private workspace"/"isolated development" school of
thought promoted by ClearCase, I am used to create a branch at the
project/release initiation and develop in that branch (say /main/dev).
Similarly, we have /main/int for making changes when the project goes 
to
integration phase, and a /main/acp branch for acceptance. In this
school, the /main will always have fewer versions of files and the
difference between any two consecutive versions is the NET CHANGE of
that SCM element (either file or dir) between two releases (say LUCENE
1.4 and 1.5).

Thanks in advance for your time.
Chakra Yadavalli
http://jroller.com/page/cyblogue
-Original Message-
From: aurora [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 02, 2005 4:25 PM
To: lucene-user@jakarta.apache.org
Subject: Re: Subversion conversion
Subversion rocks!
I have just setup the Windows svn client TortoiseSVN with my favourite
file manager Total Commander 6.5. The svn status and commands are
readily
integrated with the file manager. Offline diff and revert are two 
things
I
really like from svn.

The conversion to Subversion is complete.  The new repository is
available to users read-only at:
  http://svn.apache.org/repos/asf/lucene/java/trunk
Besides /trunk, there is also /branches and /tags.  /tags contains 
all

the CVS tags made so that you could grab a snapshot of a previous
version.  /trunk is analogous to CVS HEAD.  You can learn more about
the
Apache repository configuration here and how to use the command-line
client to check out the repository:
  http://www.apache.org/dev/version-control.html
Learn about Subversion, including the complete O'Reilly Subversion
book
in electronic form for free here:
  http://subversion.tigris.org
For committers, check out the repository using https and your Apache
username/password.
The Lucene sandbox has been integrated into our single Subversion
repository, under /java/trunk/sandbox:
  http://svn.apache.org/repos/asf/lucene/java/trunk/sandbox/
The Lucene CVS repositories have been locked for read-only.
If there are any issues with this conversion, let me know and I'll
bring
them to the Apache infrastructure group.
  Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Visit my weblog: http://www.jroller.com/page/cyblogue
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Subversion conversion

2005-02-02 Thread Erik Hatcher
The conversion to Subversion is complete.  The new repository is 
available to users read-only at:

http://svn.apache.org/repos/asf/lucene/java/trunk
Besides /trunk, there is also /branches and /tags.  /tags contains all 
the CVS tags made so that you could grab a snapshot of a previous 
version.  /trunk is analogous to CVS HEAD.  You can learn more about 
the Apache repository configuration here and how to use the 
command-line client to check out the repository:

http://www.apache.org/dev/version-control.html
Learn about Subversion, including the complete O'Reilly Subversion book 
in electronic form for free here:

http://subversion.tigris.org
For committers, check out the repository using https and your Apache 
username/password.

The Lucene sandbox has been integrated into our single Subversion 
repository, under /java/trunk/sandbox:

http://svn.apache.org/repos/asf/lucene/java/trunk/sandbox/
The Lucene CVS repositories have been locked for read-only.
If there are any issues with this conversion, let me know and I'll 
bring them to the Apache infrastructure group.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Compile lucene

2005-02-02 Thread Erik Hatcher
On Feb 2, 2005, at 2:26 PM, Helen Butler wrote:
Hi
Im trying to Compile Lucene but am encountering the following error on 
typing ant from the root of Lucene-1.4.3


C:\lucene-1.4.3>ant
Buildfile: build.xml
init:
compile-core:
BUILD FAILED
C:\lucene-1.4.3\build.xml:140: srcdir "C:\lucene-1.4.3\src\java" does 
not e=
xist!


I've installed a jdk and ant successfully and set the following 
CLASSPATH
C:\lucene-1.4.3\lucene-demos-1.4.3.jar;C:\lucene-1.4.3\lucene-1.4.3.jar
first rule of using Ant, don't use a CLASSPATH.  It is unnecessary, not 
to mention you put JAR files in there that you appear to be trying to 
build.

Do you have the source code distribution of Lucene?  It appears not, or 
you'd have src/java available.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-02 Thread Erik Hatcher
On Feb 2, 2005, at 6:17 AM, Karl Koch wrote:
Hello,
I have  been following this thread and have another question.
Is there a piece of sourcecode (which is preferably very short and 
simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 
3.2
would be enough...also no frames, CSS, etc.

I do not need to have the HTML strucutre tree or any other structure 
but
need a facility to clean up HTML into its normal underlying content 
before
indexing that content as a whole.

The code in the Lucene Sandbox for parsing HTML with JTidy (under 
contributions/ant) for the  task does what you ask.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: enquiries - pls help, thanks

2005-02-02 Thread Erik Hatcher
On Feb 2, 2005, at 2:40 AM, jac jac wrote:
May I know whether Lucene currently supports indexing of xml documents?
That's a loaded question.  Lucene "supports" it by being able to index 
text, sure.  But Lucene does not include an XML parser and the facility 
to automatically turn an XML file into a Lucene document, nor would you 
want that.  For example - in my current project, I'm parsing XML 
documents, and indexing pieces of them individually as Lucene Documents 
- in fact I'm doing that in all kinds of various ways too.

The demo applications that you've tried are not designed for anything 
but a very very basic demonstration of how to use Lucene - these 
example applications were never intended to be used as-is for anything 
other than some code you could borrow and learn from to build your own 
custom solutions.

If you want a quick jump on processing XML with Lucene, try out the 
code that comes with Lucene in Action (grab it from 
www.lucenebook.com).  When you get the code, run this:

$ ant ExtensionFileHandler
Buildfile: build.xml
...
ExtensionFileHandler:
 [echo]
 [echo]   This example demonstrates the file extension document 
handler.
 [echo]   Documents with extensions .xml, .rtf, .doc, .pdf, 
.html, and .txt are
 [echo]   all handled by the framework.  The contents of the 
Lucene Document
 [echo]   built for the specified file is displayed.
 [echo]
[input] Press return to continue...

[input] File: [src/lia/handlingtypes/data/HTML.html]
src/lia/handlingtypes/data/addressbook.xml
 [echo] Running lia.handlingtypes.framework.ExtensionFileHandler...
 [java] log4j:WARN No appenders could be found for logger 
(org.apache.commons.digester.Digester.sax).
 [java] log4j:WARN Please initialize the log4j system properly.
 [java] Document Keyword Keyword Keyword 
Keyword Keyword Keyword 
Keyword>

BUILD SUCCESSFUL
Total time: 18 seconds
Note that I typed in the path to an XML file where it asks for [input]. 
 Now dig into the source tree and borrow what you need from 
src/lia/handlingtypes

Erik

I tried building an index to index all my directories in webapps:
via:
java org.apache.lucene.demo.IndexFiles /homedir/tomcat/webapps
then I tried using the following command to search:
java org.apache.lucene.demo.SearchFiles
and i typed in my query. I was able to see the files which directs me 
the path which holds my data.

However, when I do
java org.apache.lucene.demo.IndexHTML -create -index /homedir/index ..
and I went to my website I realised it can't serach for the data I 
wanted instead.

I want to search data within XML documents... May I know if the 
current demo version allows indexing of XML documents?

Why is it that after I do "java org.apache.lucene.demo.IndexHTML 
-create -index /homedir/index .." then the data I wanted can't be 
searched? thanks alot!


jac


 Yahoo! Mobile
- Download the latest ringtones, games, and more!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: REPLACE USING ANALYZERS

2005-02-02 Thread Erik Hatcher
On Feb 2, 2005, at 4:12 AM, Karthik N S wrote:
Hi Guys
Apologies.
I am would like to know if Any Analyzers out there  which can give me 
the required o/p as shown below
Sure:
string.replaceAll("~","")
:)
1)
I/p   =  "+~shoes  -~nike"
 O/p  =  "+shoes  -nike"
 
2)
I/p    = +(+"~shoes -~nike")
O/p   = +(+"shoes -nike")
 
3)
I/p   =  +~shoes -~nike
O/p  =  +shoes  -nike
 
[ Note:- I  am Using the Javascript tool avaliable from Lucene 
Contributers Site to build Advance Search with synonym factor ]

 
Thx in advance

WITH WARM REGARDS
 HAVE A NICE DAY
[ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Results

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 7:36 PM, Hetan Shah wrote:
Another question for the day:
How to make sure that the results shown are the only one containing 
the keywords specified?

e.g.
the result for the query Red AND HAT AND Linux
should result in documents which has all the three key words and not 
show documents that only has one or two keywords?
Huh?  You would never get documents returned that only had two of those 
terms given that AND'd query.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query Format

2005-02-01 Thread Erik Hatcher
How are you indexing your document?
If you're using QueryParser with the default operator set to OR (which 
is the default), then you've already provided the expression you need 
:)

Erik
On Feb 1, 2005, at 6:29 PM, Hetan Shah wrote:
Hello All,
What should my query look like if I want to search all or any of the 
following key words.

Sun Linux Red Hat Advance Server
replies are much appreciated.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million 
documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also 
don't
think I can keep an IndexRead open to the index at the same time I 
have an
IndexWriter open.  I may have to try and deal with this issue through 
some
sort of filter on the query side, provided it doesn't impact 
performance to
much.
You can use an IndexReader and IndexWriter at the same time (the caveat 
is that you cannot delete with the IndexReader at the same time you're 
writing with an IndexWriter).  Is there no other identifying 
information, though, on the incoming documents with a date stamp?  
Identifier?  Or something unique you can go on?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: User Rights Management in Lucene

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 10:01 AM, Verma Atul (extern) wrote:
Hi,
I'm new to Lucene and want to know, whether Lucene has the capability 
of
displaying the search results based the Users Rights.

For Example:
There are suppose some resources, like :
Resource 1
Resource 2
Resource 3
Resource 4
And there are say 2 users with
User 1 having access to Resource 1, Resource 2 and Resource 4; and User
2 having access to Resource 1 and Resource 3
So when User 1 searches the database, then he should get results from
Resource 1, 2 and 4, but
When User 2 searches the databse, then he should get results from
Resource 1 and 3.
Lucene in Action has a SecurityFilterTest example (grab the source code 
distribution).  You can see a glimpse of this here:

http://www.lucenebook.com/search?query=security
So yes, its possible to index a username or roles alongside each 
document and apply that criteria to any search a user makes such that a 
user only gets documents allowed.  How complex this gets depends on how 
you need the permissions to work - the LIA example is rudimentary and 
simply associates an "owner" with each document and users are only 
allowed to see the documents they own.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:49 AM, Jerry Jalenak wrote:
Given Erik's response of 'don't put duplicate documents in the index', 
how
can I accomplish this in the IndexWriter?
As John said - you'll have to come up with some way of knowing whether 
you should index or not.  For example, when dealing with filesystem 
files, the Ant  task (in the sandbox) checks last modified date 
and only indexes new files.

Using a unique id on your data (primary key from a DB, URL from web 
pages, etc) is generally what people use for this.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Duplicate Hits

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote:
Is there a way to eliminate duplicate hits being returned from the 
index?
Sure, don't put duplicate documents in the index :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Can I sort search results by score and docID at one time?

2005-02-01 Thread Erik Hatcher
On Feb 1, 2005, at 4:21 AM, Jingkang Zhang wrote:
Lucene support sort by score or docID.Now I want to
sort search results by score and docID or by two
fields at one time, like sql
command " order by score,docID" , how can I do it?
Sorting by multiple fields (including score and document id) is 
supported.  Here's an example:

 new Sort(new SortField[]{
  new SortField("category"),
  SortField.FIELD_SCORE,
  new SortField("pubmonth", SortField.INT, true)
})

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene query (sql kind)

2005-01-28 Thread Erik Hatcher
Ross - I'm really perplexed by your message.  You create HTML from a 
database so that you can index it with Lucene, yet wish you could 
simply index the data in your database tied to a primary key directly, 
right?

Well, you're in luck - you already can do this!
What are you using for indexing?  It sounds like you borrowed the 
Lucene demo and have just run with that directly.

Erik
On Jan 28, 2005, at 11:02 AM, Ross Rankin wrote:
I agree.  My site is all dynamic pages created from the database.  
Right
now, I have to have a process create dummy pages, index them with 
Lucene,
then translate the Lucene results into meaningful links.  It actually 
works
better than it sounds, however it could be easier.

If I could just give Lucene a query result (i.e. a list of rows) and 
then
have Lucene send me back say the primary key of the rows that match 
and the
other Lucene goodness: ranking, number of hits, etc.

Could be pretty powerful and simplify the deployment for database 
driven
applications.

[Note: this opinion and $3.00 will get you a coffee at Starbucks]
Ross
-Original Message-
From: PA [mailto:[EMAIL PROTECTED]
Sent: Friday, January 28, 2005 6:44 AM
To: Lucene Users List
Subject: Re: lucene query (sql kind)
On Jan 28, 2005, at 12:40, sunil goyal wrote:
I want to run dynamic queries against the lucene index. Is there any
native syntax available for Lucene so that I can query, by first
generating the query in say an XML or SQL like format (cache this
query) and then  use this query over lucene index.
Talking of which, did anyone contemplated the possibility of a
JDBC adaptor of sort for Lucene?
Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search results excerpt similar to Google

2005-01-28 Thread Erik Hatcher
On Jan 28, 2005, at 1:46 AM, Jason Polites wrote:
I think they do a proximity result based on keyword matches.  So... If 
you search for "lucene" and the document returned has this word at the 
very start and the very end of the document, then you will see the two 
sentences (sequences of words) surrounding the two keyword matches, 
one from the start of the document and one from the end.
There is a Highlighter package in the Lucene sandbox.  Highlighting 
looks like this:

http://www.lucenebook.com/search?query=highlighter
How you determine which words from the result you include in the 
summary is up to you.  The problem with this it that in Lucene-land 
you have to store the content of the document inside in index verbatim 
(so you can get arbitrary portions of it out).  This means your index 
will be larger than it really needs to be.
You do not have to store the content in the index, it just happens to 
be convenient for most situations.  Content could be stored anywhere.  
Getting the text and reanalyzing it for Highlighter is all that is 
required.  Storing in the index has some performance benefits in the 
CVS version of Lucene, as you can store term position offset 
information and avoid having to re-analyze for highlighting.

Erik
I usually just store the first 255 characters in the index and use 
this as a summary.  It's not as good as Google, but it seems to work 
ok.

- Original Message - From: "Ben" <[EMAIL PROTECTED]>
To: "Lucene" 
Sent: Friday, January 28, 2005 5:08 PM
Subject: Search results excerpt similar to Google

Hi
Is it hard to implement a function that displays the search results
excerpts similar to Google?
Is it just string manipulations or there are some logic behind it? I
like their excerpts.
Thanks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: query term frequency

2005-01-28 Thread Erik Hatcher
On Jan 27, 2005, at 10:24 PM, Jonathan Lasko wrote:
No, the number of occurrences of a term in a Query.
Nothing built-in gives you this.  You'd have to dissect the Query  
clause-by-clause and cast each clause to the proper type to pull the  
terms from them.  The Highlighter code does this.

If there is a better way, I'd like to know.
Erik

Jonathan
Quoting David Spencer <[EMAIL PROTECTED]>:
Jonathan Lasko wrote:
What do I call to get the term frequencies for terms in the Query?  I
can't seem to find it in the Javadoc...
Do you mean the # of docs that have a term?

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/ 
IndexReader.html#docFreq(org.apache.lucene.index.Term)
Thanks.
Jonathan
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: LuceneReader.delete (term t) Failure ?

2005-01-27 Thread Erik Hatcher
Could you work up a self-contained RAMDirectory-using example that 
demonstrates this issue?

Erik
On Jan 27, 2005, at 9:10 PM, <[EMAIL PROTECTED]> wrote:
Erik,
I am using the keyword field
doc.add(Field.Keyword("uid", pathRelToArea));
anything else I can check on ?
thanks
atul
PS we worked together for Darden project

From: Erik Hatcher <[EMAIL PROTECTED]>
Date: 2005/01/27 Thu PM 07:46:40 EST
To: "Lucene Users List" 
Subject: Re: LuceneReader.delete (term t) Failure ?
How did you index the "uid" field?  Field.Keyword?  If not, that may 
be
the problem in that the field was analyzed.  For a key field like 
this,
it needs to be unanalyzed/untokenized.

Erik
On Jan 27, 2005, at 6:21 PM, <[EMAIL PROTECTED]> wrote:
Hi,
I am trying to delete a document from Lucene index using:
 Term aTerm = new Term( "uid", path );
 aReader.delete( aTerm );
 aReader.close();
If the variable path="xxx/foo.txt" then I am able to delete the
document.
However, if path variable has "-" in the string, the delete method
does not work
  e.g. path="xxx-yyy/foo.txt"  // Does Not work!!
Can I get around this problem.  I cannot subsitute minus character
with '.' as
it has other implications.
is this a bug ? I am using Lucene 1.4-final version.
Thanks for the help
Atul
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Erik Hatcher
I've often said that there is a business to be had in packaging up 
Lucene (and now Nutch) into a cute little box with user friendly 
management software to search your intranet.  SearchBlox is already 
there (except they don't include the box).

I really hope that an application like SearchBlox/Zilverline can be 
created as part of the Lucene project itself, replacing the sad demos 
that currently ship with Lucene.  I've got so many things on my plate 
that I don't foresee myself getting to this as soon as I'd like, but I 
would most definitely support and contribute what time I could to such 
an effort.  If the web UI used Tapestry, I'd be very inclined to dig in 
hardcore to it.  Any other web UI technology would likely turn me off.  
One of these days I'll Tapestry-ify Nutch just for grins and submit it 
as a replacement for the JSPs.

And I'm even more sold on it if Mac Mini's are involved!  :)
Erik
On Jan 27, 2005, at 7:16 PM, David Spencer wrote:
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure 
the thing and to customize the L&F of the search results.


jian chen wrote:
Hi,
I was searching using google and just found that there was a new
feature called "google mini". Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The "nice" feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: LuceneReader.delete (term t) Failure ?

2005-01-27 Thread Erik Hatcher
How did you index the "uid" field?  Field.Keyword?  If not, that may be 
the problem in that the field was analyzed.  For a key field like this, 
it needs to be unanalyzed/untokenized.

Erik
On Jan 27, 2005, at 6:21 PM, <[EMAIL PROTECTED]> wrote:
Hi,
I am trying to delete a document from Lucene index using:
 Term aTerm = new Term( "uid", path );
 aReader.delete( aTerm );
 aReader.close();
If the variable path="xxx/foo.txt" then I am able to delete the 
document.

However, if path variable has "-" in the string, the delete method 
does not work

  e.g. path="xxx-yyy/foo.txt"  // Does Not work!!
Can I get around this problem.  I cannot subsitute minus character 
with '.' as
it has other implications.

is this a bug ? I am using Lucene 1.4-final version.
Thanks for the help
Atul
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: text highlighting

2005-01-26 Thread Erik Hatcher
Also, there are some examples in the Lucene in Action source code (grab  
it from http://www.lucenebook.com) (see HighlightIt.java).

Erik
On Jan 26, 2005, at 5:52 PM, markharw00d wrote:
Michael Celona wrote:
Does any have a working example of the highlighter class found in the
sandbox?

There are several in the accompanying Junit test:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ 
contributions/highlighter/src/test/org/apache/lucene/search/highlight/

Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Suggestions for documentation or LIA

2005-01-26 Thread Erik Hatcher
On Jan 26, 2005, at 10:25 AM, Ian Soboroff wrote:
Erik Hatcher <[EMAIL PROTECTED]> writes:
By all means, if you have other suggestions for our site, let us know
at [EMAIL PROTECTED]
One of the things I would like to see, but which isn't either in the
Lucene site, documentation, or "Lucene in Action", is a complete
description of how the retrieval algorithm works.  That is, how the
HitCollector, Scorers, Similarity, etc all fit together.
I'm involved in a project which to some degree is looking at poking
deeply into this part of the Lucene code.  We have a nice (non-Lucene)
framework for working with more different kinds of similarity
functions (beyond tf-idf) which should also be expandable to include
query expansion, relevance feedback, and the like.
I used to think that integrating it would be as simple as hacking in
Similarity, but I'm beginning to think it might need broader changes.
I could obviously hook in our whole retrieval setup by just diving for
an IndexReader and doing it all by hand, but then I would have to redo
the incremental search and possibly the rich query structure, which
would be a lose.
So anyway, I got LIA hoping for a good explanation (not a good
Explanation) on this bit, but it wasn't there.
Hacking Similarity wasn't covered in LIA for one simple reason - 
Lucene's built-in scoring mechanism really is good enough for almost 
all projects.  The book was written for developers of those projects.

Personally, I've not had to hack Similarity, though I've toyed with it 
in prototypes and am using a minor tweak (turning off length 
normalization for the "title" field) for the lucenebook.com book 
indexing.

  There are some hints
on the Lucene site, but nothing complete.  If I muddle it out before
anything gets contributed, I'll try to write something up, but don't
expect anything too soon...
And maybe you'd contribute what you write to LIA 2nd edition :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search on heterogenous index

2005-01-26 Thread Erik Hatcher
On Jan 26, 2005, at 5:44 AM, Simeon Koptelov wrote:
Heterogenous Documents/indices are OK - check out the second hit:
  http://www.lucenebook.com/search?query=heterogenous+different
Thanks, I'll consider buying "Lucene in Action".
Our master plan is working!  :)   Just kidding I have on my TODO 
list to aggregate more Lucene related content (like the javadocs, 
Lucene's own documentation, perhaps a crawl of the wiki and the Lucene 
resources) into our search engine so that it becomes a richer resource 
and seems less than a marketing ploy.  Though the highlighted snippets 
do have enough information to be useful in some cases, which is nice.  
I will start dedicating a few minutes a day to blog some useful 
content.

By all means, if you have other suggestions for our site, let us know 
at [EMAIL PROTECTED]

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: multiple filters

2005-01-25 Thread Erik Hatcher
There is a ChainedFilter in the jakarta-lucene-sandbox CVS repository 
allowing you to AND/OR/XOR, and more with multiple filters.  I covered 
it in LIA:

http://www.lucenebook.com/search?query=ChainedFilter
And the source code you can download has some code that demonstrates it.
Erik
On Jan 25, 2005, at 6:57 PM, aaz wrote:
Hello,
Every document in my index has 2 date related fields.
created_date and modified_date stored via the DateField.dateToString()
Users want to be able to search via such between like queries such as:
(where modified_date > X AND modified_date < X AND created_date >= 
created_date =< X)

Now I tried using RangeQuery's for this but quickly ran into the 
TooManyClauses exception issue. The next thing I am looking at is the 
use of DateFilters to pass in with the query at searcher.search(). 
However the interfaces only supports one filter. Is it possible to 
pass multiple filters that would be needed for my example above?

thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: reading fields selectively

2005-01-25 Thread Erik Hatcher
I'm not sure what the status of the selective field access feature is, 
but from what you've written it sounds like you are not aware of the 
existing feature of fields to turn off storage.  For an "all" field, 
you probably do not want to access its actual value, so simply do not 
store it in the index.  Field.UnStored or Field.Text with a 
java.io.Reader will do the trick.  If you use unstored fields for text, 
and only have a Field.Keyword "id" field, you will minimize the size of 
your index and the Document objects obtained from Hits will be tiny.

Erik
On Jan 25, 2005, at 3:38 AM, sergiu gordea wrote:
Hi to all lucene developers,
The "read fields selectively" feature would be a very useful for me.
Do you plan to include it in the next lucene realeases?
I can patch lucene, but I will need to do it each time I upgrade my 
version,
and probably I would need to run the unit tests,  and this  is just 
duplicated effort

 I'm working on an application that uses lucene only to index 
information that we store in
the database and in external files. We perform the search with lucene 
to get the IDs of our
database records. The ID keyword field is the only one that we need to 
read from the index.
Each document may index a few txt, pdf, doc, html, ppt, or xls files, 
and some other database fields,
so .. the size of the lucene documents may be quite big.

 Writing  the ID as the first field in the index, and having the 
possibility to read only the ID from the index
will be a great performance improvement in our case (speed and memory 
usage).

 Another frecquenty met situation is to have an index with an ALL 
field, in order to perform the search easily,
and a few another separate fields, needed to get information from the 
index and to apply special constraints
(i.e. for extended search functionality).  Also in this case, the 
information from the ALL field won't be read, but
lucene will load it in the memory, and the memory usage will be at 
least twice bigger.

Thanks for understanding,
Sergiu
mark harwood wrote:
There is no API for this, but I recall somebody
talking about adding support for this a few months
back
See
http://marc.theaimsgroup.com/?l=lucene-dev&m=109485996612177&w=2
This implementation was working on a version of Lucene
before compression was introduced so things may have
changed a little.
Cheers,
Mark
	
	
		
___ ALL-NEW 
Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 7:01 PM, Peter Hollas wrote:
I am working on a public accessible Struts based
Well there's the problem right there :))
(just kidding)
To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst 
querying. This solution works fine, but is unacceptable since a query 
that returns thousands of results can take upwards of 30 seconds to 
sort them.
30 seconds... wow.
My question is whether it is possible to somehow return the names in 
alphabetical order without using a String SortField. My last resort 
will be to perform a monthly index rebuild, and return results by 
index order (about a day to re-index!). But ideally there might be a 
way to modify the Lucene API to incorporate a scoring system in a way 
that scores by lexical order.
What about assigning a numeric value field for each document with the 
number indicating the alphabetical ordering?  Off the top of my head, 
I'm not sure how this could be done, but perhaps some clever hashing 
algorithm could do this?  Or consider each character position one digit 
in a base 27 (or 27 to include a space) and construct a number for 
that?  (though that would be an enormous number and probably too large) 
- sorry my off-the-cuff estimating skills are not what they should be.

Certainly sorting by a numeric value is far less resource intensive 
than by String - so perhaps that is worth a try?  At the very least, 
give each document a random number and try sorting by that field (the 
value of the field can be Integer.toString()) to see how it compares 
performance-wise.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Filtering w/ Multiple Terms

2005-01-24 Thread Erik Hatcher
As Paul suggested, output the Lucene document numbers from your Hits, 
and also output which bit you're setting in your filter.  Do those sets 
overlap?

Erik
On Jan 24, 2005, at 2:13 PM, Jerry Jalenak wrote:
Paul / Erik -
I'm use the ParallelMultiSearcher to search three indexes concurrently 
-
hence the three entries into AccountFilter.  If I remove the filter 
from my
query, and simply enter the query on the command line, I get two hits 
back.
In other words, I can enter this:

smith AND (account:0011)
and get hits back.  When I add the filter back in (which should take 
care of
the account:0011 part of the query), and enter only smith as my query, 
I get
0 hits.


Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496
[EMAIL PROTECTED]

-Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, January 24, 2005 1:07 PM
To: Lucene Users List
Subject: Re: Filtering w/ Multiple Terms

On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote:
I spent some time reading the Lucene in Action book this weekend
(great job,
btw)
Thanks!
public class AccountFilter extends Filter
I see where the AccountFilter is setting the cooresponding
'bits', but
I end
up without any 'hits':
Entering AccountFilter...
Entering AccountFilter...
Entering AccountFilter...
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Leaving AccountFilter...
Leaving AccountFilter...
Leaving AccountFilter...
... Found 0 matching documents in 1000 ms
Can anyone tell me what I've done wrong?
A filter constrains which documents will be consulted during
a search,
but the Query needs to match some documents that are turned on by the
filter bits.  I'm guessing that your Query did not match any of the
documents you turned on.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

This transmission (and any information attached to it) may be 
confidential and
is intended solely for the use of the individual or entity to which it 
is
addressed. If you are not the intended recipient or the person 
responsible for
delivering the transmission to the intended recipient, be advised that 
you
have received this transmission in error and that any use, 
dissemination,
forwarding, printing, or copying of this information is strictly 
prohibited.
If you have received this transmission in error, please immediately 
notify
LabOne at the following email address: 
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Filtering w/ Multiple Terms

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote:
I spent some time reading the Lucene in Action book this weekend 
(great job,
btw)
Thanks!
public class AccountFilter extends Filter
I see where the AccountFilter is setting the cooresponding 'bits', but 
I end
up without any 'hits':

Entering AccountFilter...
Entering AccountFilter...
Entering AccountFilter...
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Setting bit on
Leaving AccountFilter...
Leaving AccountFilter...
Leaving AccountFilter...
... Found 0 matching documents in 1000 ms
Can anyone tell me what I've done wrong?
A filter constrains which documents will be consulted during a search, 
but the Query needs to match some documents that are turned on by the 
filter bits.  I'm guessing that your Query did not match any of the 
documents you turned on.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Stemming

2005-01-24 Thread Erik Hatcher
On Jan 24, 2005, at 7:24 AM, Kevin L. Cobb wrote:
Do stemming algorithms take into consideration abbreviations too?
No, they don't.  Adding abbreviations, aliases, synonyms, etc is not 
stemming.

And, the next logical question, if stemming does not take care of
abbreviations, are there any solutions that include abbreviations 
inside
or outside of Lucene?
Nothing built into Lucene does this, but the infrastructure allows it 
to be added in the form of a custom analysis step.  There are two basic 
approaches, adding aliases at indexing time, or adding them at query 
time by expanding the query.  I created some example analyzers in 
Lucene in Action (grab the source code from the site linked below) that 
demonstrate how this can be done using WordNet (and mock) synonym 
lookup.  You could extrapolate this into looking up abbreviations and 
adding them into the token stream.

http://www.lucenebook.com/search?query=synonyms
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Limit and offset

2005-01-23 Thread Erik Hatcher
Random accessing a start point in Hits will probably be sufficient for 
what you want to do.  I do this for all the web applications I've built 
with Lucene and performance has been more than acceptable.

Erik

On Jan 23, 2005, at 9:37 AM, Kristian Hellquist wrote:
Hi!
I want to retrieve a selected area of the hits I get when I search the
index similar to a SQL-clause.
SELECT foo FROM bar OFFSET 10 LIMIT 10
How should I do this and experience good performance? Or is it just so
simple that I use the method Hits.doc(int)?
Thanks!
Kristian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document 'Context' & Relation to each other

2005-01-22 Thread Erik Hatcher
On Jan 21, 2005, at 10:47 PM, Paul Smith wrote:
As a log4j developer, I've been toying with the idea of what Lucene 
could do for me, maybe as an excuse to play around with Lucene.
First off, let me thank you for your work with log4j!  I've been using 
it at lucenebook.com with the SMTPAppender (once I learned that I 
needed a custom trigger to release e-mails when I wanted, not just on 
errors) and it's been working great.

Now, I could provide a Field to the LoggingEvent Document that has a 
sequence #, and once a user has chosen an appropriate matching event, 
do another search for the documents with a Sequence # between +/- the 
context size.
My question is, is that going to be an efficient way to do this? The 
sequence # would be treated as text, wouldn't it?  Would the range 
search on an int be the most efficient way to do this?

I know from the Hits documentation that one can retrieve the Document 
ID of a matching entry.  What is the contract on this Document ID?  Is 
each Document added to the Index given an increasing number?  Can one 
search an index by Document ID?  Could one search for Document ID's 
between a range?   (Hope you can see where I'm going here).

You wouldn't even need the sequence number.  You'll certainly be adding 
the documents to the index in the proper sequence already (right?).  It 
is easy to random access documents if you know Lucene's document ids.  
Here's the pseudo-code

	- construct an IndexReader
	- open an IndexSearcher using the IndexReader
	- search, getting Hits back
	- for a hit you want to see the context, get hits.id(hit#)
	- subtract context size from the id, grab documents using 
reader.document(id)

You don't "search" for a document by id, but rather jump right to it 
with IndexReader.

Many thanks for an excellent API, and kudos to Erik & Otis for a great 
eBook btw.
Thanks!
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Chinese in Unicode !!!

2005-01-21 Thread Erik Hatcher
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote:
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?
Indexing and searching Chinese basically is no different than using 
English with Lucene.  We covered a bit about it in Lucene in Action:

http://www.lucenebook.com/search?query=chinese
And a screenshot here:
http://www.blogscene.org/erik/LuceneInAction/i18n.html
The main issues of dealing with Chinese, and of course other languages, 
are encoding concerns in both indexing and querying of reading in the 
text and analysis (as you can see from the screenshot).

Lucene itself works with Unicode fine and you're free to index anything.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Filtering w/ Multiple Terms

2005-01-20 Thread Erik Hatcher
On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote:
In looking at the examples for filtering of hits, it looks like I can 
only
specify a single term; i.e.

Filter f = new QueryFilter(new TermQuery(new Term("acct",
"acct1")));
I need to specify more than one term in my filter.  Short of using 
something
like ChainFilter, how are others handling this?
You can make as complex of a Query as you want for QueryFilter.  If you 
want to filter on multiple terms, construct a BooleanQuery with nested 
TermQuery's, either in an AND or OR fashion.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: LuceneRAR project announcement

2005-01-19 Thread Erik Hatcher
On Jan 19, 2005, at 3:30 PM, Joseph Ottinger wrote:
On Wed, 19 Jan 2005, Erik Hatcher wrote:
On Jan 19, 2005, at 2:27 PM, Joseph Ottinger wrote:
After babbling endlessly about an RDMS directory and my lack of 
success
with it, I've created a project on java.net to create a Lucene JCA
component, to allow J2EE components to interact with a Lucene 
service.
It's at https://lucenerar.dev.java.net/ currently.
Could you elaborate on some use cases?
Sure, and I'll pick the one that's been driving me along:
I have a set of J2EE servers, all of which can generate new content for
search, and all of which will be performing searches. They're on 
separate
machines. Sharing directories isn't my idea of "doing J2EE correctly."
"doing J2EE correctly" is a funny phrase.   If sharing directories 
works and gets the job done right, on time, under budget, can be 
adjusted later if needed, and has been reasonably well tested, then 
you've done it "right".  And since its in Java and not on a cell phone, 
its basically "J2EE".

Also, what about using Lucene over RMI using the RemoteSearchable 
facility built-in?

Therefore, I chose to represent Lucene as an enterprise service, one
communicated to via a remote service instead, so that every module can
communicate with Lucene without realising the communication layer... 
for
the most part.
And this is where I think the abstraction leaks.
The Nutch project has a very scalable "enterprise" approach to this 
type of remote service also.

 Plus, I no longer violate my purist's sensibilities.
Ah, now we get to the real rationale!  :)
I'm not giving you, personally, a hard time, really ... but rather this 
purist approach, where "purist" means fitting into the acronyms under 
the J2EE umbrella.  I've been there myself, read the specs, and cringed 
when I saw file system access from a session bean, and so on.

The Hits object could CERTAINLY use optimization - callbacks into the
connector would probably be acceptable, for example.
Gotcha.  Yes, callbacks would be the right approach with this type of 
abstraction.

JCA sounds like an unnecessary abstraction around Lucene - though I'm
open to be convinced otherwise.
I'm more than happy to talk about it. If I can fulfill my needs with no
code, hey, that's great!
Would RemoteSearchable get you closer to no code?
 I just haven't been able to successfully do so
yet, and everyone to whom I've spoken who says that they HAVE 
managed...
well, they've almost invariably done so by lowering the bar a great 
deal
in order to accept what Lucene requires.
I'm definitely a skeptic when it comes to generic layers on top of 
Lucene, though there is definitely a yearning for easier management of 
the lower-level details.

I'll definitely follow your work with LuceneRAR closely and will do 
what I can to help out in this forum.  So take my feedback as 
constructive criticism, but keep up the good work!

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [newbie] Confused about PrefixQuery

2005-01-19 Thread Erik Hatcher
On Jan 19, 2005, at 4:12 PM, Jerry Jalenak wrote:
Thanks for reply.  Some lists want all the info, some don't.  Just 
thought
I'd try to provide as much info as possible  8-)
The info is good... I just push for simple examples :)  By simplifying, 
often the problem becomes apparent and trivial.

That being said, where do I find Luke?
Silly response, but go to Google, type in _luke lucene_ and press "I'm 
feeling lucky" :)

But, since I already have the URL handy, here it is:
http://www.getopt.org/luke/
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [newbie] Confused about PrefixQuery

2005-01-19 Thread Erik Hatcher
On Jan 19, 2005, at 3:16 PM, Jerry Jalenak wrote:
The text files have two control lines at the beginning of them - CC> 
and
AN>.
That's quite a complex example to ask a user list to decipher.
Simplifying the example, besides making it easier for us to understand, 
would likely shed light on the problem.

Everything (I think) indexes correctly.
To be sure, try Luke out and see what got indexed exactly.  You can 
also use Luke as an ad-hoc search tool rather than writing your own.

  When I search against
this index, though, I get some weird results, especially when using an 
'*'
at the end of my criteria.
The results you got definitely are weird given the query, and in my 
initial glance through your code I did not see the issue pop out.  Luke 
will likely shed much more light on the matter.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  1   2   3   4   5   6   7   8   9   10   >