date:20031125

RE: Advice needed on Indexing

2003-11-25 Thread Justyna Lubkowski

Thanks...easy, I know how to do it that way.

I've thought a bit more about why the protocol and host are not added into
the index as part of the path and I think I understand why.  I hope I
understand it correctly.

-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Wednesday, 26 November 2003 5:24 PM
To: Lucene Users List
Subject: Re: Advice needed on Indexing

On Wed, Nov 26, 2003 at 05:12:14PM +1100, Justyna Lubkowski wrote:
> Hi,
>
> I'm new to Lucene and have just started to familiarise myself with the
API.
>
> I'm trying the web demo out, running IndexHTML.
>
> I'm wondering if it is possible to include the protocol and host as part
of
> the path which is put into the index.  What is the simplest way to
approach
> this?
>
> At the moment my results display the url as
> "/web/htdocs/test/apple/grannysmith/"
> I need to display something like
> "http://www.mysite.com/test/apple/grannysmith";

This is more of a java/JSP than a Lucene question.
Looking at the JSP code I see that it does
 String url = doc.get("url");
You then can do
server = request.getServerName()
to get the name of the server you're running on.
And then just use standard string manipulations to create the path.

You can also, when you create the index have go to /web/htdocs and index
from there. That way doc.get("url") will not include /web/htdocs in the
path.

Regards,

Dror

>
> Any advice on how to move in the right direction would be appreciated.
>
> Thanks - Justyna.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Advice needed on Indexing

2003-11-25 Thread Dror Matalon

On Wed, Nov 26, 2003 at 05:12:14PM +1100, Justyna Lubkowski wrote:
> Hi,
> 
> I'm new to Lucene and have just started to familiarise myself with the API.
> 
> I'm trying the web demo out, running IndexHTML.
> 
> I'm wondering if it is possible to include the protocol and host as part of
> the path which is put into the index.  What is the simplest way to approach
> this?
> 
> At the moment my results display the url as
> "/web/htdocs/test/apple/grannysmith/"
> I need to display something like
> "http://www.mysite.com/test/apple/grannysmith";

This is more of a java/JSP than a Lucene question. 
Looking at the JSP code I see that it does
 String url = doc.get("url");
You then can do
server = request.getServerName() 
to get the name of the server you're running on.
And then just use standard string manipulations to create the path.

You can also, when you create the index have go to /web/htdocs and index
from there. That way doc.get("url") will not include /web/htdocs in the
path.

Regards,

Dror

> 
> Any advice on how to move in the right direction would be appreciated.
> 
> Thanks - Justyna.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Advice needed on Indexing

2003-11-25 Thread Justyna Lubkowski

Hi,

I'm new to Lucene and have just started to familiarise myself with the API.

I'm trying the web demo out, running IndexHTML.

I'm wondering if it is possible to include the protocol and host as part of
the path which is put into the index.  What is the simplest way to approach
this?

At the moment my results display the url as
"/web/htdocs/test/apple/grannysmith/"
I need to display something like
"http://www.mysite.com/test/apple/grannysmith";

Any advice on how to move in the right direction would be appreciated.

Thanks - Justyna.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

log4j.properties

2003-11-25 Thread Tun Lin

I have created the following "log4j.properties" and put it in your classpath but
it still has that error. Anyone can help?

log4j.rootCategory=stdout

log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %c - %m%n

Chinese input.

2003-11-25 Thread Tun Lin

Hi,

May I know how do I analyse Chinese input from Chinese text in Lucene?

Do I use Analyser function in Lucene? If yes, how to go about using it?

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Tun Lin

When I integrate with PDFBox, I cannot update, delete or change the filename
anymore. If I did any of the above, I will get a message: Lock obtain timed out.

Anyone can help? 

-Original Message-
From: Pleasant, Tracy [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 25, 2003 11:42 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: RE: Lucene refresh index function (incremental indexing).

I was able to get PDFBox to work with my JSP webpages. 

I think you will have to in a way write your own code to do the PDF files (while
still calling the Lucene functions)

 doc = LucenePDFDocument.getDocument(file);

-Original Message-
From: Tun Lin [mailto:[EMAIL PROTECTED]
Sent: Monday, November 24, 2003 11:07 PM
To: 'Lucene Users List'
Subject: RE: Lucene refresh index function (incremental indexing).

Does it support indexing the contents of pdf files? I have found one project
called PDFBox that can be integrated with Lucene to search inside of the pdf
files. Currently, Lucene can only search for the pdf filename. I tried with
PDFBox and I got the following message when I typed the command: java
org.apache.lucene.demo.IndexHTML -create -index c:\\index .. 

log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParse
r).
log4j:WARN Please initialize the log4j system properly.

Can anyone advise?

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 5:01 AM
To: Lucene Users List
Subject: Re: Lucene refresh index function (incremental indexing).

Tun Lin wrote:
> These are the steps I took:
> 
> 1) I compile all the files in a particular directory using the
command: 
> java org.apache.lucene.demo.IndexHTML -create -index c:\\index .. 
> , putting all the indexed files in c:\\index.
> 2) Everytime, I added an additional file in that directory. I need to 
> reindex/recompile that directory to generate the indexes again. As the

> directory gets larger, the indexing takes a longer time.
> 
> My question is how do I generate the indexes automatically everytime a

> new document is added in that directory without me recompiling
everytime
manually?

To update, try removing the '-create' from the command line.  The demo code
supports incremental updates.  It will re-scan the directory and figure out
which files have changed, what new files have appeared and which previously
existing files have been removed.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

unexpected results from query

2003-11-25 Thread marc

Hi,

assume a field has the following text

"Adenylate kinase (mitochondrial GTP:AMP phosphotransferase) "

the following searches all return this document

AMP
&
&

can someone explain this to me..i figured that only the first query would be successful

Thanks,
Marc

bad link in mailing list archive?

2003-11-25 Thread Gerret Apelt

When I load

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=2648

there are three replies listed at the bottom of the page, one by Otis 
Gospodnetic. The subject of his reply is "Concurency in Lucene".

When I click on Otis' reply, my browser loads a post with a different 
subject; "Problems compiling the java source codes of lucene search engine".

Just a heads up -- seems like something funky's going on :)

cheers,
Gerret
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

new release: 1.3 RC3

2003-11-25 Thread Doug Cutting

A new Lucene release is available.

It can be downloaded from:

  http://cvs.apache.org/dist/jakarta/lucene/v1.3-rc3/

Release notes are at:

http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.58

Enjoy!

Doug



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene command line tool

2003-11-25 Thread Otis Gospodnetic

Hm, I don't know of any such tools.  It would be nice to have something
like that.  If you find such a tool, or write it yourself, let us know
about the URL, so we can include on the Lucene's site.

Otis

--- Dror Matalon <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I looked around the archives and didn't see anything about this
> subject.
> Is there a Lucene command line tool that lets you look at an index,
> run
> queries, and look at different metrics?
> 
> Something similar to Luke http://www.getopt.org/luke/, except that
> it'd
> run on the command line, have a history, etc.
> 
> Regards,
> 
> Dror
> 
> 
> -- 
> Dror Matalon
> Zapatec Inc 
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Wildcard searches and BooleanQuery$TooManyClauses

2003-11-25 Thread Otis Gospodnetic

Correct.
As for side-effect, well, things will be slower, obviously :)
Increase the limit, perform a search, and see if it's still
sufficiently fast...that's what I would do. :)

Otis

--- Dror Matalon <[EMAIL PROTECTED]> wrote:
> 
> This was raised in 
> 
>
http://www.mail-archive.com/[EMAIL PROTECTED]/msg04696.html
> 
> and not really answered.
> 
> 
> If I do 
> 
> +(contents:luc* description:luc*)
> 
> Things work fine. However if I do 
> +(contents:car* description:car*)
> 
> I get the following exception
> 
> 
> org.apache.lucene.search.BooleanQuery$TooManyClauses
>

start of original exception stack > trace---

> org.apache.lucene.search.BooleanQuery$TooManyClauses
>   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109)
>   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101)
>   at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:85)
>   at
> org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
>   at
>
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:189)
>   at org.apache.lucene.search.Query.weight(Query.java:120)
>   at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128)
>   at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
>   at org.apache.lucene.search.Hits.<init> (Hits.java:80)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:71)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:65)
>   at com.fastbuzz.search.Results.fetchResults(Results.java:134)
>   at com.fastbuzz.search.Results.<init> (Results.java:
>   ...
> 
> 
> My guess is that wild cards are rewriten as a bunch of
> BooleanQuery(s)
> so that car* is actually
> car
> cars
> carma
> ...
> 
> Since I have more than 1024 terms in my index that start with car, I
> get
> this exception. 
> 
> What are the performence reprecussions to increasing maxClauseCount
> from
> 1024 to something large, say 64K?
> 
> 
> Regards,
> 
> Dror
> 
> -- 
> Dror Matalon
> Zapatec Inc 
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Score

2003-11-25 Thread Gerret Apelt

Pleasant, Tracy wrote:

I tried using Boost but that did absolutely nothing.

The documents I am using: 
Plain text
PDF Documents
(I have two indexes) 
 

I'm not sure what's causing your scores to be off -- unless, of course, 
your scores just look wrong to you but they're in fact just what you 
should be getting :)
One bug in my code was that for an unrelated reason, terms in one field 
would never be matched. But since other fields contained the same term, 
the document was still being reported as a hit -- with a 
lower-than-expected score. Maybe you want to double check that the 
content of each field is getting tokenized properly.. when you have a 
term t in the title field that is unique to a particular document (i.e. 
not contained in any of the other fields of that document) do you still 
get a hit on the document when searching for t?Boost factors don't help 
of course if there's no hit in the first place.

When you say you use different analyzers for different fields in your
index, how would you accomplish that? When I create the index it has a
parameter for analyzer.. unless you create different indexes , how do
you use two different ones? 
 

Use PerFieldAnalyzerWrapper:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
cheers
Gerret


-Original Message-
From: Gerret Apelt [mailto:[EMAIL PROTECTED]
Sent: Monday, November 24, 2003 3:25 PM
To: Lucene Users List
Subject: Re: Score
Tracey --

it would help if you could give more detail on the types of documents, 
fields and analyzers you're using. Also what do you mean by "Multi Field

Search"? I presume you're using the MultiFieldQueryParser to have query 
terms in a user-submitted query be searched for in each field in your
index.

If I am understanding your problem, then it might be the same one I had 
a few weeks ago -- highly relevant matches would not receive a high 
ranking. (This paragraph will apply to you only if you use more than 
just one Analyzer for the set of your fields). I had six fields in my 
index, most of which were populated with a standard analyzer. I used 
self-made Analyzers for two of the fields. This turned out to be my 
problem when using MultiFieldQueryParser: I told my 
MultiFieldQueryParser instance to use only the standard analyzer. 
Instead I discovered that I needed to make use of 
org.apache.lucene.analysis.PerFieldAnalyzerWrapper and feed that to the 
MultiFieldQueryParser. Unless you do this, your problem is whats 
described here: 
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.in
dexing&toc=faq#q15.

Most likely, if your scoring is off, you're "doing something wrong" in 
the way you use the Lucene API -- at least, thats what I've discovered 
to be the case when my ranking is off.

If you're interested in the nitty-gritty of how scoring is done, check 
this FAQ entry:
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.se
arch&toc=faq#q31

cheers,
Gerret
Pleasant, Tracy wrote:

 

Hi,

I'm using the Multi Field Search to search all the fields of my
documents during the search. 

When it returns results the scores are numerically low - .06, .17, etc.
I would think if I searched for "Dog" and there was a doc with "Dog" in
the title and several times in the contents of a document that it would
receive a score more like 1.0 or close to it.
Is there a way that I can tweak the score?

I tried using Boost but that did absolutely nothing.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene command line tool

2003-11-25 Thread Dror Matalon

Hi,

I looked around the archives and didn't see anything about this subject.
Is there a Lucene command line tool that lets you look at an index, run
queries, and look at different metrics?

Something similar to Luke http://www.getopt.org/luke/, except that it'd
run on the command line, have a history, etc.

Regards,

Dror


-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching different types of words

2003-11-25 Thread Pete Lewis

Hi

I'd recommend Kstem over Porter, it performs much better on English let
alone when you get to other languages. You can get the source code for
Kstem.jar at teh following website:

http://ciir.cs.umass.edu/downloads/

Pete

- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 25, 2003 5:42 PM
Subject: Re: Searching different types of words


> Yes.
> For this particular example, PorterStemFilter will do the job.
> For more complex things (e.g. a search for car returning car, auto,
> automobile, vehicle) you'll need to add thesaurus-like capability to
> your indexer.  This can be done by writing a custom Analyzer.
>
> It sounds like you have a lot of questions, but have not read much
> Lucene documentation. :)
>
> Otis
>
>
> --- "Pleasant, Tracy" <[EMAIL PROTECTED]> wrote:
> > If I search for "like" I would want the search to return documents
> > containing "like", "liked", "likes", etc.. variations of the word.
> >
> > Is there a way to tell Lucene to do this?
>
>
>
> __
> Do you Yahoo!?
> Free Pop-Up Blocker - Get it now
> http://companion.yahoo.com/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Wildcard searches and BooleanQuery$TooManyClauses

2003-11-25 Thread Dror Matalon


This was raised in 

http://www.mail-archive.com/[EMAIL PROTECTED]/msg04696.html

and not really answered.


If I do 

+(contents:luc* description:luc*)

Things work fine. However if I do 
+(contents:car* description:car*)

I get the following exception


org.apache.lucene.search.BooleanQuery$TooManyClauses

start of original exception stack trace---

org.apache.lucene.search.BooleanQuery$TooManyClauses
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101)
at org.apache.lucene.search.PrefixQuery.rewrite(PrefixQuery.java:85)
at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:189)
at org.apache.lucene.search.Query.weight(Query.java:120)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
at org.apache.lucene.search.Hits.<init> (Hits.java:80)
at org.apache.lucene.search.Searcher.search(Searcher.java:71)
at org.apache.lucene.search.Searcher.search(Searcher.java:65)
at com.fastbuzz.search.Results.fetchResults(Results.java:134)
at com.fastbuzz.search.Results.<init> (Results.java:
  ...


My guess is that wild cards are rewriten as a bunch of BooleanQuery(s)
so that car* is actually
car
cars
carma
...

Since I have more than 1024 terms in my index that start with car, I get
this exception. 

What are the performence reprecussions to increasing maxClauseCount from
1024 to something large, say 64K?


Regards,

Dror

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Collaborative Filtering API

2003-11-25 Thread Michael Giles

Yes, he was the lead Ph.D. student on the GroupLens project at Minnesota.

-Mike

At 12:18 PM 11/25/2003, you wrote:
Hello Mike,

I had a quick look over the javadoc and it looks promising, as you said. Did
Jon Herlocker worked on GroupLens? I know GroupLens was quite a pioneer work
in the early days of collaborative systems...
Cheers
Ralph
> You should check out the work of Jon Herlocker at Oregon State
> (http://eecs.oregonstate.edu/iis/).  They have written a CF engine that
> has
> been on my to-do list to check out for a few months (sounds good on
> "paper").  If you get the chance to play with it, I'd be curious to hear
> your feedback.  Having a CF engine in the open source domain would be a
> great thing.
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hits Highlighting

2003-11-25 Thread Dror Matalon

You can always try:

http://www.google.com/search?q=lucene+highlight&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8

Which leads to various things including

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02893.html

which includes an example at 

http://www.oscom.org/search/go?publication-id=all&queryString=Lenya&fields=all&find=Search

As far as making it look yellog rather than bold, you'll need to
customize the code to generate the right HTML.

Regards,

Dror

On Tue, Nov 25, 2003 at 12:35:36PM -0500, Pleasant, Tracy wrote:
> I have seen that one, but it doesn't include the source code, only the
> jar with classes.
> 
> I need something to actually highlight - like if you took a yellow
> marker and highlighted,not doing it in bold.  
> 
>  
> 
> -Original Message-
> From: Dror Matalon [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 12:29 PM
> To: Lucene Users List
> Subject: Re: Hits Highlighting
> 
> 
> Hi,
> 
> The lucene home page has a lot of resources, including the FAQs,
> articles, javadocs and contributions. 
> 
> For instance, there's a query hilighter in the contributions page. 
> 
> 
> 
> On Tue, Nov 25, 2003 at 12:17:41PM -0500, Pleasant, Tracy wrote:
> > 
> >  Are there any  hits highlighting functions? 
> > 
> >  I have a simple one, but it gets complicated with searching multiple
> > words, having tokens, etc.
> > 
> > 
> >  
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> -- 
> Dror Matalon
> Zapatec Inc 
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Question

2003-11-25 Thread Dror Matalon

On Tue, Nov 25, 2003 at 12:30:50PM -0500, Pleasant, Tracy wrote:
>  How come if I search for 'red_car*' it returns nothing.

Looks like lucene interprets '_' as a stop word. So 'red_car' is
actually "red car" with the double quotes. I'm not sure why it does
that.

> 
>  I am using standard analyzer, too. 
> 
> -Original Message-
> From: Dror Matalon [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 12:22 PM
> To: Lucene Users List
> Subject: Re: Search Question
> 
> 
> No, but if you use the standard analyzer searching "red*" will return
> documents with "read_car"

I meant red_car in here.

> 
> On Tue, Nov 25, 2003 at 12:00:01PM -0500, Pleasant, Tracy wrote:
> > 
> >  If I have words within a document like 
> >  
> >  red_car
> >  
> >  If I search for 'red' would it return documents containing 'red_car'?
> 
> > 
> >  
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> -- 
> Dror Matalon
> Zapatec Inc 
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Hits Highlighting

2003-11-25 Thread Otis Gospodnetic

There are several Lucene highlighting solutions for Lucene out there.
I know of two that include source code, and I think at least one of
them is on the Contributions page.
There are some threads that talk about this issue, too.

Otis

--- "Pleasant, Tracy" <[EMAIL PROTECTED]> wrote:
> I have seen that one, but it doesn't include the source code, only
> the
> jar with classes.
> 
> I need something to actually highlight - like if you took a yellow
> marker and highlighted,not doing it in bold.  
> 
>  
> 
> -Original Message-
> From: Dror Matalon [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 12:29 PM
> To: Lucene Users List
> Subject: Re: Hits Highlighting
> 
> 
> Hi,
> 
> The lucene home page has a lot of resources, including the FAQs,
> articles, javadocs and contributions. 
> 
> For instance, there's a query hilighter in the contributions page. 
> 
> 
> 
> On Tue, Nov 25, 2003 at 12:17:41PM -0500, Pleasant, Tracy wrote:
> > 
> >  Are there any  hits highlighting functions? 
> > 
> >  I have a simple one, but it gets complicated with searching
> multiple
> > words, having tokens, etc.
> > 
> > 
> >  
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
> -- 
> Dror Matalon
> Zapatec Inc 
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching different types of words

2003-11-25 Thread Otis Gospodnetic

Yes.
For this particular example, PorterStemFilter will do the job.
For more complex things (e.g. a search for car returning car, auto,
automobile, vehicle) you'll need to add thesaurus-like capability to
your indexer.  This can be done by writing a custom Analyzer.

It sounds like you have a lot of questions, but have not read much
Lucene documentation. :)

Otis

--- "Pleasant, Tracy" <[EMAIL PROTECTED]> wrote:
> If I search for "like" I would want the search to return documents
> containing "like", "liked", "likes", etc.. variations of the word.
> 
> Is there a way to tell Lucene to do this? 

__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Search Question

2003-11-25 Thread Otis Gospodnetic

Because '_' wa sprobably removed from your input before it was indexed.

I suggest reading up on Analyzers and Tokenizers.

Otis

--- "Pleasant, Tracy" <[EMAIL PROTECTED]> wrote:
> Also searching 'red_*' returns nothing, also.
> 
> 
> 
> 
> 
> -Original Message-
> From: Dror Matalon [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 12:22 PM
> To: Lucene Users List
> Subject: Re: Search Question
> 
> 
> No, but if you use the standard analyzer searching "red*" will return
> documents with "read_car"
> 
> On Tue, Nov 25, 2003 at 12:00:01PM -0500, Pleasant, Tracy wrote:
> > 
> >  If I have words within a document like 
> >  
> >  red_car
> >  
> >  If I search for 'red' would it return documents containing
> 'red_car'?
> 
> > 
> >  
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
> -- 
> Dror Matalon
> Zapatec Inc 
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Question - not returning desired results

2003-11-25 Thread Otis Gospodnetic

You have to look at Analyzers.
Figure out which one you are using and why, and see if you should be
using a different one or even write your own.
Some of the Analyzers break input on certain tokens (e.g. . or _ or
...), which sounds like the problem is here.

I think Erik's java.net article about Lucene may explain some of these
things.
You could also look at Lucene's unit tests to understand Analyzers
better.

Otis

--- "Pleasant, Tracy" <[EMAIL PROTECTED]> wrote:
> 
> The documents I have index contain information regarding file names
> also.
> 
> For instance 'return_results.pl' or something like that may be in the
> document fields.
> 
> I am not understanding Lucene's way of searching:
> 
> 1. If I search for 'return_results', the search does not return
> anything
> 2. If I search for 'results' or 'return', the search does not return
> anything
> 3. If I search for 'results.pl', the search does return the document
> containg 'return_results.pl' 
> 4. If I search for 'results~', the search does return the document
> containg 'return_results.pl' 
> 5. If I search for 'return_results~', the search does not return
> anything
> 
> What is going on? 
> 
> I want it to return the document in all of the situations.
> 
> I also don't want to have to use '~' all the time.
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Hits Highlighting

2003-11-25 Thread Pleasant, Tracy

I have seen that one, but it doesn't include the source code, only the
jar with classes.

I need something to actually highlight - like if you took a yellow
marker and highlighted,not doing it in bold.  

-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 12:29 PM
To: Lucene Users List
Subject: Re: Hits Highlighting

Hi,

The lucene home page has a lot of resources, including the FAQs,
articles, javadocs and contributions. 

For instance, there's a query hilighter in the contributions page. 

On Tue, Nov 25, 2003 at 12:17:41PM -0500, Pleasant, Tracy wrote:
> 
>  Are there any  hits highlighting functions? 
> 
>  I have a simple one, but it gets complicated with searching multiple
> words, having tokens, etc.
> 
> 
>  
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Search Question

2003-11-25 Thread Pleasant, Tracy

Also searching 'red_*' returns nothing, also.





-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 12:22 PM
To: Lucene Users List
Subject: Re: Search Question


No, but if you use the standard analyzer searching "red*" will return
documents with "read_car"

On Tue, Nov 25, 2003 at 12:00:01PM -0500, Pleasant, Tracy wrote:
> 
>  If I have words within a document like 
>  
>  red_car
>  
>  If I search for 'red' would it return documents containing 'red_car'?

> 
>  
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Search Question

2003-11-25 Thread Pleasant, Tracy

 How come if I search for 'red_car*' it returns nothing.

 I am using standard analyzer, too. 

-Original Message-
From: Dror Matalon [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 12:22 PM
To: Lucene Users List
Subject: Re: Search Question


No, but if you use the standard analyzer searching "red*" will return
documents with "read_car"

On Tue, Nov 25, 2003 at 12:00:01PM -0500, Pleasant, Tracy wrote:
> 
>  If I have words within a document like 
>  
>  red_car
>  
>  If I search for 'red' would it return documents containing 'red_car'?

> 
>  
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hits Highlighting

2003-11-25 Thread Dror Matalon

Hi,

The lucene home page has a lot of resources, including the FAQs,
articles, javadocs and contributions. 

For instance, there's a query hilighter in the contributions page. 



On Tue, Nov 25, 2003 at 12:17:41PM -0500, Pleasant, Tracy wrote:
> 
>  Are there any  hits highlighting functions? 
> 
>  I have a simple one, but it gets complicated with searching multiple
> words, having tokens, etc.
> 
> 
>  
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Question

2003-11-25 Thread Dror Matalon

No, but if you use the standard analyzer searching "red*" will return
documents with "read_car"

On Tue, Nov 25, 2003 at 12:00:01PM -0500, Pleasant, Tracy wrote:
> 
>  If I have words within a document like 
>  
>  red_car
>  
>  If I search for 'red' would it return documents containing 'red_car'? 
> 
>  
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Collaborative Filtering API

2003-11-25 Thread ambiesense

Hello Mike,

I had a quick look over the javadoc and it looks promising, as you said. Did
Jon Herlocker worked on GroupLens? I know GroupLens was quite a pioneer work
in the early days of collaborative systems...

Cheers
Ralph

> You should check out the work of Jon Herlocker at Oregon State 
> (http://eecs.oregonstate.edu/iis/).  They have written a CF engine that
> has 
> been on my to-do list to check out for a few months (sounds good on 
> "paper").  If you get the chance to play with it, I'd be curious to hear 
> your feedback.  Having a CF engine in the open source domain would be a 
> great thing.
> 
> -Mike
> 
> At 10:49 AM 11/25/2003, you wrote:
> >Hello togehter,
> >
> >I am asking this group because I think people here might know about this
> >since it is a similar approach.
> >
> >Is there a Java based API which assist developers of collaborative
> filtering
> >in their programs. With this I mean software, which does use user ratings
> >between items and provide ways (algorithsm, methods) to find users with 
> >similar
> >interests for prediction generation. Finding a API like Lucene would be
> >dream for me but any pointer to other API's (also in other programming 
> >lanuages)
> >to see and learn from would be appreciated.
> >
> >Kind Regards
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-- 
GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen!

Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken
tolle Preise. http://www.gmx.net/de/cgi/specialmail/

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Hits Highlighting

2003-11-25 Thread Pleasant, Tracy


 Are there any  hits highlighting functions? 

 I have a simple one, but it gets complicated with searching multiple
words, having tokens, etc.


 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Search Question - not returning desired results

2003-11-25 Thread Pleasant, Tracy


The documents I have index contain information regarding file names also.

For instance 'return_results.pl' or something like that may be in the document fields.

I am not understanding Lucene's way of searching:

1. If I search for 'return_results', the search does not return anything
2. If I search for 'results' or 'return', the search does not return anything
3. If I search for 'results.pl', the search does return the document containg 
'return_results.pl' 
4. If I search for 'results~', the search does return the document containg 
'return_results.pl' 
5. If I search for 'return_results~', the search does not return anything

What is going on? 

I want it to return the document in all of the situations.

I also don't want to have to use '~' all the time.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Search Question

2003-11-25 Thread Pleasant, Tracy


 If I have words within a document like 
 
 red_car
 
 If I search for 'red' would it return documents containing 'red_car'? 

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Ben Litchfield


Logging uses log4j and can be configured.  If you are having issues with
specific PDFs then you can post a bug on the sourceforge site or mail me
the PDFs directly and I will look at them.

Ben
http://www.pdfbox.org


On Tue, 25 Nov 2003, Zhou, Oliver wrote:

> I do have other problems with PDFBox-0.6.4.  For one, it has annoying debug
> information at very low level parsing process.  The other, I got infinite
> loop while indexing pdf files although they say the infinite loop bug has
> been fixed in their release notes.  Anybody knows what's going on?
>
> Thanks,
> Oliver
>
>
>
> -Original Message-
> From: Ben Litchfield [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 9:45 AM
> To: Lucene Users List
> Subject: RE: Lucene refresh index function (incremental indexing).
>
>
>
> Yes, just add the log4j configuration.  The easiest way to do that is as a
> system parameter like this
>
> java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML
> -create -index c:\\index ..
>
> Where log4j.xml is the path to your log4j config, PDFBox has an example
> one you can use.
>
> Ben
> http://www.pdfbox.org
>
> On Tue, 25 Nov 2003, Zhou, Oliver wrote:
>
> > Lucene doesn't have pdf parser.  In order to index pdf files you have to
> add
> > one by your self.  PDFBox is a good choice.  You may just ignore the
> warning
> > for log4j or you can add log4j in your classpath.
> >
> > Oliver
> >
> >
> > -Original Message-
> > From: Tun Lin [mailto:[EMAIL PROTECTED]
> > Sent: Monday, November 24, 2003 10:07 PM
> > To: 'Lucene Users List'
> > Subject: RE: Lucene refresh index function (incremental indexing).
> >
> >
> > Does it support indexing the contents of pdf files? I have found one
> project
> > called PDFBox that can be integrated with Lucene to search inside of the
> pdf
> > files. Currently, Lucene can only search for the pdf filename. I tried
> with
> > PDFBox and I got the following message when I typed the command: java
> > org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
> >
> > log4j:WARN No appenders could be found for logger
> > (org.pdfbox.pdfparser.PDFParse
> > r).
> > log4j:WARN Please initialize the log4j system properly.
> >
> > Can anyone advise?
> >
> > -Original Message-
> > From: Doug Cutting [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, November 25, 2003 5:01 AM
> > To: Lucene Users List
> > Subject: Re: Lucene refresh index function (incremental indexing).
> >
> > Tun Lin wrote:
> > > These are the steps I took:
> > >
> > > 1) I compile all the files in a particular directory using the command:
> > > java org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
> > > , putting all the indexed files in c:\\index.
> > > 2) Everytime, I added an additional file in that directory. I need to
> > > reindex/recompile that directory to generate the indexes again. As the
> > > directory gets larger, the indexing takes a longer time.
> > >
> > > My question is how do I generate the indexes automatically everytime a
> > > new document is added in that directory without me recompiling everytime
> > manually?
> >
> > To update, try removing the '-create' from the command line.  The demo
> code
> > supports incremental updates.  It will re-scan the directory and figure
> out
> > which files have changed, what new files have appeared and which
> previously
> > existing files have been removed.
> >
> > Doug
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Tokenizing text custom way

2003-11-25 Thread MOYSE Gilles (Cetelem)

Hi.

You should define expressions.
To define expressions, you first have to define an expression file.
An expression file contains one expressions per line.
For instance :
time_out
expert_system
...
You can use any character to specify the "expression link". Here, I use the
underscore (_).

Then, you have to build an expression loader. You can store expressions in
recursives HashMap.
Such HashMap must be built so that HashMap.get("word1") = HashMap, and
(HashMap.get("word1")).get("word2") = null, if you want to code the
expression "word1_word2".
In other words 'HashMap.get("a_word")' returns a hashMap containing all the
successors of the word 'a_word'.

So, if your expression file looks like that :
time_out
expert_system
expert_in_information

you'll have to build a loader which returns a HashMap H so that :
H.keySet() = {"time", "expert"}
((HashMap)H.get("time")).keySet = {"out"}
((HashMap)H.get("time")).get("out") = null // null indicates the end
of the expression
((HashMap)H.get("expert")).keySet = {"system", "in"}
((HashMap)H.get("expert")).get("system") = null
((HashMap)((HashMap)H.get("expert")).get("in")).keySet() =
{"information"}
((HashMap)((HashMap)H.get("expert")).get("in")).get("information") =
null

These recursives HashMaps code the following tree :
time - out - null
system --- expert - null
  |- in - information- null

Such an expression loader may be designed this way :

public static HashMap getExpressionMap( File wordfile ) {
HashMap result = new HashMap();

try 
{
String line = null;
LineNumberReader in = new LineNumberReader(new
FileReader(wordfile));
HashMap hashToAdd = null;

while ((line = in.readLine()) != null)
{
if (line.startsWith(FILE_COMMENT_CHARACTER))
continue;

if (line.trim().length() == 0)
continue;

StringTokenizer stok = new
StringTokenizer(line, " \t_");
String curTok = "";
HashMap currentHash = result;

// Test wether the expression contains 2 at
least words or not
if (stok.countTokens() < 2)
{
System.err.println("Warning : '" +
line + "' in file '" + wordfile.getAbsolutePath() + "' line " +
in.getLineNumber() +
" is not an expression.\n\tA
valid expression contains at least 2 words.");
continue;
}

while (stok.hasMoreTokens())
{
curTok = stok.nextToken();
if
(curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end of the
line, break
break;
if (stok.hasMoreTokens())
hashToAdd = new HashMap(6);
else
hashToAdd = (HashMap)null;

if
(!(currentHash.containsKey(curTok)))
currentHash.put(curTok,
hashToAdd);

currentHash =
(HashMap)currentHash.get(curTok);
}
}
return result;
}
// On error, use an empty table
catch ( Exception e ) 
{
System.err.println("While processing '" +
wordfile.getAbsolutePath() + "' : " + e.getMessage());
e.printStackTrace();
return new HashMap();
}
}


Then, you must build a filter with 2 FIFO stacks : one is the expression
stack, the other is the default stack.
Then, you define a 'curMap' variable, initially pointing onto the HashMap
returned by the ExpressionFileLoader.

When you receive a token, you check wether it is null or not;
If it is, you check if the standard stack is null or not.
If it is not, you pop a token from the default stack and you
return it.
If it is, you return null

Searching different types of words

2003-11-25 Thread Pleasant, Tracy

If I search for "like" I would want the search to return documents
containing "like", "liked", "likes", etc.. variations of the word.

Is there a way to tell Lucene to do this? 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Collaborative Filtering API

2003-11-25 Thread Michael Giles

You should check out the work of Jon Herlocker at Oregon State 
(http://eecs.oregonstate.edu/iis/).  They have written a CF engine that has 
been on my to-do list to check out for a few months (sounds good on 
"paper").  If you get the chance to play with it, I'd be curious to hear 
your feedback.  Having a CF engine in the open source domain would be a 
great thing.

-Mike

At 10:49 AM 11/25/2003, you wrote:
Hello togehter,

I am asking this group because I think people here might know about this
since it is a similar approach.
Is there a Java based API which assist developers of collaborative filtering
in their programs. With this I mean software, which does use user ratings
between items and provide ways (algorithsm, methods) to find users with 
similar
interests for prediction generation. Finding a API like Lucene would be
dream for me but any pointer to other API's (also in other programming 
lanuages)
to see and learn from would be appreciated.

Kind Regards


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Pleasant, Tracy

I vaguely remmeber I had a problem back when I used 0.6.2. I reverted back and used 
0.6.1 instead. I haven't had any problems.


-Original Message-
From: Zhou, Oliver [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 11:03 AM
To: 'Lucene Users List'
Subject: RE: Lucene refresh index function (incremental indexing).


I do have other problems with PDFBox-0.6.4.  For one, it has annoying debug
information at very low level parsing process.  The other, I got infinite
loop while indexing pdf files although they say the infinite loop bug has
been fixed in their release notes.  Anybody knows what's going on?

Thanks,
Oliver

 

-Original Message-
From: Ben Litchfield [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 9:45 AM
To: Lucene Users List
Subject: RE: Lucene refresh index function (incremental indexing).



Yes, just add the log4j configuration.  The easiest way to do that is as a
system parameter like this

java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML
-create -index c:\\index ..

Where log4j.xml is the path to your log4j config, PDFBox has an example
one you can use.

Ben
http://www.pdfbox.org

On Tue, 25 Nov 2003, Zhou, Oliver wrote:

> Lucene doesn't have pdf parser.  In order to index pdf files you have to
add
> one by your self.  PDFBox is a good choice.  You may just ignore the
warning
> for log4j or you can add log4j in your classpath.
>
> Oliver
>
>
> -Original Message-
> From: Tun Lin [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 24, 2003 10:07 PM
> To: 'Lucene Users List'
> Subject: RE: Lucene refresh index function (incremental indexing).
>
>
> Does it support indexing the contents of pdf files? I have found one
project
> called PDFBox that can be integrated with Lucene to search inside of the
pdf
> files. Currently, Lucene can only search for the pdf filename. I tried
with
> PDFBox and I got the following message when I typed the command: java
> org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
>
> log4j:WARN No appenders could be found for logger
> (org.pdfbox.pdfparser.PDFParse
> r).
> log4j:WARN Please initialize the log4j system properly.
>
> Can anyone advise?
>
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 5:01 AM
> To: Lucene Users List
> Subject: Re: Lucene refresh index function (incremental indexing).
>
> Tun Lin wrote:
> > These are the steps I took:
> >
> > 1) I compile all the files in a particular directory using the command:
> > java org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
> > , putting all the indexed files in c:\\index.
> > 2) Everytime, I added an additional file in that directory. I need to
> > reindex/recompile that directory to generate the indexes again. As the
> > directory gets larger, the indexing takes a longer time.
> >
> > My question is how do I generate the indexes automatically everytime a
> > new document is added in that directory without me recompiling everytime
> manually?
>
> To update, try removing the '-create' from the command line.  The demo
code
> supports incremental updates.  It will re-scan the directory and figure
out
> which files have changed, what new files have appeared and which
previously
> existing files have been removed.
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: XML support in Lucene

2003-11-25 Thread Otis Gospodnetic

Check out the Resources page on Lucene's site.  Look for a link to the 
article about Lucene and Digester.  You can also look at Lucene's
Sandbox for some ideas.

Otis

--- [EMAIL PROTECTED] wrote:
> Hello group,
> 
> does Lucene offer an effective and flexible way to treat XML files. I
> know
> that as soon as an InputStream is provided Lucene can basically index
> (evtl.
> after clearning) everything. How is it with XML files?
> 
> If there is a way is it possbile to have one big XML file with many
> individual parts in it. This should be considered as docuemnts and
> the repeative XML
> tags as fields.
> 
> Here an example:
> 
> 
>   
> Tim
> How are you? Tom
>   
>   
>  Linda
> bla bla bla
>   
> 
> 
> 
> Does somebody has already developed classes which go though this XML
> file,
> create TWO documents with the fields "From" and "Content" and fill in
> the text
> between the tags ? The Indexing business should then be the same
> since it is
> abstract against the Document object. The same for the search
> process. The
> search process however could be optimised with stuctural information
> (i.e.
> only search in "Content")...
> 
> Cheers,
> Ralph
> 
> -- 
> NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
> Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService
> 
> Jetzt kostenlos anmelden unter http://www.gmx.net
> 
> +++ GMX - die erste Adresse für Mail, Message, More! +++
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Eliminating duplicate result

2003-11-25 Thread Pleasant, Tracy

When you are doing two searches are you searching for two different terms?

-Original Message-
From: Dragan Jotanovic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 11:14 AM
To: Lucene Users List
Subject: Eliminating duplicate result

What is the easyest way to eliminate duplicate documents if one is doing two searches 
on the same index?

Have anybody done something similar?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: XML support in Lucene

2003-11-25 Thread Pleasant, Tracy

This may help you: 

http://www.jguru.com/faq/view.jsp?EID=1074235

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 11:07 AM
To: [EMAIL PROTECTED]
Subject: XML support in Lucene

Hello group,

does Lucene offer an effective and flexible way to treat XML files. I
know
that as soon as an InputStream is provided Lucene can basically index
(evtl.
after clearning) everything. How is it with XML files?

If there is a way is it possbile to have one big XML file with many
individual parts in it. This should be considered as docuemnts and the
repeative XML
tags as fields.

Here an example:

Tim
How are you? Tom

 Linda
bla bla bla

Does somebody has already developed classes which go though this XML
file,
create TWO documents with the fields "From" and "Content" and fill in
the text
between the tags ? The Indexing business should then be the same since
it is
abstract against the Document object. The same for the search process.
The
search process however could be optimised with stuctural information
(i.e.
only search in "Content")...

Cheers,
Ralph

-- 
NEU FUR ALLE - GMX MediaCenter - fur Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gru?, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse fur Mail, Message, More! +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

XML support in Lucene

2003-11-25 Thread ambiesense

Hello group,

does Lucene offer an effective and flexible way to treat XML files. I know
that as soon as an InputStream is provided Lucene can basically index (evtl.
after clearning) everything. How is it with XML files?

If there is a way is it possbile to have one big XML file with many
individual parts in it. This should be considered as docuemnts and the repeative XML
tags as fields.

Here an example:


  
Tim
How are you? Tom
  
  
 Linda
bla bla bla
  



Does somebody has already developed classes which go though this XML file,
create TWO documents with the fields "From" and "Content" and fill in the text
between the tags ? The Indexing business should then be the same since it is
abstract against the Document object. The same for the search process. The
search process however could be optimised with stuctural information (i.e.
only search in "Content")...

Cheers,
Ralph

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Eliminating duplicate result

2003-11-25 Thread Dragan Jotanovic

What is the easyest way to eliminate duplicate documents if one is doing two searches 
on the same index?

Have anybody done something similar?

RE: Tokenizing text custom way

2003-11-25 Thread Pleasant, Tracy

Not exactly and answer to the question but I haven't yet used the Token 
classes/functionality that came with Lucene. Can someone give me an idea of how and 
why one may use this?

-Original Message-
From: Dragan Jotanovic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 6:42 AM
To: Lucene Users List
Subject: Tokenizing text custom way

Hi. I need to tokenize text while indexing but I don't want space to be delimiter. 
Delimiter should be my custom character (for example comma). I understand that I would 
probably need to implement my own analyzer, but could someone help me where to start. 
Is there any other way to do this without writing custom analyzer?

This is what I want to achieve.
If I have some text that will be indexed like following:

man, people, time out, sun

and if I enter 'time' as a search word, I don't want to get "time out" in results. I 
need exact keyword matching. I would achieve this if I tokenize "time out" as one 
token while idexing.

Maybe someone had similar problem? If someone knows how to handle this, please help me.

Dragan Jotanovic

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Zhou, Oliver

I do have other problems with PDFBox-0.6.4.  For one, it has annoying debug
information at very low level parsing process.  The other, I got infinite
loop while indexing pdf files although they say the infinite loop bug has
been fixed in their release notes.  Anybody knows what's going on?

Thanks,
Oliver

 

-Original Message-
From: Ben Litchfield [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 9:45 AM
To: Lucene Users List
Subject: RE: Lucene refresh index function (incremental indexing).



Yes, just add the log4j configuration.  The easiest way to do that is as a
system parameter like this

java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML
-create -index c:\\index ..

Where log4j.xml is the path to your log4j config, PDFBox has an example
one you can use.

Ben
http://www.pdfbox.org

On Tue, 25 Nov 2003, Zhou, Oliver wrote:

> Lucene doesn't have pdf parser.  In order to index pdf files you have to
add
> one by your self.  PDFBox is a good choice.  You may just ignore the
warning
> for log4j or you can add log4j in your classpath.
>
> Oliver
>
>
> -Original Message-
> From: Tun Lin [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 24, 2003 10:07 PM
> To: 'Lucene Users List'
> Subject: RE: Lucene refresh index function (incremental indexing).
>
>
> Does it support indexing the contents of pdf files? I have found one
project
> called PDFBox that can be integrated with Lucene to search inside of the
pdf
> files. Currently, Lucene can only search for the pdf filename. I tried
with
> PDFBox and I got the following message when I typed the command: java
> org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
>
> log4j:WARN No appenders could be found for logger
> (org.pdfbox.pdfparser.PDFParse
> r).
> log4j:WARN Please initialize the log4j system properly.
>
> Can anyone advise?
>
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 5:01 AM
> To: Lucene Users List
> Subject: Re: Lucene refresh index function (incremental indexing).
>
> Tun Lin wrote:
> > These are the steps I took:
> >
> > 1) I compile all the files in a particular directory using the command:
> > java org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
> > , putting all the indexed files in c:\\index.
> > 2) Everytime, I added an additional file in that directory. I need to
> > reindex/recompile that directory to generate the indexes again. As the
> > directory gets larger, the indexing takes a longer time.
> >
> > My question is how do I generate the indexes automatically everytime a
> > new document is added in that directory without me recompiling everytime
> manually?
>
> To update, try removing the '-create' from the command line.  The demo
code
> supports incremental updates.  It will re-scan the directory and figure
out
> which files have changed, what new files have appeared and which
previously
> existing files have been removed.
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Score

2003-11-25 Thread Pleasant, Tracy

Thanks for your input. 
I am using the standard analyzer for everything. I haven't created my
own analyzer yet.

The documents I am using: 

Plain text
PDF Documents

(I have two indexes) 

When I create my index: 
   IndexWriter writer = new IndexWriter(index_name, new
StandardAnalyzer(),true);

When I search:
Analyzer analyzer = new StandardAnalyzer();
query = MultiFieldQueryParser.parse(queryString, fields, analyzer); 
(where query String is the term to search and fields is the array of
fields)

When searching it does the one index then it does the other. 

When you say you use different analyzers for different fields in your
index, how would you accomplish that? When I create the index it has a
parameter for analyzer.. unless you create different indexes , how do
you use two different ones? 

-Original Message-
From: Gerret Apelt [mailto:[EMAIL PROTECTED]
Sent: Monday, November 24, 2003 3:25 PM
To: Lucene Users List
Subject: Re: Score

Tracey --

it would help if you could give more detail on the types of documents, 
fields and analyzers you're using. Also what do you mean by "Multi Field

Search"? I presume you're using the MultiFieldQueryParser to have query 
terms in a user-submitted query be searched for in each field in your
index.

If I am understanding your problem, then it might be the same one I had 
a few weeks ago -- highly relevant matches would not receive a high 
ranking. (This paragraph will apply to you only if you use more than 
just one Analyzer for the set of your fields). I had six fields in my 
index, most of which were populated with a standard analyzer. I used 
self-made Analyzers for two of the fields. This turned out to be my 
problem when using MultiFieldQueryParser: I told my 
MultiFieldQueryParser instance to use only the standard analyzer. 
Instead I discovered that I needed to make use of 
org.apache.lucene.analysis.PerFieldAnalyzerWrapper and feed that to the 
MultiFieldQueryParser. Unless you do this, your problem is whats 
described here: 
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.in
dexing&toc=faq#q15.

Most likely, if your scoring is off, you're "doing something wrong" in 
the way you use the Lucene API -- at least, thats what I've discovered 
to be the case when my ranking is off.

If you're interested in the nitty-gritty of how scoring is done, check 
this FAQ entry:
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.se
arch&toc=faq#q31

cheers,
Gerret

Pleasant, Tracy wrote:

>Hi,
>
>I'm using the Multi Field Search to search all the fields of my
>documents during the search. 
>
>When it returns results the scores are numerically low - .06, .17, etc.
>I would think if I searched for "Dog" and there was a doc with "Dog" in
>the title and several times in the contents of a document that it would
>receive a score more like 1.0 or close to it.
>
>Is there a way that I can tweak the score?
>
>I tried using Boost but that did absolutely nothing.
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>  
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Collaborative Filtering API

2003-11-25 Thread ambiesense

Hello togehter,

I am asking this group because I think people here might know about this
since it is a similar approach. 

Is there a Java based API which assist developers of collaborative filtering
in their programs. With this I mean software, which does use user ratings
between items and provide ways (algorithsm, methods) to find users with similar
interests for prediction generation. Finding a API like Lucene would be
dream for me but any pointer to other API's (also in other programming lanuages)
to see and learn from would be appreciated.

Kind Regards

-- 
GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen!

Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken
tolle Preise. http://www.gmx.net/de/cgi/specialmail/

+++ GMX - die erste Adresse für Mail, Message, More! +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Ben Litchfield


Yes, just add the log4j configuration.  The easiest way to do that is as a
system parameter like this

java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML
-create -index c:\\index ..

Where log4j.xml is the path to your log4j config, PDFBox has an example
one you can use.

Ben
http://www.pdfbox.org

On Tue, 25 Nov 2003, Zhou, Oliver wrote:

> Lucene doesn't have pdf parser.  In order to index pdf files you have to add
> one by your self.  PDFBox is a good choice.  You may just ignore the warning
> for log4j or you can add log4j in your classpath.
>
> Oliver
>
>
> -Original Message-
> From: Tun Lin [mailto:[EMAIL PROTECTED]
> Sent: Monday, November 24, 2003 10:07 PM
> To: 'Lucene Users List'
> Subject: RE: Lucene refresh index function (incremental indexing).
>
>
> Does it support indexing the contents of pdf files? I have found one project
> called PDFBox that can be integrated with Lucene to search inside of the pdf
> files. Currently, Lucene can only search for the pdf filename. I tried with
> PDFBox and I got the following message when I typed the command: java
> org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
>
> log4j:WARN No appenders could be found for logger
> (org.pdfbox.pdfparser.PDFParse
> r).
> log4j:WARN Please initialize the log4j system properly.
>
> Can anyone advise?
>
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 25, 2003 5:01 AM
> To: Lucene Users List
> Subject: Re: Lucene refresh index function (incremental indexing).
>
> Tun Lin wrote:
> > These are the steps I took:
> >
> > 1) I compile all the files in a particular directory using the command:
> > java org.apache.lucene.demo.IndexHTML -create -index c:\\index ..
> > , putting all the indexed files in c:\\index.
> > 2) Everytime, I added an additional file in that directory. I need to
> > reindex/recompile that directory to generate the indexes again. As the
> > directory gets larger, the indexing takes a longer time.
> >
> > My question is how do I generate the indexes automatically everytime a
> > new document is added in that directory without me recompiling everytime
> manually?
>
> To update, try removing the '-create' from the command line.  The demo code
> supports incremental updates.  It will re-scan the directory and figure out
> which files have changed, what new files have appeared and which previously
> existing files have been removed.
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Pleasant, Tracy

I was able to get PDFBox to work with my JSP webpages. 

I think you will have to in a way write your own code to do the PDF
files (while still calling the Lucene functions)

 doc = LucenePDFDocument.getDocument(file);

-Original Message-
From: Tun Lin [mailto:[EMAIL PROTECTED]
Sent: Monday, November 24, 2003 11:07 PM
To: 'Lucene Users List'
Subject: RE: Lucene refresh index function (incremental indexing).

Does it support indexing the contents of pdf files? I have found one
project
called PDFBox that can be integrated with Lucene to search inside of the
pdf
files. Currently, Lucene can only search for the pdf filename. I tried
with
PDFBox and I got the following message when I typed the command: java
org.apache.lucene.demo.IndexHTML -create -index c:\\index .. 

log4j:WARN No appenders could be found for logger
(org.pdfbox.pdfparser.PDFParse
r).
log4j:WARN Please initialize the log4j system properly.

Can anyone advise?

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 25, 2003 5:01 AM
To: Lucene Users List
Subject: Re: Lucene refresh index function (incremental indexing).

Tun Lin wrote:
> These are the steps I took:
> 
> 1) I compile all the files in a particular directory using the
command: 
> java org.apache.lucene.demo.IndexHTML -create -index c:\\index .. 
> , putting all the indexed files in c:\\index.
> 2) Everytime, I added an additional file in that directory. I need to 
> reindex/recompile that directory to generate the indexes again. As the

> directory gets larger, the indexing takes a longer time.
> 
> My question is how do I generate the indexes automatically everytime a

> new document is added in that directory without me recompiling
everytime
manually?

To update, try removing the '-create' from the command line.  The demo
code
supports incremental updates.  It will re-scan the directory and figure
out
which files have changed, what new files have appeared and which
previously
existing files have been removed.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Zhou, Oliver

Lucene doesn't have pdf parser.  In order to index pdf files you have to add
one by your self.  PDFBox is a good choice.  You may just ignore the warning
for log4j or you can add log4j in your classpath.

Oliver

-Original Message-
From: Tun Lin [mailto:[EMAIL PROTECTED]
Sent: Monday, November 24, 2003 10:07 PM
To: 'Lucene Users List'
Subject: RE: Lucene refresh index function (incremental indexing).

Does it support indexing the contents of pdf files? I have found one project
called PDFBox that can be integrated with Lucene to search inside of the pdf
files. Currently, Lucene can only search for the pdf filename. I tried with
PDFBox and I got the following message when I typed the command: java
org.apache.lucene.demo.IndexHTML -create -index c:\\index .. 

log4j:WARN No appenders could be found for logger
(org.pdfbox.pdfparser.PDFParse
r).
log4j:WARN Please initialize the log4j system properly.

Can anyone advise?

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 25, 2003 5:01 AM
To: Lucene Users List
Subject: Re: Lucene refresh index function (incremental indexing).

Tun Lin wrote:
> These are the steps I took:
> 
> 1) I compile all the files in a particular directory using the command: 
> java org.apache.lucene.demo.IndexHTML -create -index c:\\index .. 
> , putting all the indexed files in c:\\index.
> 2) Everytime, I added an additional file in that directory. I need to 
> reindex/recompile that directory to generate the indexes again. As the 
> directory gets larger, the indexing takes a longer time.
> 
> My question is how do I generate the indexes automatically everytime a 
> new document is added in that directory without me recompiling everytime
manually?

To update, try removing the '-create' from the command line.  The demo code
supports incremental updates.  It will re-scan the directory and figure out
which files have changed, what new files have appeared and which previously
existing files have been removed.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Tokenizing text custom way

2003-11-25 Thread Hackl, Rene

> Your solution isn't doing tokenizing, right?

You're absolutely right, I misunderstood.

Now, instead of return true, I'd maybe put something like

return !Character.toString(c).equals(",");

and then cut off surrounding spaces like "man, people, time out,..."
--> "man" " people" " time out"
--> "man" "people" "time out"

I haven't tested this though. Keep us posted when you find 
something that works. :-)

Best regards,

René

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Tokenizing text custom way

2003-11-25 Thread Dragan Jotanovic

Hi Rene,

> I've had the same problem. On some fields, I do
> employ a "NonTokenizer" now,
> which looks similar to the other tokenizers except for:

> protected boolean isTokenChar(char c)
>  {
>return true;
>  }

> So "time out" would be one token.

This is OK solution in case that I have only "time out" in a field, but I
will have dozens of words in one field of a document. Like I said in
previous letter, I would have "man, people, time out, sun" and all those
words would be in one letter and all should be "searchable" (I need to
tokenize them like "man" "people" "time out" "sun").

Your solution isn't doing tokenizing, right?

Dragan Jotanovic





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Tokenizing text custom way

2003-11-25 Thread Hackl, Rene

Hi Dragan,

> and if I enter 'time' as a search word, I don't want to get "time out" in
> results. I need exact keyword matching. I would achieve this if I tokenize
> "time out" as one token while idexing.

> Maybe someone had similar problem? If someone knows how to handle this,
> please help me.

I've had the same problem. On some fields, I do employ a "NonTokenizer" now,
which looks similar to the other tokenizers except for:

protected boolean isTokenChar(char c) 
  {
return true;
  }

So "time out" would be one token.

HTH

René

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Tokenizing text custom way

2003-11-25 Thread Dragan Jotanovic

Hi. I need to tokenize text while indexing but I don't want space to be delimiter. 
Delimiter should be my custom character (for example comma). I understand that I would 
probably need to implement my own analyzer, but could someone help me where to start. 
Is there any other way to do this without writing custom analyzer?

This is what I want to achieve.
If I have some text that will be indexed like following:

man, people, time out, sun

and if I enter 'time' as a search word, I don't want to get "time out" in results. I 
need exact keyword matching. I would achieve this if I tokenize "time out" as one 
token while idexing.

Maybe someone had similar problem? If someone knows how to handle this, please help me.

Dragan Jotanovic

RE: Lucene version 1.3.

2003-11-25 Thread Otis Gospodnetic

1.3RC2.

Otis

--- Scott Smith <[EMAIL PROTECTED]> wrote:
> If you had to be production in January, would you be using 1.3RC2 or
> 1.2?
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
> Sent: Monday, November 24, 2003 4:03 AM
> To: Lucene Users List; [EMAIL PROTECTED]
> Subject: Re: Lucene version 1.3.
> 
> 
> Sorry, no firm date.  However, 1.3 RC2 is pretty solid, so I suggest
> you
> just use that until 1.3 final is out.
> 
> Otis
> 
> --- Tun Lin <[EMAIL PROTECTED]> wrote:
> > Hi,
> > 
> > Anyone knows when the full version of Lucene version 1.3 will be 
> > released?
> > 
> > Please advise.
> > 
> > Thanks.
> > 
> 
> 
> __
> Do you Yahoo!?
> Free Pop-Up Blocker - Get it now
> http://companion.yahoo.com/
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

start of original exception stack > trace---

start of original exception stack trace---

53 matches

Mail list logo