date:20040908

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread sergiu gordea

Hi Bill,
 I think that more people wait for this patch of MultifieldIndexParser.
 It would be nice if it will be included in the next realease candidate 


   All the best,
  Sergiu
Bill Janssen wrote:
René,
Thanks for your note.
I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears.  That is,
(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
Instead, what they'd get using the current (broken) strategy of outer
combination used by the current MultiFieldQueryParser, would be
(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
Note that this would match even if only "lucene" occurred in the
document, as long as it occurred both in the title field and in the
author field.  Or, for that matter, it would also match "Cutting on
Cutting", by Doug Cutting :-).
 

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116
   

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: maximum index size

2004-09-08 Thread Doug Cutting

Chris Fraschetti wrote:
I've seen throughout the list mentions of millions of documents.. 8
million, 20 million, etc etc.. but can lucene potentially handle
billions of documents and still efficiently search through them?
Lucene can currently handle up to 2^31 documents in a single index.  To 
a large degree this is limited by Java ints and arrays (which are 
accessed by ints).  There are also a few places where the file format 
limits things to 2^32.

On typical PC hardware, 2-3 word searches of an index with 10M 
documents, each with around 10k of text, require around 1 second, 
including index i/o time.  Performance is more-or-less linear, so that a 
100M document index might require nearly 10 seconds per search.  Thus, 
as indexes grow folks tend to distribute searches in parallel to many 
smaller indexes.  That's what Nutch and Google 
(http://www.computer.org/micro/mi2003/m2022.pdf) do.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: maximum index size

2004-09-08 Thread Otis Gospodnetic

Given adequate hardware, it can.  Take a look at nutch.org.  Nutch uses
Lucene at its core.

Otis

--- Chris Fraschetti <[EMAIL PROTECTED]> wrote:

> I know the index size is very dependent on the content being index...
> 
> but running on a unix based machine w/o a filesize limit, best case
> scenario... what is the largest number of documents that can be
> indexed.
> 
> I've seen throughout the list mentions of millions of documents.. 8
> million, 20 million, etc etc.. but can lucene potentially handle
> billions of documents and still efficiently search through them?
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

maximum index size

2004-09-08 Thread Chris Fraschetti

I know the index size is very dependent on the content being index...

but running on a unix based machine w/o a filesize limit, best case
scenario... what is the largest number of documents that can be
indexed.

I've seen throughout the list mentions of millions of documents.. 8
million, 20 million, etc etc.. but can lucene potentially handle
billions of documents and still efficiently search through them?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing size

2004-09-08 Thread Dmitry Serebrennikov

Niraj Alok wrote:
Hi PA,
Thanks for the detail ! Since we are using lucene to store the data also, I
guess I would not be able to use it.
 

By the way, I could be wrong, but I think the 35% figure you referenced 
in the your first e-mail actually does not include any stored fields. 
The deal with 35% was, I think, to illustrate that index data structures 
used for searching by Lucene are efficient. But Lucene does nothing 
special about stored content - no compression or anything like that. So 
you end up with the pure size of your data plus the 35% of the indexed 
data.

Cheers.
Dmitry.
Regards,
Niraj
- Original Message -
From: "petite_abeille" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 01, 2004 1:14 PM
Subject: Re: indexing size
 

Hi Niraj,
On Sep 01, 2004, at 06:45, Niraj Alok wrote:
   

If I make some of them Field.Unstored, I can see from the javadocs
that it
will be indexed and tokenized but not stored. If it is not stored, how
can I
use it while searching?
 

The different type of fields don't impact how you do your search. This
is always the same.
Using Unstored fields simply means that you use Lucene as a pure index
for search purpose only, not for storing any data.
Specifically, the assumption is that your original data lives somewhere
else, outside of Lucene. If this assumption is true, then you can index
everything as Unstored with the addition of one Keyword per document.
The Keyword field holds some sort of unique identifier which allows you
to retrieve the original data if necessary (e.g. a primary key, an URI,
what not).
Here is an example of this approach:
(1) For indexing, check the indexValuesWithID() method
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZIndex.java?view=markup
Note the addition of a Field.Keyword for each document and the use of
Field.UnStored for everything else
(2) For fetching, check objectsWithSpecificationAndHitsInStore()
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
SZFinder.java?view=markup
HTH.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread Bill Janssen

René,

Thanks for your note.

I'd think that if a user specified a query "cutting lucene", with an
implicit AND and the default fields "title" and "author", they'd
expect to see a match in which both "cutting" and "lucene" appears.  That is,

(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)

Instead, what they'd get using the current (broken) strategy of outer
combination used by the current MultiFieldQueryParser, would be

(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)

Note that this would match even if only "lucene" occurred in the
document, as long as it occurred both in the title field and in the
author field.  Or, for that matter, it would also match "Cutting on
Cutting", by Doug Cutting :-).

> http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: IndexSearcher.close() and aborting searches in progress

2004-09-08 Thread Otis Gospodnetic

Dave,

I haven't tried this, but I think this would be messy.  Lucene needs to
keep index files open, so that when you pull a Document from Hits, it
can read this stuff from those files.  If you close index files, you
are likely to get some NPEs or some such.

I don't think you'll find a ready to use API for this use case in
Lucene.  Instead, my guess is that you will have to manually keep track
of your IndexSearcher's status (open/closed), and allow searches to
return results only if status == open.

Otis

--- David Spencer <[EMAIL PROTECTED]> wrote:

>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close()
> 
> What is the intent of IndexSearcher.close()?
> 
> I want to know how, in a web app, one can stop a search that's in 
> progress - use case is a user is limited to one search at at time,
> and 
> when one (expensive) search is running they decide it's taking too
> long 
> so they elaborate on the query and resubmit it. Goal is for the
> server 
> to stop the search that's in progress and to start a new one. I know
> how 
> to deal w/ session vars and so on in a web container - but can one
> stop 
> a search that's in progress and is that the intent of close()?
> 
> I haven't done the obvious experiment but regardless, the javadoc is 
> kinda terse so I wanted to hear from the all knowing people on the
> list.
> 
> thx,
>Dave
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Compound File Format question

2004-09-08 Thread Armbrust, Daniel C.

Ahh - two new discoveries:

You have to add a document, remove a document, and then call optimize.   Then 
everything works (nearly as expected)

The version of Lucene that ships with Luke still has the broken optimize code in it 
that didn't clean up after itself - so you need to just download Luke, and then run it 
with 1.4.1 of Lucene, rather than what is ships with (which the website indicates is 
1.4 RC4)


Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Full web search engine package using Lucene

2004-09-08 Thread Anne Y. Zhang

Thanks a lot!

Ya
- Original Message - 
From: "Bernhard Messer" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 3:38 PM
Subject: Re: Full web search engine package using Lucene


> Anne Y. Zhang wrote:
> 
> >Thanks, David. But it seems that this is downloadable. 
> >Could you please provide me the link for download?
> >Thank you very much!
> >  
> >
> http://www.nutch.org/release/
> 
> >Ya
> >- Original Message - 
> >From: "David Spencer" <[EMAIL PROTECTED]>
> >To: "Lucene Users List" <[EMAIL PROTECTED]>
> >Sent: Wednesday, September 08, 2004 2:43 PM
> >Subject: Re: Full web search engine package using Lucene
> >
> >
> >  
> >
> >>Anne Y. Zhang wrote:
> >>
> >>
> >>
> >>>Hi, I am assistanting a professor for a IR course. 
> >>>We need to provide the student with a full-fuctioned
> >>>search engine package, and the professor prefers it
> >>>being powered by lucene. Since I am new to lucene,
> >>>can anyone provide me some information that where
> >>>can I get the package? We also want the package 
> >>>contains the crawling function. Thank you very much!
> >>>  
> >>>
> >>http://www.nutch.org/
> >>
> >>
> >>
> >>>Ya
> >>>
> >>>
> >>>
> >>>-
> >>>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >>>  
> >>>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >>
> >
> >
> >-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >  
> >
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Compound File Format question

2004-09-08 Thread Armbrust, Daniel C.

Hmm, I tried that in Luke - but it doesn't seem to take.  When I uncheck the use 
compound file check box, and then select optimize, it doesn't change anything.

I guess I should just write some code already :)

Dan
 

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 08, 2004 2:37 PM
To: Lucene Users List
Subject: Re: Compound File Format question

Armbrust, Daniel C. wrote:

> Is it safe to change the compound file format option at any time during the life of 
> an index?
> 
> Can I build an index with it off, then turn it on, and call optimize, and have a 
> compound file formatted index?
> 
> And then later, turn it on, call optimize again, and go back the other way?

In my experience it's safe. I've been doing this in a couple of real 
applications, and also in Luke there is an option to re-pack the index 
using compound or not.

-- 
Best regards,
Andrzej Bialecki

-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Full web search engine package using Lucene

2004-09-08 Thread Bernhard Messer

Anne Y. Zhang wrote:
Thanks, David. But it seems that this is downloadable. 
Could you please provide me the link for download?
Thank you very much!
 

http://www.nutch.org/release/
Ya
- Original Message - 
From: "David Spencer" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 2:43 PM
Subject: Re: Full web search engine package using Lucene

 

Anne Y. Zhang wrote:
   

Hi, I am assistanting a professor for a IR course. 
We need to provide the student with a full-fuctioned
search engine package, and the professor prefers it
being powered by lucene. Since I am new to lucene,
can anyone provide me some information that where
can I get the package? We also want the package 
contains the crawling function. Thank you very much!
 

http://www.nutch.org/
   

Ya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Compound File Format question

2004-09-08 Thread Andrzej Bialecki

Armbrust, Daniel C. wrote:
Is it safe to change the compound file format option at any time during the life of an 
index?
Can I build an index with it off, then turn it on, and call optimize, and have a 
compound file formatted index?
And then later, turn it on, call optimize again, and go back the other way?
In my experience it's safe. I've been doing this in a couple of real 
applications, and also in Luke there is an option to re-pack the index 
using compound or not.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Full web search engine package using Lucene

2004-09-08 Thread Anne Y. Zhang

Thanks, David. But it seems that this is downloadable. 
Could you please provide me the link for download?
Thank you very much!

Ya
- Original Message - 
From: "David Spencer" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 2:43 PM
Subject: Re: Full web search engine package using Lucene


> Anne Y. Zhang wrote:
> 
> > Hi, I am assistanting a professor for a IR course. 
> > We need to provide the student with a full-fuctioned
> > search engine package, and the professor prefers it
> > being powered by lucene. Since I am new to lucene,
> > can anyone provide me some information that where
> > can I get the package? We also want the package 
> > contains the crawling function. Thank you very much!
> 
> http://www.nutch.org/
> 
> > 
> > Ya
> > 
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Compound File Format question

2004-09-08 Thread Armbrust, Daniel C.

Is it safe to change the compound file format option at any time during the life of an 
index?

Can I build an index with it off, then turn it on, and call optimize, and have a 
compound file formatted index?

And then later, turn it on, call optimize again, and go back the other way?

The JavaDocs don't say much of anything about it (oh - and PS - there is a copy and 
paste error in the description for the getUseCompoundFile() method)

Thanks, 

Dan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Full web search engine package using Lucene

2004-09-08 Thread David Spencer

Anne Y. Zhang wrote:
Hi, I am assistanting a professor for a IR course. 
We need to provide the student with a full-fuctioned
search engine package, and the professor prefers it
being powered by lucene. Since I am new to lucene,
can anyone provide me some information that where
can I get the package? We also want the package 
contains the crawling function. Thank you very much!
http://www.nutch.org/
Ya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Full web search engine package using Lucene

2004-09-08 Thread Anne Y. Zhang

Hi, I am assistanting a professor for a IR course. 
We need to provide the student with a full-fuctioned
search engine package, and the professor prefers it
being powered by lucene. Since I am new to lucene,
can anyone provide me some information that where
can I get the package? We also want the package 
contains the crawling function. Thank you very much!

Ya



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

IndexSearcher.close() and aborting searches in progress

2004-09-08 Thread David Spencer

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close()
What is the intent of IndexSearcher.close()?
I want to know how, in a web app, one can stop a search that's in 
progress - use case is a user is limited to one search at at time, and 
when one (expensive) search is running they decide it's taking too long 
so they elaborate on the query and resubmit it. Goal is for the server 
to stop the search that's in progress and to start a new one. I know how 
to deal w/ session vars and so on in a web container - but can one stop 
a search that's in progress and is that the intent of close()?

I haven't done the obvious experiment but regardless, the javadoc is 
kinda terse so I wanted to hear from the all knowing people on the list.

thx,
  Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: where is the SnowBallAnalyzer?

2004-09-08 Thread Ernesto De Santis

Is in snowball-1.0.jar

I sent you it in private email.

Bye
Ernesto.

- Original Message - 
From: "Wermus Fernando" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 1:12 PM
Subject: where is the SnowBallAnalyzer?

I have to look better, but why the SnowBallAnalizer isn't in 
org.apache.lucene.analysis.snowball.SnowballAnalyzer package?

I have lucene 1.4.

I'm doing my own spanish stemmer.

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.754 / Virus Database: 504 - Release Date: 06/09/2004

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PDF->Text Performance comparison

2004-09-08 Thread Ben Litchfield


Yes, that and a few other adjectives, but I didn't want to get carried
away.

Ben


On Wed, 8 Sep 2004, Doug Cutting wrote:

> Ben Litchfield wrote:
> > PDFBox: slow PDF text extraction for Java applications
> > http://www.pdfbox.org
>
> Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java
> applications, with Lucene integration"?
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PDF->Text Performance comparison

2004-09-08 Thread Doug Cutting

Ben Litchfield wrote:
PDFBox: slow PDF text extraction for Java applications
http://www.pdfbox.org
Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java 
applications, with Lucene integration"?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

where is the SnowBallAnalyzer?

2004-09-08 Thread Wermus Fernando

I have to look better, but why the SnowBallAnalizer isn't in 
org.apache.lucene.analysis.snowball.SnowballAnalyzer package?
 
I have lucene 1.4.
 
I'm doing my own spanish stemmer.

RE: -- TomCat/Lucene, filesystem

2004-09-08 Thread Will Allen

I think you might be refering to the xml files you keep in C:\Program 
Files\Apache\Tomcat\conf\Catalina\localhost

I have a file with the contents (myapp.xml):







-Original Message-
From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 31, 2004 12:36 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: RE: -- TomCat/Lucene, filesystem


i have a web application using  lucene via tomcat,
you may need to set 
the correct permissions in ur catalina.policy file 

i use a blanket policy of
grant  {
   permission java.io.FilePermission   "/","read";
};

to manage allow access to lucene 


>-Original Message-
>From: J.Ph DEGLETAGNE [mailto:[EMAIL PROTECTED]
>Sent: 31 August 2004 17:12
>To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
>Subject: -- TomCat/Lucene, filesystem
>
>
>Hello Somebody, 
> 
>..I beg your pardon... 
> 
>Under Windows XP / TomCat, 
> 
>How to "customize"  Webapp Lucene to access directory filesystem which are
>outside TomCat ?
>like this :
>D:\Program Files\Apache Software Foundation\Tomcat 5.0\..
>to access
>E:\Data
> 
>Thank's a lot
> 
>JPhD
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving from a single server to a cluster

2004-09-08 Thread Praveen Peddi

We went thru the same scenario as yours. We recently made our application
clsuterable and I wrote our own version of jdbc directory (similar to the
SQLDirectory posted by someone) with our own caching. It was great for
searching for indexing had become a real bottleneck. So we have decided to
move back to file system for non-clustered apps. I am still trying to figure
the best way (whether using a RemoteSearcher or manage multiple index). I
already tried multiple index and we didn't really like the solution of
maintaining multiple copies. It requires more space, more maintaineance, all
index needs to be in sync etc.

I will be glad if I can get the best answer for this. Did anyone try
RemoteSearchable and how is it compared to multiple index solution?

 Nader: I would appreciate if you can send me the docs.

Praveen

- Original Message - 
From: "David Townsend" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 10:42 AM
Subject: RE: Moving from a single server to a cluster

Would it be cheeky to ask you to post the docs to the group?  It would be
interesting to read how you've tackled this.

-Original Message-
From: Nader Henein [mailto:[EMAIL PROTECTED]
Sent: 08 September 2004 13:57
To: Lucene Users List
Subject: Re: Moving from a single server to a cluster

Hey Ben,

We've been using a distributed environment with three servers and three
separate indecies for the past 2 years since the first stable Lucene
release and it has been great, recently and for the past two months I've
been working on a redesign for our Lucene App and I've shared my
findings and plans with Otis, Doug and Erik, they pointed out a few
faults in my logic which you will probably come across soon enough that
mainly have to do with keeping you updates atomic (not too hard) and
your deletes atomic (a little more tricky), give me a few days and I'll
send you both the early document and the newer version that deals
squarely with Lucene in a distributed environment with high volume index.

Regards.

Nader Henein

Ben Sinclair wrote:

>My application currently uses Lucene with an index living on the
>filesystem, and it works fine. I'm moving to a clustered environment
>soon and need to figure out how to keep my indexes together. Since the
>index is on the filesystem, each machine in the cluster will end up
>with a different index.
>
>I looked into JDBC Directory, but it's not tested under Oracle and
>doesn't seem like a very mature project.
>
>What are other people doing to solve this problem?
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving from a single server to a cluster

2004-09-08 Thread Nader Henein

be a pleasure, just didn't want to mislead someone down the wrong way.
Give me a few days and I'll have the new version up.
Nader
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: PDF->Text Performance comparison

2004-09-08 Thread Chas Emerick

Ben,
Wow, thanks for the plug! :-)
Truthfully, I was worried that our open-source brethren might feel 
slighted by the comparison -- that's partially why we wanted to make 
sure it was as thorough and transparent as possible so that anyone 
could review the results for themselves.  I'm glad that you're not at 
all sore.

Chas Emerick   |   [EMAIL PROTECTED]
PDFTextStream: fast PDF text extraction for Java applications
http://snowtide.com/home/PDFTextStream/
On Sep 8, 2004, at 10:41 AM, Ben Litchfield wrote:
On Wed, 8 Sep 2004, Chas Emerick wrote:
PDFTextStream: fast PDF text extraction for Java applications
http://snowtide.com/home/PDFTextStream/

For those that have not seen, snowtide.com has done a performance
comparison against several Java PDF->Text libraries, including 
Snowtide's
PDFTextStream, PDFBox, Etymon PJ and JPedal.  It appears to be fairly 
well
done.

http://snowtide.com/home/PDFTextStream/Performance
PDFBox: slow PDF text extraction for Java applications
http://www.pdfbox.org
:)
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Moving from a single server to a cluster

2004-09-08 Thread David Townsend

Would it be cheeky to ask you to post the docs to the group?  It would be interesting 
to read how you've tackled this.

-Original Message-
From: Nader Henein [mailto:[EMAIL PROTECTED]
Sent: 08 September 2004 13:57
To: Lucene Users List
Subject: Re: Moving from a single server to a cluster

Hey Ben,

We've been using a distributed environment with three servers and three 
separate indecies for the past 2 years since the first stable Lucene 
release and it has been great, recently and for the past two months I've 
been working on a redesign for our Lucene App and I've shared my 
findings and plans with Otis, Doug and Erik, they pointed out a few 
faults in my logic which you will probably come across soon enough that 
mainly have to do with keeping you updates atomic (not too hard) and 
your deletes atomic (a little more tricky), give me a few days and I'll 
send you both the early document and the newer version that deals 
squarely with Lucene in a distributed environment with high volume index.

Regards.

Nader Henein

Ben Sinclair wrote:

>My application currently uses Lucene with an index living on the
>filesystem, and it works fine. I'm moving to a clustered environment
>soon and need to figure out how to keep my indexes together. Since the
>index is on the filesystem, each machine in the cluster will end up
>with a different index.
>
>I looked into JDBC Directory, but it's not tested under Oracle and
>doesn't seem like a very mature project.
>
>What are other people doing to solve this problem?
>
>  
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

PDF->Text Performance comparison

2004-09-08 Thread Ben Litchfield

On Wed, 8 Sep 2004, Chas Emerick wrote:
> PDFTextStream: fast PDF text extraction for Java applications
> http://snowtide.com/home/PDFTextStream/

For those that have not seen, snowtide.com has done a performance
comparison against several Java PDF->Text libraries, including Snowtide's
PDFTextStream, PDFBox, Etymon PJ and JPedal.  It appears to be fairly well
done.

http://snowtide.com/home/PDFTextStream/Performance

PDFBox: slow PDF text extraction for Java applications
http://www.pdfbox.org

:)

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: pdf in Chinese

2004-09-08 Thread Chas Emerick

I'm not aware of any Java library that can reliably extract Chinese 
text from PDF documents.  We're planning on supporting Chinese, 
Japanese, and Korean in version 2 of PDFTextStream, but there's no 
doubt that it's a huge challenge.

Chas Emerick   |   [EMAIL PROTECTED]
PDFTextStream: fast PDF text extraction for Java applications
http://snowtide.com/home/PDFTextStream/
On Sep 8, 2004, at 5:58 AM, [EMAIL PROTECTED] wrote:
it is not about analyzer ,i  need to read text from pdf file first.
- Original Message -
From: "Chandan Tamrakar" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 4:15 PM
Subject: Re: pdf in Chinese

which analyzer you are using to index chinese pdf documents ?
I think you should use cjkanalyzer
- Original Message -
From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 11:27 AM
Subject: pdf in Chinese

Hi all,
i use pdfbox to parse pdf file to lucene document.when i parse
Chinese
pdf file,pdfbox is not always success.
Is anyone have some advice?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread sergiu gordea

The class is at the end of the message.
But it hink that a better solution is that one suggested by Rene: 

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116
Wermus Fernando wrote:
Bill,
I don't receive any .java. Could you send it again?
Thanks.
-Mensaje original-
De: Bill Janssen [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m.
Para: Lucene Users List
CC: Ali Rouhi
Asunto: MultiFieldQueryParser seems broken... Fix attached.

Hi!
I'm using Lucene for an application which has lots of fields/document,
in which the users can specify in their config files what fields they
wish to be included by default in a search.  I'd been happily using
MultiFieldQueryParser to do the searches, but the darn users started
demanding more Google-like searches; that is, they want the search
terms to be implicitly AND-ed instead of implicitly OR-ed.  No
problem, thinks I, I'll just set the "operator".
Only to find this has no effect on MultiFieldQueryParser.
Once I looked at the code, I find that MultiFieldQueryParser combines
the clauses at the wrong level -- it combines them at the outermost
level instead of the innermost level.  This means that if you have two
fields, "author" and "title", and the search string "cutting lucene",
you'll get the final query
  (title:cutting title:lucene) (author:cutting author:lucene)
If the search operator is "OR", this isn't a problem.  But if it is,
you have two problems.  The first is that MultiFieldQueryParser seems
to ignore the operator entirely.  But even if it didn't, the second
problem is that the query formed would be
  +(title:cutting title:lucene) +(author:cutting author:lucene)
That is, if the word "Lucene" was in both the author field and the
title field, the match would fit.  This clearly isn't what the
searcher intended.
You can re-write MultiFieldQueryParser, as I've done in the example
code which I append here.  This little program allows you to run
either my parser (-DSearchTest.QueryParser=new) or the old parser
(-DSearchTest.QueryParser=old).  It allows you to use either OR
(-DSearchTest.QueryDefaultOperator=or) or AND
(-DSearchTest.QueryDefaultOperator=and) as the operator.  And it
allows you to pick your favorite set of default search terms
(-DSearchTest.QueryDefaultFields=author:title:body, for example).  It
takes one argument, a query string, and outputs the re-written query
after running it through the query parser.  So to evaluate the above
query:
% java -classpath /import/lucene/lucene-1.4.1.jar:. \
  -DSearchTest.QueryDefaultFields="title:author" \
  -DSearchTest.QueryDefaultOperator=AND \
  -DSearchTest.QueryParser=old \
  SearchTest "cutting lucene"
query is (title:cutting title:lucene) (author:cutting author:lucene)
%
The class NewMultiFieldQueryParser does the combination at the inner
level, using an override of "addClause", instead of the outer level.
Note that it can't cover all cases (notably PhrasePrefixQuery, because
that class has no access methods which allow one to introspect over
it, and SpanQueries, because I don't understand them well enough :-).
I post it here in advance of filing a formal bug report for early
feedback.  But it will show up in a bug report in the near future.
Running the above query with the new parser gives:
% java -classpath /import/lucene/lucene-1.4.1.jar:. \
  -DSearchTest.QueryDefaultFields="title:author" \
  -DSearchTest.QueryDefaultOperator=AND \
  -DSearchTest.QueryParser=new \
  SearchTest "cutting lucene"
query is +(title:cutting author:cutting) +(title:lucene author:lucene)
%
which I claim is what the user is expecting.
In addition, the new class uses an API more similar to QueryParser, so
that the user has less to learn when using it.  The code in it could
probably just be folded into QueryParser, in fact.
Bill
the code for SearchTest:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.FastCharStream;
import org.apache.lucene.queryParser.TokenMgrError;
import org.apache.lucene.queryParser.ParseException;
i

Re: Moving from a single server to a cluster

2004-09-08 Thread Nader Henein

Hey Ben,
We've been using a distributed environment with three servers and three 
separate indecies for the past 2 years since the first stable Lucene 
release and it has been great, recently and for the past two months I've 
been working on a redesign for our Lucene App and I've shared my 
findings and plans with Otis, Doug and Erik, they pointed out a few 
faults in my logic which you will probably come across soon enough that 
mainly have to do with keeping you updates atomic (not too hard) and 
your deletes atomic (a little more tricky), give me a few days and I'll 
send you both the early document and the newer version that deals 
squarely with Lucene in a distributed environment with high volume index.

Regards.
Nader Henein
Ben Sinclair wrote:
My application currently uses Lucene with an index living on the
filesystem, and it works fine. I'm moving to a clustered environment
soon and need to figure out how to keep my indexes together. Since the
index is on the filesystem, each machine in the cluster will end up
with a different index.
I looked into JDBC Directory, but it's not tested under Oracle and
doesn't seem like a very mature project.
What are other people doing to solve this problem?
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread Wermus Fernando

Bill,
I don't receive any .java. Could you send it again?

Thanks.

-Mensaje original-
De: Bill Janssen [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m.
Para: Lucene Users List
CC: Ali Rouhi
Asunto: MultiFieldQueryParser seems broken... Fix attached.

Hi!

I'm using Lucene for an application which has lots of fields/document,
in which the users can specify in their config files what fields they
wish to be included by default in a search.  I'd been happily using
MultiFieldQueryParser to do the searches, but the darn users started
demanding more Google-like searches; that is, they want the search
terms to be implicitly AND-ed instead of implicitly OR-ed.  No
problem, thinks I, I'll just set the "operator".

Only to find this has no effect on MultiFieldQueryParser.

Once I looked at the code, I find that MultiFieldQueryParser combines
the clauses at the wrong level -- it combines them at the outermost
level instead of the innermost level.  This means that if you have two
fields, "author" and "title", and the search string "cutting lucene",
you'll get the final query

   (title:cutting title:lucene) (author:cutting author:lucene)

If the search operator is "OR", this isn't a problem.  But if it is,
you have two problems.  The first is that MultiFieldQueryParser seems
to ignore the operator entirely.  But even if it didn't, the second
problem is that the query formed would be

   +(title:cutting title:lucene) +(author:cutting author:lucene)

That is, if the word "Lucene" was in both the author field and the
title field, the match would fit.  This clearly isn't what the
searcher intended.

You can re-write MultiFieldQueryParser, as I've done in the example
code which I append here.  This little program allows you to run
either my parser (-DSearchTest.QueryParser=new) or the old parser
(-DSearchTest.QueryParser=old).  It allows you to use either OR
(-DSearchTest.QueryDefaultOperator=or) or AND
(-DSearchTest.QueryDefaultOperator=and) as the operator.  And it
allows you to pick your favorite set of default search terms
(-DSearchTest.QueryDefaultFields=author:title:body, for example).  It
takes one argument, a query string, and outputs the re-written query
after running it through the query parser.  So to evaluate the above
query:

% java -classpath /import/lucene/lucene-1.4.1.jar:. \
   -DSearchTest.QueryDefaultFields="title:author" \
   -DSearchTest.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=old \
   SearchTest "cutting lucene"
query is (title:cutting title:lucene) (author:cutting author:lucene)
%

The class NewMultiFieldQueryParser does the combination at the inner
level, using an override of "addClause", instead of the outer level.
Note that it can't cover all cases (notably PhrasePrefixQuery, because
that class has no access methods which allow one to introspect over
it, and SpanQueries, because I don't understand them well enough :-).
I post it here in advance of filing a formal bug report for early
feedback.  But it will show up in a bug report in the near future.

Running the above query with the new parser gives:

% java -classpath /import/lucene/lucene-1.4.1.jar:. \
   -DSearchTest.QueryDefaultFields="title:author" \
   -DSearchTest.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=new \
   SearchTest "cutting lucene"
query is +(title:cutting author:cutting) +(title:lucene author:lucene)
%

which I claim is what the user is expecting.

In addition, the new class uses an API more similar to QueryParser, so
that the user has less to learn when using it.  The code in it could
probably just be folded into QueryParser, in fact.

Bill


the code for SearchTest:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.FastCharStream;
import org.apache.lucene.queryParser.TokenMgrError;
import org.apache.lucene.queryParser.ParseException;

import java.io.File;
import java.io.StringReader;
import java.util.Date;
import java.util.HashMap;
import java.util.Iterator;
import java.util.StringTokenizer;

class S

Re: pdf in Chinese

2004-09-08 Thread Ben Litchfield


This appears to be more of a PDFBox issue than a lucene issue, please post
an issue to the PDFBox site.

Also note, that because of certain encodings that a PDF writer can use, it
is impossible to extract text from all PDF documents.

Ben

On Wed, 8 Sep 2004, [EMAIL PROTECTED] wrote:

> it is not about analyzer ,i  need to read text from pdf file first.
>
> - Original Message -
> From: "Chandan Tamrakar" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Wednesday, September 08, 2004 4:15 PM
> Subject: Re: pdf in Chinese
>
>
> > which analyzer you are using to index chinese pdf documents ?
> > I think you should use cjkanalyzer
> > - Original Message -
> > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Wednesday, September 08, 2004 11:27 AM
> > Subject: pdf in Chinese
> >
> >
> > > Hi all,
> > > i use pdfbox to parse pdf file to lucene document.when i parse
> > Chinese
> > > pdf file,pdfbox is not always success.
> > > Is anyone have some advice?
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: *term search

2004-09-08 Thread iouli . golovatyi

.. and here is the way to do it:
(See attached file: SUPPOR~1.RAR)

  Erik Hatcher 

  <[EMAIL PROTECTED]To:   "Lucene Users List"  

  utions.com>   <[EMAIL PROTECTED]>

   cc:   (bcc: Iouli 
Golovatyi/X/GP/Novartis) 
  08.09.2004 12:46 Subject:  Re: *term search  

  Please respond to

  "Lucene UsersCategory:   
|-|
  List"| ( ) Action needed 
  |
   | ( ) Decision needed   
  |
   | ( ) General 
Information |

|-|

On Sep 8, 2004, at 6:26 AM, sergiu gordea wrote:
> I want to discuss a little problem, lucene doesn't support *Term like
> queries.

First of all, this is untrue.  WildcardQuery itself most definitely
supports wildcards at the beginning.

> I would like to use "*schreiben".

The dilemma you've encountered is that QueryParser prevents queries
that begin with a wildcard.

> So my question is if there is a simple solution for implementing the
> funtionality mentioned above.
> Maybe subclassing one class and overwriting some methods will sufice.

It will require more than that in this case.  You will need to create a
custom parser that allows the grammar you'd like.  Feel free to use the
JavaCC source code to QueryParser as a basis of your customizations.

 Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: *term search

2004-09-08 Thread Morus Walter

sergiu gordea writes:
> 
> 
>  Hi all,
> 
> I want to discuss a little problem, lucene doesn't support *Term like 
> queries.
> I know that this can bring a lot of results in the memory and therefore 
> it is restricted.
> 
That's not the reason for the restriction. That's possible with a* also.
The problem is, that lucene has to check all terms to see if they end
with Term. That makes the performance pretty poor.
A prefix allows to restrict the search on words with this prefix 
efficiantly, since the wordlist is orderd.
> 
>  So my question is if there is a simple solution for implementing the 
> funtionality mentioned above.
Sure.
Just follow the way, wildcard query is implemented.

Actually I'm not sure if the restriction you mention is in the wildcard
query itself or only in the query parser. In the latter case, you might
just create the query yourself.

A better way for postfix queries is to create an additional search field
where all words are reversed and search for mreT* on that field.

Depends on the size of your index, how important such an optimization is.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: *term search

2004-09-08 Thread Erik Hatcher

On Sep 8, 2004, at 6:26 AM, sergiu gordea wrote:
I want to discuss a little problem, lucene doesn't support *Term like 
queries.
First of all, this is untrue.  WildcardQuery itself most definitely 
supports wildcards at the beginning.

I would like to use "*schreiben".
The dilemma you've encountered is that QueryParser prevents queries 
that begin with a wildcard.

So my question is if there is a simple solution for implementing the 
funtionality mentioned above.
Maybe subclassing one class and overwriting some methods will sufice.
It will require more than that in this case.  You will need to create a 
custom parser that allows the grammar you'd like.  Feel free to use the 
JavaCC source code to QueryParser as a basis of your customizations.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

*term search

2004-09-08 Thread sergiu gordea


Hi all,
I want to discuss a little problem, lucene doesn't support *Term like 
queries.
I know that this can bring a lot of results in the memory and therefore 
it is restricted.

I think that allowing this kind of search and limiting the amount of 
returned results would be
a more usefull aproach. Since the german language has a lot of words 
that are concatenated or
derivated from another words by using a prefix.

I'm not a good german speaker but I can say that maybe a half of the 
german words are a part of the
category described above.

for example
Himbeer, Erdbeer, Johanesbeer -- all of them are fruits from a certain 
category. So it will make sense to search
for "*beer". Also ... I know that the word is ended in "beer" but I 
don't know the exact word ...
"*beer" will help me a lot.

also:
schreiben = to write
beschreiben = to describe
verschreiben = to subscribe ..
I would like to use "*schreiben".
So my question is if there is a simple solution for implementing the 
funtionality mentioned above.
Maybe subclassing one class and overwriting some methods will sufice.

Thank in advance,
Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: pdf in Chinese

2004-09-08 Thread [EMAIL PROTECTED]

it is not about analyzer ,i  need to read text from pdf file first.

- Original Message - 
From: "Chandan Tamrakar" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 4:15 PM
Subject: Re: pdf in Chinese


> which analyzer you are using to index chinese pdf documents ?
> I think you should use cjkanalyzer
> - Original Message - 
> From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Wednesday, September 08, 2004 11:27 AM
> Subject: pdf in Chinese
> 
> 
> > Hi all,
> > i use pdfbox to parse pdf file to lucene document.when i parse
> Chinese
> > pdf file,pdfbox is not always success.
> > Is anyone have some advice?
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Use of explain() vs search()

2004-09-08 Thread Erik Hatcher

Could you create a simple piece of code (using a RAMDirectory) that 
demonstrates this issue?

Erik
On Sep 8, 2004, at 12:35 AM, Minh Kama Yie wrote:
Hi all,
Sorry I should clarify my last point.
The search() would return no hits, but the explain() using the 
apparently invalid docId returns a value greater than 0.

For what it's worth it's performing a PhraseQuery.
Thanks in advance,
Minh
Minh Kama Yie wrote:
Hi all,
I was wondering if anyone could tell me what the expected behaviour 
is for calling an explain() without calling a search() first on a 
particular query. Would it effectively do a search and then I can 
examine the Explanation in order to check whether it matches?

I'm currently looking at some existing code to this effect:
Explanation  exp = searcher.explain(myQuery, docId)
// Where docId was _not_ returned by a search on myQuery
if (exp.getValue() > 0.0f)
{
   // Assuming document for docId matched query.
}
Is the assumption wrong?
I ask because the result of this code is inconsistent with
Hits h = searcher.search(myQuery);  // there are not hits returned.
Thanks in advance,
Minh

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread "René Hackl"

Hi Bill,

-
But even if it didn't, the second
problem is that the query formed would be

   +(title:cutting title:lucene) +(author:cutting author:lucene)

That is, if the word "Lucene" was in both the author field and the
title field, the match would fit.  This clearly isn't what the
searcher intended.
-
AFA my understanding of the query syntax goes, this would be interpreted
as (A OR B) AND (C OR D) which would produce the same set as 
(A OR C) AND (B OR D) == +(title:cutting author:cutting) +(title:lucene
author:lucene). But it would only be true for this special case with 2 terms
and 2 fields.

I reckon there has been a discussion (and solution :-) on how to achieve the
functionality you've been
after:

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116

I'm not sure if this would be the same though.

Best regards,
René

-- 
Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR*
Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: pdf in Chinese

2004-09-08 Thread Alex Kiselevski

Hi,
Can you pls,advice me any solution for hebrew analyzer

-Original Message-
From: Chandan Tamrakar [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 08, 2004 11:15 AM
To: Lucene Users List
Subject: Re: pdf in Chinese

which analyzer you are using to index chinese pdf documents ?
I think you should use cjkanalyzer
- Original Message -
From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 11:27 AM
Subject: pdf in Chinese

> Hi all,
> i use pdfbox to parse pdf file to lucene document.when i parse
Chinese
> pdf file,pdfbox is not always success.
> Is anyone have some advice?
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: pdf in Chinese

2004-09-08 Thread Chandan Tamrakar

which analyzer you are using to index chinese pdf documents ?
I think you should use cjkanalyzer
- Original Message - 
From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 11:27 AM
Subject: pdf in Chinese


> Hi all,
> i use pdfbox to parse pdf file to lucene document.when i parse
Chinese
> pdf file,pdfbox is not always success.
> Is anyone have some advice?
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

41 matches

Mail list logo