Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
I want that Chinese Anayzer !!


On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS)
<[EMAIL PROTECTED]> wrote:
> I've written a Chinese Analyzer for Lucene that uses a segmenter written by
> Erik Peterson. However, as the author of the segmenter does not want his code
> released under apache open source license (although his code _is_
> opensource), I cannot place my work in the Lucene Sandbox.  This is
> unfortunate, because I believe the analyzer works quite well in indexing and
> searching chinese docs in GB2312 and UTF-8 encoding, and I like more people
> to test, use, and confirm this.  So anyone who wants it, can have it. Just
> shoot me an email.
> BTW, I also have written an arabic analyzer, which is collecting dust for
> similar reasons.
> Good luck,
> 
> Ali Safarnejad
> 
> 
> -Original Message-
> From: Eric Chow [mailto:[EMAIL PROTECTED]
> Sent: 21 January 2005 11:42
> To: Lucene Users List
> Subject: Re: Search Chinese in Unicode !!!
> 
> Search not really correct with UTF-8 !!!
> 
> The following is the search result that I used the SearchFiles in the lucene
> demo.
> 
> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
> org.apache.lucene.demo.SearchFiles c:\temp\myindex
> Usage: java SearchFiles 
> Query: ç
> Searching for: g  strange ??
> 3 total matching documents
> 0. ../docs/ChineseDemo.htmlthis files contains
> the ç
>   -
> 1. ../docs/luceneplan.html
>   - Jakarta Lucene - Plan for enhancements to Lucene
> 2. ../docs/api/index-all.html
>   - Index (Lucene 1.4.3 API)
> Query:
> 
> From the above result only the ChineseDemo.html includes the character that I
> want to search !
> 
> The modified code in SearchFiles.java:
> 
> BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
> "UTF-8"));
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Document 'Context' & Relation to each other

2005-01-21 Thread Paul Smith
As a log4j developer, I've been toying with the idea of what Lucene 
could do for me, maybe as an excuse to play around with Lucene.

I've started creating a LoggingEvent->Document converter, and thinking 
through how I'd like this utility to work when I came across a question 
I wasn't sure about.

When scanning/searching through logging events, one is usually looking 
for a particular matching event which Lucene does excellently, but what 
a person usually needs is also the context of that matching logging 
event around it. 

With grep, one can use the "-C" argument to grep to provide 
X # of lines around the matching entry. I'd like to be able to do the 
same thing with Lucene.

Now, I could provide a Field to the LoggingEvent Document that has a 
sequence #, and once a user has chosen an appropriate matching event, do 
another search for the documents with a Sequence # between +/- the 
context size. 

My question is, is that going to be an efficient way to do this? The 
sequence # would be treated as text, wouldn't it?  Would the range 
search on an int be the most efficient way to do this?

I know from the Hits documentation that one can retrieve the Document ID 
of a matching entry.  What is the contract on this Document ID?  Is each 
Document added to the Index given an increasing number?  Can one search 
an index by Document ID?  Could one search for Document ID's between a 
range?   (Hope you can see where I'm going here).

If you have any other recommendations about "Context" searching I would 
appreciate any thoughts.

Many thanks for an excellent API, and kudos to Erik & Otis for a great 
eBook btw.

regards,
Paul Smith
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Chris Hostetter
: We have one large index right now... its about 60G ... When I open it
: the Java VM used 940M of memory.  The VM does nothing else besides open

Just out of curiosity, have you tried turning on the verbose gc log, and
putting in some thread sleeps after you open the reader, to see if the
memory footprint "settles down" after a little while?  You're currently
checking the memoory usage immediately after opening the index, and some
of that memory may be used holding transient data that will get freed up
after some GC iterations.


: IndexReader ir = IndexReader.open( dir );
: System.out.println( ir.getClass() );
: long after = System.currentTimeMillis();
: System.out.println( "opening...done - duration: " +
: (after-before) );
:
: System.out.println( "totalMemory: " +
: Runtime.getRuntime().totalMemory() );
: System.out.println( "freeMemory: " +
: Runtime.getRuntime().freeMemory() );





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
Kevin A. Burton wrote:
We have one large index right now... its about 60G ... When I open it 
the Java VM used 940M of memory.  The VM does nothing else besides 
open this index.
After thinking about it I guess 1.5% of memory per index really isn't 
THAT bad.  What would be nice if there was a way to do this from disk 
and then use the a buffer (either via the filesystem or in-vm memory) to 
access these variables.

This would be similar to the way the MySQL index cache works...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
We have one large index right now... its about 60G ... When I open it 
the Java VM used 940M of memory.  The VM does nothing else besides open 
this index.

Here's the code:
   System.out.println( "opening..." );
   long before = System.currentTimeMillis();
   Directory dir = FSDirectory.getDirectory( 
"/var/ksa/index-1078106952160/", false );
   IndexReader ir = IndexReader.open( dir );
   System.out.println( ir.getClass() );
   long after = System.currentTimeMillis();
   System.out.println( "opening...done - duration: " + 
(after-before) );

   System.out.println( "totalMemory: " + 
Runtime.getRuntime().totalMemory() );
   System.out.println( "freeMemory: " + 
Runtime.getRuntime().freeMemory() );

Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Stemming

2005-01-21 Thread Chris Lamprecht
Also if you can't wait, see page 2 of
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

or the LIA e-book ;)

On Fri, 21 Jan 2005 09:27:42 -0500, Kevin L. Cobb
<[EMAIL PROTECTED]> wrote:
> OK, OK ... I'll buy the book. I guess its about time since I am deeply
> and forever in love with Lucene. Might as well take the final plunge.
> 
> 
> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Friday, January 21, 2005 9:12 AM
> To: Lucene Users List
> Subject: Re: Stemming
> 
> Hi Kevin,
> 
> Stemming is an optional operation and is done in the analysis step.
> Lucene comes with a Porter stemmer and a Filter that you can use in an
> Analyzer:
> 
> ./src/java/org/apache/lucene/analysis/PorterStemFilter.java
> ./src/java/org/apache/lucene/analysis/PorterStemmer.java
> 
> You can find more about it here:
> http://www.lucenebook.com/search?query=stemming
> You can also see mentions of SnowballAnalyzer in those search results,
> and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.
> 
> Otis
> 
> --- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote:
> 
> > I want to understand how Lucene uses stemming but can't find any
> > documentation on the Lucene site. I'll continue to google but hope
> > that
> > this list can help narrow my search. I have several questions on the
> > subject currently but hesitate to list them here since finding a good
> > document on the subject may answer most of them.
> >
> >
> >
> > Thanks in advance for any pointers,
> >
> >
> >
> > Kevin
> >
> >
> >
> >
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Closed IndexWriter reuse

2005-01-21 Thread Oscar Picasso
> --- Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> 
> > No, you can't add documents to an index once you close the IndexWriter.
> > You can re-open the IndexWriter and add more documents, of course.
> > 
> > Otis

After my previous post I have made some further tests with multithreading and
effectively it randomly throw NullPointerExceptions and Lock exceptions when
reusing a closed IndexWriter.

My example was bad because based on a very simple single thread.

But wouldn't it be safer if IndexWriter rose immediatly an Exception when
trying to use its modifying methods after is has been closed?



__ 
Do you Yahoo!? 
Yahoo! Mail - 250MB free storage. Do more. Manage less. 
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene and multiple languages

2005-01-21 Thread Ernesto De Santis
I send you the source code in a private mail.
Ernesto.
aurora escribió:
Thanks. I would like to give it a try. Is the source code available? 
I'm  using a Python version of Lucene so it would need to be wrapped 
or ported  :)

Hi Aurora
I develop a tool with this multiple languages issue. I found very useful
an nuke library "language-identifier". This jar have nuke dependencies,
but I delete all unnecessary code (for me obvious).
This language-identifier that I use work fine and is very simple:
For example:
LanguageIdentifier languageIdentifier = 
LanguageIdentifier.getInstance();
String userInputText = "free text";
String language = languageIdentifier.identify(text);

This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
French, Italian, and Others.
I can send you this touched jar, but remember that this jar is from
Nuke, for copyright (or left :).
http://www.nutch.org/LICENSE.txt
More comments above...
aurora escribió:
I'm trying to build some web search tool that could work for 
multiple   languages. I understand that Lucene is shipped with 
StandardAnalyzer  plus  a German and Russian analyzers and some more 
in the sandbox. And  that  indexing and searching should use the 
same analyzer.

Now let's said I have an index with documents in multiple languages  
and  analyzed by an assortment of analyzers. When user enter a 
query,  what  analyzer should be used? Should the user be asked for 
the  language  upfront? What to expect when the analyzer and the 
document  doesn't match?  Let's said the query is parsed using 
StandardAnalyzer.  Would it match any  documents done in German 
analyzer at all. Or would  it end up in poor  result?

When this happen, in the major cases you do not obtain matchs.
Also is there a good way to find out the languages used in a web 
page?   There is a 'content-langage' header in http and a 'lang' 
attribute in   HTML. Looks like people don't really use them. How 
can we recognize  the  language?

With language identifier. :)
Even more interesting is multiple languages used in one document,  
let's  say half English and half French. Is there a good way to 
deal  with those  cases?

Language identifier only return one language. I look into
language-identifier and work with a score for each language, and return
the language with greater value.
Maybe you can modify the language-identifier for take the most greater
values.
Bye
Ernesto.
Thanks for any guidance.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Luke Shannon
Thanks Ben. I new none related issues now. For the time being I will be
using path. Once I get a chance I will try this on the command line as you
have recommended.

Luke

- Original Message - 
From: "Ben Litchfield" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, January 21, 2005 1:05 PM
Subject: Re: FOP Generated PDF and PDFBox


>
>
> Ya, when calling LucenePDFDocument.getDocument( File ) then it should be
> the same as the path.
>
> This is the code that the class uses to set those fields.
>
> document.add( Field.UnIndexed("path", file.getPath() ) );
> document.add(Field.UnIndexed("url", file.getPath().replace(FILE_SEPARATOR,
> '/')));
>
> I have no idea why an FOP PDF would be any different than another PDF.
>
> You can also run it from the command line, this is just for debugging
> purposes like this.
>
> java org.pdfbox.searchengine.lucene.LucenePDFDocument 
>
> and it should print out the fields of the lucene Document object.  Is the
> url there and is it correct?
>
> Ben
>
> On Fri, 21 Jan 2005, Luke Shannon wrote:
>
> > That is correct. No difference with how other PDF are handled.
> >
> > I am looking at the index in Luke now. The FOP generated documents have
a
> > path but no URL? I would guess that these would be the same?
> >
> > Thanks for the speedy reply.
> >
> > Luke
> >
> >
> > - Original Message -
> > From: "Ben Litchfield" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" 
> > Sent: Friday, January 21, 2005 12:34 PM
> > Subject: Re: FOP Generated PDF and PDFBox
> >
> >
> > >
> > >
> > > Are you indexing the FOP PDF's differently than other PDF documents?
> > >
> > > Can I assume that you are using PDFBox's
LucenePDFDocument.getDocument()
> > > method?
> > >
> > > Ben
> > >
> > > On Fri, 21 Jan 2005, Luke Shannon wrote:
> > >
> > > > Hello;
> > > >
> > > > Our CMS now allows users to create PDF documents (uses FOP) and than
> > search
> > > > them.
> > > >
> > > > I seem to be able to index these documents ok. But when I am
generating
> > the
> > > > results to display I get a Null Pointer Exception while trying to
use a
> > > > variable that should contain the url keyword for one of these
documents
> > in
> > > > the index:
> > > >
> > > > Document doc = hits.doc(i);
> > > > String path = doc.get("url");
> > > >
> > > > Path contains null.
> > > >
> > > > The interesting thing is this only happens with PDF that are
generate
> > with
> > > > FOP. Other PDFs are fine.
> > > >
> > > > What I find weird is shouldn't the "url" field just contain the path
of
> > the
> > > > file?
> > > >
> > > > Anyone else seen this before?
> > > >
> > > > Any ideas?
> > > >
> > > > Thanks,
> > > >
> > > > Luke
> > > >
> > > >
> > > >
> > >
> -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread jian chen
Hi,

I have some studies in Chinese text search. The main problem is how to
separate the words. As in Chinese, there is no white space between
words.

The typical commercial search engines these days use a dictionary
based approach. That is, look through the Chinese text and find the
words that are in the dictionary. As for those characters that do not
match words in the dictionary, you could use bi-gram based approach.
Say, a b c,  you could index as 2 (pseudo) words, ab, bc.

I think pure bi-gram based approach is not good for relative large
Chinese text collection, as you end up with many pseudo terms that are
not actual words.

Cheers,

Jian

On Fri, 21 Jan 2005 18:55:56 +0100, Safarnejad, Ali (AFIS)
<[EMAIL PROTECTED]> wrote:
> The ChineseAnalyzer tokenizes based on some english stopwords.  The
> CJKAnalzyer is not much more sophisticated for Chinese Analysis (2 byte
> tokenizing).  The analyzer I just sent you (using Erik Peterson's
> segmenter:), looks up three dictionaries to segment the chinese text, based
> on real word matches.
> 
> 
> -Original Message-
> From: news [mailto:[EMAIL PROTECTED] On Behalf Of aurora
> Sent: 21 January 2005 18:29
> To: lucene-user@jakarta.apache.org
> Subject: Re: Search Chinese in Unicode !!!
> 
> I would love to give it a try. Please email me at aurora00 at gmail.com.
> Thanks!
> 
> Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some
> people actually said the StandardAnalyzer works better. I wonder what's
> the pros and cons.
> 
> > I've written a Chinese Analyzer for Lucene that uses a segmenter
> > written
> > by
> > Erik Peterson. However, as the author of the segmenter does not want his
> > code
> > released under apache open source license (although his code _is_
> > opensource), I cannot place my work in the Lucene Sandbox.  This is
> > unfortunate, because I believe the analyzer works quite well in indexing
> > and
> > searching chinese docs in GB2312 and UTF-8 encoding, and I like more
> > people
> > to test, use, and confirm this.  So anyone who wants it, can have it.
> > Just
> > shoot me an email.
> > BTW, I also have written an arabic analyzer, which is collecting dust for
> > similar reasons.
> > Good luck,
> >
> > Ali Safarnejad
> >
> >
> > -Original Message-
> > From: Eric Chow [mailto:[EMAIL PROTECTED]
> > Sent: 21 January 2005 11:42
> > To: Lucene Users List
> > Subject: Re: Search Chinese in Unicode !!!
> >
> >
> > Search not really correct with UTF-8 !!!
> >
> >
> > The following is the search result that I used the SearchFiles in the
> > lucene
> > demo.
> >
> > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
> > org.apache.lucene.demo.SearchFiles c:\temp\myindex
> > Usage: java SearchFiles 
> > Query: ç
> > Searching for: g
> > strange ??
> > 3 total matching documents
> > 0. ../docs/ChineseDemo.htmlthis files
> > contains
> > the ç
> >-
> > 1. ../docs/luceneplan.html
> >- Jakarta Lucene - Plan for enhancements to Lucene
> > 2. ../docs/api/index-all.html
> >- Index (Lucene 1.4.3 API)
> > Query:
> >
> >
> >
> > From the above result only the ChineseDemo.html includes the character
> > that I
> > want to search !
> >
> >
> >
> >
> > The modified code in SearchFiles.java:
> >
> >
> > BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
> > "UTF-8"));
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



thanks for the URLDirectory pointer

2005-01-21 Thread Bill Janssen
lucene-user got blacklisted on SPEW, so I didn't actually get the
responses to my last question via email.  But I managed to dig them
out of the archive, and it should do what I needed.

Thanks for the pointer!

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Ben Litchfield


Ya, when calling LucenePDFDocument.getDocument( File ) then it should be
the same as the path.

This is the code that the class uses to set those fields.

document.add( Field.UnIndexed("path", file.getPath() ) );
document.add(Field.UnIndexed("url", file.getPath().replace(FILE_SEPARATOR,
'/')));

I have no idea why an FOP PDF would be any different than another PDF.

You can also run it from the command line, this is just for debugging
purposes like this.

java org.pdfbox.searchengine.lucene.LucenePDFDocument 

and it should print out the fields of the lucene Document object.  Is the
url there and is it correct?

Ben

On Fri, 21 Jan 2005, Luke Shannon wrote:

> That is correct. No difference with how other PDF are handled.
>
> I am looking at the index in Luke now. The FOP generated documents have a
> path but no URL? I would guess that these would be the same?
>
> Thanks for the speedy reply.
>
> Luke
>
>
> - Original Message -
> From: "Ben Litchfield" <[EMAIL PROTECTED]>
> To: "Lucene Users List" 
> Sent: Friday, January 21, 2005 12:34 PM
> Subject: Re: FOP Generated PDF and PDFBox
>
>
> >
> >
> > Are you indexing the FOP PDF's differently than other PDF documents?
> >
> > Can I assume that you are using PDFBox's LucenePDFDocument.getDocument()
> > method?
> >
> > Ben
> >
> > On Fri, 21 Jan 2005, Luke Shannon wrote:
> >
> > > Hello;
> > >
> > > Our CMS now allows users to create PDF documents (uses FOP) and than
> search
> > > them.
> > >
> > > I seem to be able to index these documents ok. But when I am generating
> the
> > > results to display I get a Null Pointer Exception while trying to use a
> > > variable that should contain the url keyword for one of these documents
> in
> > > the index:
> > >
> > > Document doc = hits.doc(i);
> > > String path = doc.get("url");
> > >
> > > Path contains null.
> > >
> > > The interesting thing is this only happens with PDF that are generate
> with
> > > FOP. Other PDFs are fine.
> > >
> > > What I find weird is shouldn't the "url" field just contain the path of
> the
> > > file?
> > >
> > > Anyone else seen this before?
> > >
> > > Any ideas?
> > >
> > > Thanks,
> > >
> > > Luke
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search Chinese in Unicode !!!

2005-01-21 Thread Safarnejad, Ali (AFIS)
The ChineseAnalyzer tokenizes based on some english stopwords.  The
(BCJKAnalzyer is not much more sophisticated for Chinese Analysis (2 byte
(Btokenizing).  The analyzer I just sent you (using Erik Peterson's
(Bsegmenter:), looks up three dictionaries to segment the chinese text, based
(Bon real word matches.
(B
(B
(B-Original Message-
(BFrom: news [mailto:[EMAIL PROTECTED] On Behalf Of aurora
(BSent: 21 January 2005 18:29
(BTo: lucene-user@jakarta.apache.org
(BSubject: Re: Search Chinese in Unicode !!!
(B
(B
(BI would love to give it a try. Please email me at aurora00 at gmail.com.  
(BThanks!
(B
(BAlso what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some  
(Bpeople actually said the StandardAnalyzer works better. I wonder what's  
(Bthe pros and cons.
(B
(B
(B
(B> I've written a Chinese Analyzer for Lucene that uses a segmenter 
(B> written
(B> by
(B> Erik Peterson. However, as the author of the segmenter does not want his  
(B> code
(B> released under apache open source license (although his code _is_
(B> opensource), I cannot place my work in the Lucene Sandbox.  This is
(B> unfortunate, because I believe the analyzer works quite well in indexing  
(B> and
(B> searching chinese docs in GB2312 and UTF-8 encoding, and I like more  
(B> people
(B> to test, use, and confirm this.  So anyone who wants it, can have it.  
(B> Just
(B> shoot me an email.
(B> BTW, I also have written an arabic analyzer, which is collecting dust for
(B> similar reasons.
(B> Good luck,
(B>
(B> Ali Safarnejad
(B>
(B>
(B> -Original Message-
(B> From: Eric Chow [mailto:[EMAIL PROTECTED]
(B> Sent: 21 January 2005 11:42
(B> To: Lucene Users List
(B> Subject: Re: Search Chinese in Unicode !!!
(B>
(B>
(B> Search not really correct with UTF-8 !!!
(B>
(B>
(B> The following is the search result that I used the SearchFiles in the
(B> lucene
(B> demo.
(B>
(B> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
(B> org.apache.lucene.demo.SearchFiles c:\temp\myindex
(B> Usage: java SearchFiles 
(B> Query: $Be4(J
(B> Searching for: g   
(B> strange ??
(B> 3 total matching documents
(B> 0. ../docs/ChineseDemo.htmlthis files  
(B> contains
(B> the $Be4(J
(B>-
(B> 1. ../docs/luceneplan.html
(B>- Jakarta Lucene - Plan for enhancements to Lucene
(B> 2. ../docs/api/index-all.html
(B>- Index (Lucene 1.4.3 API)
(B> Query:
(B>
(B>
(B>
(B> From the above result only the ChineseDemo.html includes the character
(B> that I
(B> want to search !
(B>
(B>
(B>
(B>
(B> The modified code in SearchFiles.java:
(B>
(B>
(B> BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
(B> "UTF-8"));
(B>
(B> -
(B> To unsubscribe, e-mail: [EMAIL PROTECTED]
(B> For additional commands, e-mail: [EMAIL PROTECTED]
(B
(B
(B
(B-- 
(BUsing Opera's revolutionary e-mail client: http://www.opera.com/m2/
(B
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]
(B
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and multiple languages

2005-01-21 Thread aurora
Thanks. I would like to give it a try. Is the source code available? I'm  
using a Python version of Lucene so it would need to be wrapped or ported  
:)

Hi Aurora
I develop a tool with this multiple languages issue. I found very useful
an nuke library "language-identifier". This jar have nuke dependencies,
but I delete all unnecessary code (for me obvious).
This language-identifier that I use work fine and is very simple:
For example:
LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance();
String userInputText = "free text";
String language = languageIdentifier.identify(text);
This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
French, Italian, and Others.
I can send you this touched jar, but remember that this jar is from
Nuke, for copyright (or left :).
http://www.nutch.org/LICENSE.txt
More comments above...
aurora escribió:
I'm trying to build some web search tool that could work for multiple   
languages. I understand that Lucene is shipped with StandardAnalyzer  
plus  a German and Russian analyzers and some more in the sandbox. And  
that  indexing and searching should use the same analyzer.

Now let's said I have an index with documents in multiple languages  
and  analyzed by an assortment of analyzers. When user enter a query,  
what  analyzer should be used? Should the user be asked for the  
language  upfront? What to expect when the analyzer and the document  
doesn't match?  Let's said the query is parsed using StandardAnalyzer.  
Would it match any  documents done in German analyzer at all. Or would  
it end up in poor  result?

When this happen, in the major cases you do not obtain matchs.
Also is there a good way to find out the languages used in a web page?   
There is a 'content-langage' header in http and a 'lang' attribute in   
HTML. Looks like people don't really use them. How can we recognize  
the  language?

With language identifier. :)
Even more interesting is multiple languages used in one document,  
let's  say half English and half French. Is there a good way to deal  
with those  cases?

Language identifier only return one language. I look into
language-identifier and work with a score for each language, and return
the language with greater value.
Maybe you can modify the language-identifier for take the most greater
values.
Bye
Ernesto.
Thanks for any guidance.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Luke Shannon
That is correct. No difference with how other PDF are handled.

I am looking at the index in Luke now. The FOP generated documents have a
path but no URL? I would guess that these would be the same?

Thanks for the speedy reply.

Luke


- Original Message - 
From: "Ben Litchfield" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, January 21, 2005 12:34 PM
Subject: Re: FOP Generated PDF and PDFBox


>
>
> Are you indexing the FOP PDF's differently than other PDF documents?
>
> Can I assume that you are using PDFBox's LucenePDFDocument.getDocument()
> method?
>
> Ben
>
> On Fri, 21 Jan 2005, Luke Shannon wrote:
>
> > Hello;
> >
> > Our CMS now allows users to create PDF documents (uses FOP) and than
search
> > them.
> >
> > I seem to be able to index these documents ok. But when I am generating
the
> > results to display I get a Null Pointer Exception while trying to use a
> > variable that should contain the url keyword for one of these documents
in
> > the index:
> >
> > Document doc = hits.doc(i);
> > String path = doc.get("url");
> >
> > Path contains null.
> >
> > The interesting thing is this only happens with PDF that are generate
with
> > FOP. Other PDFs are fine.
> >
> > What I find weird is shouldn't the "url" field just contain the path of
the
> > file?
> >
> > Anyone else seen this before?
> >
> > Any ideas?
> >
> > Thanks,
> >
> > Luke
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Ben Litchfield


Are you indexing the FOP PDF's differently than other PDF documents?

Can I assume that you are using PDFBox's LucenePDFDocument.getDocument()
method?

Ben

On Fri, 21 Jan 2005, Luke Shannon wrote:

> Hello;
>
> Our CMS now allows users to create PDF documents (uses FOP) and than search
> them.
>
> I seem to be able to index these documents ok. But when I am generating the
> results to display I get a Null Pointer Exception while trying to use a
> variable that should contain the url keyword for one of these documents in
> the index:
>
> Document doc = hits.doc(i);
> String path = doc.get("url");
>
> Path contains null.
>
> The interesting thing is this only happens with PDF that are generate with
> FOP. Other PDFs are fine.
>
> What I find weird is shouldn't the "url" field just contain the path of the
> file?
>
> Anyone else seen this before?
>
> Any ideas?
>
> Thanks,
>
> Luke
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread aurora
I would love to give it a try. Please email me at aurora00 at gmail.com.  
Thanks!

Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some  
people actually said the StandardAnalyzer works better. I wonder what's  
the pros and cons.


I've written a Chinese Analyzer for Lucene that uses a segmenter written  
by
Erik Peterson. However, as the author of the segmenter does not want his  
code
released under apache open source license (although his code _is_
opensource), I cannot place my work in the Lucene Sandbox.  This is
unfortunate, because I believe the analyzer works quite well in indexing  
and
searching chinese docs in GB2312 and UTF-8 encoding, and I like more  
people
to test, use, and confirm this.  So anyone who wants it, can have it.  
Just
shoot me an email.
BTW, I also have written an arabic analyzer, which is collecting dust for
similar reasons.
Good luck,

Ali Safarnejad
-Original Message-
From: Eric Chow [mailto:[EMAIL PROTECTED]
Sent: 21 January 2005 11:42
To: Lucene Users List
Subject: Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!!
The following is the search result that I used the SearchFiles in the  
lucene
demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles 
Query: ç
Searching for: g   
strange ??
3 total matching documents
0. ../docs/ChineseDemo.htmlthis files  
contains
the ç
   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query:


From the above result only the ChineseDemo.html includes the character  
that I
want to search !


The modified code in SearchFiles.java:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
"UTF-8"));
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


FOP Generated PDF and PDFBox

2005-01-21 Thread Luke Shannon
Hello;

Our CMS now allows users to create PDF documents (uses FOP) and than search
them.

I seem to be able to index these documents ok. But when I am generating the
results to display I get a Null Pointer Exception while trying to use a
variable that should contain the url keyword for one of these documents in
the index:

Document doc = hits.doc(i);
String path = doc.get("url");

Path contains null.

The interesting thing is this only happens with PDF that are generate with
FOP. Other PDFs are fine.

What I find weird is shouldn't the "url" field just contain the path of the
file?

Anyone else seen this before?

Any ideas?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search Chinese in Unicode !!!

2005-01-21 Thread Otis Gospodnetic
If you are hosting the code somewhere (e.g. your site, SF, java.net,
etc.), we should link to them from one of the Lucene pages where we
link to related external tools, apps, and such.

Otis


--- "Safarnejad, Ali (AFIS)" <[EMAIL PROTECTED]> wrote:

> I've written a Chinese Analyzer for Lucene that uses a segmenter
> written by
> Erik Peterson. However, as the author of the segmenter does not want
> his code
> released under apache open source license (although his code _is_
> opensource), I cannot place my work in the Lucene Sandbox.  This is
> unfortunate, because I believe the analyzer works quite well in
> indexing and
> searching chinese docs in GB2312 and UTF-8 encoding, and I like more
> people
> to test, use, and confirm this.  So anyone who wants it, can have it.
> Just
> shoot me an email.
> BTW, I also have written an arabic analyzer, which is collecting dust
> for
> similar reasons.
> Good luck,
> 
> Ali Safarnejad
> 
> 
> -Original Message-
> From: Eric Chow [mailto:[EMAIL PROTECTED] 
> Sent: 21 January 2005 11:42
> To: Lucene Users List
> Subject: Re: Search Chinese in Unicode !!!
> 
> 
> Search not really correct with UTF-8 !!!
> 
> 
> The following is the search result that I used the SearchFiles in the
> lucene
> demo.
> 
> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
> org.apache.lucene.demo.SearchFiles c:\temp\myindex
> Usage: java SearchFiles 
> Query: å´
> Searching for: g 
> strange ??
> 3 total matching documents
> 0. ../docs/ChineseDemo.htmlthis files
> contains
> the å´
>-
> 1. ../docs/luceneplan.html
>- Jakarta Lucene - Plan for enhancements to Lucene
> 2. ../docs/api/index-all.html
>- Index (Lucene 1.4.3 API)
> Query: 
> 
> 
> 
> From the above result only the ChineseDemo.html includes the
> character that I
> want to search !
> 
> 
> 
> 
> The modified code in SearchFiles.java:
> 
> 
> BufferedReader in = new BufferedReader(new
> InputStreamReader(System.in,
> "UTF-8"));
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestion needed for extranet search

2005-01-21 Thread Otis Gospodnetic
Free as in orange juice.

Otis

--- "Ranjan K. Baisak" <[EMAIL PROTECTED]> wrote:

> Otis,
> Thanks for your help. Is nutch a freeware tool?
> 
> regards,
> Ranjan
> --- Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
> 
> > Hi Ranjan,
> > 
> > It sounds like you are should look at and use Nutch:
> > http://www.nutch.org
> > 
> > Otis
> > 
> > --- "Ranjan K. Baisak" <[EMAIL PROTECTED]>
> > wrote:
> > 
> > > I am planning to move to Lucene but not have much
> > > knowledge on the same. The search engine which I
> > had
> > > developed is searching some extranet URLs e.g.
> > > codeguru.com/index.html. Is is possible to get the
> > > same functionality using Lucene. So basically can
> > I
> > > make Lucene as a search engine to search
> > extranets.
> > > 
> > > regards,
> > > Ranjan
> > > 
> > > __
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam
> > protection around 
> > > http://mail.yahoo.com 
> > > 
> > >
> >
> -
> > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > 
> > > 
> > 
> > 
> >
> -
> > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > 
> > 
> 
> 
> 
>   
> __ 
> Do you Yahoo!? 
> The all-new My Yahoo! - What will yours do?
> http://my.yahoo.com 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search Chinese in Unicode !!!

2005-01-21 Thread Safarnejad, Ali (AFIS)
I've written a Chinese Analyzer for Lucene that uses a segmenter written by
(BErik Peterson. However, as the author of the segmenter does not want his code
(Breleased under apache open source license (although his code _is_
(Bopensource), I cannot place my work in the Lucene Sandbox.  This is
(Bunfortunate, because I believe the analyzer works quite well in indexing and
(Bsearching chinese docs in GB2312 and UTF-8 encoding, and I like more people
(Bto test, use, and confirm this.  So anyone who wants it, can have it. Just
(Bshoot me an email.
(BBTW, I also have written an arabic analyzer, which is collecting dust for
(Bsimilar reasons.
(BGood luck,
(B
(BAli Safarnejad
(B
(B
(B-Original Message-
(BFrom: Eric Chow [mailto:[EMAIL PROTECTED] 
(BSent: 21 January 2005 11:42
(BTo: Lucene Users List
(BSubject: Re: Search Chinese in Unicode !!!
(B
(B
(BSearch not really correct with UTF-8 !!!
(B
(B
(BThe following is the search result that I used the SearchFiles in the lucene
(Bdemo.
(B
(Bd:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
(Borg.apache.lucene.demo.SearchFiles c:\temp\myindex
(BUsage: java SearchFiles 
(BQuery: $Be4(J
(BSearching for: g  strange ??
(B3 total matching documents
(B0. ../docs/ChineseDemo.htmlthis files contains
(Bthe $Be4(J
(B   -
(B1. ../docs/luceneplan.html
(B   - Jakarta Lucene - Plan for enhancements to Lucene
(B2. ../docs/api/index-all.html
(B   - Index (Lucene 1.4.3 API)
(BQuery: 
(B
(B
(B
(B>From the above result only the ChineseDemo.html includes the character that I
(Bwant to search !
(B
(B
(B
(B
(BThe modified code in SearchFiles.java:
(B
(B
(BBufferedReader in = new BufferedReader(new InputStreamReader(System.in,
(B"UTF-8"));
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]
(B
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]

Re: Concurrent read and write

2005-01-21 Thread Otis Gospodnetic
Hello Ashley,

You can read/search while modifying the index, but you have to ensure
only one thread or only one process is modifying an index at any given
time.  Both IndexReader and IndexWriter can be used to modify an index.
 The former to delete Documents and the latter to add them.  You have
to ensure these two operations don't overlap.
c.f. http://www.lucenebook.com/search?query=concurrent

Otis


--- Ashley Steigerwalt <[EMAIL PROTECTED]> wrote:

> I am a little fuzzy on the thread-safeness of Lucene, or maybe just
> java.  
> From what I understand, and correct me if I'm wrong, Lucene takes
> care of 
> concurrency issues and it is ok to run a query while writing to an
> index.
> 
> My question is, does this still hold true if the reader and writer
> are being 
> executed as separate programs?  I have a cron job that will update
> the index 
> periodically.  I also have a search application on a web form.  Is
> this going 
> to cause trouble if someone runs a query while the indexer is
> updating?
> 
> Ashley
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrent read and write

2005-01-21 Thread Clas Rydergren
Hi,

My limited experience shows that reading/searching in a servlet at the
"same" time as writing to the index from an application (e.g. by a
scheduled script) works very well.

The only thing that has caused me problems is applications (e.g. cron
started) writing to the index that "crash" while the write-lock is in
effect. (The "crash" is in my case often cause by bad socket
programming, and has nothing to do with Lucene.) The following
scheduled applications will then, of cause, not be able to update the
index.

cheers
Clas / frisim.com


On Fri, 21 Jan 2005 09:57:22 -0500, Ashley Steigerwalt
<[EMAIL PROTECTED]> wrote:
> I am a little fuzzy on the thread-safeness of Lucene, or maybe just java.
> From what I understand, and correct me if I'm wrong, Lucene takes care of
> concurrency issues and it is ok to run a query while writing to an index.
> 
> My question is, does this still hold true if the reader and writer are being
> executed as separate programs?  I have a cron job that will update the index
> periodically.  I also have a search application on a web form.  Is this going
> to cause trouble if someone runs a query while the indexer is updating?
> 
> Ashley
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Closed IndexWriter reuse

2005-01-21 Thread Oscar Picasso
--- Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

> No, you can't add documents to an index once you close the IndexWriter.
> You can re-open the IndexWriter and add more documents, of course.
> 
> Otis

That's what I expected at first, but:
1- It's a disappointment, because such a 'feature' would have made IndexeWriter
management much easier. You would open an IndexWriter at startup and reuse it
during all the life of the application, just flushing on a regular base using
the close() method and without worrying if other objects are currently using
the writer.

2- When you say you can't add, do you mean it's impossible or that you
shouldn't because for example it could corrupt the index?
Maybe I'm wrong, but I think it's possible. Let's look at the follwoing code:
"


public static void main(String[] args) throws IOException
{
final IndexWriter writer1 = new IndexWriter("/tmp/test-reuse", new
StandardAnalyzer(), true);

// First write with the writer
Document doc = new Document();
doc.add(new Field("name", "John", Field.Store.YES, 
Field.Index.UN_TOKENIZED));
writer1.addDocument(doc);
System.out.println("1  After first write, before closing the writer 
---");
Searcher searcher = new IndexSearcher("/tmp/test-reuse");
Query query = new TermQuery(new Term("name", "John"));
Hits hits = searcher.search(query);
System.out.println("===> hits: " + hits.length());
System.out.println();

// CLOSING THE WRITER ONCE
writer1.close();
System.out.println("2  After first write, after closing the writer 
---");
searcher = new IndexSearcher("/tmp/test-reuse");
hits = searcher.search(query);
System.out.println("===> hits: " + hits.length());
System.out.println();

// Second write, THE WRITER HAS ALREADY BEEN CLOSED ONCE
writer1.addDocument(doc);
System.out.println("3  After second write, the writer has been 
closed once
---");
hits = searcher.search(query);
System.out.println("===> hits: " + hits.length());
System.out.println();

// Closing the writer again
writer1.close();
System.out.println("4  After second write, the writer has been 
closed
twice ---");
searcher = new IndexSearcher("/tmp/test-reuse");
hits = searcher.search(query);
System.out.println("===> hits: " + hits.length());

}

== Results ==
1  After first write, before closing the writer ---
===> hits: 0

2  After first write, after closing the writer ---
===> hits: 1

3  After second write, the writer has been closed once ---
===> hits: 1

4  After second write, the writer has been closed twice ---
===> hits: 2


As your can see, not only does the code above execute without complain but it
also gives the right results.

Thanks for your comments.



__ 
Do you Yahoo!? 
Yahoo! Mail - Easier than ever with enhanced search. Learn more.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Concurrent read and write

2005-01-21 Thread Ashley Steigerwalt
I am a little fuzzy on the thread-safeness of Lucene, or maybe just java.  
From what I understand, and correct me if I'm wrong, Lucene takes care of 
concurrency issues and it is ok to run a query while writing to an index.

My question is, does this still hold true if the reader and writer are being 
executed as separate programs?  I have a cron job that will update the index 
periodically.  I also have a search application on a web form.  Is this going 
to cause trouble if someone runs a query while the indexer is updating?

Ashley

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Stemming

2005-01-21 Thread Kevin L. Cobb
OK, OK ... I'll buy the book. I guess its about time since I am deeply
and forever in love with Lucene. Might as well take the final plunge.



-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 21, 2005 9:12 AM
To: Lucene Users List
Subject: Re: Stemming

Hi Kevin,

Stemming is an optional operation and is done in the analysis step. 
Lucene comes with a Porter stemmer and a Filter that you can use in an
Analyzer:

./src/java/org/apache/lucene/analysis/PorterStemFilter.java
./src/java/org/apache/lucene/analysis/PorterStemmer.java

You can find more about it here:
http://www.lucenebook.com/search?query=stemming
You can also see mentions of SnowballAnalyzer in those search results,
and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.

Otis

--- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote:

> I want to understand how Lucene uses stemming but can't find any
> documentation on the Lucene site. I'll continue to google but hope
> that
> this list can help narrow my search. I have several questions on the
> subject currently but hesitate to list them here since finding a good
> document on the subject may answer most of them. 
> 
>  
> 
> Thanks in advance for any pointers,
> 
>  
> 
> Kevin
> 
>  
> 
>  
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestion needed for extranet search

2005-01-21 Thread Ranjan K. Baisak
Otis,
Thanks for your help. Is nutch a freeware tool?

regards,
Ranjan
--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Hi Ranjan,
> 
> It sounds like you are should look at and use Nutch:
> http://www.nutch.org
> 
> Otis
> 
> --- "Ranjan K. Baisak" <[EMAIL PROTECTED]>
> wrote:
> 
> > I am planning to move to Lucene but not have much
> > knowledge on the same. The search engine which I
> had
> > developed is searching some extranet URLs e.g.
> > codeguru.com/index.html. Is is possible to get the
> > same functionality using Lucene. So basically can
> I
> > make Lucene as a search engine to search
> extranets.
> > 
> > regards,
> > Ranjan
> > 
> > __
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> > http://mail.yahoo.com 
> > 
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> > 
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 




__ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search on heterogenous index

2005-01-21 Thread Simeon Koptelov
Hello all. I'm new to lucene and think about using it in my project.

I have prices with dynamic structure, containing wares there, about 10K prices 
with total 500K wares. Each price has about 5 text fields. 

I'll do searches on wares. The difficult part is that I'll do searches for all 
wares, the search is not bound to a particular price structure.

My question is, how should I organize my indices? Can Lucene handle data 
effectlively if I'll have one index containing different Fields in Documents? 
Or should I create a separate index for each price with same Fields structure 
across Documents?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Suggestion needed for extranet search

2005-01-21 Thread Otis Gospodnetic
Hi Ranjan,

It sounds like you are should look at and use Nutch:
http://www.nutch.org

Otis

--- "Ranjan K. Baisak" <[EMAIL PROTECTED]> wrote:

> I am planning to move to Lucene but not have much
> knowledge on the same. The search engine which I had
> developed is searching some extranet URLs e.g.
> codeguru.com/index.html. Is is possible to get the
> same functionality using Lucene. So basically can I
> make Lucene as a search engine to search extranets.
> 
> regards,
> Ranjan
> 
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Filtering w/ Multiple Terms

2005-01-21 Thread Otis Gospodnetic
This:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.TooManyClauses.html
?

You can control that limit via
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/BooleanQuery.html#maxClauseCount

Otis


--- Jerry Jalenak <[EMAIL PROTECTED]> wrote:

> OK.  But isn't there a limit on the number of BooleanQueries that can
> be
> combined with AND / OR / etc?
> 
> 
> 
> Jerry Jalenak
> Senior Programmer / Analyst, Web Publishing
> LabOne, Inc.
> 10101 Renner Blvd.
> Lenexa, KS  66219
> (913) 577-1496
> 
> [EMAIL PROTECTED]
> 
> 
> > -Original Message-
> > From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, January 20, 2005 5:05 PM
> > To: Lucene Users List
> > Subject: Re: Filtering w/ Multiple Terms
> > 
> > 
> > 
> > On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote:
> > 
> > > In looking at the examples for filtering of hits, it looks 
> > like I can 
> > > only
> > > specify a single term; i.e.
> > >
> > >   Filter f = new QueryFilter(new TermQuery(new Term("acct",
> > > "acct1")));
> > >
> > > I need to specify more than one term in my filter.  Short of
> using 
> > > something
> > > like ChainFilter, how are others handling this?
> > 
> > You can make as complex of a Query as you want for 
> > QueryFilter.  If you 
> > want to filter on multiple terms, construct a BooleanQuery 
> > with nested 
> > TermQuery's, either in an AND or OR fashion.
> > 
> > Erik
> > 
> > 
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> > 
> 
> This transmission (and any information attached to it) may be
> confidential and
> is intended solely for the use of the individual or entity to which
> it is
> addressed. If you are not the intended recipient or the person
> responsible for
> delivering the transmission to the intended recipient, be advised
> that you
> have received this transmission in error and that any use,
> dissemination,
> forwarding, printing, or copying of this information is strictly
> prohibited.
> If you have received this transmission in error, please immediately
> notify
> LabOne at the following email address:
> [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stemming

2005-01-21 Thread Otis Gospodnetic
Hi Kevin,

Stemming is an optional operation and is done in the analysis step. 
Lucene comes with a Porter stemmer and a Filter that you can use in an
Analyzer:

./src/java/org/apache/lucene/analysis/PorterStemFilter.java
./src/java/org/apache/lucene/analysis/PorterStemmer.java

You can find more about it here:
http://www.lucenebook.com/search?query=stemming
You can also see mentions of SnowballAnalyzer in those search results,
and you can find an adapter for SnowballAnalyzers in Lucene Sandbox.

Otis

--- "Kevin L. Cobb" <[EMAIL PROTECTED]> wrote:

> I want to understand how Lucene uses stemming but can't find any
> documentation on the Lucene site. I'll continue to google but hope
> that
> this list can help narrow my search. I have several questions on the
> subject currently but hesitate to list them here since finding a good
> document on the subject may answer most of them. 
> 
>  
> 
> Thanks in advance for any pointers,
> 
>  
> 
> Kevin
> 
>  
> 
>  
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Stemming

2005-01-21 Thread Kevin L. Cobb
I want to understand how Lucene uses stemming but can't find any
documentation on the Lucene site. I'll continue to google but hope that
this list can help narrow my search. I have several questions on the
subject currently but hesitate to list them here since finding a good
document on the subject may answer most of them. 

 

Thanks in advance for any pointers,

 

Kevin

 

 



RE: Filtering w/ Multiple Terms

2005-01-21 Thread Jerry Jalenak
OK.  But isn't there a limit on the number of BooleanQueries that can be
combined with AND / OR / etc?



Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Thursday, January 20, 2005 5:05 PM
> To: Lucene Users List
> Subject: Re: Filtering w/ Multiple Terms
> 
> 
> 
> On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote:
> 
> > In looking at the examples for filtering of hits, it looks 
> like I can 
> > only
> > specify a single term; i.e.
> >
> > Filter f = new QueryFilter(new TermQuery(new Term("acct",
> > "acct1")));
> >
> > I need to specify more than one term in my filter.  Short of using 
> > something
> > like ChainFilter, how are others handling this?
> 
> You can make as complex of a Query as you want for 
> QueryFilter.  If you 
> want to filter on multiple terms, construct a BooleanQuery 
> with nested 
> TermQuery's, either in an AND or OR fashion.
> 
>   Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread PA
On Jan 21, 2005, at 11:42, Eric Chow wrote:
Search not really correct with UTF-8 !!!
Lucene works just fine with any flavor of Unicode as long as _your_ 
application knows how to consistently deal with Unicode as well. 
Remember: the world is not just one Big5 pile.

As far as Analyzer goes, you may or may not be better off using 
something more tailored to your linguistic needs. That said, even the 
default Analyzer does a fairly decent job at handling non-roman 
languages. YMMV.

Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread mark harwood
>>1 - I'm a bit concerned that reasonable stemming
(Porter/Snowball) 
>>apparently produces non-word stems .. i.e. not
really human readable. 

It is possible to derive the human-readable form of a
stemmed term using either re-analysis of indexed
content or TermPositionVector. Either of these
techniques should give you the position data required
to discover the original form. 
The highlighter package is one example of where this
technique is used.

Cheers
Mark





___ 
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
Search not really correct with UTF-8 !!!


The following is the search result that I used the SearchFiles in the
lucene demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles 
Query: ç
Searching for: g  strange ??
3 total matching documents
0. ../docs/ChineseDemo.htmlthis files contains the 
ç
   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query: 



>From the above result only the ChineseDemo.html includes the character
that I want to search !




The modified code in SearchFiles.java:


BufferedReader in = new BufferedReader(new
InputStreamReader(System.in, "UTF-8"));

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How works *

2005-01-21 Thread Miles Barr
On Fri, 2005-01-21 at 10:58 +0100, Bertrand VENZAL wrote:
> I wondered how lucene implement the * character, I know that is working 
> but when I look at the Query Object, it doesn t seem to appear somewhere, 
> does someone know how is it implemented ?

Take a look at the PrefixQuery and WildcardQuery. 

PrefixQuery works by finding all terms beginning with the query then
constructing a boolean query of them. I assume WildcardQuery works in a
similar way.

If you have several terms or a short prefix (e.g. a*) you might need to
increase the maximum number of clauses allowed in a boolean query
because the number of terms might exceed the default (i.e. 1024).
 
-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread Erik Hatcher
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote:
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?
Indexing and searching Chinese basically is no different than using 
English with Lucene.  We covered a bit about it in Lucene in Action:

http://www.lucenebook.com/search?query=chinese
And a screenshot here:
http://www.blogscene.org/erik/LuceneInAction/i18n.html
The main issues of dealing with Chinese, and of course other languages, 
are encoding concerns in both indexing and querying of reading in the 
text and analysis (as you can see from the screenshot).

Lucene itself works with Unicode fine and you're free to index anything.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How works *

2005-01-21 Thread Bertrand VENZAL
Hi,

I wondered how lucene implement the * character, I know that is working 
but when I look at the Query Object, it doesn t seem to appear somewhere, 
does someone know how is it implemented ?

thanks 

Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-21 Thread Andrzej Bialecki
Morus Walter wrote:
Owen Densmore writes:

1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
apparently produces non-word stems .. i.e. not really human readable.  
(Example: generate, generates, generated, generating  -> generat) 
Although in typical queries this is not important because the result of 
the search is a document list, it *would* be important if we use the 
stems within a graphical navigation interface.
So the question is: Is there a way to have the stemmer produce 
english
base forms of the words being stemmed?

rule based stemmers such as porter/snowball cannot do that.
But there are (commercial) dictionary based tools that can. E.g. the
canoo lemmatizer.
You might also have a look at egothors stemmer, that are word list based.
Egothor stemmers are algorithmic, they only use word lists for training. 
Stems produced by them are usually closer to lemmas than in e.g. 
Porter's stemmer, but there is a significant amount of stems like in the 
example above.


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]