from:"Che Dong"

Re: seach impossible with numbers

2001-10-19 Thread Che Dong


if you wanna indexing digit and letter mixed words just change LowerCaseTokinzer.java 
in com/lucene/analysis/ line 62:
  if (Character.isLetter(c)) { 
=>   if (Character.isLetterOrDigit(c)) { 
then the digit and letter mixed words, like "U2", "fifa98" will tokened and indexed as 
one word.
I think it should be default for there too much words in the world now: like "fifa98", 
telphone number etc.


Che Dong

- Original Message - 
From: "eou" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, October 17, 2001 6:04 AM
Subject: seach impossible with numbers


> Hello lucene-user,
> 
> 
> > Hello
> hi,
> 
> > 
> > I download your webapplication in jsp concerning lucene.
> > It is very well, with my tomcat server... congralutations!
> Thanks
> > I create index for severals documents.
> > I use some requet like "text + text" yes
> >   "text - text" yes
> >   "text AND text"... yes
> >   "AUTHOR:text"... no
> >   "keywords:text"... no
> > 
> > And i would like make a search with number, for example a year like
> > 1999 :-) and i obtain no results! why? can you help me?
> I don't know. Maybe the tokenizer filters out numbers?
> As I was working a bit more with the jsp pages, I found 
> a bug, disabling the Create checkbox.
> I will post the fix in the next few days. An some simple tokenizer
> sample too..
> > 
> > Sorry i am beginner with lucene
> > thanks  lot for your help
> Well i'm a beginner, too. :-)
> 
> I don't mind directly answering this email, but 
> there is the wunderful mailing list [EMAIL PROTECTED]
> so everybody can share, and add comments, so please use this 
> mailing list!
> 
> bye
>   
> 
> -- 
> Best regards,
>  eou  mailto:[EMAIL PROTECTED]
> 
èbj(ër¢êßç²j(ruÚÞ²ÆÛiÿù\¢

Fw: [contrib]: XMLIndexer/StringFilter

2002-09-25 Thread Che Dong



> Hi All:
> I pack the some source code I wrote before extend to lucene project. Hope it can be 
>added into sandbox and get more communications with other also interesting following 
>issues: including
> 
> customize search result 
> sorting, search result filtering, 
> common xml indexing source format, 
> Asia language(Chinese Korean Japanese) word segment analyer support
> etc...
> 
> File list: 
> sample.xml: sample index source
> lucene.dtd: lucene index source xml data type defination
> 
> org/apache/lucene
> analysis/
> /cjk/CJKTokenizer.java:  java based  tokenizer(CJK bigram)
> /standard/StandardTokenizer.java: JavaCC based tokenizer(CJK sigram)
> search/
>   /IndexSearcher.java: support sort by docID(desc) beside sort by score.
>   /StringFilter.java: string match or prefix match index filter
> demo/
> /XMLIndexer.java: indexing xml source which mapping to lucene index file(not 
>tested).
>
> Regards
> 
> Che, Dong
> 
> Attach with README
> 
> Lucene extend package
> 
> Author: Che, Dong <[EMAIL PROTECTED]>
> $Header: /home/cvsroot/lucene_ext/README,v 1.1.1.1 2002/09/22 19:36:08 chedong Exp $
> 
> Introduction
> 
> There is some source code extend to lucene project for some purpose: customize 
>search result 
> sorting, search result filtering, common xml indexing source format, Asia 
>language(Chinese 
> Korean Japanese) word segment analyer support...
> 
> File list: 
> 
> sample.xml: sample index source
> lucene.dtd: lucene index source xml data type defination
> 
> org/apache/lucene
> analysis/
> /cjk/CJKTokenizer.java:  java based  tokenizer(CJK bigram)
> /standard/StandardTokenizer.java: JavaCC based tokenizer(CJK sigram)
> search/
>   /IndexSearcher.java: support sort by docID(desc) beside sort by score.
>   /StringFilter.java: string match or prefix match index filter
> demo/
> /XMLIndexer.java: indexing xml source which mapping to lucene index file(not 
>tested).
>
> 
> INSTALL
> ===
> Required jar: lucene-version.jar xerces.jar(only XMLIndexre needed), 
> Please sure these two jar file included in your CLASSPATH
> 
> check the javacc related configure in build.xml fit you environment.
> build:
> ant 
> ant javadocs
> 
> TODO
> 
> 1 Bigram based word segment in StandardTokenizer.jj:
> I still not familar with JavaCC, I try to use getNextToken() in 
>StandardTokenizer.next() to 
> implement over lap match: C1C2C3C4 ==> C1C2 C2C3 C3C4 
> or even to  C1C2 C2C3 C3C4 C4 / C1 C1C2 C2C3 C3C4
> 
> 2 more complex lucene index source binding:
> indexType: DateIndex etc...make one lucene.dtd(or schema) as the common lucene 
>indexing source format:
> source WORD   PDF HTMLDB   other
>  \  |   |  | /
>xml(lucene.dtd) 
> |
>XMLIndexer.build(XML InputSource)
> |
>  Lucene INDEX
> 
> 3 IndexSearcher:
> Lower-level search API search() still not docID order search able
> 
> 4 test suit for above package:
> 
> 





> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>



lucene_ext.tar.gz
Description: GNU Zip compressed data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Custom result ordering

2002-12-02 Thread Che Dong

another simple proach: use docID instead of score as sort field
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg02052.html

Che, Dong
- Original Message - 
From: "Eric Jain" <[EMAIL PROTECTED]>
To: "Peter Carlson" <[EMAIL PROTECTED]>
Cc: "lucene-user" <[EMAIL PROTECTED]>
Sent: Monday, December 02, 2002 6:41 PM
Subject: Re: Custom result ordering


> > There was some work done on this and it was added to the Lucene
> > Sandbox. It's called SearchBean.
> 
> Thanks. But as mentioned I believe SearchBean requires all fields that are
> used for sorting to fit into memory?
> 
> 
> --
> Eric Jain
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>

Re: Analyzers for various languages

2002-12-30 Thread Che Dong

For asian language, Chinese Korean Japanese,  bigram based word segment is easy way to 
solve the word segment problem. 
Bigram based word segment is:  C1C2C3C4  =>  C1C2 C2C3 C3C4  (C# is single CJK 
charator term)
I think the the make a StandardTokenizer can handle multi language mixed content : 
Chinese/English, Japanese/French mixed content. 

In CJKTokenizer(modify from StopTokenizer) I use one char buffer remember previous CJK 
charactor to make overlap term(Ci + Ci-1)。
but in StandardTokenizer I still don't know how to make:
T1T2T3T4 => T1T2 T2T3 T3T4.  (T# is single CJK charator term)

for more article on word segment for asian languages:
http://www.google.com/search?q=chinese+word+segment+bigram

Regards

Che, Dong
- Original Message - 
From: "Eric Isakson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, December 07, 2002 12:40 AM
Subject: Analyzers for various languages


> Hi All, 
> 
> I want to volunteer to help get language modules organized into the CVS and builds.
> 
> I've been lurking on the lists here for a couple months and working with and getting 
>familiar with Lucene. I'm investigating the use of lucene to support our help 
>system's fulltext search requirements. I have to build indices for multiple 
>languages. I just poked around the CVS archives and found only the German, Russian 
>and standard(English) analyzers in the core and nothing in the sandbox. In the list 
>archives I've found many references to folks using Lucene for several other 
>languages. I did find the CJKTokenizer, Dutch and French analyzers and have put those 
>into my tests. Is there somewhere these analyzers are organized that I might get a 
>hold of the sources for other languages to build into my toolset? There were a couple 
>mentioned that several of you appear to be using that I can't find the sources for 
>(most notably http://www.halyava.ru/do/org.apache.lucene.analysis.zip 
><http://www.halyava.ru/do/org.apache.lucene.analysis.zip>  which gives a "Cannot find 
>server" error). 
> 
> In order to meet the requirements for my product these are the languages I have to 
>support: 
> 
> Must Support 
>  
> English
> Japanese 
> Chinese 
> Korean 
> French 
> German 
> Italian 
> Polish 
> 
> Not Sure Yet 
>  
> Czech 
> Danish 
> Hebrew 
> Hungarian 
> Russian 
> Spanish 
> Swedish 
> 
> I understand the issues that were raised about putting language modules in the core 
>and then not being able to support them, but it seems they have not been put 
>anywhere. I would be willing to try and get them into a central place that people can 
>access them or help someone that is already working on that. I can't commit today to 
>being able to maintain or bugfix contributions, but should my company adopt Lucene as 
>our search engine (which seems likely at this point) I'll do what I can to contribute 
>back any fixes we make. I also have a personal interest in the project since I've 
>found Lucene quite interesting to be working with and I've enjoyed learning about 
>internationalizing java apps.
> 
> I'll volunteer to help gather and organize these somewhere if I were given committer 
>rights to the appropriate area and folks would be willing to send me their language 
>modules. 
> 
> I recall some discussion about moving language modules out of the core, but I don't 
>think any decisions were made about where to put them (perhaps this is why they 
>aren't in the CVS at all). I was thinking perhaps give each language a sandbox 
>project or create language packages in the core build that could be enabled via 
>settings in the build.properties file. Using the build.properties file could allow us 
>to create a jar for each language during the core build so folks could install just 
>the language modules they want and if a language module starts breaking due to 
>changes in the core it could easily be turned off until fixes were made to that 
>module. I can start working on a setup like this in my local source tree next week 
>using the existing language modules in the core if you all think this would be a good 
>approach. If not, does anyone have a proposal for where these belong so we can get 
>some movement on getting them committed to CVS?
> 
> Regards,
> Eric
> -- 
> Eric D. IsaksonSAS Institute Inc. 
> Application Developer  SAS Campus Drive 
> XML Technologies   Cary, NC 27513 
> (919) 531-3639 http://www.sas.com <http://www.sas.com>  
> 
> 
>

Use filter instead of searching Re: Error when trying to match file path

2002-12-30 Thread Che Dong

first indexing file path field with a untokened indexing field 
Field("filePath", file.getAbsolutePath(), true, true, false)

second , construct a prefix filter for searcher.I wrote a StringFilter.java for match 
and prefix match which can download from:
http://www.chedong.com/tech/lucene_ext.tar.gz

Che , Dong
- Original Message - 
From: "Rob Outar" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, October 24, 2002 8:45 PM
Subject: RE: Error when trying to match file path


> Some more information, with the following:
> 
> this.query =
> QueryParser.parse(file.getAbsolutePath(),"path",this.analyzer);
> System.out.println(this.query.toString("path"));
> I got:
> F:"onesaf dev block b pair dev unittestdatafiles tools unitcomposer.xml"
> 
> So it looks like the Query Parser is stripping out all the "\", and doing
> something with the F:\, would anyone happen to know why this is happening?
> Do I need to use a different query to get the infromation I need?
> 
> Thanks,
> 
> Rob
> 
> -Original Message-
> From: Rob Outar [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, October 23, 2002 5:48 PM
> To: [EMAIL PROTECTED]
> Subject: Error when trying to match file path
> 
> 
> Hi all,
> 
> I am indexing the filepath with the below:
> 
>  Document doc = new Document();
> doc.add(Field.UnIndexed("path", f.getAbsolutePath()));
> 
> I then try to run the following after building the index:
> 
>  this.query =
> QueryParser.parse(file.getAbsolutePath(),"path",this.analyzer);
> Hits hits = this.searcher.search(this.query);
> 
> It returns zero hits?!?
> 
> What am I doing wrong?
> 
> Any help would be appreciated.
> 
> Thanks,
> 
> Rob
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>

Re: Your experiences with Lucene

2002-12-30 Thread Che Dong

my experiences:
1 caching if source document doesn't update frequntly.
2 caching first 100 results only. when read above 100 results, lucene search twice and 
make a 200 results buffer, reach again lucene search again ,make 400 results buffer.

http://search.163.com use lucene as category search and news search. handle 10 
querys/sec with 2 pIII(1G linux) 

Che, Dong
- Original Message - 
From: "Tim Jones" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, October 30, 2002 4:02 AM
Subject: Your experiences with Lucene


> Hi,
>  
> I am currently starting work on a project that requires indexing and
> searching on potentially thousands, maybe tens of thousands, of text
> documents.
>  
> I'm hoping that someone has a great success story about using Lucene for
> a project that required indexing and searching of a large number of
> documents.
> Like maybe more than 10,000. I guess what I'm trying to figure out is if
> Lucene's performance will be acceptable where the number of documents is
> very large.
> I realize this is a very general question but I just need a general
> answer.
>  
> Thanks,
>  
> Tim J.
>

Re: Bitset Filters

2002-12-30 Thread Che Dong

I wrote a StringFilter for exactly match and prefix match field filter
http://www.chedong.com/tech/lucene_ext.tar.gz

Che, Dong
- Original Message - 
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, October 26, 2002 6:08 AM
Subject: Bitset Filters


> Peter,
> 
> Could you give, or point to, a couple of examples on how to use bitset
> filters in the way you describe below?
> 
> Regards,
> 
> Terry
> 
> - Original Message -
> From: "Peter Carlson" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Tuesday, October 22, 2002 11:26 PM
> Subject: Re: Need Help URGENT
> 
> 
> > I think the answer is yes.
> >
> > When creating a Lucene Document you can create a field which is the URL
> > field. If you are not searching for words within the field, I would
> > probably make it a keyword field type so you don't tokenize it into
> > multiple Terms.
> >
> > Then you can great a multi-field search.
> >
> >
> > url:www.apache.org AND lucene
> >
> > Where url is the field where the URL exists and the term you want to
> > search for in your default field is Lucene.
> >
> > To answer what I think your second question is I will restate the
> > question.
> >
> > Can Lucene support subsearching.
> > Well yes and no. I will answer how to accomplish this, there is also
> > some information in the FAQ about this.
> >
> > You can just add criteria to the search so
> >
> > url:www.apache.org AND lucene AND indexing
> >
> > This will return the subset of information.
> >
> > If you are going to do the same search over and over again, you may
> > also want to look at filters, which basically keep a bitset of a Lucene
> > search results so you don't actually have to do the search again, just
> > an intersection of two bitsets.
> >
> > When you get the Hits back you can get the information from what ever
> > field you want including the URL field that you will create.
> >
> > I hope this helps and is on the mark. If not, the answer in can you use
> > Lucene to accomplish the task the answer is typically yes (The
> > questions then become just how much work has to be done on top of
> > Lucene, or is Lucene the right tool).
> >
> > --Peter
> >
> >
> >
> > On Tuesday, October 22, 2002, at 04:32 PM, nandkumar rayanker wrote:
> >
> > > Hi,
> > >
> > > Forther to the request already made in my previous
> > > mail I would like to know:
> > >
> > > - Whether I can use lucene to search the remote site
> > > or not?
> > >
> > > Here is what I wnt to do.
> > > -Install Licene and search and create search info for
> > > a given URL.
> > >
> > > -Search the info from search info already created .
> > >
> > > Can do this sort of things using Lucene or not?
> > >
> > > thanks and regards
> > > Nandkumar
> > >
> > > --- nandkumar rayanker <[EMAIL PROTECTED]>
> > > wrote:
> > >> Hi,
> > >>
> > >> I need to develop search java stand alone
> > >> application,
> > >> which takes "SearchString" and "URL/URLS"
> > >>
> > >> "SearchString": string to be searched in web
> > >>
> > >> URL/URLS" : List of URLs where string needs to
> > >> searched.
> > >> return: List of URL/URLS where "SearchString" is
> > >> found.
> > >>
> > >> thanks & regards
> > >> Nandkumar
> > >>
> > >> --
> > >> To unsubscribe, e-mail:
> > >> <mailto:[EMAIL PROTECTED]>
> > >> For additional commands, e-mail:
> > >> <mailto:[EMAIL PROTECTED]>
> > >>
> > >
> > >
> > > --
> > > To unsubscribe, e-mail:
> > > <mailto:[EMAIL PROTECTED]>
> > > For additional commands, e-mail:
> > > <mailto:[EMAIL PROTECTED]>
> > >
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> >
> >
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
>

Re: OutOfMemoryException while Indexing an XML file

2003-02-16 Thread Che Dong

Maybe you can use SAX parse and indexing xml source streamly.

Here is my demo:
http://www.chedong.com/tech/lucene_ext.tar.gz

Che, Dong
- Original Message - 
From: "Tatu Saloranta" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, February 15, 2003 9:18 AM
Subject: Re: OutOfMemoryException while Indexing an XML file


> On Friday 14 February 2003 07:27, Aaron Galea wrote:
> > I had this problem when using xerces to parse xml documents. The problem I
> > think lies in the Java garbage collector. The way I solved it was to create
> 
> It's unlikely that GC is the culprit. Current ones are good at purging objects 
> that are unreachable, and only throw OutOfMem exception when they really have 
> no other choice.
> Usually it's the app that has some dangling references to objects that prevent 
> GC from collecting objects not useful any more.
> 
> However, it's good to note that Xerces (and DOM parsers in general) generally 
> use more memory than the input XML files they process; this because they 
> usually have to keep the whole document struct in memory, and there is 
> overhead on top of text segments. So it's likely to be at least 2 * input 
> file size (files usually use UTF-8 which most of the time uses 1 byte per 
> char; in memory 16-bit unicode-2 chars are used for performance), plus some 
> additional overhead for storing element structure information and all that.
> 
> And since default max. java heap size is 64 megs, big XML files can cause 
> problems.
> 
> More likely however is that references to already processed DOM trees are not 
> nulled in a loop or something like that? Especially if doing one JVM process 
> for item solves the problem.
> 
> > a shell script that invokes a java program for each xml file that adds it
> > to the index.
> 
> -+ Tatu +-
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

Re: Indexing XML with Lucene

2003-02-16 Thread Che Dong

Maybe you can use SAX parse and indexing xml source streamly.

Here is my demo:
http://www.chedong.com/tech/lucene_ext.tar.gz

Che, Dong
http://www.chedong.com

- Original Message - 
From: "Pierre Lacchini" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, February 14, 2003 5:49 PM
Subject: Indexing XML with Lucene


> Hello,
> 
> I'm using Lucene, and I need to index an XML Database (Tamino).
> How can I do that ? Do i have to use an XML parser as Digester ?
> 
> I'm kinda noob with Lucene, and I really need help ;)
> 
> Thx !Pierre Lacchini
> Consultant développement
> 
> PeopleWare
> 12, rue du Cimetière
> L-8413 Steinfort
> Phone : + 352 399 968 35
> http://www.peopleware.lu
> 
> 
>

PLAN: WebLucene -- Lucene Web interface, use XML as a lightweight protocol.

2003-02-19 Thread Che Dong

http://sourceforge.net/projects/weblucene/

WebLucene: Lucene Web interface, use XML as a lightweight protocol. 

Developer convert data source (text, DB, MS Word, PDF... etc) into standard xml format 
indexing with lucene engine, and get full text search result via HTTP, with XML format 
output, user can easily intergrated with JSP ASP PHP front end or use XSLT at server 
side transform output.

Developer can intergrate lucene full text search engine with old MSSQL + ASP MySQL + 
PHP Oracle + JSP based web applications.

MySQL  \  / JSP
Oracle - DB  -  ==>   XML ==> (Lucene Index) ==> XML  -  ASP
MSSQL  /  -  PHP
 MS Word /\ / XHTML
 PDF / =XSLT=> -  text
\ XML
 
 \_Web Lucene/ 
   
i18n issue: for Java is Unicode based, user can indexing data source(XML) in different 
charset into one lucene index(in unicode) and output result according to client 
browser support languages.
  GBK  \   / BIG5
  BIG5  -  UNICODE>   Unicode -  GB2312
  SJIS  -   (XML) (XML)   -  SJIS
  ISO-8859-1   /   \ ISO-8859-1


Che, Dong
http://www.chedong.com/tech/

Re: PLAN: WebLucene -- Lucene Web interface, use XML as a lightweight protocol.

2003-02-20 Thread Che Dong

Yes, I think compare to JavaBean and SOAP based API,   HTTP/URI/XML-based API is much 
simpler.

I'd like to here more opintions at this project starting.

Regards

Che, Dong
- Original Message - 
From: "Michael Wechner" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Cc: "Lucene Developers List" <[EMAIL PROTECTED]>
Sent: Thursday, February 20, 2003 6:13 PM
Subject: Re: PLAN: WebLucene -- Lucene Web interface, use XML as a lightweight 
protocol.


> That's very interesting.
> 
> I have tried something similar by integrating
> Lucene into Wyona, which is a CMS based on Cocoon,
> and I also separated Structure from Layout. You can try it out at
> 
> HTML:
> 
> 
>http://195.226.6.70:8080/wyona-cms/oscom/search-oscom/lucene?publication-id=all&queryString=Cocoon+Wyona&fields=all&find=Search
>
> 
> XML:
> 
> 
>http://195.226.6.70:8080/wyona-cms/oscom/search-oscom/lucene.xml?publication-id=all&queryString=Cocoon+Wyona&fields=all&find=Search
>
> 
> I think XooMLe also did a pretty good job:
> 
> http://www.dentedreality.com.au/xoomle/search/
> 
> Maybe we find a way how to join efforts
> 
> Thanks
> 
> Michael
> 
> 
> Che Dong wrote:
> > http://sourceforge.net/projects/weblucene/
> > 
> > WebLucene: Lucene Web interface, use XML as a lightweight protocol. 
> > 
> > Developer convert data source (text, DB, MS Word, PDF... etc) into standard xml 
>format indexing with lucene engine, and get full text search result via HTTP, with 
>XML format output, user can easily intergrated with JSP ASP PHP front end or use XSLT 
>at server side transform output.
> > 
> > Developer can intergrate lucene full text search engine with old MSSQL + ASP MySQL 
>+ PHP Oracle + JSP based web applications.
> > 
> > MySQL  \  / JSP
> > Oracle - DB  -  ==>   XML ==> (Lucene Index) ==> XML  -  ASP
> > MSSQL  /  -  PHP
> >  MS Word /\ / XHTML
> >  PDF / =XSLT=> -  text
> > \ XML
> >  
> >  \_Web Lucene/ 
> >
> > i18n issue: for Java is Unicode based, user can indexing data source(XML) in 
>different charset into one lucene index(in unicode) and output result according to 
>client browser support languages.
> >   GBK  \   / BIG5
> >   BIG5  -  UNICODE>   Unicode -  GB2312
> >   SJIS  -   (XML) (XML)   -  SJIS
> >   ISO-8859-1   /   \ ISO-8859-1
> > 
> > 
> > Che, Dong
> > http://www.chedong.com/tech/
> > 
> > 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

[PLAN]: SAXIndexer, indexing database via XML gateway

2003-06-06 Thread Che Dong

In current weblucene project including a SAX Based xml source indexer:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/weblucene/weblucene/webapp/WEB-INF/src/com/chedong/weblucene/index/

It can parse  xml data source like following example: 


 
  39314
  title of document
  chedong
  blah blah
  2003-06-06
  Title,Content
  Author
 
 ...
 


I use two Index elements in  each Record block to speciefy field => index mapping, The 
SAXIndexer will parse this xml source into Id, Title, Author, Content ,PubTime into 
Lucene store only Fields and create another two index fields:
one index field with Title + Content 
one index field Author without token

Recently I notice more and more application provided xml interface very similar to RSS:
for example: you can even dump table into xml output from phpMyAdmin like following:


  

localhost
root

Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y




0
0
0

...


the SAXIndexer will be able to database xml dump directly if SAXIndexer can let 
specify field => index mapping rule from enternal program.
for example: 
java IndexRunner -c field_index_mapping.conf -i http://localhost/table_dump.xml

#the config file like following:
FullIndex   Title,Content 
AuthorIndex  Author  no

Hope this SAXIndexer can be added into Lucene demos make lucene end user can make 
lucene index from current database applications.

Regards

Che, Dong
http://www.chedong.com/

Re: commercial websites powered by Lucene?

2003-06-05 Thread Che Dong

http://search.163.com  China portal: NetEase use lucene as directory search and news 
search.


Che, Dong
http://www.chedong.com

- Original Message - 
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, June 04, 2003 10:08 PM
Subject: commercial websites powered by Lucene?


> 
> 
> Hello All,
> 
> I've been trying to find examples of large commercial websites that
> use Lucene to power their search.  Having such examples would
> make Lucene an easy sell to management
> 
> Does anyone know of any good examples?  The bigger the better, and
> the more the better.
> 
> TIA,
> -John
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

WebLucene: XML gateway for Lucene

2003-06-19 Thread Che Dong

Hi All:
Today I also read Otis's 'Parsing, indexing, and searching XML with Digester and 
Lucene'
at:  http://www-106.ibm.com/developerworks/java/library/j-lucene/

After long time delay, I decide to release a demo of WebLucene while it still not very 
well accomplished.

WebLucene: Lucene Web interface, use XML as a lightweight protocol. developer can 
convert data source (text, DB
, MS Word, PDF... etc) into xml format, indexing with lucene engine, and get full text 
search result via HTTP, 
with XML format output, user can easily intergrated with JSP ASP PHP front end or use 
XSLT at server side trans
form output.
http://sourceforge.net/projects/weblucene/

In this application I think following match some mostly ask question in Lucene user 
list:
1 Custom sorting: use docID based sorting, we can sorting results according data 
source order.
2 Internationalization issue: CJKTokenizer
 XML input avoid a lot of double byte charactor decoding problem for application runs 
on iso-8859-1 plat form.
3 I rewrite some SAXIndexer to fit for RSS like xml source indexing 
4 Highlighting support: WebLuceneHighlighter is a token based highlighter.

TODO:
1 RSS indexing demo
2 Documents

Regards

Che, Dong
http://www.chedong.com/

Re: making XML from articles

2003-07-07 Thread Che Dong

>>// just remove invalid characters: in php
>>$pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
>>$string = preg_replace($pattern,'',$string);

- Original Message - 
From: "Jagdip Singh" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Monday, July 07, 2003 7:53 AM
Subject: making XML from articles


> Hi,
> I am trying to use Lucene for searching articles (text files) and web
> pages. I am thinking of converting those articles to XML files and then
> feed to Lucene for indexing.
> I have not done anything much with XML before and trying to know if this
> is going to be a better idea in term of searching. 
> How can I convert text into XML?
>  
> Please suggest me if someone has faced similar situation before.
>  
> Regards, 
> Jagdip
>

How to implement Similarity for custom sorting by field ( or by docID)?

2003-07-15 Thread Che Dong

Hi All: 
In lucene 1.3 rc1 release included a Similarity for custom scoring. Is it possible 
implement a docID based or (some field value based )scoring beside DefautlSimilarity? 

Thanks.

Che, Dong
http://www.chedong.com/tech/lucene.html

Re: CJK support in lucene

2003-07-17 Thread Che Dong

I think Tranditional Chinese use in HK and TW is supported for CJK Charactor is 
indentified with charactor block of: CJK_UNIFIED_IDEOGRAPHS

more:
http://sourceforge.net/projects/weblucene/

Che, Dong
- Original Message - 
From: "Eric Isakson" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, July 17, 2003 2:04 AM
Subject: FW: CJK support in lucene

-Original Message-
From: Eric Isakson 
Sent: Wednesday, July 16, 2003 2:04 PM
To: 'Avnish Midha'
Subject: RE: CJK support in lucene

I'm no linguist, so the short answer is, I'm not sure about Taiwanese. If they share 
the same character sets and a bigram indexing approach makes sense for that language 
(read the links in the CJKTokenizer source), then it would probably work.

For Latin-1 languages, it will tokenize (It is setup to deal with mixed language 
documents where some of the text might be Chinese and some might be English) but it 
will be far less efficient than the standard tokenizer supplied with the Lucene core. 
But you should run your own tests to see if that would be livable.

Eric

-Original Message-
From: Avnish Midha [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 16, 2003 1:50 PM
To: Eric Isakson
Cc: Lucene Users List
Subject: RE: CJK support in lucene

Eric,

Does this tokenizer also support Taiwanese & European languages (Latin-1)?

Regards,
Avnish

-Original Message-
From: Eric Isakson [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 16, 2003 10:38 AM
To: Avnish Midha
Cc: Lucene Users List
Subject: RE: CJK support in lucene

This archived message has the CJKTokenizer code attached (there are some links in the 
code to material that describes the tokenization strategy).

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]
e.org&msgId=330905

You have to write your own analyzer that uses this tokenizer. See 
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html for some details on how to 
write an analyzer.

here is one you could use:
package my.package;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKTokenizer;
import java.io.Reader;

public class CJKAnalyzer extends Analyzer {

public CJKAnalyzer() {
}

/**
 * Creates a TokenStream which tokenizes all the text in the provided Reader.
 *
 * @return  A TokenStream built from a CJKTokenizer
 */
public TokenStream tokenStream( String fieldName, Reader reader )
{
TokenStream result = new CJKTokenizer( reader );
result = new StopFilter(result, new String[] {""}); // CJKTokenizer emitts a "" 
sometimes, haven't been able to figure it out, so this is a workaround
return result;
}
}

Lastly, you have to package those things up and use them along with the core lucene 
code.

CC'ing this to Lucene User so everyone can benefit from these answers. Maybe a faq on 
indexing CJK languages would be a good thing to add. The existing one 
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.index
ing&toc=faq#q28) is somewhat light on details (so is this answer, but it is a bit more 
direct about dealing with CJK) and http://www.jguru.com/faq/view.jsp?EID=108 is 
useful to be aware of too.

Good luck,
Eric

-Original Message-
From: Avnish Midha [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 16, 2003 1:06 PM
To: Eric Isakson
Subject: CJK support in lucene

Hi Eric,

I read the description of the bug (#18933) reported by you on the apache site. I had a 
question related to this defect. In the description you have mentioned that CJK 
support should be included in the core build. Is there any other way we can enable the 
CJK support in the lucene search engine? Would be grateful to you if you could let me 
know of any such method of enabling CJK support in the serach engine.

Eagerly waiting for your reply.

Thanks & Regards,
Avnish Midha
Phone no.: +1-949-8852540

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.

2003-11-30 Thread Che Dong

http://sourceforge.net/projects/weblucene/

WebLucene: 
Lucene search engine XML interface, provided sax based indexing, indexing sequence 
based result sorting and xml output with highlight support.The CJKTokenizer support 
Chinese Japanese and Korean with Westen language simultaneously.

The key features:
1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer

2 docID based result sorting: org/apache/lucene/search/IndexOrderSearcher

3 xml output: com/chedong/weblucene/search/DOMSearcher

4 sax based indexing: com/chedong/weblucene/index/SAXIndexer

5 token based highlighter: 
reverse StopTokenzier:
org/apache/lucene/anlysis/HighlightAnalyzer.java
  HighlightFilter.java
with abstract:
com/chedong/weblucene/search/WebluceneHighlighter

6 A simplified query parser:
google like syntax with term limit
org/apache/lucene/queryParser/SimpleQueryParser
modified from early version of Lucene :)

Regards

Che, Dong

Re: WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.

2003-12-01 Thread Che Dong

build..properties.default 

# -
# WebLucene  BUILD  PROPERTIES
# -
jsdk_jar=/usr/local/resin/lib/jsdk23.jar

# Home directory of JavaCC
javacc.home = /usr/java/javacc/bin

# modify following on Windows
# jsdk_jar=c:\\resin\\lib\\jsdk23.jar
# javacc.home = c:\\java\\javacc\\bin


javacc.zip.dir = ${javacc.home}/lib
javacc.zip = ${javacc.zip.dir}/JavaCC.zip

Che, Dong
- Original Message - 
From: "Tun Lin" <[EMAIL PROTECTED]>
To: "'Lucene Developers List'" <[EMAIL PROTECTED]>; "'Lucene Users List'" <[EMAIL 
PROTECTED]>
Sent: Monday, December 01, 2003 11:34 AM
Subject: RE: WebLucene 0.3 release:support CJK, use sax based indexing, docID based 
result sorting and xml format output with highlighting support.


> Hi,
> 
> Do you have the install.txt for windows XP setup of the WebLucene? It seems that
> the install.txt is only for UNIX setup.
> 
> Thanks.  
> 
> -Original Message-
> From: Che Dong [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, November 30, 2003 9:57 PM
> To: Lucene Developers List; Lucene Users List
> Subject: WebLucene 0.3 release:support CJK, use sax based indexing, docID based
> result sorting and xml format output with highlighting support.
> 
> http://sourceforge.net/projects/weblucene/
> 
> WebLucene: 
> Lucene search engine XML interface, provided sax based indexing, indexing
> sequence based result sorting and xml output with highlight support.The
> CJKTokenizer support Chinese Japanese and Korean with Westen language
> simultaneously.
> 
> The key features:
> 1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer
> 
> 2 docID based result sorting: org/apache/lucene/search/IndexOrderSearcher
> 
> 3 xml output: com/chedong/weblucene/search/DOMSearcher
> 
> 4 sax based indexing: com/chedong/weblucene/index/SAXIndexer
> 
> 5 token based highlighter: 
> reverse StopTokenzier:
> org/apache/lucene/anlysis/HighlightAnalyzer.java
>   HighlightFilter.java
> with abstract:
> com/chedong/weblucene/search/WebluceneHighlighter
> 
> 6 A simplified query parser:
> google like syntax with term limit
> org/apache/lucene/queryParser/SimpleQueryParser
> modified from early version of Lucene :)
> 
> Regards
> 
> Che, Dong
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

WebLucene 0.4 released: added full featured demo(dump data php scripts and demo data in Chinese)

2003-12-16 Thread Che Dong

http://sourceforge.net/projects/weblucene/

WebLucene: 
Lucene search engine XML interface, provided sax based indexing, indexing sequence 
based result sorting and xml output with highlight support. 

The key features:
1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer, The 
CJKTokenizer support Chinese Japanese and Korean with Westen language simultaneously.

2 DocID based result sorting: org/apache/lucene/search/IndexOrderSearcher

3 xml output: com/chedong/weblucene/search/DOMSearcher

4 sax based indexing: com/chedong/weblucene/index/SAXIndexer

5 token based highlighter: 
reverse StopTokenzier:
org/apache/lucene/anlysis/HighlightAnalyzer.java
  HighlightFilter.java
with abstract:
com/chedong/weblucene/search/WebluceneHighlighter

6 A simplified query parser:
google like syntax with term limit
org/apache/lucene/queryParser/SimpleQueryParser
modified from early version of Lucene :)

7 Add full featured demo (including dump script and sample data) runs on: 
http://www.blochina.com/weblucene/

Regards


Che Dong
http://www.chedong.com/tech/weblucene.html

Re: WebLucene 0.4 released: added full featured demo(dump data php scripts and demo data in Chinese)

2003-12-16 Thread Che Dong

sorry, demo address is:
http://www.blogchina.com/weblucene/


Che, Dong
- Original Message - 
From: "Che Dong" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, December 17, 2003 1:33 AM
Subject: WebLucene 0.4 released: added full featured demo(dump data php scripts and 
demo data in Chinese)


> http://sourceforge.net/projects/weblucene/
> 
> WebLucene: 
> Lucene search engine XML interface, provided sax based indexing, indexing sequence 
> based result sorting and xml output with highlight support. 
> 
> The key features:
> 1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer, The 
> CJKTokenizer support Chinese Japanese and Korean with Westen language simultaneously.
> 
> 2 DocID based result sorting: org/apache/lucene/search/IndexOrderSearcher
> 
> 3 xml output: com/chedong/weblucene/search/DOMSearcher
> 
> 4 sax based indexing: com/chedong/weblucene/index/SAXIndexer
> 
> 5 token based highlighter: 
> reverse StopTokenzier:
> org/apache/lucene/anlysis/HighlightAnalyzer.java
>   HighlightFilter.java
> with abstract:
> com/chedong/weblucene/search/WebluceneHighlighter
> 
> 6 A simplified query parser:
> google like syntax with term limit
> org/apache/lucene/queryParser/SimpleQueryParser
> modified from early version of Lucene :)
> 
> 7 Add full featured demo (including dump script and sample data) runs on: 
> http://www.blogchina.com/weblucene/
> 
> Regards
> 
> 
> Che Dong
> http://www.chedong.com/tech/weblucene.html
>

Re: Japanese Analyzer

2004-01-30 Thread Che Dong

As I know: for east Asian Languages(which without space for word segment in natural), 
as an non-dictionary based solution, bigram based word segment maybe the best way.

Regards

Che, Dong

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, January 31, 2004 1:14 AM
Subject: Re: Japanese Analyzer


> On Jan 29, 2004, at 1:45 PM, Otis Gospodnetic wrote:
> > --- "Weir, Michael" <[EMAIL PROTECTED]> wrote:
> >> Is the CJKAnalyzer the best to use for Japanese?  If not, which is?
> >> If so,
> >> from where can I download it?
> 
> There is also a ChineseTokenizer/Analyzer in the sandbox as well.  It 
> may have value for Japanese as well?
> 
> Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Re: CJK Analyzer in lucene 1.3 final

2004-02-29 Thread Che Dong

here is a demo on BlogChina.com
http://www.blogchina.com/weblucene/

Che Dong

- Original Message - 
From: "Ankur Goel" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Saturday, February 28, 2004 12:11 AM
Subject: RE: CJK Analyzer in lucene 1.3 final




I tried with Standard Analyzer but not able to do so. Then I tried CJk
Anlayzer given using CJK tokenizer but was again unsuccesful. The File in
which text to be indexed is contains noth English and Japanese Characters.
Can this be a problem.

Regards
Ankur 

-Original Message-
From: ? ? [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 27, 2004 7:58 PM
To: [EMAIL PROTECTED]
Subject: Re: CJK Analyzer in lucene 1.3 final

for east asian language without space for word segment in nature, the 
StandardTokenizer now is sigram based C1C2C3 ==> C1 C2 C3, so you search 
C1C2 and C2C1 will return same results

CJKTokenizer is bigram based: C1C2C3 ==> C1C2 C2C3, so you it will result 
return when you search C2C1,
briefly: CJKTotenizer is better than StandardTokenizer for CJK but I don't 
know how to implement bigram based token in StandartTokenzier.

Che Dong
http://www.chedong.com/tech/lucene.html

>From: Erik Hatcher <[EMAIL PROTECTED]>
>Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
>To: "Lucene Users List" <[EMAIL PROTECTED]>
>Subject: Re: CJK Analyzer in lucene 1.3 final
>Date: Fri, 27 Feb 2004 08:29:10 -0500
>MIME-Version: 1.0 (Apple Message framework v612)
>Received: from mail.apache.org ([208.185.179.12]) by mc11-f27.hotmail.com 
with Microsoft SMTPSVC(5.0.2195.6824); Fri, 27 Feb 2004 05:29:21 -0800
>Received: (qmail 58976 invoked by uid 500); 27 Feb 2004 13:29:16 -
>Received: (qmail 58962 invoked from network); 27 Feb 2004 13:29:15 -
>Received: from unknown (HELO c000.snv.cp.net) (209.228.32.77)  by 
daedalus.apache.org with SMTP; 27 Feb 2004 13:29:15 -
>Received: (cpmta 24544 invoked from network); 27 Feb 2004 05:29:16 -0800
>Received: from 128.143.26.2 (HELO ?128.143.26.2?)  by smtp.hatcher.net 
(209.228.32.77) with SMTP; 27 Feb 2004 05:29:16 -0800
>X-Message-Info: JGTYoYF78jEAnq90Su6PQLeCibywrZOE
>Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
>Precedence: bulk
>List-Unsubscribe: <mailto:[EMAIL PROTECTED]>
>List-Subscribe: <mailto:[EMAIL PROTECTED]>
>List-Help: <mailto:[EMAIL PROTECTED]>
>List-Post: <mailto:[EMAIL PROTECTED]>
>List-Id: "Lucene Users List" 
>Delivered-To: mailing list [EMAIL PROTECTED]
>X-Sent: 27 Feb 2004 13:29:16 GMT
>In-Reply-To: <[EMAIL PROTECTED]>
>References: <[EMAIL PROTECTED]>
>Message-Id: <[EMAIL PROTECTED]>
>X-Mailer: Apple Mail (2.612)
>X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N
>Return-Path: 
[EMAIL PROTECTED]
>X-OriginalArrivalTime: 27 Feb 2004 13:29:21.0631 (UTC) 
FILETIME=[B57A96F0:01C3FD35]
>
>On Feb 27, 2004, at 7:12 AM, Ankur Goel wrote:
>>  Hi,
>>In the lucene-1.3-final version's CHANGES.txt it is written that 
>>"Fix
>>StandardTokenizer's handling of CJK characters (Chinese, Japanese 
>>and Korean
>>ideograms)."
>>
>>Does it mean that for CJK characters we now do not need to use any 
>>separate
>>analyzer, standard analyzer will be sufficient??
>
>You tell us.  Does it work for you?
>
>An analyzer is a pretty personal decision based on your dataset, so 
>it is impossible to answer your question directly.
>
> Erik
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>

_
免费下载 MSN Explorer:   http://explorer.msn.com/lccn/  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: dynamic summary

2004-02-29 Thread Che Dong

WebLucene with a summary package:
http://sourceforge.net/projects/weblucene/

Che Dong
- Original Message - 
From: "umamahesh bayireddya" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, February 28, 2004 4:01 AM
Subject: dynamic summary 


> hi all
> 
> how to generate dynamic summary based on query?
> 
> right now i am trying to get the query results, parsing html files which i 
> got in query result and displaying the sentence in which query is matched
> 
> thanks
> mahesh
> 
> _
> Contact brides & grooms FREE! http://www.shaadi.com/ptnr.php?ptnr=hmltag 
> Only on www.shaadi.com. Register now!
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Che Dong

some Korean friends tell me they use it successfully for Korean. So I think its also 
work for Japanese. mostly the problem is locale settings

Please check weblucene project for xml indexing samples:
http://sourceforge.net/projects/weblucene/ 

Che Dong
- Original Message - 
From: "Chandan Tamrakar" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 16, 2004 4:31 PM
Subject: CJK Analyzer indexing japanese word document


> 
> I am using a CJKAnalyzer from apache sandbox , I have set the java
> file.encoding setting to SJIS
> and  i am able to index and search the japanese html page . I can see the
> index dumps as i expected , However when i index a word document containing
> japanese characters it is not indexing as expected . Do I need to change
> anything with CJKTokenizer and CJKAnalyzer classes?
> I have been able to index a word document with StandardAnalyzers.
> 
> thanks in advace
> chandan
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Che Dong

Yes, store data in Unicode inside and present to localize outside .

for Chinese users can read my documents on Java unicode process:
http://www.chedong.com/tech/hello_unicode.html
http://www.chedong.com/tech/unicode_java.html

Che Dong
http://www.chedong.com/

- Original Message - 
From: "Scott Smith" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, March 17, 2004 6:42 AM
Subject: RE: CJK Analyzer indexing japanese word document


I have used this analyzer with Japanese and it works fine.  In fact, I'm
currently doing English, several western European languages, traditional
and simplified Chinese and Japanese.  I throw them all in the same index
and have had no problem other than my users wanted the search limited by
language.  I solved that problem by simply adding a keyword field to the
Document which has the 2-letter language code.  I then automatically add
the term indicating the language as an additional constraint when the
user specifies the search.  

You do need to be sure that the Shift-JIS gets converted to unicode
before you put it in the Document (and pass it to the analyzer).
Internally, I believe lucene wants everything in unicode (as any good
java program would). Originally, I had problems with Asian languages and
eventually determined my xml parser wasn't translating my Shift-JIS,
Big5, etc. to unicode.  Once I fixed that, life was good.

-Original Message-
From: Che Dong [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 16, 2004 8:31 AM
To: Lucene Users List
Subject: Re: CJK Analyzer indexing japanese word document

some Korean friends tell me they use it successfully for Korean. So I
think its also work for Japanese. mostly the problem is locale settings

Please check weblucene project for xml indexing samples:
http://sourceforge.net/projects/weblucene/ 

Che Dong
- Original Message -
From: "Chandan Tamrakar" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 16, 2004 4:31 PM
Subject: CJK Analyzer indexing japanese word document


> 
> I am using a CJKAnalyzer from apache sandbox , I have set the java
> file.encoding setting to SJIS
> and  i am able to index and search the japanese html page . I can see
the
> index dumps as i expected , However when i index a word document
containing
> japanese characters it is not indexing as expected . Do I need to
change
> anything with CJKTokenizer and CJKAnalyzer classes?
> I have been able to index a word document with StandardAnalyzers.
> 
> thanks in advace
> chandan
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: CJK Analyzer indexing japanese word document

2004-03-16 Thread Che Dong

please check the java i/o's  ByteStream ==> CharactorStream

Che Dong
- Original Message - 
From: "Chandan Tamrakar" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, March 17, 2004 12:37 PM
Subject: Re: CJK Analyzer indexing japanese word document


> thanks smith  . How do i convert SJIS encoding to be converted into unicode
> ? As far as i know java converts ascii and latin1 into unicode by default
> which xml parsers you are using to translate to unicode ?
> 
> - Original Message -
> From: "Scott Smith" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Wednesday, March 17, 2004 4:27 AM
> Subject: RE: CJK Analyzer indexing japanese word document
> 
> 
> > I have used this analyzer with Japanese and it works fine.  In fact, I'm
> > currently doing English, several western European languages, traditional
> > and simplified Chinese and Japanese.  I throw them all in the same index
> > and have had no problem other than my users wanted the search limited by
> > language.  I solved that problem by simply adding a keyword field to the
> > Document which has the 2-letter language code.  I then automatically add
> > the term indicating the language as an additional constraint when the
> > user specifies the search.
> >
> > You do need to be sure that the Shift-JIS gets converted to unicode
> > before you put it in the Document (and pass it to the analyzer).
> > Internally, I believe lucene wants everything in unicode (as any good
> > java program would). Originally, I had problems with Asian languages and
> > eventually determined my xml parser wasn't translating my Shift-JIS,
> > Big5, etc. to unicode.  Once I fixed that, life was good.
> >
> > -Original Message-
> > From: Che Dong [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, March 16, 2004 8:31 AM
> > To: Lucene Users List
> > Subject: Re: CJK Analyzer indexing japanese word document
> >
> > some Korean friends tell me they use it successfully for Korean. So I
> > think its also work for Japanese. mostly the problem is locale settings
> >
> > Please check weblucene project for xml indexing samples:
> > http://sourceforge.net/projects/weblucene/
> >
> > Che Dong
> > - Original Message -
> > From: "Chandan Tamrakar" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Tuesday, March 16, 2004 4:31 PM
> > Subject: CJK Analyzer indexing japanese word document
> >
> >
> > >
> > > I am using a CJKAnalyzer from apache sandbox , I have set the java
> > > file.encoding setting to SJIS
> > > and  i am able to index and search the japanese html page . I can see
> > the
> > > index dumps as i expected , However when i index a word document
> > containing
> > > japanese characters it is not indexing as expected . Do I need to
> > change
> > > anything with CJKTokenizer and CJKAnalyzer classes?
> > > I have been able to index a word document with StandardAnalyzers.
> > >
> > > thanks in advace
> > > chandan
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Will CJKAnalyser be release with Lucene 1.4?

2004-05-29 Thread Che Dong

Hi All:
I checked the org/apache/lucene/analysis/cjk/ in lucene sandbox:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/analyzers/src/java/org/apache/lucene/analysis/cjk/

The original version works fine at http://search.163.com http://search.soufun.com and 
www.blogchina.com/weblucene/

Regards

Che Dong
http://www.chedong.com/tech/lucene.html

Re: Will CJKAnalyser be release with Lucene 1.4?

2004-05-29 Thread Che Dong

Hi Erik:
Is it possable move CJKAnalyser out of sandbox to jakarta-lucene package?

Regards

Che Dong
- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Sunday, May 30, 2004 5:17 AM
Subject: Re: Will CJKAnalyser be release with Lucene 1.4?


> I'm not sure I understand your question.
> 
> At this point there is no plan to "release" the sandbox components -  
> they are really there in a batteries-not-included fashion at this point  
> but its all there to freely use if you like.
> 
> I did centralize the build system in contributions area, so each piece  
> should easily build into JAR files.
> 
> Is there more that you desire?
> 
> Erik
> 
> 
> On May 29, 2004, at 3:24 PM, Che Dong wrote:
> 
> > Hi All:
> > I checked the org/apache/lucene/analysis/cjk/ in lucene sandbox:
> > http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ 
> > contributions/analyzers/src/java/org/apache/lucene/analysis/cjk/
> >
> > The original version works fine at http://search.163.com  
> > http://search.soufun.com and www.blogchina.com/weblucene/
> >
> > Regards
> >
> > Che Dong
> > http://www.chedong.com/tech/lucene.html
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Bigram Co-occurrences will be the better way for Word Discrimination. Re: Will CJKAnalyser be release with Lucene 1.4?

2004-05-30 Thread Che Dong

> I would be against such a move.  I think Lucene's core has too many 
> analyzers in it already, such as the German and Russian ones.  The core 
> could do without any of the concrete analyzers altogether, in my 
> opinion - but it is handy to have a few general purpose convenience 
> ones.
+1
> 
> What benefit, besides convenience, would there be in CJKAnalyzer into 
> the core?  What about the all the others in the sandbox?  If we bring 
> one in, why not all of them?
but for CJK there is no space for word segment in nature. so the Bigram Co-occurrences 
will be the better way for  Word Discrimination. 
For example: term C1C2 if segment into C1 and C2 the results will contains C2C1... but 
in Chinese, the word C1C2 and C2C1 maybe in different meaning.
compare to the the sigram base tokenizer implement in StandardTokenizer the bigram 
based token will return MUCH better results.

According to my feed back on CJKTokenizer: 
for CJK users, the bigram based CJKTokenzier was strongly recommended for better 
results.

for more:
Word Discrimination Based on Bigram Co-occurrences
... There is a match routine that detects any common segment between the target word 
and each of the ... The entries
of the matrix indicate whether a reference word and a lexicon word share at least one 
n- gram ... It also
shows the bigram match list for an unknown word generated by the feature-matching 
process ... 
www.ecse.rpi.edu/homepages/nagy/PDF_files/ ElNasan-Nagy-ICDAR01.pdf 

Segmenting Chinese in Unicode
... However, to date no in-depth analysis has been performed analyzing the 
deficiencies in segmentation
that lead to the improved performance of the simpler bigram methods. ... The 
part-of-speech of the segment
and the ... A study on integrating Chinese word segmentation and part-of-speech 
tagging. ... 
www.basistech.com/papers/chinese/iuc-16-paper.pdf 

> 
> It has been brought up to bring in the SnowballAnalyzer - as it 
> actually is general purpose and spans many languages.  I'm not really 
> for bringing that one in either.
> 
> I'm but one voice and would not veto bringing in other analyzers, I 
> just don't think there is much benefit, especially if we improve the 
> release process to incorporate the sandbox goodies into a single 
> distribution but as separate JARs.
> 
> Erik
Thank you,  Erik. Hope we can more communications on this issue with other east Asian 
Luaguage users.

Che Dong

> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Re: searching using the CJKAnalyzer

2004-10-12 Thread Che Dong

CJKAnalyser not support single byte-stream, front end interface and 
backend indexing process need to transform source into double byte 
charactor-stream properly before search/index.

Please tell me know the output of
http://www.chedong.com/tech/HelloUnicode.java
with javac -encoding=gb2312 and javac -encoding=iso-8859-1
Regards
Che Dong
Daan Hoogland wrote:
Jon Schuster wrote:

I didn't need to make any changes to Entities to get Japanese searches working. Are 
you using the CJKAnalyzer when you perform the search, not only when building the 
index?

Yes, I use CJKAnalyzer all around. When searching I translate 
character-entities in order to find anything. When displaying search 
results, I don't see anything that looks as being part of an eastern 
character set. instead I see accented latin - and mathematical symbols.

When I don't pass entities by the way things get really nasty:
query passed: >Î??Âââ<
 char(Î, LATIN_1_SUPPLEMENT)  char(?, LATIN_1_SUPPLEMENT) token found : 
 >Î< length: 1
 char(?, LATIN_1_SUPPLEMENT)  char(Â, LATIN_1_SUPPLEMENT)  char(â, 
LATIN_1_SUPPLEMENT) token found : >Â< length: 1
 char(â, LATIN_1_SUPPLEMENT) searching contents:"Î Â"

This was a query for two japanese characters.

-Original Message-
From: Daan Hoogland [mailto:[EMAIL PROTECTED] 
Sent: Sunday, October 10, 2004 10:48 PM
To: Lucene Users List
Subject: Re: searching using the CJKAnalyzer
Importance: Low

Che Dong wrote:


Seem not Analyser problem but html parser charset detecting error.
Could you show me the detail of the problem?
  

Thank Che,
I got it working by making the decode() from the Entities in demo 
public. I wrote a scanner to tranlate any entities in the query.
I want to translate back to entities in the results, but I'm not sure 
what the criteria should be. It seems to be just binary data.
How to conclude that Â0Å4?Â0â3ÂÂ?Â0â4 means ÃÃÃÂ?



Thanks
Che Dong
Daan Hoogland wrote:
  


LS,
in
http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=8980
Jon Schuster explains how to get a Japanese search system working. I 
followed his advice and got a index that "luke" shows as what I 
expected it to be.
I don't know how to enter a search so that it gets passed to the 
engine properly. It works in luke but not in weblucene or in my own app.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

WebLucene 0.5 released: with a SAX based indexing sample Re: XML Indexing

2004-10-05 Thread Che Dong

http://sourceforge.net/projects/weblucene/
Regards
Che Dong
http://www.chedong.com/tech/weblucene.html
Sumathi wrote:
  Can any one give me a demo for indexing XML files ?
  Mit freundlichen Grüssen - with kind regards 
  _

  Sumathi P
  Junior Consultant QA
  GFT Technologies , India 

  95 , Bharathidasan Salai
  Cantonment , Trichy-620001
  TamilNadu , India 

  T +91-431-2418 398
  F +91-431-2418 698
  [EMAIL PROTECTED]
  www.gft.com 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: BooleanQuery - Too Many Clases on date range.

2004-10-05 Thread Che Dong

How about use inter based filter instead of datatime based filter. 
datetime can convert to unix timestamp for compare.

Thanks
Che Dong
http://www.chedong.com/
Chris Fraschetti wrote:
Surely some folks out there have used lucene on a large scale and have
had to compensate for this somehow, any other solutions? Morus, thank
you very more for your imput, and I am looking into your solution,
just putting my feelers out there once more.
The lucene API is very limited as to it's descriptions of it's
components, short of digging into the code, is there a good doc
somewhere out there that explains the workins of lucene?
On Mon, 4 Oct 2004 01:57:06 -0700, Chris Fraschetti
<[EMAIL PROTECTED]> wrote:
So before I spend a significant amount of time digging into the lucene
code, how does your experience with lucene give light to my
situation  Our current index is pretty huge, and with each
increase in side i've had, i've experienced a problem like this...
Without taking up too much of your time.. because obviously this i my
task, I thought i'd ask you if you'd had any experience with this
boolean clause nonsense...  of course it can be overcome, but if you
know a quick hack, awesome, otherwise.. no big, but off to work i go
:)
-Fraschetti
-- Forwarded message --
From: Morus Walter <[EMAIL PROTECTED]>
Date: Mon, 4 Oct 2004 09:01:50 +0200
Subject: Re: BooleanQuery - Too Many Clases on date range.
To: Lucene Users List <[EMAIL PROTECTED]>, Chris
Fraschetti <[EMAIL PROTECTED]>
Chris Fraschetti writes:
So i decicded to move my epoch date to the  20040608 date which fixed
my boolean query problem in regards to my current data size (approx
600,000) 
but now as soon as I do a query like ...  a*
I get the boolean error again. Google obviously can handle this query,
and I'm pretty sure lucene can handle it.. any ideas? With out
without a date dange specified i still get the  TooManyClauses error.

I tired cranking the maxclauses up to Integer.MaxInt, but java gave me
a out of memory error. Is this b/c the boolean search tried to
allocate that many clauses by default or because my query actually
needed that many clauses?
boolean search allocates clauses for all tokens having the prefix or
matching the wildcard expression.

Why does it work on small indexes but not
large?
Because there are fewer tokens starting with a.

Is there any way to have the parser create as many clauses as
it can and then search with what it has? w/o recompiling the source?
You need to create your own version of Wildcard- and Prefix-Query
that takes a maximum term number and ignores further clauses.
And you need a variant of the query parser that uses these queries.
This can be done, even without recompiling lucene, but you will have to
do some programming at the level of lucene queries.
Shouldn't be hard, since you can use the sources as a starting point.
I guess this does not exist because the lucene developer decided to prefer
a query error rather than uncomplete results.
Morus
--
___
Chris Fraschetti, Student CompSci System Admin
University of San Francisco
e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: WebLucene 0.5 released: with a SAX based indexing sample Re: XML Indexing

2004-10-06 Thread Che Dong

You can found a INSTALL.txt in gzipped package and a sample xml data 
source within dump/ directory and run the command line IndexRunner to 
build index.

Good luck
Che Dong

Sumathi wrote:
  can u pls tellme where can i find a complete documentation/tutorialhelp
regarding using this api?
  - Original Message - 
  From: "Che Dong" <[EMAIL PROTECTED]>
  To: "Lucene Users List" <[EMAIL PROTECTED]>
  Sent: Tuesday, October 05, 2004 11:20 PM
  Subject: WebLucene 0.5 released: with a SAX based indexing sample Re: XML
Indexing

  > http://sourceforge.net/projects/weblucene/
  >
  > Regards
  >
  > Che Dong
  > http://www.chedong.com/tech/weblucene.html
  >
  > Sumathi wrote:
  > >   Can any one give me a demo for indexing XML files ?
  > >   Mit freundlichen Grüssen - with kind regards
  > >   _
  > >
  > >
  > >   Sumathi P
  > >
  > >   Junior Consultant QA
  > >
  > >   GFT Technologies , India
  > >
  > >   95 , Bharathidasan Salai
  > >   Cantonment , Trichy-620001
  > >   TamilNadu , India
  > >
  > >   T +91-431-2418 398
  > >   F +91-431-2418 698
  > >
  > >   [EMAIL PROTECTED]
  > >
  > >   www.gft.com
  > >
  >
  >
  > -
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]
  >
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: WebLucene 0.5 released: with a SAX based indexing sample Re: XML Indexing

2004-10-09 Thread Che Dong

Sorry, not fully tested with Tomcat. maybe you can try 
resin(www.caucho.com) instead.

I'll doc a english search demo later.
Thanks
Che Dong
Sumathi wrote:
  Hi ,
   As of now , WebLucene is working from command as a standalone
application (i can both index and search). but when i try it as a
webapplication using tomcat server , i'm getting a blank page  :(. Can u
please tell me what could be the problem? and also the purpose of creating
various XSLs.
  Expecting some Help from u ,
  Thanks in Advance !
  - Original Message - 
  From: "Che Dong" <[EMAIL PROTECTED]>
  To: "Lucene Users List" <[EMAIL PROTECTED]>
  Sent: Wednesday, October 06, 2004 8:02 PM
  Subject: Re: WebLucene 0.5 released: with a SAX based indexing sample Re:
XML Indexing

  > You can found a INSTALL.txt in gzipped package and a sample xml data
  > source within dump/ directory and run the command line IndexRunner to
  > build index.
  >
  > Good luck
  >
  > Che Dong
  >
  >
  >
  > Sumathi wrote:
  > >   can u pls tellme where can i find a complete
documentation/tutorialhelp
  > > regarding using this api?
  > >
  > >   - Original Message - 
  > >   From: "Che Dong" <[EMAIL PROTECTED]>
  > >   To: "Lucene Users List" <[EMAIL PROTECTED]>
  > >   Sent: Tuesday, October 05, 2004 11:20 PM
  > >   Subject: WebLucene 0.5 released: with a SAX based indexing sample
Re: XML
  > > Indexing
  > >
  > >
  > >   > http://sourceforge.net/projects/weblucene/
  > >   >
  > >   > Regards
  > >   >
  > >   > Che Dong
  > >   > http://www.chedong.com/tech/weblucene.html
  > >   >
  > >   > Sumathi wrote:
  > >   > >   Can any one give me a demo for indexing XML files ?
  > >   > >
  > >   >
  > >   >
  > >
  > -
  > >   > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > >   > For additional commands, e-mail:
[EMAIL PROTECTED]
  > >   >
  > >
  > >
  > > -
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail: [EMAIL PROTECTED]
  > >
  > >
  >
  >
  > -
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]
  >

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: seach impossible with numbers

Fw: [contrib]: XMLIndexer/StringFilter

Re: Custom result ordering

Re: Analyzers for various languages

Use filter instead of searching Re: Error when trying to match file path

Re: Your experiences with Lucene

Re: Bitset Filters

Re: OutOfMemoryException while Indexing an XML file

Re: Indexing XML with Lucene

PLAN: WebLucene -- Lucene Web interface, use XML as a lightweight protocol.

Re: PLAN: WebLucene -- Lucene Web interface, use XML as a lightweight protocol.

[PLAN]: SAXIndexer, indexing database via XML gateway

Re: commercial websites powered by Lucene?

WebLucene: XML gateway for Lucene

Re: making XML from articles

How to implement Similarity for custom sorting by field ( or by docID)?

Re: CJK support in lucene

WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.

Re: WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.

WebLucene 0.4 released: added full featured demo(dump data php scripts and demo data in Chinese)

Re: WebLucene 0.4 released: added full featured demo(dump data php scripts and demo data in Chinese)

Re: Japanese Analyzer

Re: CJK Analyzer in lucene 1.3 final

Re: dynamic summary

Re: CJK Analyzer indexing japanese word document

Re: CJK Analyzer indexing japanese word document

Re: CJK Analyzer indexing japanese word document

Will CJKAnalyser be release with Lucene 1.4?

Re: Will CJKAnalyser be release with Lucene 1.4?

Bigram Co-occurrences will be the better way for Word Discrimination. Re: Will CJKAnalyser be release with Lucene 1.4?

Re: searching using the CJKAnalyzer

WebLucene 0.5 released: with a SAX based indexing sample Re: XML Indexing

Re: BooleanQuery - Too Many Clases on date range.

Re: WebLucene 0.5 released: with a SAX based indexing sample Re: XML Indexing

Re: WebLucene 0.5 released: with a SAX based indexing sample Re: XML Indexing

35 matches

Site Navigation

Mail list logo

Footer information