RE: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread Karthik N S
Hi

  Please can some body give me a simple Example of
  org.apache.lucene.search.highlight.Highlighter

  I am trying to use it but unsucessfull


Karthik


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 20, 2004 2:08 AM
To: [EMAIL PROTECTED]
Subject: Re: org.apache.lucene.search.highlight.Highlighter


Was Investigating,found some Compile time error..

I see the code you have is taken from the example in the javadocs.
Unfortunately that example wasn't complete because the class didnt
include the method defined in the Formatter interface. I have updated the
Javadocs to correct this oversight.

To correct your problem either make your class implement the Formatter
interface to perform your choice of custom formatting or remove the this
parameter from your call to create a new Highlighter with the default
Formatter implementation.

Thanks for highlighting the problem with the Javadocs...

Cheers
Mark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching Microsoft Word , Excel and PPT files for Japanese

2004-05-21 Thread Chandan Tamrakar
for miscrosoft word documents and excel use POI API's  from jakarta
apache.
   First you need to extract the test and convert inot suitable encoding
before you put into lucene for index.
   It worked for me.


- Original Message - 
From: Ankur Goel [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Thursday, May 20, 2004 10:55 PM
Subject: Searching Microsoft Word , Excel and PPT files for Japanese


 Hi,

 I am using CJK Tokenzier for searching the Japanese documents.  I am able
to
 search japanese documents which are text files. But I am not able to
search
 from Microsoft word, excel files with content in Japanese.

 Can you tell me how can search on Japanese content for Microsoft word,
excel
 and ppt files.

 Thanks,
 Ankur

 -Original Message-
 From: Ankur Goel [mailto:[EMAIL PROTECTED]
 Sent: Sunday, April 04, 2004 1:36 AM
 To: 'Lucene Users List'
 Subject: RE: Boolean Phrase Query question

 Thanks Eric for the solution. I have to filename field as I have to give
the
 end user facility to search on File Name also. That's   why I am using
TEXT
 for file Name also.

 By using true on the finalQuery.add calls, you have said that both fields
 must have the word temp in them.  Is that what you meant?  Or did you
mean
 an OR type of query?

 I need an OR type of query. I mean the word can be in the filename or in
the
 contents of the filename. But i am not able to do this. Can you tell me
how
 to do it?

 Regards,
 Ankur

 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: Sunday, April 04, 2004 1:27 AM
 To: Lucene Users List
 Subject: Re: Boolean Phrase Query question

 On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote:
 
  Hi,
  I have to provide a functionality which provides search on both file
  name and contents of the file.
 
  For indexing I use the following code:
 
 
  org.apache.lucene.document.Document doc = new org.apache.
  lucene.document.Document();
  doc.add(Field.Keyword(fileId, + document.getFileId()));
  doc.add(Field.Text(fileName,fileName);
  doc.add(Field.Text(contents, new FileReader(new File(fileName)));

 I'm not sure what you plan on doing with the fileName field, but you
 probably want to use a Keyword field for it.

 And you may want to glue the file name and contents together into a single
 field to facilitate searches to span both.  (be sure to put a space in
 between if you do this)

  For searching a text say  temp I use the following code to look both
  in file Name and contents of the file:
 
  BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery =
  QueryParser.parse(temp,fileName,analyzer);
  Query mainQuery = QueryParser.parse(temp,contents,analyzer);
 
  finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery,
  true, false);
 
  Hits hits = is.search(finalQuery);

 By using true on the finalQuery.add calls, you have said that both fields
 must have the word temp in them.  Is that what you meant?  Or did you
mean
 an OR type of query?

 Erik



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searching Microsoft Word , Excel and PPT files for Japanese

2004-05-21 Thread Ankur Goel
Thanks chandan .. 
I am tried  using POI for text extraction . I used The
WordDocument.writeAllText method but it didn't worked for Japanese. 
Is there any other way also for extracting the Japanese text?
Regards,
Ankur 

-Original Message-
From: Chandan Tamrakar [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 21, 2004 3:51 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Searching Microsoft Word , Excel and PPT files for Japanese

for miscrosoft word documents and excel use POI API's  from jakarta
apache.
   First you need to extract the test and convert inot suitable encoding
before you put into lucene for index.
   It worked for me.


- Original Message -
From: Ankur Goel [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Thursday, May 20, 2004 10:55 PM
Subject: Searching Microsoft Word , Excel and PPT files for Japanese


 Hi,

 I am using CJK Tokenzier for searching the Japanese documents.  I am able
to
 search japanese documents which are text files. But I am not able to
search
 from Microsoft word, excel files with content in Japanese.

 Can you tell me how can search on Japanese content for Microsoft word,
excel
 and ppt files.

 Thanks,
 Ankur

 -Original Message-
 From: Ankur Goel [mailto:[EMAIL PROTECTED]
 Sent: Sunday, April 04, 2004 1:36 AM
 To: 'Lucene Users List'
 Subject: RE: Boolean Phrase Query question

 Thanks Eric for the solution. I have to filename field as I have to give
the
 end user facility to search on File Name also. That's   why I am using
TEXT
 for file Name also.

 By using true on the finalQuery.add calls, you have said that both fields
 must have the word temp in them.  Is that what you meant?  Or did you
mean
 an OR type of query?

 I need an OR type of query. I mean the word can be in the filename or in
the
 contents of the filename. But i am not able to do this. Can you tell me
how
 to do it?

 Regards,
 Ankur

 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]
 Sent: Sunday, April 04, 2004 1:27 AM
 To: Lucene Users List
 Subject: Re: Boolean Phrase Query question

 On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote:
 
  Hi,
  I have to provide a functionality which provides search on both file
  name and contents of the file.
 
  For indexing I use the following code:
 
 
  org.apache.lucene.document.Document doc = new org.apache.
  lucene.document.Document();
  doc.add(Field.Keyword(fileId, + document.getFileId()));
  doc.add(Field.Text(fileName,fileName);
  doc.add(Field.Text(contents, new FileReader(new File(fileName)));

 I'm not sure what you plan on doing with the fileName field, but you
 probably want to use a Keyword field for it.

 And you may want to glue the file name and contents together into a single
 field to facilitate searches to span both.  (be sure to put a space in
 between if you do this)

  For searching a text say  temp I use the following code to look both
  in file Name and contents of the file:
 
  BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery =
  QueryParser.parse(temp,fileName,analyzer);
  Query mainQuery = QueryParser.parse(temp,contents,analyzer);
 
  finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery,
  true, false);
 
  Hits hits = is.search(finalQuery);

 By using true on the finalQuery.add calls, you have said that both fields
 must have the word temp in them.  Is that what you meant?  Or did you
mean
 an OR type of query?

 Erik



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: documentation fix for website

2004-05-21 Thread Otis Gospodnetic
Thanks for catching this.  I fix it, and the change should show up on
the site with the next Lucene release.

Otis

--- Ryan Sonnek [EMAIL PROTECTED] wrote:
 Is this the right place to submit a problem with the website
 documentation? 
 http://jakarta.apache.org/lucene/docs/systemproperties.html lists
 mergeFactor twice with different property names.  the second
 occurance should be updated to lockDir (the underlying href link is
 correct).
 
 Ryan
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Problem indexing Spanish Characters

2004-05-21 Thread PEP AD Server Administrator
Hi all,
Martin was right. I just adapt the HTML demo as Wallen recommended and it
worked. Now I have only to deal with some crazy documents which are UTF-8
decoded mixed with entities.
Does anyone know a class which can translate entities into UTF-8 or any
other encoding?

Peter MH

-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]

Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding. 

public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPipedInputStream();
pipeOutStream = new PipedOutputStream(pipeInStream);
pipeIn = new InputStreamReader(pipeInStream);
pipeOut = new OutputStreamWriter(pipeOutStream);
//check the first 4 bytes for FFFE marker, if its
there we know its UTF-16 encoding
if (useUTF16)
{
try
{
pipeIn = new BufferedReader(new
InputStreamReader(pipeInStream, UTF-16));
}
catch (Exception e)
{
}
}
Thread thread = new ParserThread(this);
thread.start(); // start parsing
}
return pipeIn;
}

-Original Message-
From: Martin Remy [mailto:[EMAIL PROTECTED]

The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying UTF-8 (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread Karthik N S



Hi 
Please can some body give me a simple Example of 
org.apache.lucene.search.highlight.Highlighter 
I am trying to use it but unsucessfull

Karthik

  
  
 

  

StandardTokenizer and e-mail

2004-05-21 Thread Albert Vila
Hi all,
I want to achieve the following, when I indexing the '[EMAIL PROTECTED]', 
I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token, the 
'company' token and the 'com'token.
This way, you'll be able to find the document searching for 
'[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only.

How can I achieve that?, I need to write my own tokenizer?
Thanks
Albert
--
Albert Vila
Director de proyectos I+D
http://www.imente.com
902 933 242
[iMente La informacin con ms beneficios]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Query parser and minus signs

2004-05-21 Thread alex . bourne




Hi All,

I'm using Lucene on a site that has split content with a branch containing
pages in English and a separate branch in Chinese.  Some of the chinese
pages include some (untranslatable) English words, so when a search is
carried out in either language you can get pages from the wrong branch. To
combat this we introduced a language field into the index which contains
the standard language codes: en-UK and zh-HK.

When you parse a query  e.g. language:en\-UK you could reasonably expect
the search to recover all pages with the language field set to en-UK (the
minus symbol should be escaped by the backslash according to the FAQ).
Unfortunately the parser seems to return en UK as the parsed query and
hence returns no documents.

Has anyone else had this problem, or could suggest a workaround ?? as I
have
yet to find a solution in the mailing archives or elsewhere.

Many thanks in advance,

Alex Bourne



_

This transmission has been issued by a member of the HSBC Group 
(HSBC) for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Query parser and minus signs

2004-05-21 Thread Ryan Sonnek
if you're dealing with locales, why not use java's built in locale syntax (ex: en_UK, 
zh_HK)?

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Friday, May 21, 2004 10:36 AM
 To: [EMAIL PROTECTED]
 Subject: Query parser and minus signs
 
 
 
 
 
 
 Hi All,
 
 I'm using Lucene on a site that has split content with a 
 branch containing
 pages in English and a separate branch in Chinese.  Some of 
 the chinese
 pages include some (untranslatable) English words, so when a search is
 carried out in either language you can get pages from the 
 wrong branch. To
 combat this we introduced a language field into the index 
 which contains
 the standard language codes: en-UK and zh-HK.
 
 When you parse a query  e.g. language:en\-UK you could 
 reasonably expect
 the search to recover all pages with the language field set 
 to en-UK (the
 minus symbol should be escaped by the backslash according to the FAQ).
 Unfortunately the parser seems to return en UK as the 
 parsed query and
 hence returns no documents.
 
 Has anyone else had this problem, or could suggest a 
 workaround ?? as I
 have
 yet to find a solution in the mailing archives or elsewhere.
 
 Many thanks in advance,
 
 Alex Bourne
 
 
 
 _
 
 This transmission has been issued by a member of the HSBC Group 
 (HSBC) for the information of the addressee only and should not be 
 reproduced and / or distributed to any other person. Each page 
 attached hereto must be read in conjunction with any disclaimer which 
 forms part of it. This transmission is neither an offer nor 
 the solicitation 
 of an offer to sell or purchase any investment. Its contents 
 are based 
 on information obtained from sources believed to be reliable but HSBC 
 makes no representation and accepts no responsibility or 
 liability as to 
 its completeness or accuracy.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memo: RE: Query parser and minus signs

2004-05-21 Thread alex . bourne




Hmm, we may have to if there is no work around. We're not using java
locales, but were trying to stick to the ISO standard which uses hyphens.




Ryan Sonnek [EMAIL PROTECTED] on 21 May 2004 16:38

Please respond to Lucene Users List [EMAIL PROTECTED]

To:Lucene Users List [EMAIL PROTECTED]
cc:
bcc:

Subject:RE: Query parser and minus signs


if you're dealing with locales, why not use java's built in locale syntax
(ex: en_UK, zh_HK)?

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Friday, May 21, 2004 10:36 AM
 To: [EMAIL PROTECTED]
 Subject: Query parser and minus signs






 Hi All,

 I'm using Lucene on a site that has split content with a
 branch containing
 pages in English and a separate branch in Chinese.  Some of
 the chinese
 pages include some (untranslatable) English words, so when a search is
 carried out in either language you can get pages from the
 wrong branch. To
 combat this we introduced a language field into the index
 which contains
 the standard language codes: en-UK and zh-HK.

 When you parse a query  e.g. language:en\-UK you could
 reasonably expect
 the search to recover all pages with the language field set
 to en-UK (the
 minus symbol should be escaped by the backslash according to the FAQ).
 Unfortunately the parser seems to return en UK as the
 parsed query and
 hence returns no documents.

 Has anyone else had this problem, or could suggest a
 workaround ?? as I
 have
 yet to find a solution in the mailing archives or elsewhere.

 Many thanks in advance,

 Alex Bourne



 _

 This transmission has been issued by a member of the HSBC Group
 (HSBC) for the information of the addressee only and should not be
 reproduced and / or distributed to any other person. Each page
 attached hereto must be read in conjunction with any disclaimer which
 forms part of it. This transmission is neither an offer nor
 the solicitation
 of an offer to sell or purchase any investment. Its contents
 are based
 on information obtained from sources believed to be reliable but HSBC
 makes no representation and accepts no responsibility or
 liability as to
 its completeness or accuracy.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



**
 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.
**








_

This transmission has been issued by a member of the HSBC Group 
(HSBC) for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizer and e-mail

2004-05-21 Thread Otis Gospodnetic
Si, si.
Write your own TokenFilter sub-class that overrides next() and extracts
those other elements/tokens from an email address token and uses
Token's setPositionIncrement(0) to store the extracted tokens in the
same position as the original email.

Otis

--- Albert Vila [EMAIL PROTECTED] wrote:
 Hi all,
 
 I want to achieve the following, when I indexing the
 '[EMAIL PROTECTED]', 
 I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token,
 the 
 'company' token and the 'com'token.
 This way, you'll be able to find the document searching for 
 '[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only.
 
 How can I achieve that?, I need to write my own tokenizer?
 
 Thanks
 Albert
 
 -- 
 Albert Vila
 Director de proyectos I+D
 http://www.imente.com
 902 933 242
 [iMente “La información con más beneficios”]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query parser and minus signs

2004-05-21 Thread Peter M Cipollone

- Original Message - 
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, May 21, 2004 11:36 AM
Subject: Query parser and minus signs






 Hi All,

 I'm using Lucene on a site that has split content with a branch containing
 pages in English and a separate branch in Chinese.  Some of the chinese
 pages include some (untranslatable) English words, so when a search is
 carried out in either language you can get pages from the wrong branch. To
 combat this we introduced a language field into the index which contains
 the standard language codes: en-UK and zh-HK.

 When you parse a query  e.g. language:en\-UK you could reasonably expect
 the search to recover all pages with the language field set to en-UK
(the
 minus symbol should be escaped by the backslash according to the FAQ).
 Unfortunately the parser seems to return en UK as the parsed query and
 hence returns no documents.

 Has anyone else had this problem, or could suggest a workaround ?? as I
 have
 yet to find a solution in the mailing archives or elsewhere.

Index the standard language code as a

new Field(fieldName, code, false, true, false)

This will bypass the Analyzer at indexing time, since tokenization is set to
false.  Then when you create your queries, add a

new TermQuery(new Term(fieldName, desiredLanguageCode))

to the user query object.  This will bypass the Analyzer at query time and
give you the desired result.


 Many thanks in advance,

 Alex Bourne



 _

 This transmission has been issued by a member of the HSBC Group
 (HSBC) for the information of the addressee only and should not be
 reproduced and / or distributed to any other person. Each page
 attached hereto must be read in conjunction with any disclaimer which
 forms part of it. This transmission is neither an offer nor the
solicitation
 of an offer to sell or purchase any investment. Its contents are based
 on information obtained from sources believed to be reliable but HSBC
 makes no representation and accepts no responsibility or liability as to
 its completeness or accuracy.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



now maybe Mozlla/IMAP URLs - Re: StandardTokenizer and e-mail

2004-05-21 Thread David Spencer
This reminds me - if you have a search engine that indexes a mail store 
and you present results in a web page to a browser, you want to (of 
course...well I think this is obvious)  send back a URL that would cause 
the users native mail client to pull up the msg.
IMAP has a URL format, and I use Mozilla on windows to browse  read 
mail, however when I've presented IMAP URLs on a results page the IMAP 
URL doesn't work - either nothing happens or the cursor changes to busy 
but still no mail comes up. Has anyone come across this? This may be 
more appropriate for a moz list but it's definitely a search issue.

This page mentions the problem:
http://www.mozilla.org/projects/security/known-vulnerabilities.html
A writeup on an IMAP indexer I did a while ago:
http://www.tropo.com/techno/java/lucene/imap.html

Albert Vila wrote:
Hi all,
I want to achieve the following, when I indexing the 
'[EMAIL PROTECTED]', I want to index the '[EMAIL PROTECTED]' token, then 
the 'xyz' token, the 'company' token and the 'com'token.
This way, you'll be able to find the document searching for 
'[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only.

How can I achieve that?, I need to write my own tokenizer?
Thanks
Albert

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Memo: RE: Query parser and minus signs

2004-05-21 Thread David Townsend
Doesn't en UK as a phrase query work?

You're probably indexing it as a text field so it's being tokenised.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 21 May 2004 16:47
To: Lucene Users List
Subject: Memo: RE: Query parser and minus signs






Hmm, we may have to if there is no work around. We're not using java
locales, but were trying to stick to the ISO standard which uses hyphens.




Ryan Sonnek [EMAIL PROTECTED] on 21 May 2004 16:38

Please respond to Lucene Users List [EMAIL PROTECTED]

To:Lucene Users List [EMAIL PROTECTED]
cc:
bcc:

Subject:RE: Query parser and minus signs


if you're dealing with locales, why not use java's built in locale syntax
(ex: en_UK, zh_HK)?

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Friday, May 21, 2004 10:36 AM
 To: [EMAIL PROTECTED]
 Subject: Query parser and minus signs






 Hi All,

 I'm using Lucene on a site that has split content with a
 branch containing
 pages in English and a separate branch in Chinese.  Some of
 the chinese
 pages include some (untranslatable) English words, so when a search is
 carried out in either language you can get pages from the
 wrong branch. To
 combat this we introduced a language field into the index
 which contains
 the standard language codes: en-UK and zh-HK.

 When you parse a query  e.g. language:en\-UK you could
 reasonably expect
 the search to recover all pages with the language field set
 to en-UK (the
 minus symbol should be escaped by the backslash according to the FAQ).
 Unfortunately the parser seems to return en UK as the
 parsed query and
 hence returns no documents.

 Has anyone else had this problem, or could suggest a
 workaround ?? as I
 have
 yet to find a solution in the mailing archives or elsewhere.

 Many thanks in advance,

 Alex Bourne



 _

 This transmission has been issued by a member of the HSBC Group
 (HSBC) for the information of the addressee only and should not be
 reproduced and / or distributed to any other person. Each page
 attached hereto must be read in conjunction with any disclaimer which
 forms part of it. This transmission is neither an offer nor
 the solicitation
 of an offer to sell or purchase any investment. Its contents
 are based
 on information obtained from sources believed to be reliable but HSBC
 makes no representation and accepts no responsibility or
 liability as to
 its completeness or accuracy.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



**
 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.
**








_

This transmission has been issued by a member of the HSBC Group 
(HSBC) for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



asktog on search problems

2004-05-21 Thread David Spencer
Haven't seen this discussed here.
See 7a at the link below:
http://www.asktog.com/columns/062top10ReasonsToNotShop.html
7a talks about searching on a camera site for the Lowepro 100 AW.
He says this query works:Lowepro 100 AW
and this query does not work: Lowepro 100AW
Cross checking with google indeed shows that the 1st form is much more 
popular, however the 2nd form is used, and if you're a commerce site or 
a site that wants to make it easier for users to find things you should 
help them out.

So the discussion question is what's the best way to handle this.
I guess the somewhat general form of this is that in a query, and term 
might be split into 2 terms that are individually indexed (so 100AW is 
not indexed, but 100 and AW is).
In a way the flip side of this is that any 2 terms could be concatenated 
to form another term that was indexed (so in another universe it might 
be that passing 100 AW is not as precise as passing 100AW but how's 
the user to know).

In the context of Lucene ways to handle this seem to be:
- automagically run a fuzzy query (so if a query doesn't work, transform 
Lowepro 100AW to Lowepro~ 100AW~
- write a query parser that breaks apart unindexed tokens into ones that 
are indexed (so 100AW becomes 100 AW)
- write a tokenizer that inserts dummy tokens for every pair of tokens, 
so the stream Lowepro 100 AW would also have Lowepro100 and 100AW 
inserted, presumably via magic w/ TokenStream.next()

Comments on best way to handle this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread Claude Devarenne
Hi,

Here is the documentation Mark Harwood included in the original package.  I followed his directorions and it worked for me.  Let me know if this doesn't do it for you.

Claude



On May 21, 2004, at 4:29 AM, Karthik N S wrote:

Hi

 Please can some body give me a simple Example of

 org.apache.lucene.search.highlight.Highlighter

 I am trying to use it but unsucessfull

 

Karthik






















image.tiff>
WITH WARM REGARDS 
HAVE A NICE DAY 
[ N.S.KARTHIK] 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread Claude Devarenne
Arrgh the attachment didn't make it here it goes, sorry:
//perform a standard lucene query
searcher = new IndexSearcher(ramDir);
Analyzer analyzer=new StandardAnalyzer();
Query query = QueryParser.parse(Kenne*, FIELD_NAME, analyzer);
query=query.rewrite(reader); //necessary to expand search terms
Hits hits = searcher.search(query);
//create an instance of the highlighter with the tags used to  
surround highlighted text
QueryHighlightExtractor highlighter =
new QueryHighlightExtractor(query, new  
StandardAnalyzer(), b, /b);

for (int i = 0; i  hits.length(); i++)
{
String text = hits.doc(i).get(FIELD_NAME);
//call to highlight text with chosen tags
String highlightedText =  
highlighter.highlightText(text);
System.out.println(highlightedText);
}

If your documents are large you can select only the best fragments from  
each document like this:
//...as above example

int highlightFragmentSizeInBytes = 80;
int maxNumFragmentsRequired = 4;
String fragmentSeparator=...;
for (int i = 0; i  hits.length(); i++)
{
String text = hits.doc(i).get(FIELD_NAME);
String highlightedText =  
highlighter.getBestFragments(text,
 
highlightFragmentSizeInBytes,maxNumFragmentsRequired,fragmentSeparator);
System.out.println(highlightedText);
}

On May 21, 2004, at 9:22 AM, Claude Devarenne wrote:
Hi,
Here is the documentation Mark Harwood included in the original  
package.  I followed his directorions and it worked for me.  Let me  
know if this doesn't do it for you.

Claude

On May 21, 2004, at 4:29 AM, Karthik N S wrote:

Hi
 Please can some body give me a simple Example of
 org.apache.lucene.search.highlight.Highlighter
 I am trying to use it but unsucessfull
 
Karthik










image.tiff
WITH WARM REGARDS
HAVE A NICE DAY
[ N.S.KARTHIK]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread markharw00d
Hi Claude, that example code you provided is out of date.

For all concerned - the highlighter code was refactored about a month ago and then 
moved into the Sandbox.

Want the latest version? - get the latest code from the sandbox CVS.
Want the latest docs? - Run javadoc on the above.

There is a basic example of highlighter use in the package-level javadocs and more 
extensive examples 
in the JUnit test that accompanies the source code.

Hope this helps clarify things.

Mark

ps Bruce, I know you were interested in providing an alternative Fragmenter 
implementation 
for the highlighter that detects sentence boundaries.
You may want to look at LingPipe which has a heuristic sentence boundary detector.
( http://threattracker.com:8080/lingpipe-demo/demo.html )
I took a quick look at it but it has its own tokenizer that would be difficult to make 
work with 
the tokenstream used to identify query terms. At least the code gives some examples of 
the
heuristics involved in detecting sentence boundaries. For my own apps I find the 
standard Fragmenter
implementation suffices.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: asktog on search problems

2004-05-21 Thread Jeff Wong
I don't think the first solution will work because the 100AW~ term must
match either 100 or AW which are your index terms.

Coincidentally,  I have been trying to deal with this very problem over
the past few days.  

In my situation,  I'm trying to help users find thing when the spacing of
their queries doesn't match the spacing in an indexed term.  Possible
errors can be divided into 2 classes.

1) User leaves out  a space where there ought to be one.  Let's say the
user is trying to find blue bird but types in the query bluebird
thinking it is a single word.  Lucene won't catch this because blue and
bird are stored as single index tokens.

2) User errantly inserts a space where there shouldn't be one.  An example
would be an index where the word blackbird is stored but the user types
in black bird as a query.

What I tried to do was create an alternate tokenizer which stored the
entire string in the index in a different field and perform fuzzy search
on the entire string.  This is possible because I am only doing searches
on strings of less than 40 characters on average.  To take the black
bird example, I would store the entire string into a field which doesn't
tokenize on word boundaries.  The query, in turn, would look something
like this:

+title:black +title:bird OR fulltitle:black bird~

Where the tilde applies to the entire black bird term.  When I tested it
it appeared to work, but was really slow for large indexes.  At about
4 entries, this query started to take 1 or 2 seconds which was worse
than my performance requirement.

Actually, I also thought of the last 2 things you suggested and I was
about to try them out.  However, you do need to apply both of them.
Adding additional concatenated index terms addresses the problem where
users leave out spaces.  Add concatenated terms helps users match terms
in your index when they inject spaces incorrectly.

This may balloon the memory consumption of your Lucene index.  However,
you can use heuristics to avoid inserting extra terms which won't match
likely errors.  For example, you could decide that you only want to
concatenate terms that are parts of model numbers.  Or, if you are dealing
with compound words, you can choose to only concatenate terms which are
English words.  For example,  in my situation, concatenating blue bird
as an extra term is useful while doing the same with  Roy Orbison is
not since people aren't likely to neglect the space in that situation.

Hope this helps.

Jeff


On Fri, 21 May 2004, David Spencer wrote:

 In the context of Lucene ways to handle this seem to be:
 - automagically run a fuzzy query (so if a query doesn't work, transform 
 Lowepro 100AW to Lowepro~ 100AW~ 
 - write a query parser that breaks apart unindexed tokens into ones that 
 are indexed (so 100AW becomes 100 AW)
 - write a tokenizer that inserts dummy tokens for every pair of tokens, 
 so the stream Lowepro 100 AW would also have Lowepro100 and 100AW 
 inserted, presumably via magic w/ TokenStream.next()


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizer and e-mail

2004-05-21 Thread Erik Hatcher
Further on this...
If you are using StandardTokenizer, the token for an e-mail address has 
the type value of EMAIL, which you could use to pick up 
specifically in a custom TokenFilter implementation and split it how 
you like, passing through everything else.  Take a look at 
StandardFilter's source code for an example of keying off the types 
emitted by StandardTokenizer.

Erik
On May 21, 2004, at 11:50 AM, Otis Gospodnetic wrote:
Si, si.
Write your own TokenFilter sub-class that overrides next() and extracts
those other elements/tokens from an email address token and uses
Token's setPositionIncrement(0) to store the extracted tokens in the
same position as the original email.
Otis
--- Albert Vila [EMAIL PROTECTED] wrote:
Hi all,
I want to achieve the following, when I indexing the
'[EMAIL PROTECTED]',
I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token,
the
'company' token and the 'com'token.
This way, you'll be able to find the document searching for
'[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only.
How can I achieve that?, I need to write my own tokenizer?
Thanks
Albert
--
Albert Vila
Director de proyectos I+D
http://www.imente.com
902 933 242
[iMente “La información con más beneficios”]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: asktog on search problems

2004-05-21 Thread Erik Hatcher
This is not specific advice, but an idea that I think Google leverages 
to build up search corrections.  If a user searches for 100AW and it 
doesn't match, but a moment later they try something different and 
immediately get to a product page, the system can make a loose 
connection between their original search and the product they soon 
thereafter found.  Over time, the connections get stronger because 
others will do the same thing.

I think term vectors could factor into making latent connections 
somehow also.

Just postulating...
Erik

On May 21, 2004, at 12:09 PM, David Spencer wrote:
Haven't seen this discussed here.
See 7a at the link below:
http://www.asktog.com/columns/062top10ReasonsToNotShop.html
7a talks about searching on a camera site for the Lowepro 100 AW.
He says this query works:Lowepro 100 AW
and this query does not work: Lowepro 100AW
Cross checking with google indeed shows that the 1st form is much more 
popular, however the 2nd form is used, and if you're a commerce site 
or a site that wants to make it easier for users to find things you 
should help them out.

So the discussion question is what's the best way to handle this.
I guess the somewhat general form of this is that in a query, and term 
might be split into 2 terms that are individually indexed (so 100AW 
is not indexed, but 100 and AW is).
In a way the flip side of this is that any 2 terms could be 
concatenated to form another term that was indexed (so in another 
universe it might be that passing 100 AW is not as precise as 
passing 100AW but how's the user to know).

In the context of Lucene ways to handle this seem to be:
- automagically run a fuzzy query (so if a query doesn't work, 
transform Lowepro 100AW to Lowepro~ 100AW~
- write a query parser that breaks apart unindexed tokens into ones 
that are indexed (so 100AW becomes 100 AW)
- write a tokenizer that inserts dummy tokens for every pair of 
tokens, so the stream Lowepro 100 AW would also have Lowepro100 
and 100AW inserted, presumably via magic w/ TokenStream.next()

Comments on best way to handle this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Rebuild after corruption

2004-05-21 Thread Steve Rajavuori
I have a problem periodically where the process updating my Lucene files
terminates abnormally. When I try to open the Lucene files afterward I get
an exception indicating that files are missing. Does anyone know how I can
recover at this point, without having to rebuild the whole index from
scratch?


RE: Rebuild after corruption

2004-05-21 Thread wallen
Make sure you close your indexwriter.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#close()

-Original Message-
From: Steve Rajavuori [mailto:[EMAIL PROTECTED]
Sent: Friday, May 21, 2004 7:49 PM
To: '[EMAIL PROTECTED]'
Subject: Rebuild after corruption


I have a problem periodically where the process updating my Lucene files
terminates abnormally. When I try to open the Lucene files afterward I get
an exception indicating that files are missing. Does anyone know how I can
recover at this point, without having to rebuild the whole index from
scratch?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]