date:20050203

Re: Synonyms Not Showing In The Index

2005-02-03 Thread Andrzej Bialecki

Luke Shannon wrote:
Hello;
It seems my Synonym analyzer is working (based on some successful queries).
But I can't see the synonyms in the index using Luke. Is this correct?
Did you use the combined JAR to run? It contains an oldish version of 
Lucene... Other than that, I'm not sure - if you can't find the reason 
you could send me a small test index...

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch

Hello Sergiu,

thank you for your help so far. I appreciate it.

I am working with Java 1.1 which does not include regular expressions.

Your turn ;-)
Karl 

 Karl Koch wrote:
 
 I am in control of the html, which means it is well formated HTML. I use
 only HTML files which I have transformed from XML. No external HTML (e.g.
 the web).
 
 Are there any very-short solutions for that?
   
 
 if you are using only correct formated HTML pages and you are in control 
 of these pages.
 you can use a regular exprestion to remove the tags.
 
 something like
 replaceAll(*,);
 
 This is the ideea behind the operation. If you will search on google you 
 will find a more robust
 regular expression.
 
 Using a simple regular expression will be a very cheap solution, that 
 can cause you a lot of problems in the future.
  
  It's up to you to use it 
 
  Best,
  
  Sergiu
 
 Karl
 
   
 
 Karl Koch wrote:
 
 
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking that a
   
 
 5kB
 
 
 code could actually do that. That sourceforge project is doing much
 more
 than that but I do not need it.
  
 
   
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the
 size.
 
   You can use 3 lines of code with a good regular expresion to eliminate
 the html tags,
 but this won't give you any guarantie that the text from the bad 
 fromated html files will be
 correctly extracted...
 
   Best,
 
   Sergiu
 
 
 
 Karl
 
  
 
   
 
  Hi Karl,
 
 I already submitted a peace of code that removes the html tags.
 Search for my previous answer in this thread.
 
  Best,
 
   Sergiu
 
 Karl Koch wrote:
 

 
 
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short and
   
 
 simple
 
 
 (KISS)) which allows to remove all HTML tags from HTML content? HTML
   
 
 3.2
 
 
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other structure
   
 
 but
 
 
 need a facility to clean up HTML into its normal underlying content
  
 
   
 
 before

 
 
 
 indexing that content as a whole.
 
 Karl
 
 
 
 
  
 
   
 
 I think that depends on what you want to do.  The Lucene demo parser

 
 
 
 does

 
 
 
 simple mapping of HTML files into Lucene Documents; it does not give
 
 
 you
 
 

 
 
 
 a

 
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces
 
 
 (uses
 
 
   
 

 
 
 
 the
 
 
  
 
   
 
 same API; will likely become part of Xerces), and so maps an HTML

 
 
 
 document

 
 
 
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it
 
 
 as
 
 
   
 

 
 
 
 well --
 
 
  
 
   
 
 based on its UI, it appears to be focused primarily on HTML
 validation

 
 
 
 and

 
 
 
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go

 
 
 
 beyond

 
 
 
 indexing them in Lucene, and really like it.  It has been robust for
 
 
 me
 
 

 
 
 
 so

 
 
 
 far.
 
 Chuck
 
 
 
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 
 _
 Do You Yahoo!?
 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
 http://music.yisou.com/
 ÃÀÅ®Ã÷ÐÇÓ¦ÓÐ¾¡ÓÐ£¬ËÑ±éÃÀÍ¼¡¢ÑÞÍ¼ºÍ¿áÍ¼
 http://image.yisou.com
 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 
   
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   
 
 il_1g/
 
 
   
 

 
 
 
 -

 
 
 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
   
 
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
   
 

 
 
 
  
 
   
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch

Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
in a single class or even method called by another part in my Java
application. It should also run on Java 1.1 and it should be small and
simple. As I said before, I am in control of the HTML and it will be well
formated, because I generate it from XML using XSLT.

Karl

 If you are not married to Java:
 http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
 
 Otis
 
 --- sergiu gordea [EMAIL PROTECTED] wrote:
 
  Karl Koch wrote:
  
  I am in control of the html, which means it is well formated HTML. I
  use
  only HTML files which I have transformed from XML. No external HTML
  (e.g.
  the web).
  
  Are there any very-short solutions for that?

  
  if you are using only correct formated HTML pages and you are in
  control 
  of these pages.
  you can use a regular exprestion to remove the tags.
  
  something like
  replaceAll(*,);
  
  This is the ideea behind the operation. If you will search on google
  you 
  will find a more robust
  regular expression.
  
  Using a simple regular expression will be a very cheap solution, that
  
  can cause you a lot of problems in the future.
   
   It's up to you to use it 
  
   Best,
   
   Sergiu
  
  Karl
  

  
  Karl Koch wrote:
  
  
  
  Hi,
  
  yes, but the library your are using is quite big. I was thinking
  that a

  
  5kB
  
  
  code could actually do that. That sourceforge project is doing
  much more
  than that but I do not need it.
   
  

  
  you need just the htmlparser.jar 200k.
  ... you know ... the functionality is strongly correclated with the
  size.
  
You can use 3 lines of code with a good regular expresion to
  eliminate 
  the html tags,
  but this won't give you any guarantie that the text from the bad 
  fromated html files will be
  correctly extracted...
  
Best,
  
Sergiu
  
  
  
  Karl
  
   
  

  
   Hi Karl,
  
  I already submitted a peace of code that removes the html tags.
  Search for my previous answer in this thread.
  
   Best,
  
Sergiu
  
  Karl Koch wrote:
  
 
  
  
  
  Hello,
  
  I have  been following this thread and have another question. 
  
  Is there a piece of sourcecode (which is preferably very short
  and

  
  simple
  
  
  (KISS)) which allows to remove all HTML tags from HTML content?
  HTML

  
  3.2
  
  
  would be enough...also no frames, CSS, etc. 
  
  I do not need to have the HTML strucutre tree or any other
  structure

  
  but
  
  
  need a facility to clean up HTML into its normal underlying
  content
   
  

  
  before
 
  
  
  
  indexing that content as a whole.
  
  Karl
  
  
  
  
   
  

  
  I think that depends on what you want to do.  The Lucene demo
  parser
 
  
  
  
  does
 
  
  
  
  simple mapping of HTML files into Lucene Documents; it does not
  give
  
  
  you
  
  
 
  
  
  
  a
 
  
  
  
  parse tree for the HTML doc.  CyberNeko is an extension of
  Xerces
  
  
  (uses
  
  

  
 
  
  
  
  the
  
  
   
  

  
  same API; will likely become part of Xerces), and so maps an
  HTML
 
  
  
  
  document
 
  
  
  
  into a full DOM that you can manipulate easily for a wide range
  of
  purposes.  I haven't used JTidy at an API level and so don't
  know it
  
  
  as
  
  

  
 
  
  
  
  well --
  
  
   
  

  
  based on its UI, it appears to be focused primarily on HTML
  validation
 
  
  
  
  and
 
  
  
  
  error detection/correction.
  
  I use CyberNeko for a range of operations on HTML documents
  that go
 
  
  
  
  beyond
 
  
  
  
  indexing them in Lucene, and really like it.  It has been
  robust for
  
  
  me
  
  
 
  
  
  
  so
 
  
  
  
  far.
  
  Chuck
  
  
  
  -Original Message-
  From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, February 01, 2005 1:15 AM
  To: lucene-user@jakarta.apache.org
  Subject: which HTML parser is better?
  
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
  
  _
  Do You Yahoo!?
  150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
  http://music.yisou.com/
  ÃÀÅ®Ã÷ÐÇÓ¦ÓÐ¾¡ÓÐ£¬ËÑ±éÃÀÍ¼¡¢ÑÞÍ¼ºÍ¿áÍ¼
  http://image.yisou.com
  1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
  

  
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma

  
  il_1g/

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea

Karl Koch wrote:
Hello Sergiu,
thank you for your help so far. I appreciate it.
I am working with Java 1.1 which does not include regular expressions.
 

Why are you using Java 1.1? Are you so limited in resources?
What operating system do you use?
I asume that you just need to index the html files, and you need a 
html2txt conversion.
If  an external converter si a solution for you, you can use
Runtime.executeCommnand(...) to run the converter that will extract the 
information from your HTMLs
and generate a .txt file. Then you can use a reader to index the txt.

As I told you before, the best solution depends on your constraints 
(time, effort, hardware, performance) and requirements :)

 Best,
 Sergiu
Your turn ;-)
Karl 

 

Karl Koch wrote:
   

I am in control of the html, which means it is well formated HTML. I use
only HTML files which I have transformed from XML. No external HTML (e.g.
the web).
Are there any very-short solutions for that?
 

if you are using only correct formated HTML pages and you are in control 
of these pages.
you can use a regular exprestion to remove the tags.

something like
replaceAll(*,);
This is the ideea behind the operation. If you will search on google you 
will find a more robust
regular expression.

Using a simple regular expression will be a very cheap solution, that 
can cause you a lot of problems in the future.

It's up to you to use it 
Best,
Sergiu
   

Karl

 

Karl Koch wrote:
  

   

Hi,
yes, but the library your are using is quite big. I was thinking that a


 

5kB
  

   

code could actually do that. That sourceforge project is doing much
 

more
   

than that but I do not need it.


 

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the
   

size.
   

You can use 3 lines of code with a good regular expresion to eliminate
the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

Best,
Sergiu
  

   

Karl



 

Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
Best,
Sergiu
Karl Koch wrote:
 

  

   

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and


 

simple
  

   

(KISS)) which allows to remove all HTML tags from HTML content? HTML


 

3.2
  

   

would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure


 

but
  

   

need a facility to clean up HTML into its normal underlying content
   



 

before
 

  

   

indexing that content as a whole.
Karl

   



 

I think that depends on what you want to do.  The Lucene demo parser
 

  

   

does
 

  

   

simple mapping of HTML files into Lucene Documents; it does not give
  

   

you
  

   

 

  

   

a
 

  

   

parse tree for the HTML doc.  CyberNeko is an extension of Xerces
  

   

(uses
  

   


 

  

   

the
   



 

same API; will likely become part of Xerces), and so maps an HTML
 

  

   

document
 

  

   

into a full DOM that you can manipulate easily for a wide range of
purposes.  I haven't used JTidy at an API level and so don't know it
  

   

as
  

   


 

  

   

well --
   



 

based on its UI, it appears to be focused primarily on HTML
   

validation
   

 

  

   

and
 

  

   

error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go
 

  

   

beyond
 

  

   

indexing them in Lucene, and really like it.  It has been robust for
  

   

me
  

   

 

  

   

so
 

  

   

far.
Chuck
  

   

-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
_
Do You Yahoo!?
150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
http://music.yisou.com/
ÃÀÅ®Ã÷ÐÇÓ¦ÓÐ¾¡ÓÐ£¬ËÑ±éÃÀÍ¼¡¢ÑÞÍ¼ºÍ¿áÍ¼
http://image.yisou.com
1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡

Re: Subversion conversion

2005-02-03 Thread Miles Barr

On Wed, 2005-02-02 at 22:11 -0500, Erik Hatcher wrote:
 I've seen both of these types of procedures followed on Apache 
 projects.  It really just depends.  Lucene's codebase is not being 
 modified frequently, so it is not necessary to branch and merge back.  
 Rather we simply develop off of the trunk (HEAD) and when we're ready 
 for a release we'll just do it from the trunk.  Actually  we'd most 
 likely tag and build from that tag just to be clean about it.

What consequences does this have for the 1.9/2.0 releases? i.e. after
2.0 the deprecated API will be removed, does this mean 1.x will no
longer be supported after 2.0?

The typical scenario being a bug is found that affects 1.x and 2.x, it's
patched in 2.x (i.e. the trunk) but we can't patch the last 1.x release.
The other scenario being a bug is found in the 1.x code, but it cannot
be applied.


-- 
Miles Barr [EMAIL PROTECTED]
Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch

I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.

I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.

Thank you all,
Karl

 No one has yet mentioned using ParserDelegator and ParserCallback that 
 are part of HTMLEditorKit in Swing.  I have been successfully using 
 these classes to parse out the text of an HTML file.  You just need to 
 extend HTMLEditorKit.ParserCallback and override the various methods 
 that are called when different tags are encountered.
 
 
 On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
 
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
 -- 
 Bill Tschumy
 Otherwise -- Austin, TX
 http://www.otherwise.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea

Karl Koch wrote:
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
in a single class or even method called by another part in my Java
application. It should also run on Java 1.1 and it should be small and
simple. As I said before, I am in control of the HTML and it will be well
formated, because I generate it from XML using XSLT.
 

Why don't you get the data directly from  XML files?
You can use a SAX parser, ... but I think it will require java 1.3 or at 
least 1.2.2

Best,
 Sergiu
Karl
 

If you are not married to Java:
http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
Otis
--- sergiu gordea [EMAIL PROTECTED] wrote:
   

Karl Koch wrote:
 

I am in control of the html, which means it is well formated HTML. I
   

use
 

only HTML files which I have transformed from XML. No external HTML
   

(e.g.
 

the web).
Are there any very-short solutions for that?
   

if you are using only correct formated HTML pages and you are in
control 
of these pages.
you can use a regular exprestion to remove the tags.

something like
replaceAll(*,);
This is the ideea behind the operation. If you will search on google
you 
will find a more robust
regular expression.

Using a simple regular expression will be a very cheap solution, that
can cause you a lot of problems in the future.
It's up to you to use it 
Best,
Sergiu
 

Karl

   

Karl Koch wrote:
  

 

Hi,
yes, but the library your are using is quite big. I was thinking
   

that a
 



   

5kB
  

 

code could actually do that. That sourceforge project is doing
   

much more
 

than that but I do not need it.


   

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the
 

size.
 

You can use 3 lines of code with a good regular expresion to
 

eliminate 
 

the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

Best,
Sergiu
  

 

Karl



   

Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
Best,
Sergiu
Karl Koch wrote:
 

  

 

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short
   

and
 



   

simple
  

 

(KISS)) which allows to remove all HTML tags from HTML content?
   

HTML
 



   

3.2
  

 

would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other
   

structure
 



   

but
  

 

need a facility to clean up HTML into its normal underlying
   

content
 

   



   

before
 

  

 

indexing that content as a whole.
Karl

   



   

I think that depends on what you want to do.  The Lucene demo
 

parser
 

 

  

 

does
 

  

 

simple mapping of HTML files into Lucene Documents; it does not
 

give
 

  

 

you
  

 

 

  

 

a
 

  

 

parse tree for the HTML doc.  CyberNeko is an extension of
 

Xerces
 

  

 

(uses
  

 


 

  

 

the
   



   

same API; will likely become part of Xerces), and so maps an
 

HTML
 

 

  

 

document
 

  

 

into a full DOM that you can manipulate easily for a wide range
 

of
 

purposes.  I haven't used JTidy at an API level and so don't
 

know it
 

  

 

as
  

 


 

  

 

well --
   



   

based on its UI, it appears to be focused primarily on HTML
 

validation
 

 

  

 

and
 

  

 

error detection/correction.
I use CyberNeko for a range of operations on HTML documents
 

that go
 

 

  

 

beyond
 

  

 

indexing them in Lucene, and really like it.  It has been
 

robust for
 

  

 

me
  

 

 

  

 

so
 

  

 

far.
Chuck
  

 

-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch

I am using Java 1.1 with a Sharp Zaurus PDA. I have very limited memory
constraints. I do not think CPU performance is a big issues though. But I
have other parts in my application which use quite a lot of memory and
soemthing run short. I therefore do not look into solutions which build up
tag trees etc. More like a solution who reads a stream of HTML and
transforms it into a stream of text.

I see your point of using an external program. I am however not entirely
sure if this is available. Also it would be much simpler to have a 3-5 kB
solution in Java, perhaps encapsulated in a class which does the job without
the need for advanced libraries which need 100-200 KB on my internal
storage. 

I hope I could clarify my situation now.

Cheers,
Karl 

 Karl Koch wrote:
 
 Hello Sergiu,
 
 thank you for your help so far. I appreciate it.
 
 I am working with Java 1.1 which does not include regular expressions.
   
 
 Why are you using Java 1.1? Are you so limited in resources?
 What operating system do you use?
 I asume that you just need to index the html files, and you need a 
 html2txt conversion.
 If  an external converter si a solution for you, you can use
 Runtime.executeCommnand(...) to run the converter that will extract the 
 information from your HTMLs
 and generate a .txt file. Then you can use a reader to index the txt.
 
 As I told you before, the best solution depends on your constraints 
 (time, effort, hardware, performance) and requirements :)
 
   Best,
 
   Sergiu
 
 Your turn ;-)
 Karl 
 
   
 
 Karl Koch wrote:
 
 
 
 I am in control of the html, which means it is well formated HTML. I
 use
 only HTML files which I have transformed from XML. No external HTML
 (e.g.
 the web).
 
 Are there any very-short solutions for that?
  
 
   
 
 if you are using only correct formated HTML pages and you are in control
 of these pages.
 you can use a regular exprestion to remove the tags.
 
 something like
 replaceAll(*,);
 
 This is the ideea behind the operation. If you will search on google you
 will find a more robust
 regular expression.
 
 Using a simple regular expression will be a very cheap solution, that 
 can cause you a lot of problems in the future.
  
  It's up to you to use it 
 
  Best,
  
  Sergiu
 
 
 
 Karl
 
  
 
   
 
 Karl Koch wrote:
 

 
 
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking that
 a
  
 
   
 
 5kB

 
 
 
 code could actually do that. That sourceforge project is doing much
   
 
 more
 
 
 than that but I do not need it.
 
 
  
 
   
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the
 
 
 size.
 
 
  You can use 3 lines of code with a good regular expresion to
 eliminate
 the html tags,
 but this won't give you any guarantie that the text from the bad 
 fromated html files will be
 correctly extracted...
 
  Best,
 
  Sergiu
 

 
 
 
 Karl
 
 
 
  
 
   
 
 Hi Karl,
 
 I already submitted a peace of code that removes the html tags.
 Search for my previous answer in this thread.
 
 Best,
 
  Sergiu
 
 Karl Koch wrote:
 
   
 

 
 
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short and
  
 
   
 
 simple

 
 
 
 (KISS)) which allows to remove all HTML tags from HTML content?
 HTML
  
 
   
 
 3.2

 
 
 
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other
 structure
  
 
   
 
 but

 
 
 
 need a facility to clean up HTML into its normal underlying content
 
 
  
 
   
 
 before
   
 

 
 
 
 indexing that content as a whole.
 
 Karl
 
 
 
 
 
 
  
 
   
 
 I think that depends on what you want to do.  The Lucene demo
 parser
   
 

 
 
 
 does
   
 

 
 
 
 simple mapping of HTML files into Lucene Documents; it does not
 give

 
 
 
 you

 
 
 
   
 

 
 
 
 a
   
 

 
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces

 
 
 
 (uses

 
 
 
  
 
   
 

 
 
 
 the
 
 
 
 
  
 
   
 
 same API; will likely become part of Xerces), and so maps an HTML
   
 

 
 
 
 document
   
 

 
 
 
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know
 it

 
 
 
 as

 
 
 
  
 
   
 

 
 
 
 well --

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea

Karl Koch wrote:
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.
 

I see,
In this case you can read line by line your HTML file and then write 
something like this:

String line;
int startPos, endPos;
StringBuffer text = new StringBuffer();
while((line = reader.readLine()) != null   ){
   startPos = line.indexOf();
   endPos = line.indexOf();
   if(startPos 0  endPos  startPos)
 text.append(line.substring(startPos, endPos));
}
This is just a sample code that should work if you have just one tag per 
line in the HTML file.
This can be a start point for you.

 Hope it helps,
Best,
Sergiu
I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.
Thank you all,
Karl
 

No one has yet mentioned using ParserDelegator and ParserCallback that 
are part of HTMLEditorKit in Swing.  I have been successfully using 
these classes to parse out the text of an HTML file.  You just need to 
extend HTMLEditorKit.ParserCallback and override the various methods 
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
   

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
 

--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-03 Thread Dawid Weiss

Karl,
Two things, try to experiment with both:
1) I would try to write a lexical scanner that strips HTML tags, much 
like the regular expression does. Java lexical scanner packages produce 
nice pure Java classes that seldom use any advanced API, so they should 
work on Java 1.1. They are simple state machines with states encoded in 
integers -- this should work like a charm, be fast and small.

2) Write a parser yourself. Having a regular expression it isn't that 
difficult to do... :)

D.
Karl Koch wrote:
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.
I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.
Thank you all,
Karl

No one has yet mentioned using ParserDelegator and ParserCallback that 
are part of HTMLEditorKit in Swing.  I have been successfully using 
these classes to parse out the text of an HTML file.  You just need to 
extend HTMLEditorKit.ParserCallback and override the various methods 
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better? - Thread closed

2005-02-03 Thread Karl Koch

Thank you, I will do that.

 Karl Koch wrote:
 
 I appologise in advance, if some of my writing here has been said before.
 The last three answers to my question have been suggesting pattern
 matching
 solutions and Swing. Pattern matching was introduced in Java 1.4 and
 Swing
 is something I cannot use since I work with Java 1.1 on a PDA.
   
 
 I see,
 
 In this case you can read line by line your HTML file and then write 
 something like this:
 
 String line;
 int startPos, endPos;
 StringBuffer text = new StringBuffer();
 while((line = reader.readLine()) != null   ){
 startPos = line.indexOf();
 endPos = line.indexOf();
 if(startPos 0  endPos  startPos)
   text.append(line.substring(startPos, endPos));
 }
 
 This is just a sample code that should work if you have just one tag per 
 line in the HTML file.
 This can be a start point for you.
 
   Hope it helps,
 
  Best,
 
  Sergiu
 
 I am wondering if somebody knows a piece of simple sourcecode with low
 requirement which is running under this tense specification.
 
 Thank you all,
 Karl
 
   
 
 No one has yet mentioned using ParserDelegator and ParserCallback that 
 are part of HTMLEditorKit in Swing.  I have been successfully using 
 these classes to parse out the text of an HTML file.  You just need to 
 extend HTMLEditorKit.ParserCallback and override the various methods 
 that are called when different tags are encountered.
 
 
 On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
 
 
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
   
 
 -- 
 Bill Tschumy
 Otherwise -- Austin, TX
 http://www.otherwise.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
   
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Rewrite causes BooleanQuery to loose required terms

2005-02-03 Thread Nick Burch

Hi All

I'm using lucene from CVS, and I've discovered the rewriting a 
BooleanQuery created with the old style (Query,boolean,boolean) method,
the rewrite will cause the required parameters to get lost.

Using old style (Query,boolean,boolean):
query = +contents:test* +(class:1.2 class:1.2.*)
rewritten query = (contents:tester contents:testing contents:tests) 
  (class:1.2 (class:1.2.3 class:1.2.4))

Using new style (Query,BooleanClause.Occur.MUST):
query = +contents:test* +(class:1.2 class:1.2.*)
rewritten query = +(contents:tester contents:testing contents:tests) 
  +(class:1.2 (class:1.2.3 class:1.2.4))

Attached is a simple RAMDirectory test to show this. I know that the 
(Query,boolean,boolean) method is depricated, but should it also be 
broken?

Thanks
Nick
/*
 * For testing to see if there are problems with rewriting
 *
 * Should show
 *+contents:test* +(class:1.2 class:1.2.*)
 * Goes to with old style
 *(contents:tester contents:testing contents:tests) (class:1.2 (class:1.2.3 
class:1.2.4))
 * Goes to (correctly) with new style
 *+(contents:tester contents:testing contents:tests) +(class:1.2 
(class:1.2.3 class:1.2.4))
 *
 * Nick Burch nick at torchbox dot com
 */

import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.search.*;
import org.apache.lucene.document.*;

import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.Analyzer;

import java.io.IOException;

public class rewritingTest {
public static void main(String args[]) throws IOException {
// Create a ram directory with a few test entries in it
RAMDirectory dir = new RAMDirectory();

// Create an analyzer
Analyzer a = new SimpleAnalyzer();

// Open an index writer
IndexWriter writer = new IndexWriter(dir,a,true);

// Add some test docs
Document doc;
Field f;

// Doc 1
doc = new Document();
f = new 
Field(contents,testing,Field.Store.YES,Field.Index.TOKENIZED);
doc.add(f);
f = new Field(title,testing,Field.Store.YES,Field.Index.TOKENIZED);
doc.add(f);
f = new Field(class,1.2.3,Field.Store.YES,Field.Index.UN_TOKENIZED);
doc.add(f);
f = new Field(class,1.2.4,Field.Store.YES,Field.Index.UN_TOKENIZED);
doc.add(f);
writer.addDocument(doc);

// Doc 2
doc = new Document();
f = new Field(contents,tests,Field.Store.YES,Field.Index.TOKENIZED);
doc.add(f);
f = new Field(tilte,tests,Field.Store.YES,Field.Index.TOKENIZED);
doc.add(f);
f = new Field(class,1.3.3,Field.Store.YES,Field.Index.UN_TOKENIZED);
doc.add(f);
f = new Field(class,1.3.4,Field.Store.YES,Field.Index.UN_TOKENIZED);
doc.add(f);
writer.addDocument(doc);

// Doc 3
doc = new Document();
f = new 
Field(contents,tester,Field.Store.YES,Field.Index.TOKENIZED);
doc.add(f);
f = new Field(tilte,tester,Field.Store.YES,Field.Index.TOKENIZED);
doc.add(f);
f = new Field(class,1.4,Field.Store.YES,Field.Index.UN_TOKENIZED);
doc.add(f);
f = new Field(class,1.5,Field.Store.YES,Field.Index.UN_TOKENIZED);
doc.add(f);
writer.addDocument(doc);

// Now get a searcher
writer.close();
IndexReader reader = IndexReader.open(dir);
IndexSearcher search = new IndexSearcher(reader);

// Construct a nest of queries
BooleanQuery overallQueryOld = new BooleanQuery();
BooleanQuery overallQueryNew = new BooleanQuery();

// Contents query
Term contentsT = new Term(contents,test*);
WildcardQuery contents = new WildcardQuery(contentsT);
overallQueryOld.add(contents,true,false);
overallQueryNew.add(contents,BooleanClause.Occur.MUST);

// Classifcation Query
BooleanQuery classQ = new BooleanQuery();
Term justClassT = new Term(class,1.2);
Term classChildrenT = new Term(class,1.2.*);
TermQuery justClass = new TermQuery(justClassT);
WildcardQuery classChildren = new WildcardQuery(classChildrenT);
classQ.add(justClass,false,false);
classQ.add(classChildren,false,false);
overallQueryOld.add(classQ,true,false);
overallQueryNew.add(classQ,BooleanClause.Occur.MUST);

System.out.println(overallQueryOld);
Query rewrittenOld = overallQueryOld.rewrite(reader);
System.out.println(rewrittenOld);

System.out.println();
System.out.println(overallQueryNew);
Query rewrittenNew = overallQueryNew.rewrite(reader);
System.out.println(rewrittenNew);
}
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Has anyone tried indexing xml files: DigesterXMLHandler.java file before?

2005-02-03 Thread Erik Hatcher

You're missing the Commons Digester JAR, which is in the lib directory 
of the LIA download.  Check the build.xml file for the build details of 
how the compile class path is set.  You'll likely need some other JAR's 
at runtime too.

Erik
On Feb 3, 2005, at 2:12 AM, jac jac wrote:
Hi,
I just tried to compile DigesterXMLHandler.java  from the LIA codes 
which I have gotten from the src directory.

I placed it into my own directory...
I could't seem to be able to compile DigesterXMLHandler.java:
It keeps prompting:
DigesterXMLHandler.java:9: package org.apache.commons.digester does 
not exist
import org.apache.commons.digester.Digester;
   ^
DigesterXMLHandler.java:19: cannot resolve symbol
symbol  : class Digester
location: class lia.handlingtypes.xml.DigesterXMLHandler
  private Digester dig;
  ^
DigesterXMLHandler.java:25: cannot resolve symbol
symbol  : class Digester
location: class lia.handlingtypes.xml.DigesterXMLHandler
dig = new Digester();

I have set the classpath...
May I know how do we run the file in order to get my index folder?
so sorry, i really can't interpret the way to run it...
are there any documentation around...?
thanks very much!
 Yahoo! Mobile
- Download the latest ringtones, games, and more!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Subversion conversion

2005-02-03 Thread Erik Hatcher

We can work the 1.x and 2.0 lines of code however we need to.  We can 
branch (a branch or tag in Subversion is inexpensive and a constant 
time operation).  How we want to manage both versions of Lucene is open 
for discussion.  Nothing about Subversion changes how we manage this 
from how we'd do it with CVS.

Currently the 1.x and 2.x lines of code are one and the same.  Once 
they diverge in 2.0, it will depend on who steps up to maintain 1.x but 
I suspect there will be a strong interest in keeping it alive by some, 
but we would of course encourage everyone using 1.x upgrade to 1.9 and 
remove deprecation warnings.

Erik

On Feb 3, 2005, at 4:33 AM, Miles Barr wrote:
On Wed, 2005-02-02 at 22:11 -0500, Erik Hatcher wrote:
I've seen both of these types of procedures followed on Apache
projects.  It really just depends.  Lucene's codebase is not being
modified frequently, so it is not necessary to branch and merge back.
Rather we simply develop off of the trunk (HEAD) and when we're ready
for a release we'll just do it from the trunk.  Actually  we'd most
likely tag and build from that tag just to be clean about it.
What consequences does this have for the 1.9/2.0 releases? i.e. after
2.0 the deprecated API will be removed, does this mean 1.x will no
longer be supported after 2.0?
The typical scenario being a bug is found that affects 1.x and 2.x, 
it's
patched in 2.x (i.e. the trunk) but we can't patch the last 1.x 
release.
The other scenario being a bug is found in the 1.x code, but it cannot
be applied.

--
Miles Barr [EMAIL PROTECTED]
Runtime Collective Ltd.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Getting Search Results Pharse

2005-02-03 Thread mahaveer jain

Hi All,

I am using lucene to index and search my app. Till
date I am just showing file name or title based on my
application. We want to show, pharse that contain the
keyword searched. 
Has anybody tried this ? Can someone help me start
this ?

Thanks
Mahaveer



__ 
Do you Yahoo!? 
Yahoo! Mail - Easier than ever with enhanced search. Learn more.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Getting Search Results Pharse

2005-02-03 Thread Pasha Bizhan

Hi, 

 From: mahaveer jain [mailto:[EMAIL PROTECTED] 

 I am using lucene to index and search my app. Till date I am 
 just showing file name or title based on my application. We 
 want to show, pharse that contain the keyword searched. 
 Has anybody tried this ? Can someone help me start this ?

Look the Mark Harwood's Highlighter package :
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/highl
ighter/

Pasha Bizhan
http://lucenedotnet.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Subversion conversion

2005-02-03 Thread Kevin L. Cobb

We recently started using SVN for SCM, were using VSS. We're trying out
approach A, branching off for each release. Development always develops
on the trunk, except when a bug is discovered that needs to be patched
to a previous version of the product. When that scenario comes up (and
it never has), then the developer has to make the change to the branched
version that needs to be patched and then must merge those changes into
other branches and the trunk.  

It seems to be a cleaner approach, at least for now. Of course, for an
open source project like Lucene, I'm not sure branching is necessary at
all. Anyone have any other models to use for SCM, I'd love to hear them,

Here's some ASCII art showing our model:

 +--- branch release
1.2
 |
---trunk|---trunk--|--trunk--|---trunk--
---
|  |
|  +-- branch release 1.1
|
+ branch release 1.0 ---
 

Kevin Cobb


-Original Message-
From: Chakra Yadavalli [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 02, 2005 7:50 PM
To: Lucene Users List
Subject: Re: Subversion conversion

Hello ALL, It might not be the right place for it but as we are talking
about SCM, I have a quick question. First, I haven't used CVS/SVN on any
project. I am a ClearCase/PVCS guy. I just would like to know WHICH
CONFIGURATION MANAGEMENT PLAN DO YOU FOLLOW IN LUCENE DEVELOPMENT.

PLAN A: DEVELOP IN TRUNK AND BRANCH OFF ON RELEASE
Recently I had a discussion with a friend about developing in the TRUNK
(which in the /main in ClearCase speak),  which my friend claims that is
done in the APACHE/Open Source projects. The main advantage he pointed
was that Merging could be avoided if you are developing in the TRUNK.
And when there is a release, they create a new Branch (say LUCENE_1.5
branch) and label them. That branch will be used for maintenance and any
code deltas will be merged back to TRUNK as needed.

PLAN B: BRANCH OF BEFORE PLANNED RELEASE AND MERGE BACK TO MAIN/TRUNK
As I am from a private workspace/isolated development school of
thought promoted by ClearCase, I am used to create a branch at the
project/release initiation and develop in that branch (say /main/dev).
Similarly, we have /main/int for making changes when the project goes to
integration phase, and a /main/acp branch for acceptance. In this
school, the /main will always have fewer versions of files and the
difference between any two consecutive versions is the NET CHANGE of
that SCM element (either file or dir) between two releases (say LUCENE
1.4 and 1.5).

Thanks in advance for your time.
Chakra Yadavalli
http://jroller.com/page/cyblogue

 -Original Message-
 From: aurora [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, February 02, 2005 4:25 PM
 To: lucene-user@jakarta.apache.org
 Subject: Re: Subversion conversion
 
 Subversion rocks!
 
 I have just setup the Windows svn client TortoiseSVN with my favourite
 file manager Total Commander 6.5. The svn status and commands are
 readily
 integrated with the file manager. Offline diff and revert are two
things
 I
 really like from svn.
 
  The conversion to Subversion is complete.  The new repository is
  available to users read-only at:
 
http://svn.apache.org/repos/asf/lucene/java/trunk
 
  Besides /trunk, there is also /branches and /tags.  /tags contains
all
 
  the CVS tags made so that you could grab a snapshot of a previous
  version.  /trunk is analogous to CVS HEAD.  You can learn more about
 the
  Apache repository configuration here and how to use the command-line
  client to check out the repository:
 
http://www.apache.org/dev/version-control.html
 
  Learn about Subversion, including the complete O'Reilly Subversion
 book
  in electronic form for free here:
 
http://subversion.tigris.org
 
  For committers, check out the repository using https and your Apache
  username/password.
 
  The Lucene sandbox has been integrated into our single Subversion
  repository, under /java/trunk/sandbox:
 
http://svn.apache.org/repos/asf/lucene/java/trunk/sandbox/
 
  The Lucene CVS repositories have been locked for read-only.
 
  If there are any issues with this conversion, let me know and I'll
 bring
  them to the Apache infrastructure group.
 
Erik
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-- 
Visit my weblog: http://www.jroller.com/page/cyblogue

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Right way to make analyzer

2005-02-03 Thread Owen Densmore

Is this the right way to make a porter analyzer using the standard 
tokenizer?  I'm not sure about the order of the filters.

Owen
class MyAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
return new PorterStemFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader))),
   StopAnalyzer.ENGLISH_STOP_WORDS));
  }
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Right way to make analyzer

2005-02-03 Thread Erik Hatcher

On Feb 3, 2005, at 9:26 AM, Owen Densmore wrote:
Is this the right way to make a porter analyzer using the standard 
tokenizer?  I'm not sure about the order of the filters.

Owen
class MyAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
return new PorterStemFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader))),
   StopAnalyzer.ENGLISH_STOP_WORDS));
  }
}
Yes, that is correct.
Analysis starts with a tokenizer, and chains the output of that to the 
next filter and so on.

I strongly recommend, as you start tinkering with custom analysis, to 
use a little bit of code to see how your analyzer works on some text.  
The Lucene Intro article I wrote for java.net has some code you can 
borrow to do this, as does Lucene in Action's source code.  Also, Luke 
has this capability - which is a tool I also highly recommend.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-03 Thread aurora

For all parser suggestion I think there is one important attribute. Some  
parsers returns data provide that the input HTML is sensible. Some parsers  
is designed to be most flexible as tolerant as it can be. If the input is  
clean and controlled the former class is sufficient. Even some regular  
expression may be sufficient. (I that's the original poster wants). If you  
are building a web crawler you need something really tolerant.

Once I have prototyped a nice and fast parser. Later I have to abandon it  
because it failed to parse about 15% documents (problem handling nested  
quotes like onclick=alert('hi')).

No one has yet mentioned using ParserDelegator and ParserCallback that  
are part of HTMLEditorKit in Swing.  I have been successfully using  
these classes to parse out the text of an HTML file.  You just need to  
extend HTMLEditorKit.ParserCallback and override the various methods  
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Hits and HitCollector performance

2005-02-03 Thread aurora

I am trying to do some filtering and rearrangement of search result. Two  
possiblity come into mind are iterating though the Hits or making custom  
HitCollector.

All documentation invaribly warn about the performance impact of using  
HitCollector with large result set. The scenario that google return 10s of  
millions of documents comes into mind. But I'm thinking, wouldn't Hits  
also have to fill up an array with millions of integer id at least? Or  
does it only return the correct lenght but build the result array on  
demand?

Another idea I have is first gone through the first n hits, let say 1000,  
which I filter and rearrange. If user ever need the result pass 1000 the  
get the result from Hits.

Is there any recommended way in these situations?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Subversion conversion

2005-02-03 Thread John Haxby

Kevin L. Cobb wrote:
We recently started using SVN for SCM, were using VSS. We're trying out
approach A, branching off for each release. Development always develops
on the trunk, except when a bug is discovered that needs to be patched
to a previous version of the product. When that scenario comes up (and
it never has), then the developer has to make the change to the branched
version that needs to be patched and then must merge those changes into
other branches and the trunk.  

It seems to be a cleaner approach, at least for now. Of course, for an
open source project like Lucene, I'm not sure branching is necessary at
all. Anyone have any other models to use for SCM, I'd love to hear them,
 

We've tried a variety of approaches over the years, but this one seems 
to be the easiest to handle and least prone to errors.   It's nice to 
see someone else has reached the same conclusion!

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lock failure recovery

2005-02-03 Thread Claes Holmerson

Hello
A commit.lock can get left by a process that dies in the middle of 
reading the index, for example because of an OutOfMemoryError. How can I 
handle such a left lock gracefully the next time the process runs? 
Checking if there is a lock is straight forward - but how can I be sure 
that it is not just a current lock created by another thread? The only 
methods I find to deal with the lock is IndexReader.isLocked() and 
IndexReader.unlock(). I would like to know the lock age - if it is older 
than a certain age then I can remove it. How do other people deal with 
left over locks?

Claes
--
Claes Holmerson
Polopoly - Cultivating the information garden
Kungsgatan 88, SE-112 27 Stockholm, SWEDEN
Direct: +46 8 506 782 59
Mobile: +46 704 47 82 59
Fax:  +46 8 506 782 51
[EMAIL PROTECTED], http://www.polopoly.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lock failure recovery

2005-02-03 Thread Luke Shannon

The indexing process is totally synchronized in our system. Thus if an
Indexing thread starts up and the index exists, but is locked, I know this
to be the only indexing processing running so the lock must be from a
process that got stopped before it could finish.

So right before I begin writing to the index I have this check:

//if we have gotten to here that this is the only index running.
//the index should not be locked. if it is, the lock is stale
//and must be released before we can continue
try {
if (index.exists()  IndexReader.isLocked(indexFileLocation)) {
Trace.ERROR(INDEX INFO: Had to clear a stale index lock);
IndexReader.unlock(FSDirectory.getDirectory(index, false));
}
} catch (IOException e3) {
Trace.ERROR(INDEX ERROR: Was unable to clear a stale index
lock:  + e3);
}

Luke

- Original Message - 
From: Claes Holmerson [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 03, 2005 12:02 PM
Subject: Lock failure recovery


 Hello

 A commit.lock can get left by a process that dies in the middle of
 reading the index, for example because of an OutOfMemoryError. How can I
 handle such a left lock gracefully the next time the process runs?
 Checking if there is a lock is straight forward - but how can I be sure
 that it is not just a current lock created by another thread? The only
 methods I find to deal with the lock is IndexReader.isLocked() and
 IndexReader.unlock(). I would like to know the lock age - if it is older
 than a certain age then I can remove it. How do other people deal with
 left over locks?

 Claes
 -- 

 Claes Holmerson
 Polopoly - Cultivating the information garden
 Kungsgatan 88, SE-112 27 Stockholm, SWEDEN
 Direct: +46 8 506 782 59
 Mobile: +46 704 47 82 59
 Fax:  +46 8 506 782 51
 [EMAIL PROTECTED], http://www.polopoly.com


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

when indexing, java.io.FileNotFoundException

2005-02-03 Thread Chris Lu

Hi,
I am getting this exception now and then when I am indexing content.
It doesn't always happen. But when it happens, I have to delete the
index and start over again.
This is a serious problem for us.
In this email, Doug was say it has something to do with win32's lack of
atomic renaming.
http://java2.5341.com/msg/1348.html
But how can I prevent this?
Chris Lu
java.io.FileNotFoundException: C:\data\indexes\customer\_temp\0\_1e.fnm
(The system cannot find the file specified)
  at java.io.RandomAccessFile.open(Native Method)
  at java.io.RandomAccessFile.init(RandomAccessFile.java:204)
  at 
org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376)
  at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:405)
  at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
  at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:53)
  at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
  at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94)
  at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480)
  at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
  at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
  at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: when indexing, java.io.FileNotFoundException

2005-02-03 Thread Will Allen

Increase the minMergeDocs and use the compact file format when creating your 
index.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)

-Original Message-
From: Chris Lu [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 03, 2005 12:46 PM
To: Lucene Users List
Subject: when indexing, java.io.FileNotFoundException


Hi,
I am getting this exception now and then when I am indexing content.
It doesn't always happen. But when it happens, I have to delete the
index and start over again.
This is a serious problem for us.

In this email, Doug was say it has something to do with win32's lack of
atomic renaming.
http://java2.5341.com/msg/1348.html

But how can I prevent this?

Chris Lu


java.io.FileNotFoundException: C:\data\indexes\customer\_temp\0\_1e.fnm
(The system cannot find the file specified)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.init(RandomAccessFile.java:204)
   at 
org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376)
   at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:405)
   at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
   at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:53)
   at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
   at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94)
   at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480)
   at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
   at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
   at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: when indexing, java.io.FileNotFoundException

2005-02-03 Thread Chris Lu

Thank you for your reply.
I am already using compound file format, and the minMergeDocs is already 
increased to 50.
As my understanding and observation,  files are compounded at the end of 
indexing. The error happens when indexing, so compound file format 
should not matter.

Chris Lu
Will Allen wrote:
Increase the minMergeDocs and use the compact file format when creating your 
index.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)
-Original Message-
From: Chris Lu [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 03, 2005 12:46 PM
To: Lucene Users List
Subject: when indexing, java.io.FileNotFoundException
Hi,
I am getting this exception now and then when I am indexing content.
It doesn't always happen. But when it happens, I have to delete the
index and start over again.
This is a serious problem for us.
In this email, Doug was say it has something to do with win32's lack of
atomic renaming.
http://java2.5341.com/msg/1348.html
But how can I prevent this?
Chris Lu
java.io.FileNotFoundException: C:\data\indexes\customer\_temp\0\_1e.fnm
(The system cannot find the file specified)
  at java.io.RandomAccessFile.open(Native Method)
  at java.io.RandomAccessFile.init(RandomAccessFile.java:204)
  at 
org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:376)
  at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:405)
  at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
  at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:53)
  at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
  at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:94)
  at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:480)
  at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:458)
  at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:310)
  at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:294)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon

Hello;

I have a query that finds document that contain fields with a specific
value.

query1 = QueryParser.parse(jpg, kcfileupload, new StandardAnalyzer());

This works well.

I would like a query that find documents containing all kcfileupload fields
that don't contain jpg.

The example I found in the book that seems to relate shows me how to find
documents without a specific term:

QueryParser parser = new QueryParser(contents, analyzer);
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

But than it says:

Negating a term must be combined with at least one nonnegated term to return
documents; in other words, it isn't possible to use a query like NOT term to
find all documents that don't contain a term.

So does that mean the above example wouldn't work?

The API says:

 a plus (+) or a minus (-) sign, indicating that the clause is required or
prohibited respectively;

I have been playing around with using the minus character without much luck.

Can someone give point me in the right direction to figure this out?

Thanks,

Luke




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Rewrite causes BooleanQuery to loose required terms

2005-02-03 Thread Paul Elschot

On Thursday 03 February 2005 11:38, Nick Burch wrote:
 Hi All
 
 I'm using lucene from CVS, and I've discovered the rewriting a 
 BooleanQuery created with the old style (Query,boolean,boolean) method,
 the rewrite will cause the required parameters to get lost.
 
 Using old style (Query,boolean,boolean):
 query = +contents:test* +(class:1.2 class:1.2.*)
 rewritten query = (contents:tester contents:testing contents:tests) 
   (class:1.2 (class:1.2.3 class:1.2.4))
 
 Using new style (Query,BooleanClause.Occur.MUST):
 query = +contents:test* +(class:1.2 class:1.2.*)
 rewritten query = +(contents:tester contents:testing contents:tests) 
   +(class:1.2 (class:1.2.3 class:1.2.4))
 
 Attached is a simple RAMDirectory test to show this. I know that the 
 (Query,boolean,boolean) method is depricated, but should it also be 
 broken?

No.
Currently, the old constructor for BooleanClause does not carry the
old state forward.
The new constructor does carry the new state backward.

I'll post a fix in bugzilla later.

Thanks,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Kelvin Tan

Alternatively, add a dummy field-value to all documents, like 
doc.add(Field.Keyword(foo, bar))

Waste of space, but allows you to perform negated queries.

On Thu, 03 Feb 2005 19:19:15 +0100, Maik Schreiber wrote:
 Negating a term must be combined with at least one nonnegated
 term to return documents; in other words, it isn't possible to
 use a query like NOT term to find all documents that don't
 contain a term.

 So does that mean the above example wouldn't work?

 Exactly. You cannot search for -kcfileupload:jpg, you need at
 least one clause that actually _includes_ documents.

 Do you by chance have a field with known contents? If so, you could
 misuse that one and include it in your query (perhaps by doing
 range or wildcard/prefix search). If not, try IndexReader.terms()
 for building a Query yourself, then use that one for search.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Searching for doc without a field

2005-02-03 Thread Bill Tschumy

Is there any way to construct a query to locate all documents without a 
specific field?  By this I mean the Document was created without ever 
having that field added to it.
--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Synonyms Not Showing In The Index

2005-02-03 Thread Andrzej Bialecki

Andrzej Bialecki wrote:
Luke Shannon wrote:
Hello;
It seems my Synonym analyzer is working (based on some successful 
queries).
But I can't see the synonyms in the index using Luke. Is this correct?

Did you use the combined JAR to run? It contains an oldish version of 
Lucene... Other than that, I'm not sure - if you can't find the reason 
you could send me a small test index...


Got the bug. Your index is ok, and your synonym analyzer works as 
expected. The Doc #16, field name has the content luigi|mario test, 
where tokens luigi and mario occupy the same position.

This was a deficiency with the current version of Luke, where if you 
press Reconstruct it tries to reconstruct only unstored fields, but 
shows you the stored fields verbatim (without actually checking how 
their content was tokenized, and what tokens ended up in the index).

This is fixed in the new (yet unreleased) version of Luke. This new 
version restores all fields (no matter if they are stored or only 
indexed), and then displays both the stored content, and the restored 
tokenized content. There was also a bug in GrowableStringsArray - the 
values of tokens with the same position were being overwritten instead 
of appended. This is also fixed now.

You should expect a new release within a week or two. If you can't wait, 
let me know and I'll send you the patches.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Synonyms Not Showing In The Index

2005-02-03 Thread Luke Shannon

Thanks!

I can wait for the release.

Luke

- Original Message - 
From: Andrzej Bialecki [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 03, 2005 2:53 PM
Subject: Re: Synonyms Not Showing In The Index

 Andrzej Bialecki wrote:
  Luke Shannon wrote:

  Hello;

  It seems my Synonym analyzer is working (based on some successful 
  queries).
  But I can't see the synonyms in the index using Luke. Is this correct?

  Did you use the combined JAR to run? It contains an oldish version of 
  Lucene... Other than that, I'm not sure - if you can't find the reason 
  you could send me a small test index...

 Got the bug. Your index is ok, and your synonym analyzer works as 
 expected. The Doc #16, field name has the content luigi|mario test, 
 where tokens luigi and mario occupy the same position.

 This was a deficiency with the current version of Luke, where if you 
 press Reconstruct it tries to reconstruct only unstored fields, but 
 shows you the stored fields verbatim (without actually checking how 
 their content was tokenized, and what tokens ended up in the index).

 This is fixed in the new (yet unreleased) version of Luke. This new 
 version restores all fields (no matter if they are stored or only 
 indexed), and then displays both the stored content, and the restored 
 tokenized content. There was also a bug in GrowableStringsArray - the 
 values of tokens with the same position were being overwritten instead 
 of appended. This is also fixed now.

 You should expect a new release within a week or two. If you can't wait, 
 let me know and I'll send you the patches.

 -- 
 Best regards,
 Andrzej Bialecki
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching for doc without a field

2005-02-03 Thread Paul Elschot

On Thursday 03 February 2005 20:18, Bill Tschumy wrote:
 Is there any way to construct a query to locate all documents without a 
 specific field?  By this I mean the Document was created without ever 
 having that field added to it.

One way is to add an extra document field containing the field
names of all (other) indexed fields in the document.
Assuming there is always a primary key field the query is then:

+fieldnames:primarykeyfield -fieldnames:specificfield

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-03 Thread Ian Soboroff


One which we've been using can be found at:
http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

We absolutely need to be able to recover gracefully from malformed
HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
failed this criterion when we started our effort.  The above one is
kind of SAX-y but doesn't fall over at the sight of a real web page
;-)

Ian


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon

Ok.

I have added the following to every document:

doc.add(Field.UnIndexed(olFaithfull, stillHere));

The plan is a query that says: olFaithull = stillHere and kcfileupload!=jpg.

I have been experimenting with the MultiFieldQueryParser, this is not
working out for me. From a syntax how is this done? Does someone have an
example of a query similar to the one I am trying?

Thanks,

Luke

- Original Message - 
From: Maik Schreiber [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 03, 2005 1:19 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


  Negating a term must be combined with at least one nonnegated term to
return
  documents; in other words, it isn't possible to use a query like NOT
term to
  find all documents that don't contain a term.
 
  So does that mean the above example wouldn't work?

 Exactly. You cannot search for -kcfileupload:jpg, you need at least one
 clause that actually _includes_ documents.

 Do you by chance have a field with known contents? If so, you could misuse
 that one and include it in your query (perhaps by doing range or
 wildcard/prefix search). If not, try IndexReader.terms() for building a
 Query yourself, then use that one for search.

 -- 
 Maik Schreiber   *   http://www.blizzy.de

 GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x1F11D713
 Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon

Hello,

Still working on the same query, here is the code I am currently working
with.

I am thinking this should bring up all the documents that have
olFaithFull=stillHere and kcfileupload!=jpg (so anything else)

query1 = QueryParser.parse(jpg, kcfileupload, new StandardAnalyzer());
query2 = QueryParser.parse(stillHere, olFaithFull, new
StandardAnalyzer());
BooleanQuery typeNegativeSearch = new BooleanQuery();
typeNegativeSearch.add(query1, false, true);
typeNegativeSearch.add(query2, true, false);

There toString() on the query is:

-kcfileupload:jpg +olFaithFull:stillhere

This looks right to me. Why the 0 results?

Thanks,

Luke

- Original Message - 
From: Maik Schreiber [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 03, 2005 1:19 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


  Negating a term must be combined with at least one nonnegated term to
return
  documents; in other words, it isn't possible to use a query like NOT
term to
  find all documents that don't contain a term.
 
  So does that mean the above example wouldn't work?

 Exactly. You cannot search for -kcfileupload:jpg, you need at least one
 clause that actually _includes_ documents.

 Do you by chance have a field with known contents? If so, you could misuse
 that one and include it in your query (perhaps by doing range or
 wildcard/prefix search). If not, try IndexReader.terms() for building a
 Query yourself, then use that one for search.

 -- 
 Maik Schreiber   *   http://www.blizzy.de

 GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x1F11D713
 Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Maik Schreiber

-kcfileupload:jpg +olFaithFull:stillhere
This looks right to me. Why the 0 results?
Looks good to me, too. You sure all your documents have 
olFaithFull:stillhere and there is at least a document with kcfileupload not 
being jpg?

--
Maik Schreiber   *   http://www.blizzy.de -- Get GMail invites here!
GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x1F11D713
Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon

Yes. There should be 119 with stillHere, and if I run a query in Luke on
kcfileupload = ppt, it returns one result. I am thinking I should at least
get this result back with: -kcfileupload:jpg +olFaithFull:stillhere?

Luke

- Original Message - 
From: Maik Schreiber [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 03, 2005 4:27 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x

  -kcfileupload:jpg +olFaithFull:stillhere

  This looks right to me. Why the 0 results?

 Looks good to me, too. You sure all your documents have
 olFaithFull:stillhere and there is at least a document with kcfileupload
not
 being jpg?

 -- 
 Maik Schreiber   *   http://www.blizzy.de -- Get GMail invites here!

 GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x1F11D713
 Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Maik Schreiber

Yes. There should be 119 with stillHere,
You have double-checked that, haven't you? :)
and if I run a query in Luke on
kcfileupload = ppt, it returns one result. I am thinking I should at least
get this result back with: -kcfileupload:jpg +olFaithFull:stillhere?
You really should.
--
Maik Schreiber   *   http://www.blizzy.de -- Get GMail invites here!
GPG public key: http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x1F11D713
Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon

I did, I have ran both queries in Luke.

kcfileupload:ppt

returns 1

olFaithfull:stillhere

returns 119

Luke

- Original Message - 
From: Maik Schreiber [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 03, 2005 4:55 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


  Yes. There should be 119 with stillHere,

 You have double-checked that, haven't you? :)

  and if I run a query in Luke on
  kcfileupload = ppt, it returns one result. I am thinking I should at
least
  get this result back with: -kcfileupload:jpg +olFaithFull:stillhere?

 You really should.

 -- 
 Maik Schreiber   *   http://www.blizzy.de -- Get GMail invites here!

 GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x1F11D713
 Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Numbers in the Query String

2005-02-03 Thread Hetan Shah

Hello,
How can one search for a document based on the query which has numbers 
in the query srting.

e.g. query = Java 2 Platform J2EE
What do I need to do so that the numbers do not get neglected.
I am using StandardAnalyzer to index the pages and using StopAnalyzer to 
search the documents. Would the use of two different analyzers cause any 
trouble for the results?

Thanks.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Numbers in the Query String

2005-02-03 Thread Andrzej Bialecki

Hetan Shah wrote:
Hello,
How can one search for a document based on the query which has numbers 
in the query srting.

e.g. query = Java 2 Platform J2EE
What do I need to do so that the numbers do not get neglected.
I am using StandardAnalyzer to index the pages and using StopAnalyzer to 
search the documents. Would the use of two different analyzers cause any 
trouble for the results?
Yes. StopAnalyzer eats all numbers for breakfast. ;-) You need to use 
another analyzer, one that doesn't discard numbers.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Numbers in the Query String

2005-02-03 Thread Otis Gospodnetic

Using different analyzers for indexing and searching is not
recommended.
Your numbers are not even in the index because you are using
StandardAnalyzer.  Use Luke to look at your index.

Otis


--- Hetan Shah [EMAIL PROTECTED] wrote:

 Hello,
 
 How can one search for a document based on the query which has
 numbers 
 in the query srting.
 
 e.g. query = Java 2 Platform J2EE
 
 What do I need to do so that the numbers do not get neglected.
 
 I am using StandardAnalyzer to index the pages and using StopAnalyzer
 to 
 search the documents. Would the use of two different analyzers cause
 any 
 trouble for the results?
 
 Thanks.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon

This works:

query1 = QueryParser.parse(jpg, kcfileupload, new StandardAnalyzer());
query2 = QueryParser.parse(stillHere, olFaithFull, new
StandardAnalyzer());
BooleanQuery typeNegativeSearch = new BooleanQuery();
typeNegativeSearch.add(query1, false, false);
typeNegativeSearch.add(query2, false, false);

It returns 9 results. And in string form is: kcfileupload:jpg
olFaithFull:stillhere

But this:

query1 = QueryParser.parse(jpg, kcfileupload, new StandardAnalyzer());
query2 = QueryParser.parse(stillHere, olFaithFull, new
StandardAnalyzer());
BooleanQuery typeNegativeSearch = new BooleanQuery();
typeNegativeSearch.add(query1, true, false);
typeNegativeSearch.add(query2, true, false);

Reutrns 0 results and is in string form : +kcfileupload:jpg
+olFaithFull:stillhere

If I do the query kcfileupload:jpg in Luke I get 9 docs, each doc containing
a olFaithFull:stillHere. Why would +kcfileupload:jpg +olFaithFull:stillhere
return no results?

Thanks,

Luke

- Original Message - 
From: Maik Schreiber [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 03, 2005 4:55 PM
Subject: Re: Parsing The Query: Every document that doesn't have a field
containing x


  Yes. There should be 119 with stillHere,

 You have double-checked that, haven't you? :)

  and if I run a query in Luke on
  kcfileupload = ppt, it returns one result. I am thinking I should at
least
  get this result back with: -kcfileupload:jpg +olFaithFull:stillhere?

 You really should.

 -- 
 Maik Schreiber   *   http://www.blizzy.de -- Get GMail invites here!

 GPG public key:
http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x1F11D713
 Key fingerprint: CF19 AFCE 6E3D 5443 9599 18B5 5640 1F11 D713

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Kauler, Leto S

First thing that jumps out is case-sensitivity.  Does your olFaithFull
field contain stillHere or stillhere?

--Leto


 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED] 
 This works:
 
 query1 = QueryParser.parse(jpg, kcfileupload, new 
 StandardAnalyzer()); query2 = QueryParser.parse(stillHere, 
 olFaithFull, new StandardAnalyzer()); BooleanQuery 
 typeNegativeSearch = new BooleanQuery(); 
 typeNegativeSearch.add(query1, false, false); 
 typeNegativeSearch.add(query2, false, false);
 
 It returns 9 results. And in string form is: kcfileupload:jpg 
 olFaithFull:stillhere
 
 But this:
 
 query1 = QueryParser.parse(jpg, kcfileupload, new 
 StandardAnalyzer());
 query2 = QueryParser.parse(stillHere, 
 olFaithFull, new StandardAnalyzer());
 BooleanQuery typeNegativeSearch = new BooleanQuery();
 typeNegativeSearch.add(query1, true, false);
 typeNegativeSearch.add(query2, true, false);
 
 Reutrns 0 results and is in string form : +kcfileupload:jpg
 +olFaithFull:stillhere
 
 If I do the query kcfileupload:jpg in Luke I get 9 docs, each 
 doc containing a olFaithFull:stillHere. Why would 
 +kcfileupload:jpg +olFaithFull:stillhere return no results?
 
 Thanks,
 
 Luke

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon

stillHere

Capital H.

- Original Message - 
From: Kauler, Leto S [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 03, 2005 6:40 PM
Subject: RE: Parsing The Query: Every document that doesn't have a field
containing x

First thing that jumps out is case-sensitivity.  Does your olFaithFull
field contain stillHere or stillhere?

--Leto

 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED]
 This works:

 query1 = QueryParser.parse(jpg, kcfileupload, new
 StandardAnalyzer()); query2 = QueryParser.parse(stillHere,
 olFaithFull, new StandardAnalyzer()); BooleanQuery
 typeNegativeSearch = new BooleanQuery();
 typeNegativeSearch.add(query1, false, false);
 typeNegativeSearch.add(query2, false, false);

 It returns 9 results. And in string form is: kcfileupload:jpg
 olFaithFull:stillhere

 But this:

 query1 = QueryParser.parse(jpg, kcfileupload, new
 StandardAnalyzer());
 query2 = QueryParser.parse(stillHere,
 olFaithFull, new StandardAnalyzer());
 BooleanQuery typeNegativeSearch = new BooleanQuery();
 typeNegativeSearch.add(query1, true, false);
 typeNegativeSearch.add(query2, true, false);

 Reutrns 0 results and is in string form : +kcfileupload:jpg
 +olFaithFull:stillhere

 If I do the query kcfileupload:jpg in Luke I get 9 docs, each
 doc containing a olFaithFull:stillHere. Why would
 +kcfileupload:jpg +olFaithFull:stillhere return no results?

 Thanks,

 Luke

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom
it is addressed and may contain privileged and/or confidential information.
If you are not the intended recipient, any disclosure, copying or
dissemination of the information is unauthorised and you should
delete/destroy all copies and notify the sender. No liability is accepted
for any unauthorised use of the information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Kauler, Leto S

Because you are build from QueryParser rather than a TermQuery, all
search terms in the query are being lowercased by StandardAnalyzer.

So your query of olFaithFull:stillhere requires that there is an exact
index term of stillhere in that field.  It depends on how you built
the index (index and stored fields are different), but I would check on
that.  Also maybe try out TermQuery and see if that does anything for
you.



 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED] 
 Sent: Friday, 4 February 2005 10:47 AM
 To: Lucene Users List
 Subject: Re: Parsing The Query: Every document that doesn't 
 have a field containing x
 
 
 stillHere
 
 Capital H.
 
 - Original Message - 
 From: Kauler, Leto S [EMAIL PROTECTED]
 To: Lucene Users List lucene-user@jakarta.apache.org
 Sent: Thursday, February 03, 2005 6:40 PM
 Subject: RE: Parsing The Query: Every document that doesn't 
 have a field containing x
 
 
 First thing that jumps out is case-sensitivity.  Does your 
 olFaithFull field contain stillHere or stillhere?
 
 --Leto
 
 
  -Original Message-
  From: Luke Shannon [mailto:[EMAIL PROTECTED]
  This works:
 
  query1 = QueryParser.parse(jpg, kcfileupload, new 
  StandardAnalyzer()); query2 = QueryParser.parse(stillHere, 
  olFaithFull, new StandardAnalyzer()); BooleanQuery 
  typeNegativeSearch = new BooleanQuery(); 
  typeNegativeSearch.add(query1, false, false); 
  typeNegativeSearch.add(query2, false, false);
 
  It returns 9 results. And in string form is: kcfileupload:jpg 
  olFaithFull:stillhere
 
  But this:
 
  query1 = QueryParser.parse(jpg, kcfileupload, new 
  StandardAnalyzer());
  query2 = QueryParser.parse(stillHere, 
 olFaithFull, new 
  StandardAnalyzer());
  BooleanQuery typeNegativeSearch = new BooleanQuery();
  typeNegativeSearch.add(query1, true, false);
  typeNegativeSearch.add(query2, true, false);
 
  Reutrns 0 results and is in string form : +kcfileupload:jpg
  +olFaithFull:stillhere
 
  If I do the query kcfileupload:jpg in Luke I get 9 docs, each doc 
  containing a olFaithFull:stillHere. Why would
  +kcfileupload:jpg +olFaithFull:stillhere return no results?
 
  Thanks,
 
  Luke

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread Luke Shannon

Bingo! Nice catch. That was it. Made everything lower case when I set the
field. Works great now.

Thanks!

Luke

- Original Message - 
From: Kauler, Leto S [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 03, 2005 6:48 PM
Subject: RE: Parsing The Query: Every document that doesn't have a field
containing x

Because you are build from QueryParser rather than a TermQuery, all
search terms in the query are being lowercased by StandardAnalyzer.

So your query of olFaithFull:stillhere requires that there is an exact
index term of stillhere in that field.  It depends on how you built
the index (index and stored fields are different), but I would check on
that.  Also maybe try out TermQuery and see if that does anything for
you.

 -Original Message-
 From: Luke Shannon [mailto:[EMAIL PROTECTED]
 Sent: Friday, 4 February 2005 10:47 AM
 To: Lucene Users List
 Subject: Re: Parsing The Query: Every document that doesn't
 have a field containing x

 stillHere

 Capital H.

 - Original Message - 
 From: Kauler, Leto S [EMAIL PROTECTED]
 To: Lucene Users List lucene-user@jakarta.apache.org
 Sent: Thursday, February 03, 2005 6:40 PM
 Subject: RE: Parsing The Query: Every document that doesn't
 have a field containing x

 First thing that jumps out is case-sensitivity.  Does your
 olFaithFull field contain stillHere or stillhere?

 --Leto

  -Original Message-
  From: Luke Shannon [mailto:[EMAIL PROTECTED]
  This works:

  query1 = QueryParser.parse(jpg, kcfileupload, new
  StandardAnalyzer()); query2 = QueryParser.parse(stillHere,
  olFaithFull, new StandardAnalyzer()); BooleanQuery
  typeNegativeSearch = new BooleanQuery();
  typeNegativeSearch.add(query1, false, false);
  typeNegativeSearch.add(query2, false, false);

  It returns 9 results. And in string form is: kcfileupload:jpg
  olFaithFull:stillhere

  But this:

  query1 = QueryParser.parse(jpg, kcfileupload, new
  StandardAnalyzer());
  query2 = QueryParser.parse(stillHere,
 olFaithFull, new
  StandardAnalyzer());
  BooleanQuery typeNegativeSearch = new BooleanQuery();
  typeNegativeSearch.add(query1, true, false);
  typeNegativeSearch.add(query2, true, false);

  Reutrns 0 results and is in string form : +kcfileupload:jpg
  +olFaithFull:stillhere

  If I do the query kcfileupload:jpg in Luke I get 9 docs, each doc
  containing a olFaithFull:stillHere. Why would
  +kcfileupload:jpg +olFaithFull:stillhere return no results?

  Thanks,

  Luke

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom
it is addressed and may contain privileged and/or confidential information.
If you are not the intended recipient, any disclosure, copying or
dissemination of the information is unauthorised and you should
delete/destroy all copies and notify the sender. No liability is accepted
for any unauthorised use of the information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Optimize not deleting all files

2005-02-03 Thread yahootintin . 1247688

Hi,



When I run an optimize in our production environment, old index are
left in the directory and are not deleted.  



My understanding is that an
optimize will create new index files and all existing index files should be
deleted.  Is this correct?



We are running Lucene 1.4.2 on Windows.  



Any help is appreciated.  Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Optimize not deleting all files

2005-02-03 Thread

Your understanding is right!

The old existing files should be deleted,but it  will build new files!


On Thu, 03 Feb 2005 17:36:27 -0800 (PST),
[EMAIL PROTECTED] [EMAIL PROTECTED]
wrote:
 Hi,
 
 When I run an optimize in our production environment, old index are
 left in the directory and are not deleted.
 
 My understanding is that an
 optimize will create new index files and all existing index files should be
 deleted.  Is this correct?
 
 We are running Lucene 1.4.2 on Windows.
 
 Any help is appreciated.  Thanks!
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Parsing The Query: Every document that doesn't have a field containing x

2005-02-03 Thread

I  think you may can use a filter to get right result!
See examlples below
package lia.advsearching;

import junit.framework.TestCase;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.QueryFilter;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.RAMDirectory;

public class SecurityFilterTest extends TestCase {
  private RAMDirectory directory;

  protected void setUp() throws Exception {
directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory,
new WhitespaceAnalyzer(), true);

// Elwood
Document document = new Document();
document.add(Field.Keyword(owner, elwood));
document.add(Field.Text(keywords, elwoods sensitive info));
writer.addDocument(document);

// Jake
document = new Document();
document.add(Field.Keyword(owner, jake));
document.add(Field.Text(keywords, jakes sensitive info));
writer.addDocument(document);

writer.close();
  }

  public void testSecurityFilter() throws Exception {
TermQuery query = new TermQuery(new Term(keywords, info));

IndexSearcher searcher = new IndexSearcher(directory);
Hits hits = searcher.search(query);
assertEquals(Both documents match, 2, hits.length());

QueryFilter jakeFilter = new QueryFilter(
new TermQuery(new Term(owner, jake)));

hits = searcher.search(query, jakeFilter);
assertEquals(1, hits.length());
assertEquals(elwood is safe,
jakes sensitive info, hits.doc(0).get(keywords));
  }

}


On Thu, 3 Feb 2005 13:04:50 -0500, Luke Shannon
[EMAIL PROTECTED] wrote:
 Hello;
 
 I have a query that finds document that contain fields with a specific
 value.
 
 query1 = QueryParser.parse(jpg, kcfileupload, new StandardAnalyzer());
 
 This works well.
 
 I would like a query that find documents containing all kcfileupload fields
 that don't contain jpg.
 
 The example I found in the book that seems to relate shows me how to find
 documents without a specific term:
 
 QueryParser parser = new QueryParser(contents, analyzer);
 parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
 
 But than it says:
 
 Negating a term must be combined with at least one nonnegated term to return
 documents; in other words, it isn't possible to use a query like NOT term to
 find all documents that don't contain a term.
 
 So does that mean the above example wouldn't work?
 
 The API says:
 
  a plus (+) or a minus (-) sign, indicating that the clause is required or
 prohibited respectively;
 
 I have been playing around with using the minus character without much luck.
 
 Can someone give point me in the right direction to figure this out?
 
 Thanks,
 
 Luke
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Numbers in the Query String

2005-02-03 Thread

I agree their viewpoint!


On Thu, 3 Feb 2005 14:29:13 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Using different analyzers for indexing and searching is not
 recommended.
 Your numbers are not even in the index because you are using
 StandardAnalyzer.  Use Luke to look at your index.
 
 Otis
 
 
 --- Hetan Shah [EMAIL PROTECTED] wrote:
 
  Hello,
 
  How can one search for a document based on the query which has
  numbers
  in the query srting.
 
  e.g. query = Java 2 Platform J2EE
 
  What do I need to do so that the numbers do not get neglected.
 
  I am using StandardAnalyzer to index the pages and using StopAnalyzer
  to
  search the documents. Would the use of two different analyzers cause
  any
  trouble for the results?
 
  Thanks.
  -H
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

53 matches

Mail list logo