RE: Does Escaping Really Work?

2002-11-26 Thread Spencer, Dave
My understanding is that escaping may not work (as Terry and I believe)
however
 a workaround for most 'reasonable' cases is to use WhitespaceAnalyzer
when
parsing a query.


-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, November 26, 2002 1:48 PM
To: Lucene Users List
Subject: Re: Does Escaping Really Work?


Well, pardon me for breathing, Otis.

I didn't make the connection (partly 'cause you changed the subject
line).
But anyway, I don't understand your rather oblique answer - does
escaping
work or not?  Are you saying that, in order for it to work (the way the
docs
say it does), I need to insert this module in the chain? Or what?

Terry

- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 26, 2002 3:07 PM
Subject: Re: Does Escaping Really Work?


 Didn't I just answer this last night?
 WhitespaceAnalyzer?

 Otis

 --- Terry Steichen [EMAIL PROTECTED] wrote:
  I'm confused about how to use escape characters in Lucene.  My
Lucene
  configuration is 1.3-dev1 and I use the StandardAnalyzer and
  QueryParser.
 
  My documents have a field called 'path' with a value like
  1102/a55407-2002nov2.xml.  This field is indexed but not
tokenized.
   Here are the various queries I've tried and their results:
 
  1) When a dash is included in the query, Lucene interprets this as a
  space. (path:1102/a55402-2002nov2.xml is interpreted as
  path:1102/a55402 -body:2002nov2.xml)
 
  2) When a backslash is inserted before the dash (and the query does
  *not* contain a wildcard), Lucene interprets this by inserting a
  space in lieu of the next character.
  ('path:1102/a55402\-2002nov2.xml' interpreted as 'path:1102/a55402
  2002nov2.xml [note the space where the dash was]')
 
  3) When a backslash is inserted before the dash (and the query
*does*
  contain a wildcard), Lucene interprets this literally, without any
  conversion. (path:1102/55407\-2002nov* is interpreted literally).
 
  4) When a backslash is inserted before the dash and immediately
  followed by a wildcard, Lucene reports an error.
  ('path:1102/a55407-*'causes lexical error: Encountered EOF
  after :)
 
  My overall observation is that it appears it is not possible to
  escape a dash - is this true?
 
  A previous post (yesterday) suggests that it is also not possible to
  escape a backslash.  If that's also true, what characters can be
  escaped?
 
 
  Regards,
 
  Terry
 
 
 
 


 __
 Do you Yahoo!?
 Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
 http://mailplus.yahoo.com

 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Does Escaping Really Work?

2002-11-26 Thread Spencer, Dave
I suspect to dig deeper we'll have to look
at QueryParser.jj.

-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, November 26, 2002 3:11 PM
To: Lucene Users List
Subject: Re: Does Escaping Really Work?


Dave,

I would say you seem to be right.  But this is getting very frustrating.
Here is what the Lucene docs say:

docs quote
Lucene supports escaping special characters that are part of the query
syntax. The current list special characters are

+ -  || ! ( ) { } [ ] ^  ~ * ? : \

To escape these character use the \ before the character. For example to
search for (1+1):2 use the query:

 \(1\+1\)\:2

/docs quote

Is the Lucene documentation in error?  Does it work but only using
something
other than the standard configuration?  If so, precisely what
non-standard
configuration is necessary?

Why can't these questions be answered simply and clearly?

Terry


- Original Message -
From: Spencer, Dave [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 26, 2002 5:02 PM
Subject: RE: Does Escaping Really Work?


My understanding is that escaping may not work (as Terry and I believe)
however
 a workaround for most 'reasonable' cases is to use WhitespaceAnalyzer
when
parsing a query.


-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, November 26, 2002 1:48 PM
To: Lucene Users List
Subject: Re: Does Escaping Really Work?


Well, pardon me for breathing, Otis.

I didn't make the connection (partly 'cause you changed the subject
line).
But anyway, I don't understand your rather oblique answer - does
escaping
work or not?  Are you saying that, in order for it to work (the way the
docs
say it does), I need to insert this module in the chain? Or what?

Terry

- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, November 26, 2002 3:07 PM
Subject: Re: Does Escaping Really Work?


 Didn't I just answer this last night?
 WhitespaceAnalyzer?

 Otis

 --- Terry Steichen [EMAIL PROTECTED] wrote:
  I'm confused about how to use escape characters in Lucene.  My
Lucene
  configuration is 1.3-dev1 and I use the StandardAnalyzer and
  QueryParser.
 
  My documents have a field called 'path' with a value like
  1102/a55407-2002nov2.xml.  This field is indexed but not
tokenized.
   Here are the various queries I've tried and their results:
 
  1) When a dash is included in the query, Lucene interprets this as a
  space. (path:1102/a55402-2002nov2.xml is interpreted as
  path:1102/a55402 -body:2002nov2.xml)
 
  2) When a backslash is inserted before the dash (and the query does
  *not* contain a wildcard), Lucene interprets this by inserting a
  space in lieu of the next character.
  ('path:1102/a55402\-2002nov2.xml' interpreted as 'path:1102/a55402
  2002nov2.xml [note the space where the dash was]')
 
  3) When a backslash is inserted before the dash (and the query
*does*
  contain a wildcard), Lucene interprets this literally, without any
  conversion. (path:1102/55407\-2002nov* is interpreted literally).
 
  4) When a backslash is inserted before the dash and immediately
  followed by a wildcard, Lucene reports an error.
  ('path:1102/a55407-*'causes lexical error: Encountered EOF
  after :)
 
  My overall observation is that it appears it is not possible to
  escape a dash - is this true?
 
  A previous post (yesterday) suggests that it is also not possible to
  escape a backslash.  If that's also true, what characters can be
  escaped?
 
 
  Regards,
 
  Terry
 
 
 
 


 __
 Do you Yahoo!?
 Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
 http://mailplus.yahoo.com

 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Slash Problem

2002-11-25 Thread Spencer, Dave
Funny, I have more or less the same question I've been meaning to post.
I think the answer is going to be that the analyzer applies to all parts
of
a query, even to untokenized fields, which to me seems wrong.

So I think if you have a query like

body:foo uri:/alpha/beta

With 'body' being tokenized and 'uri' not tokenized, I think that
the analyzer applies to /alpha/beta and breaks it into alpha beta
which is not desired...


-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Monday, November 25, 2002 9:26 AM
To: Lucene Users List
Subject: Re: Slash Problem


Rob,

I presume that means that you used backslashes (in the url) rather than
forward slashes (in the path).  I had planned to test that as a
workaround
and it's good to know that you've already tested that successfully.

But why is this necessary?  Why doesn't the escape ('\') allow the use
of a
backslash?

Regards,

Terry

- Original Message -
From: Rob Outar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 25, 2002 12:01 PM
Subject: RE: Slash Problem


 I don't know if this helps but I had exact same problem, I then stored
the
 URI instead of the path, I was then able to search on the URI.

 Thanks,

 Rob


 -Original Message-
 From: Terry Steichen [mailto:[EMAIL PROTECTED]]
 Sent: Monday, November 25, 2002 11:53 AM
 To: Lucene Users Group
 Subject: Slash Problem


 I've got a Text field (tokenized, indexed, stored) called 'path' which
 contains a string in the form of '1102\A3345-12RT.XML'.  When I submit
a
 query like path:1102* it works fine.  But, when I try to be more
specific
 (such as path:1102\a* or path:1102*a*) it fails.  I've tried
escaping
 the slash (path:1102\\a*) but that also fails.

 I'm using the StandardAnalyzer and the default QueryParser.  Could
anyone
 suggest what's going wrong here?

 Regards,

 Terry



 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: PDF parser

2002-11-25 Thread Spencer, Dave
I've tried all 3 of those and none have worked out for me.
Our intranet has 802 PDFs from lots of (vendor) sources and
all the pure java parsers have trouble w/ some of them.
I've since gone to pdftotext from xpdf at the link below.
True, not pure java, but it works on all platforms
w/ my doc set and I suggest people use it, esp if they
have any troubles w/ the java stuff below.

http://www.foolabs.com/xpdf/

problems: some java parsers have trouble w/ the dummy encryption
used, some parsers go into loops w/ some docs, and some parsers
crash on some docs. Yes, I've reported some of these problems to the
authors.

-Original Message-
From: Borkenhagen, Michael (ofd-ko zdfin)
[mailto:[EMAIL PROTECTED]]
Sent: Friday, November 22, 2002 6:42 AM
To: 'Lucene Users List'
Subject: AW: PDF parser


There are different Parsers available - every Parser has other
advantages
and disadvantages.
I use a combination of the PDFBox  http://www.pdfbox.org/ and Etymon PJ
http://www.etymon.com/pjc/, cause their APIs are very simple. Both of
them
parse PDF in a format of their own an provide interfaces to get the PDF
Documents contents.

Other developers on this list prefer JPedal http://www.jpedal.org/ which
parses PDF into XML an provide a XML Tree with the PDF Documents
contents. 
JPedal does the work best, but the Documentation isn´t very detailed.

Micha

-Ursprüngliche Nachricht-
Von: Thomas Chacko [mailto:[EMAIL PROTECTED]]
Gesendet: Freitag, 22. November 2002 15:26
An: Lucene Users List
Betreff: PDF parser


Whats the best parser available to extarct text from PDF documents.
Expecting a reply ASAP

Thanks in advance
Thomas Chacko


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Book

2002-11-25 Thread Spencer, Dave

I didn't see anyone mention my favorite text, Managing Gigabytes.
My amazon link is:
http://www.amazon.com/exec/obidos/ASIN/1558605703/tropoA



-Original Message-
From: William W [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, November 20, 2002 12:14 PM
To: [EMAIL PROTECTED]
Subject: Book



I would like to buy a book about Lucene.
Who could write it ? : )

_
STOP MORE SPAM with the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Slash Problem

2002-11-25 Thread Spencer, Dave
OK, sorry for the noise then.
If I can reproduce I'll be more precise.


-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Monday, November 25, 2002 12:13 PM
To: Lucene Users List
Subject: Re: Slash Problem


Dave,

My recent testing suggests that when the field is not tokenized, it is
not
split as you suggest.  When I search the path field using
path:1102/A* I
get precisely what I am looking for (though I discovered the lowercase
mechanism isn't applied to this field and the query is case-sensitive -
not
the uppercase 'A' above.)

Regards,

Terry

- Original Message -
From: Spencer, Dave [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 25, 2002 2:58 PM
Subject: RE: Slash Problem


Funny, I have more or less the same question I've been meaning to post.
I think the answer is going to be that the analyzer applies to all parts
of
a query, even to untokenized fields, which to me seems wrong.

So I think if you have a query like

body:foo uri:/alpha/beta

With 'body' being tokenized and 'uri' not tokenized, I think that
the analyzer applies to /alpha/beta and breaks it into alpha beta
which is not desired...


-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Monday, November 25, 2002 9:26 AM
To: Lucene Users List
Subject: Re: Slash Problem


Rob,

I presume that means that you used backslashes (in the url) rather than
forward slashes (in the path).  I had planned to test that as a
workaround
and it's good to know that you've already tested that successfully.

But why is this necessary?  Why doesn't the escape ('\') allow the use
of a
backslash?

Regards,

Terry

- Original Message -
From: Rob Outar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 25, 2002 12:01 PM
Subject: RE: Slash Problem


 I don't know if this helps but I had exact same problem, I then stored
the
 URI instead of the path, I was then able to search on the URI.

 Thanks,

 Rob


 -Original Message-
 From: Terry Steichen [mailto:[EMAIL PROTECTED]]
 Sent: Monday, November 25, 2002 11:53 AM
 To: Lucene Users Group
 Subject: Slash Problem


 I've got a Text field (tokenized, indexed, stored) called 'path' which
 contains a string in the form of '1102\A3345-12RT.XML'.  When I submit
a
 query like path:1102* it works fine.  But, when I try to be more
specific
 (such as path:1102\a* or path:1102*a*) it fails.  I've tried
escaping
 the slash (path:1102\\a*) but that also fails.

 I'm using the StandardAnalyzer and the default QueryParser.  Could
anyone
 suggest what's going wrong here?

 Regards,

 Terry



 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




test case - RE: Slash Problem

2002-11-25 Thread Spencer, Dave

I'm sure there's something that I'm missing here.
Let's say we have an index of a web site with 2 fields,
body, and url.
Body is formed via Field.Text(...,Reader) and the url field by 
Field.Keyword(), thus the URL is not tokenized but is searchable.

I use StandardAnalyzer and I want to find
the Document with a matching URL, and I want
to use QueryParser to parse the users queries.

I'm using v1.2.

It seems that, if I'm correct, one design problem is that the Analyzer 
does not have a reference to an index, so it doesn't know
if a field has been tokenized. It probably should not tokenize
queries against an untokenized field. AFAIAK the queries against
untokenized fields are always tokenized and there is no way to tell
the QueryParser to not tokenize a field.

I have attached a test program that shows the behavior and
sample output.
The From: lines are user queries.
The To: lines are the result of calling QueryParser and then
Query.toString().

The 3rd and 4th From/To lines below are the key ones.
The goal is to enter a query like url:http://.tropo.com/
or url:http://www.tropo.com/; and not tokenize the
'http://www.tropo.com/'.
I tried backslashes too to no avail (url:http\://www.tropo.com/)

  


==
C:\proj\tropo_javajava com.tropo.lucene.KeywordProblem
From: foo
To  : foo

From: body:foo
To  : body:foo

From: url:http://www.tropo.com/-- first attempt
To  : http -- first
problem, ok, we gotta quote

From: url:http://www.tropo.com/;  -- second
attempt
To  : http www.tropo.com -- second
problem, colon and slashes missing



==
package com.tropo.lucene;

import java.io.*;
import java.util.*;

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;

public class KeywordProblem
{
/**
 *
 */
public static void main(String[] args)
throws Throwable
{
String body = body;
String url = url;

String[] lines = new String[] {
foo,
body:foo,
url:http://www.tropo.com/;,
url:\http://www.tropo.com/\;
};

Analyzer a = new StandardAnalyzer();
for ( int i = 0; i  lines.length; i++)
{
Query query = QueryParser.parse( lines[i], url,
a);
o.println( From:  + lines[i]);
o.println( To  :  + query.toString( url));
o.println();
}
}
private static PrintStream o = System.out;
}




-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Monday, November 25, 2002 12:13 PM
To: Lucene Users List
Subject: Re: Slash Problem


Dave,

My recent testing suggests that when the field is not tokenized, it is
not
split as you suggest.  When I search the path field using
path:1102/A* I
get precisely what I am looking for (though I discovered the lowercase
mechanism isn't applied to this field and the query is case-sensitive -
not
the uppercase 'A' above.)

Regards,

Terry

- Original Message -
From: Spencer, Dave [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 25, 2002 2:58 PM
Subject: RE: Slash Problem


Funny, I have more or less the same question I've been meaning to post.
I think the answer is going to be that the analyzer applies to all parts
of
a query, even to untokenized fields, which to me seems wrong.

So I think if you have a query like

body:foo uri:/alpha/beta

With 'body' being tokenized and 'uri' not tokenized, I think that
the analyzer applies to /alpha/beta and breaks it into alpha beta
which is not desired...


-Original Message-
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Monday, November 25, 2002 9:26 AM
To: Lucene Users List
Subject: Re: Slash Problem


Rob,

I presume that means that you used backslashes (in the url) rather than
forward slashes (in the path).  I had planned to test that as a
workaround
and it's good to know that you've already tested that successfully.

But why is this necessary?  Why doesn't the escape ('\') allow the use
of a
backslash?

Regards,

Terry

- Original Message -
From: Rob Outar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 25, 2002 12:01 PM
Subject: RE: Slash Problem


 I don't know if this helps but I had exact same problem, I then stored
the
 URI instead of the path, I was then able to search on the URI.

 Thanks,

 Rob


 -Original Message-
 From: Terry Steichen [mailto:[EMAIL PROTECTED]]
 Sent: Monday

RE: test case - RE: Slash Problem

2002-11-25 Thread Spencer, Dave
Good point though I thought the rule was you were supposed
to use the same Analyzer on your Query as you built the
index with.

Of course I suspect that this will break down if the
Field.Keyword text has spaces in it.

But: it gets past all reasonable uri/url/filename cases so thanks.


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: Monday, November 25, 2002 7:23 PM
To: Lucene Users List
Subject: Re: test case - RE: Slash Problem


Maybe there is a good reason for using WhitespaceAnalyzer in
TestQueryParser.java :).  Try it.

public void testEscaped() throws Exception {
Analyzer a = new WhitespaceAnalyzer();
assertQueryEquals(\\[brackets, a, \\[brackets);
assertQueryEquals(\\[brackets, null, brackets);
assertQueryEquals(, a, );
assertQueryEquals(\\+blah, a, \\+blah);
assertQueryEquals(\\(blah, a, \\(blah);
}

Otis

--- Spencer, Dave [EMAIL PROTECTED] wrote:
 
 I'm sure there's something that I'm missing here.
 Let's say we have an index of a web site with 2 fields,
 body, and url.
 Body is formed via Field.Text(...,Reader) and the url field by 
 Field.Keyword(), thus the URL is not tokenized but is searchable.
 
 I use StandardAnalyzer and I want to find
 the Document with a matching URL, and I want
 to use QueryParser to parse the users queries.
 
 I'm using v1.2.
 
 It seems that, if I'm correct, one design problem is that the
 Analyzer 
 does not have a reference to an index, so it doesn't know
 if a field has been tokenized. It probably should not tokenize
 queries against an untokenized field. AFAIAK the queries against
 untokenized fields are always tokenized and there is no way to tell
 the QueryParser to not tokenize a field.
 
 I have attached a test program that shows the behavior and
 sample output.
 The From: lines are user queries.
 The To: lines are the result of calling QueryParser and then
 Query.toString().
 
 The 3rd and 4th From/To lines below are the key ones.
 The goal is to enter a query like url:http://.tropo.com/
 or url:http://www.tropo.com/; and not tokenize the
 'http://www.tropo.com/'.
 I tried backslashes too to no avail (url:http\://www.tropo.com/)
 
   
 


 ==
 C:\proj\tropo_javajava com.tropo.lucene.KeywordProblem
 From: foo
 To  : foo
 
 From: body:foo
 To  : body:foo
 
 From: url:http://www.tropo.com/-- first
 attempt
 To  : http -- first
 problem, ok, we gotta quote
 
 From: url:http://www.tropo.com/;  -- second
 attempt
 To  : http www.tropo.com -- second
 problem, colon and slashes missing
 
 


 ==
 package com.tropo.lucene;
 
 import java.io.*;
 import java.util.*;
 
 import org.apache.lucene.analysis.*;
 import org.apache.lucene.analysis.standard.*;
 import org.apache.lucene.search.*;
 import org.apache.lucene.queryParser.*;
 
 public class KeywordProblem
 {
   /**
*
*/
   public static void main(String[] args)
   throws Throwable
   {
   String body = body;
   String url = url;
 
   String[] lines = new String[] {
   foo,
   body:foo,
   url:http://www.tropo.com/;,
   url:\http://www.tropo.com/\;
   };
 
   Analyzer a = new StandardAnalyzer();
   for ( int i = 0; i  lines.length; i++)
   {
   Query query = QueryParser.parse( lines[i], url,
 a);
   o.println( From:  + lines[i]);
   o.println( To  :  + query.toString( url));
   o.println();
   }
   }
   private static PrintStream o = System.out;
 }
 
 
 
 
 -Original Message-
 From: Terry Steichen [mailto:[EMAIL PROTECTED]]
 Sent: Monday, November 25, 2002 12:13 PM
 To: Lucene Users List
 Subject: Re: Slash Problem
 
 
 Dave,
 
 My recent testing suggests that when the field is not tokenized, it
 is
 not
 split as you suggest.  When I search the path field using
 path:1102/A* I
 get precisely what I am looking for (though I discovered the
 lowercase
 mechanism isn't applied to this field and the query is case-sensitive
 -
 not
 the uppercase 'A' above.)
 
 Regards,
 
 Terry
 
 - Original Message -
 From: Spencer, Dave [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Monday, November 25, 2002 2:58 PM
 Subject: RE: Slash Problem
 
 
 Funny, I have more or less the same question I've been meaning to
 post.
 I think the answer is going to be that the analyzer applies to all
 parts
 of
 a query, even to untokenized fields, which to me seems wrong.
 
 So I think if you have a query like
 
 body:foo uri:/alpha/beta
 
 With 'body

RE: How to get all field names

2002-11-12 Thread Spencer, Dave
This fragment (from a JSP page..) should dump the
fields for an index in alphabetical order - this
is not precisely what you're asking however -this is all
the fields used in an *index*, not a document, but
anyway maybe this helps:


IndexReader r = IndexReader.open( indexName);
TermEnum te = r.terms();
Set s = new TreeSet();
while ( te.next())
{
Term t = te.term();
s.add( t.field());
}
te.close();
r.close();

o.println( These are all the fields in the index and they can be
searched on...p);
Iterator it = s.iterator();
while ( it.hasNext())
{
o.println( it.next() + br);
}

-Original Message-
From: Christoph Kiehl [mailto:kiehl;subshell.com]
Sent: Tuesday, November 12, 2002 1:11 AM
To: [EMAIL PROTECTED]
Subject: How to get all field names


Hi,

I was wondering if there is a possibility to get a list of all field
names
that have ever been used to index a document? This way I could filter
out
some special fields, like identity and such, and do a search over the
remaining. That would give me total freedom to choose any document
structure
and have all fields searched. Is this possible? Or do anyone of you have
a
better way achieving that?

Regards
Christoph


--
To unsubscribe, e-mail:
mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail:
mailto:lucene-user-help;jakarta.apache.org



--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org




the code - RE: Indexing synonyms

2002-11-12 Thread Spencer, Dave
 terms in the index but I am
pressured on time. If you want something more sophisticated is to expand
terms depending on the word sense but this requires the expensive
process of building a word sense disambiguation. This will solve the
problem mentioned by Joshua  like 'minute' (time period) and 'minute'
(very small). However this is no easy task and time consuming!!!
Perhaps in my case doing a query expansion is the best idea and will
solve all the hassle but I am still thinking which way to go.

Regarding the question how things will be stored in the index it is as
you say Otis:
Document1:
   word: word1
 word1synonym1
 word1synonym2
 word1synonym3
But not sure whether I understood your question.

regards
Aaron



- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 11, 2002 8:22 PM
Subject: RE: Indexing synonyms


 I always thought that WordNet was not accessible to general public. 
 Wrong?

 Also, I'm curious - what would you use for storing synonyms? Are you 
 considering using a 'static', read-only Lucene index maybe? An index 
 that makes use of setPosition(0) calls to store synonyms like this, 
 for instance:

 Document1:
   word: word1
 word1synonym1
 word1synonym2
 word1synonym3

 ...

 DocumentN:
   word: wordN
   wordNsynonym1
   wordNsynonym2
   wordNsynonym3


 Unless I am missing something, and if a synonym database is available,

 this would be pretty easy to implement, no?

 Otis





 --- Spencer, Dave [EMAIL PROTECTED] wrote:
  Re reducing the set of question/answer pair to
  consider below - I would expect that using synonyms either in the 
  index or in the reformed query would (annoyingly) increase the 
  number of potential matches or is there something I'm missing.
 
  Interesting that this topic just came up as I wanted to experiment 
  w/ the same thing. My first stab at an public domain synonym list, 
  the moby list, didn't seem to have synonyms however. I believe 
  another poster mentioned WordNet so I'll try that.
 
  I'd really like it if it was possibly to automatically determine 
  synonyms - maybe something similar to Latent Semantic Analysis - but

  such things seem kinda hard to code up...
 
 
  -Original Message-
  From: Aaron Galea [mailto:agale;nextgen.net.mt]
  Sent: Sunday, November 10, 2002 4:18 PM
  To: Lucene Users List; [EMAIL PROTECTED]
  Subject: Re: Indexing synonyms
 
 
  Thanks for all your replies,
 
  Well I will start of with an idea of what I am trying to achieve. I 
  am building a question answer system and one of its modules is an 
  FAQ Module.
  Since the QA system is concerned with education, users can
  concentrate
  their
  question on a particular subject reducing the set of question/answer
  pair to
  consider. Since there is this hierarchical indexing the index files
  are
  not
  that big so I can store synonyms for each word in a question in the
  index.
  Query expansion will solve the problem and eliminating the need to
  store
  synonyms in the index but this will slow things as there is no depth
  limit
  to consider for term expansion. It is not my intension to build
  something
  similar to the FAQFinder system but I want to further reduce the
  subset
  of
  questions to consider on which a question reformulation algorithm
  would
  be
  applied. Therefore the idea is get an faq file dealing with one
  education
  subject, index all of its questions and expand each term in the
  question.
  Using lucene I will retrieve the questions that are likely to be
  similar
  to
  a user question, select say the top 5 and apply a query
reformulation
  algorithm. If this succeeds fine and I return the answer to user,
  otherwise
  submit the question to an answer extraction module. The most
  important
  thing
  is speed so putting term expansion in the index hopefully should
  improve
  things. Obviously problems arise with this method as there is no
word
  sense
  disambiguation but the query reformulation algorithm will solve
this.
  However it is slow so I must reduce the number of questions it is
  applied
  on. It is a tradeoff!!!
 
  Well I managed to solve this by overriding the next() method and 
  when it gets to an EOS I start returning the new expanded terms that

  I accumulated
  in a list.
 
  Thanks everyone for your reply
 
  Aaron
 
  NB : And yep I am a Malteser Otis ! :)
 
 
  - Original Message -
  From: Alex Murzaku [EMAIL PROTECTED]
  To: 'Lucene Users List' [EMAIL PROTECTED]
  Sent: Monday, November 11, 2002 12:17 AM
  Subject: RE: Indexing synonyms
 
 
   You could also do something with org.apache.lucene.analyzer.Token
  which
   includes the following self-explanatory note:
  
 /** Set the position increment.  This determines the position of
  this
   token
  * relative to the previous Token in a {@link TokenStream}, used
  in
   phrase

RE: Indexing synonyms

2002-11-11 Thread Spencer, Dave
Re reducing the set of question/answer pair to
consider below - I would expect that using synonyms either
in the index or in the reformed query would (annoyingly) 
increase the number of potential matches or is there
something I'm missing.

Interesting that this topic just came up as I wanted to experiment
w/ the same thing. My first stab at an public domain synonym
list, the moby list, didn't seem to have synonyms however. 
I believe another poster mentioned WordNet so I'll try that.

I'd really like it if it was possibly to automatically determine
synonyms - maybe something similar to Latent Semantic Analysis - but
such things seem kinda hard to code up...


-Original Message-
From: Aaron Galea [mailto:agale;nextgen.net.mt]
Sent: Sunday, November 10, 2002 4:18 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Indexing synonyms


Thanks for all your replies,

Well I will start of with an idea of what I am trying to achieve. I am
building a question answer system and one of its modules is an FAQ
Module.
Since the QA system is concerned with education, users can concentrate
their
question on a particular subject reducing the set of question/answer
pair to
consider. Since there is this hierarchical indexing the index files are
not
that big so I can store synonyms for each word in a question in the
index.
Query expansion will solve the problem and eliminating the need to store
synonyms in the index but this will slow things as there is no depth
limit
to consider for term expansion. It is not my intension to build
something
similar to the FAQFinder system but I want to further reduce the subset
of
questions to consider on which a question reformulation algorithm would
be
applied. Therefore the idea is get an faq file dealing with one
education
subject, index all of its questions and expand each term in the
question.
Using lucene I will retrieve the questions that are likely to be similar
to
a user question, select say the top 5 and apply a query reformulation
algorithm. If this succeeds fine and I return the answer to user,
otherwise
submit the question to an answer extraction module. The most important
thing
is speed so putting term expansion in the index hopefully should improve
things. Obviously problems arise with this method as there is no word
sense
disambiguation but the query reformulation algorithm will solve this.
However it is slow so I must reduce the number of questions it is
applied
on. It is a tradeoff!!!

Well I managed to solve this by overriding the next() method and when it
gets to an EOS I start returning the new expanded terms that I
accumulated
in a list.

Thanks everyone for your reply

Aaron

NB : And yep I am a Malteser Otis ! :)


- Original Message -
From: Alex Murzaku [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, November 11, 2002 12:17 AM
Subject: RE: Indexing synonyms


 You could also do something with org.apache.lucene.analyzer.Token
which
 includes the following self-explanatory note:

   /** Set the position increment.  This determines the position of
this
 token
* relative to the previous Token in a {@link TokenStream}, used in
 phrase
* searching.
*
* pThe default value is one.
*
* pSome common uses for this are:ul
*
* liSet it to zero to put multiple terms in the same position.
 This is
* useful if, e.g., a word has multiple stems.  Searches for phrases
* including either stem will match.  In this case, all but the
first
 stem's
* increment should be set to zero: the increment of the first
 instance
* should be one.  Repeating a token with an increment of zero can
 also be
* used to boost the scores of matches on that token.
*
* liSet it to values greater than one to inhibit exact phrase
 matches.
* If, for example, one does not want phrases to match across
removed
 stop
* words, then one could build a stop word filter that removes stop
 words and
* also sets the increment to the number of stop words removed
before
 each
* non-stop word.  Then exact phrase queries will only match when
the
 terms
* occur with no intervening stop words.
*
* /ul
* @see TermPositions
*/
   public void setPositionIncrement(int positionIncrement) {
 if (positionIncrement  0)
   throw new IllegalArgumentException
 (Increment must be positive:  + positionIncrement);
 this.positionIncrement = positionIncrement;
   }


 --
 Alex Murzaku
 ___
  alex(at)lissus.com  http://www.lissus.com

 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodnetic;yahoo.com]
 Sent: Sunday, November 10, 2002 1:30 PM
 To: Lucene Users List
 Subject: Re: Indexing synonyms


 .mt?  Malta?  That's rare! :)

 A person called Clemens Marschner just submitted diffs for query
 rewriting to lucene-dev list 1-2 weeks ago.  The diffs are not in CVS
 yet, and they are a bit old now becase the code they were 

RE: Indexing Db Table -- Better way request

2002-11-08 Thread Spencer, Dave
We have a number of internal systems here (content mgmt, bug db, support
email,
CRM), all of which are PHP/MySQL combos - and in all cases Lucene is
used for the
indexing and we have never seen any reason to go to XML
as in intermediate step. We've been at this for 6 months or so.
Only hassle is that if the group that's doing the PHP/MySQL tweaks the
schema,
they have to remember to modify the Lucene indexer so that, say, it
picks
up the new columns - but there's no way around this unless you want to 
be very generic, in which case xml still doesn't give you anything since
you could
just as well use JDBC meta-data to get all columns...


-Original Message-
From: Michael Caughey [mailto:michael;caughey.com]
Sent: Friday, November 08, 2002 4:21 PM
To: Spencer, Dave; Lucene Users List
Subject: Re: Indexing Db Table -- Better way request


Converting straight to a document seemed to me the best answer as I
started
to investigate.  Somewhere along the line I thought I remembered seeing
a
suggestion that it was for some reason better to convert to XML and then
add
it as an XML document.  I'd rather not have the hassel of creating then
later parsing the XML.  I could not find the reference again.  This in
part
was what I was hoping to hear.

Thanks,
Michael
- Original Message -
From: Spencer, Dave [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, November 08, 2002 6:59 PM
Subject: RE: Indexing Db Table -- Better way request


One small comment: what's the point of converting a row to XML?
What I think you want to do is convert a row to a Document and then
pass that off to IndexWriter.

-Original Message-
From: Caughey, Michael [mailto:mcaughey;trigon.com]
Sent: Friday, November 08, 2002 2:22 PM
To: '[EMAIL PROTECTED]'
Cc: '[EMAIL PROTECTED]'
Subject: Indexing Db Table -- Better way request


Hello,

I'm new to Lucene and this group, if it is improper to send such a
message
to this group I apologize.  I tried to do a reasonable amount of up
front
research before coming here.

I'm about to undertake a piece of my project where I've decided that
Lucene
will be of use.  I have been researching, over the past two week's, ways
to
accomplish this.  I know I'll use an indexWriter to write the index to a
file, but I'm having difficultly settling on how to process the data to
be
indexed.

What I have is a table in a MySQL database called items.  I want to be
able
to search on a couple of fields and have it return the ID:
Fields:
=
Name VARCHAR (80)
Description TEXT
Location VARCHAR (80)
Qty int
ExpireDate Long MMDD
Category int
ListingPrice FLOAT(9,2)
Supplier int

Return
=
ItemId int


On start up of the application every row in the database will be read.
After that I need to keep the table and the index in sync.  Data in the
columns can change, rows can be added and removed.  I have a centeral
entity
controller which is responsible for all access to that table.

I figured on approach which would work would be on start up to read each
row
and build an XML document and submit it to the IndexWriter.
As Inserts, Deletes and updates occurred I could modify both lucene and
the
database.

Seems simple enough, and may be the only way to handle it.  Before I did
it
I wanted to make sure that there wasn't a better way.
Are there documents which can automatically read the table and build a
document?
Should I read the row and just build fields and construct a document?

Does anyone see any problems with storing it in memory versus writing it
to
a file?  Or should I say at point would you consider writing it to a
file,
would you base that on total document size?  I feel that a file index
will
most likely be just fine.

Thanks in advance for any suggestions.






Michael Caughey






--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org




RE: stopwords

2002-10-18 Thread Spencer, Dave
Suggestion - include a ref to
org.apache.lucene.analysis.StopFilter.makeStopTable()
which some users of this stop word list will use. 

Also maybe you want to put in a ref to SMART - this may
be the offical download site: ftp://ftp.cs.cornell.edu/pub/smart/



-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodnetic;yahoo.com]
Sent: Thursday, October 17, 2002 11:31 AM
To: Lucene Users List
Cc: John Caron
Subject: Re: stopwords


Thanks.  I may stick this in the Lucene CVS repository somewhere.

Otis

--- John Caron [EMAIL PROTECTED] wrote:
 i am just starting to use lucene, and it it very impressive! I hope
 to try 
 Dmitri's new term vectors when he gets them in, in order to do vector
 model 
 research, in particular LSA. i will port my existing code to use
 lucene 
 framework, and make it available when it is ready.
 
 I am appending a longer list of stop words, mostly from SMART, in
 case these are 
 useful to anyone.
 
 Thanks again!
 
 private static String smart[] =  {
a,
able,
about,
above,
according,
accordingly,
across,
actually,
after,
afterwards,
again,
against,
all,
allow,
allows,
almost,
alone,
along,
already,
also,
although,
always,
am,
among,
amongst,
an,
and,
another,
any,
anybody,
anyhow,
anyone,
anything,
anyway,
anyways,
anywhere,
apart,
appear,
appreciate,
appropriate,
are,
around,
as,
aside,
ask,
asking,
associated,
at,
available,
away,
awfully,
b,
be,
became,
because,
become,
becomes,
becoming,
been,
before,
beforehand,
behind,
being,
believe,
below,
beside,
besides,
best,
better,
between,
beyond,
both,
brief,
but,
by,
c,
came,
can,
cannot,
cant,
cause,
causes,
certain,
certainly,
changes,
clearly,
co,
com,
come,
comes,
concerning,
consequently,
consider,
considering,
contain,
containing,
contains,
corresponding,
could,
course,
currently,
d,
definitely,
described,
despite,
did,
different,
do,
does,
doing,
done,
down,
downwards,
during,
e,
each,
edu,
eg,
eight,
either,
else,
elsewhere,
enough,
entirely,
especially,
et,
etc,
even,
ever,
every,
everybody,
everyone,
everything,
everywhere,
ex,
exactly,
example,
except,
f,
far,
few,
fifth,
first,
five,
followed,
following,
follows,
for,
former,
formerly,
forth,
four,
from,
further,
furthermore,
g,
get,
gets,
getting,
given,
gives,
go,
goes,
going,
gone,
got,
gotten,
greetings,
h,
had,
happens,
hardly,
has,
have,
having,
he,
hello,
help,
hence,
her,
here,
hereafter,
hereby,
herein,
hereupon,
hers,
herself,
hi,
him,
himself,
his,
hither,
hopefully,
how,
howbeit,
however,
i,
ie,
if,
ignored,
immediate,
in,
inasmuch,
inc,
indeed,
indicate,
indicated,
indicates,
inner,
insofar,
instead,
into,
inward,
is,
it,
its,
itself,
j,
just,
k,
keep,
keeps,
kept,
know,
knows,
known,
l,
last,
lately,
later,
latter,
latterly,
least,
less,
lest,
let,
like,
liked,
likely,
little,
look,
looking,
looks,
ltd,
m,
mainly,
many,
may,
maybe,
me,
mean,
meanwhile,
merely,
might,
more,
moreover,
most,
mostly,
much,
must,
my,
myself,
n,
name,
namely,
nd,
near,
nearly,
necessary,
need,
needs,
neither,
never,
nevertheless,
new,
next,
nine,
no,
nobody,
non,
none,
noone,
nor,
normally,
not,
nothing,
novel,
now,
nowhere,
o,
obviously,
of,
off,
often,
oh,
ok,
okay,
old,
on,
once,
one,
ones,
only,
onto,
or,
other,
others,
otherwise,
ought,
our,
ours,
ourselves,
out,
outside,
over,
overall,
own,
p,
particular,
particularly,
per,
perhaps,
placed,
please,
plus,
possible,
presumably,
probably,
provides,
q,
que,
quite,
qv,
r,
rather,
rd,
re,
really,
reasonably,
regarding,
regardless,
regards,
relatively,
respectively,
right,
s,
said,
same,
saw,
say,
saying,
says,
second,

RE: Using Pooled IndexSearchers?

2002-10-18 Thread Spencer, Dave
I/O buffering would certainly be handled by the OS but in theory the
application
can do its own buffering -and in a sense RAMDirectory is an extreme
example of this.
Having an app w/ an adjustable buffer pool gives you more options for
tuning.

-Original Message-
From: Jonathan Pace [mailto:jmpace;fedex.com]
Sent: Thursday, October 17, 2002 10:54 AM
To: Lucene Users List
Subject: RE: Using Pooled IndexSearchers?


The index is only a gig, but of course, optimizing will increase that
size
substantially.  At the rate our index grows, it would be better to  keep
it
in a disk array.

I assume that I/O buffering would be handled by the underlying OS
wouldn't
it?

-jon


-Original Message-
From: Spencer, Dave [mailto:dave;lumos.com]
Sent: Thursday, October 17, 2002 11:45 AM
To: Lucene Users List
Subject: RE: Using Pooled IndexSearchers?


One idea - have you tried searching with a RAMDirectory instead of an
FSDirectory?
If you index fits into memory then this could be a win.
Some notes  code here:

http://www.tropo.com/techno/java/lucene/rammer.html

Note: I know some people have huge indexes that can't fit
into RAM...but I'm sure I've read that Google uses solid state (ram)
disks
in their search farm. Can't find the article however that says this.
Might have been an interview w/ E. Schmidt.

Also: does Lucene have any buffer control in the API?
In theory shouldn't IndexSearcher, or FSDirectory, have control
over buffering of disk blocks?


-Original Message-
From: Jonathan Pace [mailto:jmpace;fedex.com]
Sent: Thursday, October 17, 2002 8:08 AM
To: Lucene Users List
Subject: Using Pooled IndexSearchers?


Just a question for the group.  Is anyone using or have benchmarked a
pooled
IndexSearcher setup?  (Especially the Jakarta Commons POOL
implementations)
I am looking to increase the concurrent search performance because quite
a
few of our users use DateFiltering which dramatically increases search
times.

Is it worth the effort?

Thankyou in advance.



Jonathan M Pace
Sr Programmer/Analyst
Corporate Portal Development
FedEx Services
60 FedEx Pkwy
1st Floor Horiz
901-263-4744
[EMAIL PROTECTED]


--
To unsubscribe, e-mail:
mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail:
mailto:lucene-user-help;jakarta.apache.org



--
To unsubscribe, e-mail:
mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail:
mailto:lucene-user-help;jakarta.apache.org




--
To unsubscribe, e-mail:
mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail:
mailto:lucene-user-help;jakarta.apache.org



--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org




RE: Performance with 5 Millions indexed items

2002-09-10 Thread Spencer, Dave

I have a 1GHz P4 w/ 512MB of RAM and prob a standard 7200 RPM disk.
Running w/ JDK1.4.

I have indexed the content from dmoz.org [maybe I should donate this as
a kind
of example] and the index size is 1GB and it has 3.2M docs in it. I
think it takes
around 4 hours to produce the index.

Briefly, for one quick test, a fuzzy 2 word search takes 10x as long as
the same search unfuzzy.

Searching for: title:kasparov
35 total matching documents after 1232(ms)

Searching for: title:kasparov title:chess
1046 total matching documents after 1272(ms)

Searching for: title:kasparov~ title:chess~
18965 total matching documents after 11276(ms)

As an aside, you can get the dmoz.org content here:
http://dmoz.org/rdf.html
I indexed content.rdf.u8.gz.
It is invalid xml(!) and I couldn't get several SAX parsers to work so I
had
to use Electric XML. 



-Original Message-
From: Mader, Volker [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, September 10, 2002 12:00 AM
To: [EMAIL PROTECTED]
Subject: Performance with 5 Millions indexed items


Hi,

I've got a question about performance with bigger indexes. We used
IndexWriter with GermanAnalyzer to index data with the following fields:

Field1: ID (a long value)
Field2: Description (a free text)
Field3: Groups (a list of up to 10 long values encoded in a single
string)
Field4: Classes (a list of up to 10 long values encoded in a single
string)

Documents are created with the 4 fields and then added to the
Indexwriter.
After all the index is optimized.

Searching now for a word in field Description using
IndexSearcher(GermanAnalyzer) with FuzzyQuery leads to search times up
to 30 seconds on a Pentium 4 1,4GHz.
Also the retrieval with hits.doc(..) is very slow.

Any ideas?

Volker

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Question Deleting/Reindexing Files

2002-03-20 Thread Spencer, Dave

[1] There's no update so delete and then add is what you want.
[2] I have had the same problems w/ using an IndexWriter and IndexReader
at the same time and getting a locking problem when deleting. I think I
sent
mail to the list w/ a test case a week ago  [disclaimer: this is not
a complaint!] and I think the issue is still open. Maybe I should turn
this
into a bug report? I know fixing bugs is encourage but I don't have
enough
context about the right solution, or how the locking apparently
changed to foul this up, though I did look thru things. 
My workaround was to write new entries to a new index and then run
a separate merge utility that 1st does a delete pass, and then reopens
and does adds, based on a primary key (the URL of each doc in my
case).


-Original Message-
From: Joe Hajek [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, March 20, 2002 12:28 AM
To: [EMAIL PROTECTED]
Subject: Question Deleting/Reindexing Files


Hi,

I am using Lucene for indexing a relatively large article based system
where articles change from time to time so i have to reindex them.
reindexing had the effekt that a query would return the hit for a file
multiple times (according to the number of updates.

The only solution to that problem I found was to delete the file to be
updated before indexing it again. Is there another possibility ?

As the system is large i am collecting the articles that have to be
updated together, open a writer and add the documents to the index. this
solution worked fine for me using rc1 in rc4 it seems that it is not
possible anymore to delete a file from an index while the index is
opened for writing.

do you know any solutions to that problem ?

thanx a lot in advance

regards joe

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: getting relative path after searching

2002-03-18 Thread Spencer, Dave

I think this is a JSP question, not a Lucene question,
and the answer is application.getResouce(...)
or application.getRealPath()

http://www.jspinsider.com/reference/jsp/jspapplication.html

-Original Message-
From: Parag Dharmadhikari [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 18, 2002 6:08 AM
To: Lucene Users List
Subject: getting relative path after searching


Hi all,

When searching is done it gives you the full path of the searched
document for instance D:/tomcat/webapps/Root/Office/Office/xyz.doc.
Now if I want only relative path like /Office/Office/xyz.doc instead
of total path then how should I proceed/

regards
parag


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Search result ordering question

2002-03-12 Thread Spencer, Dave

Is this question still pending?
Well I haven't tried it but DateFilter might be what you're looking for:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DateF
ilter.html

You could also add a field that's a kind of enumerated indicating how
recent the doc is.
You add a field when with a value of day, week, month, year,
to indicate
if it is a day old, week old etc.
Then you query using a boost:

   when:day^2.0 when:week^1.8 when:month^1.6 when:year:^1.4

and priority will be given to newer docs.





-Original Message-
From: Kent Vilhelmsen [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 12, 2002 12:00 PM
To: [EMAIL PROTECTED]
Subject: Search result ordering question



I've been using Lucene a bit, and find it very flexible and fast.

However, I need to order search results by date (or, equally, document
id); I've looked a bit into (re)writing a collect method without any
luck. I'm not programming Java too much, so I'm not getting any way with
the (few) hints I've seen regarding date-sorted result sets. 

Does anyone have a quick solution/example to give?

thanks,
Kent Vilhelmsen





--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Deleting documents

2002-03-12 Thread Spencer, Dave

I think I've come across the same problem.
If you have an indexer that adds docs and also deletes docs as it goes
(use case: it's updating
old docs or adding new ones) it seems that you always get
an exception like this thrown from IndexReader.delete().

java.io.IOException: Index locked for write:
Lock@C:\tmp\luc\locktest\write.lock

I had code similar to the code below, and then modified
to explicitely use the same Directory, to no avail.
Approx code:

Directory dir = FSDirectory.getDirectory( indexName, create);
IndexWriter writer = new IndexWriter( dir, ..., create);
IndexReader reader = IndexReader.open( dir);
// now calls to writer.addDocument() work
// if you call reader.delete(int) it fails

I've attached the full src below though it's a bit messy w/ trace
statements.
Should work fine as an isolation test case.
Uses windows dir names, sorry to Unix folk.

This fails against rc4 and also the latest build (0312).

I'm positive a few months ago this stuff worked fine.

If this is indeed a bug then I think the IndexReader and IndexWriter
should know they're
sharing a Directory, whereas now they don't seem to.

As a side note I've always found it strange that IndexReader was used to
delete entries. reader to me means read-only, thus I would have
expected IndexWriter to be the thing that is used to add/delete
documents.




-Original Message-
From: Aruna Raghavan [mailto:[EMAIL PROTECTED]]
Sent: Friday, March 08, 2002 10:40 AM
To: 'Lucene Users List'
Subject: Deleting documents


Hi,
Is there anything wrong with the following code?
  try {
   m_lock.write(); // obtain a write lock on a RWLock
   IndexReader indexReader = IndexReader.open(mypath);
   IndexSearcher indexSearcher = new IndexSearcher(mypath);
  // use the searcher to search for documents to be deleted
  // use the reader to do the deletes.
  indexReader.close();
  }
  catch(Throwable e)
  {   
   e.printStackTrace();
  }
  finally
  {
   m_lock.unlock();
  }

Sometimes I am getting the following exception:
java.io.IOException: Index locked for write:
Lock@D:\RevealCS\Search\Data\reports\write.lock
at org.apache.lucene.index.IndexReader.delete(Unknown Source)
at org.apache.lucene.index.IndexReader.delete(Unknown Source)
at
revsearch.RevSearch$DeleteWatcherThread.checkAction(RevSearch.java:1455)
at revsearch.RevSearch$WatcherThread.run(RevSearch.java:250)

This exception was not happening every time the code was run, it was
intermittent.

I suspect it is because I am using indexSearcher and indexWriter to open
the
myPath dir. I changed it such that indexSearcher uses the indexReader in
the
constructor.

I am hoping that some one can shed some light on what went wrong,
thanks.
Aruna.



--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




LockTest.java
Description: LockTest.java

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]