RE: Time to index documents

2004-08-25 Thread Stephane James Vaucher
Hetan,

If you are using a corpus with multiple editors, I suggest that you 
use a cleaner like tidy as there might be weird stuff appearing in the 
html.

sv

On Thu, 26 Aug 2004, Karthik N S wrote:

> Hi Hetan
> 
> 
>Th's the  major Problem of non Standatrdized Tags for HTML Document's
>   u are Indexing ,resulting in lag time taken for Indexing process
> 
> 
>If u can Tweak the HTMLParser.jj file within  lucene.zip   '/demo/html'
> file
>[U have to have some Knowledge of JAVACC for this].
> 
> 
> 
> Karthik
> 
> -Original Message-
> From: Hetan Shah [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 26, 2004 3:01 AM
> To: Lucene Users List
> Subject: Time to index documents
> 
> 
> Hello all,
> 
> Is there a way to reduce the indexing time taken when the indexer is
> indexing about 30,000 + files. It is roughly taking around 6-7 hours to
> do this. I am using IndexHTML class to create the index out of HTML files.
> 
> Another issue that I see is every once in a while I get the following
> output on the screen.
> 
> adding ../31/1104852.html
> Parse Aborted: Encountered "\"" at line 7, column 1.
> Was expecting one of:
>   ...
>  "=" ...
>   ...
> 
> Any suggestions on preventing this from happening?
> 
> Thanks in advance.
> -H
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Time to index documents

2004-08-25 Thread Karthik N S
Hi Hetan


   Th's the  major Problem of non Standatrdized Tags for HTML Document's
  u are Indexing ,resulting in lag time taken for Indexing process


   If u can Tweak the HTMLParser.jj file within  lucene.zip   '/demo/html'
file
   [U have to have some Knowledge of JAVACC for this].



Karthik

-Original Message-
From: Hetan Shah [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 26, 2004 3:01 AM
To: Lucene Users List
Subject: Time to index documents


Hello all,

Is there a way to reduce the indexing time taken when the indexer is
indexing about 30,000 + files. It is roughly taking around 6-7 hours to
do this. I am using IndexHTML class to create the index out of HTML files.

Another issue that I see is every once in a while I get the following
output on the screen.

adding ../31/1104852.html
Parse Aborted: Encountered "\"" at line 7, column 1.
Was expecting one of:
  ...
 "=" ...
  ...

Any suggestions on preventing this from happening?

Thanks in advance.
-H


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Content from multiple folders in single index

2004-08-25 Thread John Greenhill
Hi,

I suspect this is an easy one but I didn't see a reference in the FAQ's
so I thought I'd ask. I have a file structure like this:

web
  - pages
  - downloads (pdf docs)
  - include

I want to index the html in pages and the pdf's in downloads, but not
the html in include, so I don't want to start my index at web. I've
modified the IndexHTML in demo to do the pdf's. 

What is the best way to do this? Thanks for your suggestions.

John
 


Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
JGuru explanation: 
http://www.jguru.com/faq/view.jsp?EID=1074228

I have no sample code for neko, I think nutch uses it though. For tidy, 
you can look at ant in the sandbox:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?rev=1.3&view=markup

HTH,
sv

On Wed, 25 Aug 2004, Hetan Shah wrote:

> Do you have any pointers for sample code for them?
> Would highly appreciate it.
> Thanks.
> -H
> 
> Stephane James Vaucher wrote:
> 
> > I don't think that the demo parser is meant as a production 
> > system component. You can look at Tidy or NekoHtml. They cleanup your html 
> > and are probably optimised.
> > 
> > sv
> > 
> > On Wed, 25 Aug 2004, Hetan Shah wrote:
> > 
> > 
> >>Hello all,
> >>
> >>Is there a way to reduce the indexing time taken when the indexer is 
> >>indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
> >>do this. I am using IndexHTML class to create the index out of HTML files.
> >>
> >>Another issue that I see is every once in a while I get the following 
> >>output on the screen.
> >>
> >>adding ../31/1104852.html
> >>Parse Aborted: Encountered "\"" at line 7, column 1.
> >>Was expecting one of:
> >>  ...
> >> "=" ...
> >>  ...
> >>
> >>Any suggestions on preventing this from happening?
> >>
> >>Thanks in advance.
> >>-H
> >>
> >>
> >>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> > 
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Time to index documents

2004-08-25 Thread Hetan Shah
Do you have any pointers for sample code for them?
Would highly appreciate it.
Thanks.
-H
Stephane James Vaucher wrote:
I don't think that the demo parser is meant as a production 
system component. You can look at Tidy or NekoHtml. They cleanup your html 
and are probably optimised.

sv
On Wed, 25 Aug 2004, Hetan Shah wrote:

Hello all,
Is there a way to reduce the indexing time taken when the indexer is 
indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
do this. I am using IndexHTML class to create the index out of HTML files.

Another issue that I see is every once in a while I get the following 
output on the screen.

adding ../31/1104852.html
Parse Aborted: Encountered "\"" at line 7, column 1.
Was expecting one of:
 ...
"=" ...
 ...
Any suggestions on preventing this from happening?
Thanks in advance.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
I don't think that the demo parser is meant as a production 
system component. You can look at Tidy or NekoHtml. They cleanup your html 
and are probably optimised.

sv

On Wed, 25 Aug 2004, Hetan Shah wrote:

> Hello all,
> 
> Is there a way to reduce the indexing time taken when the indexer is 
> indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
> do this. I am using IndexHTML class to create the index out of HTML files.
> 
> Another issue that I see is every once in a while I get the following 
> output on the screen.
> 
> adding ../31/1104852.html
> Parse Aborted: Encountered "\"" at line 7, column 1.
> Was expecting one of:
>   ...
>  "=" ...
>   ...
> 
> Any suggestions on preventing this from happening?
> 
> Thanks in advance.
> -H
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Time to index documents

2004-08-25 Thread Hetan Shah
Hello all,
Is there a way to reduce the indexing time taken when the indexer is 
indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
do this. I am using IndexHTML class to create the index out of HTML files.

Another issue that I see is every once in a while I get the following 
output on the screen.

adding ../31/1104852.html
Parse Aborted: Encountered "\"" at line 7, column 1.
Was expecting one of:
 ...
"=" ...
 ...
Any suggestions on preventing this from happening?
Thanks in advance.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How not to show results with the same score?

2004-08-25 Thread Paul Elschot
On Wednesday 25 August 2004 12:21, B. Grimm [Eastbeam GmbH] wrote:
> hi there,
>
> i browsed through the list and had some different searches but i do not
> find, what i'm looking for.
>
> i got an index which is generated by a bot, collecting websites. there
> are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1
> these different urls have the same content and when u search for a word,
> matching, both are returned, which is correct.
>
> they have excatly the same score because of there content an so one, so
> i would like to know if its possible "to group by" (mysql, of course)
> the returned score, so that only the first match is collected into
> "Hits" and all following matches with the same score are ignored.
>
> it would be great if anyone has an idea how to do that.

You can implement your own HitCollector and pass it to IndexSearcher.search()
Have a look at the javadocs of the org.apache.lucene.search package,
it's quite straightforward. The PriorityQueue from the
util package is useful to collect results. For every distinct score you could
store an int[] of document nrs in there while collecting the hits.
Basically you'll end up implementing your own Hits class.

For URL's that have the same content, it's better
to store multiple URL's for the same document. However, this
merging is normally done by a crawler because the same contents
means the same outgoing URL's. Crawlers also keep track
of multiple host names resolving to the same IP address.

In case you need to crawl and index an intranet or more, have a look
at Nutch.

Regards,
Paul Elschot




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
On Aug 25, 2004, at 12:25 PM, Grant Ingersoll wrote:
I may be confused, as I understand it you said you were interested in
the last document indexed,
Yes, I see what you meant. I'm sorry.
That's actually an interesting option. Is getting the timestamp of the 
last document indexed a good enough solution or must I find the latest 
timestamp of all indexed documents? I'd have to ponder that for a 
while.

Avi
--
Avi 'rlwimi' Drissman
[EMAIL PROTECTED]
Argh! This darn mail server is trunca
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Grant Ingersoll
Avi,

I may be confused, as I understand it you said you were interested in
the last document indexed, Berhnard's code does that.   Lucene adds
documents sequentially, so counting backwards from the maxDoc() should
get you the last indexed document pretty quickly.  If all documents were
deleted, then this would go through all documents, otherwise, it is
going to find it pretty quickly.  It doesn't have to traverse through
all of the documents, it just has to find the "first" document that is
not deleted (since we are starting at the end of the list and going
backward)

>>> [EMAIL PROTECTED] 8/25/2004 12:01:50 PM >>>
On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote:

> You are right, in the worst case, this would be linear,

No, in _all_ cases this would be linear.

> I would bet, that on average,
> arguably nearly all cases, you would go through very few iterations
> before finding the doc you are interested in

Then you don't understand what I'm trying to do. I'm trying to find the

document with the biggest value for the field. That would involve 
checking the field's value in every document to ensure this.

Avi

-- 
Avi 'rlwimi' Drissman
[EMAIL PROTECTED] 
Argh! This darn mail server is trunca


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote:
You are right, in the worst case, this would be linear,
No, in _all_ cases this would be linear.
I would bet, that on average,
arguably nearly all cases, you would go through very few iterations
before finding the doc you are interested in
Then you don't understand what I'm trying to do. I'm trying to find the 
document with the biggest value for the field. That would involve 
checking the field's value in every document to ensure this.

Avi
--
Avi 'rlwimi' Drissman
[EMAIL PROTECTED]
Argh! This darn mail server is trunca
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Grant Ingersoll


>>> [EMAIL PROTECTED] 8/25/2004 11:50:01 AM >>>
On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:

> If you already store the date time when the doc was index, you could

> use the following trick to get the last document added to the index:
>
>while (--maxDoc > 0) {

Yes, but that's a linear search :(

>>>
You are right, in the worst case, this would be linear, but that would
require you to delete a lot of documents.  I would bet, that on average,
arguably nearly all cases, you would go through very few iterations
before finding the doc you are interested in

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Otis Gospodnetic
The more documents match, the slower the search; how long your
particular search would take I cannot tell, though - you should just
test it out and see.

I never needed to use the trick with a flag field in all documents, but
I know others do it.

Otis

--- Avi Drissman <[EMAIL PROTECTED]> wrote:

> On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:
> 
> > If you already store the date time when the doc was index, you
> could 
> > use the following trick to get the last document added to the
> index:
> >
> >while (--maxDoc > 0) {
> 
> Yes, but that's a linear search :(
> 
> On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote:
> 
> > What if all Documents in your index contained some flag field + an
> 'add
> > date' field.  Then you could make a query such as: flag:1 and sort
> it
> > by 'add date' field, taking only the very first hit as the most
> > recently added Document.
> 
> That's a very clever approach. I'm currently using Lucene 1.3, so I 
> hadn't thought about using the new sorting abilities. I'd need to
> move 
> to 1.4, of course.
> 
> A question, though: how efficient is it to make a query that matches 
> all documents and then sort it? I'm looking for something as small as
> I 
> can; after all, storing the last date in a file separate from the
> index 
> is O(1)...
> 
> Thanks!
> 
> Avi
> 
> -- 
> Avi 'rlwimi' Drissman
> [EMAIL PROTECTED]
> Argh! This darn mail server is trunca
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:
If you already store the date time when the doc was index, you could 
use the following trick to get the last document added to the index:

   while (--maxDoc > 0) {
Yes, but that's a linear search :(
On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote:
What if all Documents in your index contained some flag field + an 'add
date' field.  Then you could make a query such as: flag:1 and sort it
by 'add date' field, taking only the very first hit as the most
recently added Document.
That's a very clever approach. I'm currently using Lucene 1.3, so I 
hadn't thought about using the new sorting abilities. I'd need to move 
to 1.4, of course.

A question, though: how efficient is it to make a query that matches 
all documents and then sort it? I'm looking for something as small as I 
can; after all, storing the last date in a file separate from the index 
is O(1)...

Thanks!
Avi
--
Avi 'rlwimi' Drissman
[EMAIL PROTECTED]
Argh! This darn mail server is trunca
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Introduction to Lucene [was Re: worddoucments search]

2004-08-25 Thread Steven Rowe
A collection of links to introductory level Lucene articles (including 
one in simplified Chinese and one in Turkish) is available on the 
Lucene Wiki at:

http://wiki.apache.org/jakarta-lucene/IntroductionToLucene>
Steve
Otis Gospodnetic wrote:
that part you have to do yourself.  It is easy, just create a new
Document, create an appropriate Field, give it a name and the string
value you got with textmining.org library, then add the Field to your
Document, and then add the Document to the index with IndexWriter.
Look at one of the articles about Lucene to get started.  I wrote one
called something like Introduction to Text Indexing with Lucene.  You
probably want to read that one to get going.
Otis
--- Santosh <[EMAIL PROTECTED]> wrote:
I have gon through textmining.org, I am able to extract text in
string format. but how can I get it as lucene document format
- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, August 24, 2004 11:54 PM
Subject: Re: worddoucments search
As I just answered in a separate email to Ryan - we used
textmining.orglibrary, too, as an example of something that is easier
to use thanPOI.  It's been a while since I wrote that chapter, so it
slipped mymind when I replied.  Yes, use textmining.org first, you'll
be able toinclude it in your code in 2 minutes.  Good stuff.
Otis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Bernhard Messer
Avi,
i would prefer the second approach. If you already store the date time 
when the doc was index, you could use the following trick to get the 
last document added to the index:

   IndexReader ir = IndexReader.open("/tmp/testindex");
 
   int maxDoc = ir.maxDoc();
   while (--maxDoc > 0) {
 if (!ir.isDeleted(maxDoc)) {
   Document doc = ir.document(maxDoc);
   System.out.println(doc.getField("indexDate"));
   break;
 }
   }

What do you think about the implementation, no extra properties, nothing 
to worry about. Every information is within you index.

regards
Bernhard
Avi Drissman wrote:
I've used Lucene for a long time, but only in the most basic way. I 
have a custom analyzer and a slightly hacked query parser, but in 
general it's the basic add document/remove document/query documents 
cycle.

In my system, I'm indexing a store of external documents, maintaining 
an index for full-text querying. However, I might be turned off when 
documents are added, and then when I'm restarted, I'm going to need to 
determine the timestamp of the last document added to the index so 
that I can pick up where I left off.

There are three approaches to doing this, two using Lucene. I don't 
know how I would do the two Lucene approaches, or even if they're 
possible.

1. Just keep a file in parallel with the index, reading and writing 
the timestamp of the last indexed document in it. I know how to do 
this, but I don't like the idea of keeping a separate file.

2. Drop a timestamp onto each document as it's indexed. I've attached 
timestamp fields to documents in the past so that I could do range 
queries on them. However, I don't know how to do a query like "the 
document with the latest timestamp" or even if that's possible.

3. Create a dummy document (with some unique field identifier so you 
could quickly query for it) with a field "last timestamp". This is a 
"global value storage" approach, as you could just store any field 
with any value on it. But I'd be updating this timestamp field a lot, 
which means that every time I updated the index I'd have to remove 
this special document and reindex it. Is there any way to update the 
value of a field in a document directly in the index without removing 
and adding it again to the index? The field I'd want to update would 
just be stored, not indexed or tokenized.

Thanks for your help in guiding my exploration into the capabilities 
of Lucene.

Avi

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Otis Gospodnetic
What if all Documents in your index contained some flag field + an 'add
date' field.  Then you could make a query such as: flag:1 and sort it
by 'add date' field, taking only the very first hit as the most
recently added Document.

Otis

--- Avi Drissman <[EMAIL PROTECTED]> wrote:

> I've used Lucene for a long time, but only in the most basic way. I 
> have a custom analyzer and a slightly hacked query parser, but in 
> general it's the basic add document/remove document/query documents 
> cycle.
> 
> In my system, I'm indexing a store of external documents, maintaining
> 
> an index for full-text querying. However, I might be turned off when 
> documents are added, and then when I'm restarted, I'm going to need
> to 
> determine the timestamp of the last document added to the index so
> that 
> I can pick up where I left off.
> 
> There are three approaches to doing this, two using Lucene. I don't 
> know how I would do the two Lucene approaches, or even if they're 
> possible.
> 
> 1. Just keep a file in parallel with the index, reading and writing
> the 
> timestamp of the last indexed document in it. I know how to do this, 
> but I don't like the idea of keeping a separate file.
> 
> 2. Drop a timestamp onto each document as it's indexed. I've attached
> 
> timestamp fields to documents in the past so that I could do range 
> queries on them. However, I don't know how to do a query like "the 
> document with the latest timestamp" or even if that's possible.
> 
> 3. Create a dummy document (with some unique field identifier so you 
> could quickly query for it) with a field "last timestamp". This is a 
> "global value storage" approach, as you could just store any field
> with 
> any value on it. But I'd be updating this timestamp field a lot,
> which 
> means that every time I updated the index I'd have to remove this 
> special document and reindex it. Is there any way to update the value
> 
> of a field in a document directly in the index without removing and 
> adding it again to the index? The field I'd want to update would just
> 
> be stored, not indexed or tokenized.
> 
> Thanks for your help in guiding my exploration into the capabilities
> of 
> Lucene.
> 
> Avi
> 
> -- 
> Avi 'rlwimi' Drissman
> [EMAIL PROTECTED]
> Argh! This darn mail server is trunca
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to implement KWIC (KeyWord In Context) display

2004-08-25 Thread yinjin
Hi, Otis,

Thank you very much. I'll try it.

Best,
Ying
- Original Message - 
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, August 24, 2004 5:55 PM
Subject: Re: How to implement KWIC (KeyWord In Context) display


> Hello Ying,
> 
> Take a look at Lucene Highlighter in Lucene Sandbox:
> http://jakarta.apache.org/lucene/docs/lucene-sandbox/
> 
> Otis
> 
> --- yinjin <[EMAIL PROTECTED]> wrote:
> 
> > Hello all,
> > 
> > Does anyone know how to implement KWIC display using Lucene? I'd like
> > to display the result similar to google search.
> > 
> > Thanks for any help,
> > Ying
> > 
> > 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search Applet

2004-08-25 Thread Simon mcIlwaine
Hi Jon,

I modified the three files exactly the way you said using separate
declaration and static initializer block but for IndexWriter I had to change
4 of the variables because they were final. Then I updated the Lucene JAR
file with the three files in the appropriate directory. But i'm still
getting the error: java.security.AccessControlException: access denied
(java.util.PropertyPermission user.dir read)?? What am I doing wrong? The
last mail you sent I was unable to download the files you attached. Is it
possible you could send them to my work address: [EMAIL PROTECTED]

Many Thanks

Simon


- Original Message - 
From: "Jon Schuster" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 6:25 PM
Subject: RE: Lucene Search Applet


> Hi all,
>
> The changes I made to get past the System.getProperty issues are
essentially
> the same in the three files org.apache.lucene.index.IndexWriter,
> org.apache.lucene.store.FSDirectory, and
> org.apache.lucene.search.BooleanQuery.
>
> Change the static initializations from a form like this:
>
>   public static long WRITE_LOCK_TIMEOUT =
>
> Integer.parseInt(System.getProperty("org.apache.lucene.writeLockTimeout",
>   "1000"));
>
> to a separate declaration and static initializer block like this:
>
>public static long WRITE_LOCK_TIMEOUT;
>static
>{
> try
> {
> WRITE_LOCK_TIMEOUT =
> Integer.parseInt(System.getProperty("org.apache.lucene.writeLockTimeout",
> "1000"));
> }
> catch ( Exception e )
> {
> WRITE_LOCK_TIMEOUT = 1000;
> }
>};
>
> As before, the variables are initialized when the class is loaded, but if
> the System.getProperty fails, the variable still gets initialized to its
> default value in the catch block.
>
> You can use a separate static block for each variable, or put them all
into
> a single static block. You could also add a setter for each variable if
you
> want the ability to set the value separately from the class init.
>
> In the FSDirectory class, the variables DISABLE_LOCKS and LOCK_DIR are
> marked final, which I had to remove to do the initialization as described.
>
> I've also attached the three modified files if you want to just copy and
> paste.
>
> --Jon
>
> -Original Message-
> From: Simon mcIlwaine [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 23, 2004 7:37 AM
> To: Lucene Users List
> Subject: Re: Lucene Search Applet
>
> Hi,
>
> Just used the RODirectory and I'm now getting the following error:
> java.security.AccessControlException: access denied
> (java.util.PropertyPermission user.dir read) I'm reckoning that this is
what
> Jon was on about with System.getProperty() within certain files because im
> using an applet. Is this correct and if so can someone show me one of the
> hacked files so that I know what I need to modify.
>
> Many Thanks
>
> Simon
> .
> - Original Message -
> From: "Simon mcIlwaine" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, August 23, 2004 3:12 PM
> Subject: Re: Lucene Search Applet
>
> > Hi Stephane,
> >
> > A bit of a stupid question but how do you mean set the system property
> > disableLuceneLocks=true? Can I do it from a call from FSDirectory API or
> do
> > I have to actually hack the code? Also if I do use RODirectory how do I
go
> > about using it? Do I have to update the Lucene JAR archive file with
> > RODirectory class included as I tried using it and its not recognising
the
> > class?
> >
> > Many Thanks
> >
> > Simon
> >
> > - Original Message -
> > From: "Stephane James Vaucher" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Monday, August 23, 2004 2:22 PM
> > Subject: Re: Lucene Search Applet
> >
> >
> > > Hi Simon,
> > >
> > > Does this work? From FSDirectory api:
> > >
> > > If the system property 'disableLuceneLocks' has the String value of
> > > "true", lock creation will be disabled.
> > >
> > > Otherwise, I think there was a Read-Only Directory hack:
> > >
> > >
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html
> > >
> > > HTH,
> > > sv
> > >
> > > On Mon, 23 Aug 2004, Simon mcIlwaine wrote:
> > >
> > > > Thanks Jon that works by putting the jar file in the archive
> attribute.
> > Now
> > > > im getting the disablelock error cause of the unsigned applet. Do I
> just
> > > > comment out the code anywhere where System.getProperty() appears in
> the
> > > > files that you specified and then update the JAR Archive?? Is it
> > possible
> > > > you could show me one of the hacked files so that I know what I'm
> > modifying?
> > > > Does anyone else know if there is another way of doing this without
> > having
> > > > to hack the source code?
> > > >
> > > > Many thanks.
> > > >
> > > > Simon
> > > >
> > > > - Original Message -
> > > > From: "Jon Schuster" <[EMAIL PROTECTED]>
> > > > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > > > Sent: Saturday, August 21, 2004 2:08 AM
> > > > Subject: R

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Claes Holmerson
Avi Drissman wrote:
I've used Lucene for a long time, but only in the most basic way. I 
have a custom analyzer and a slightly hacked query parser, but in 
general it's the basic add document/remove document/query documents 
cycle.

In my system, I'm indexing a store of external documents, maintaining 
an index for full-text querying. However, I might be turned off when 
documents are added, and then when I'm restarted, I'm going to need to 
determine the timestamp of the last document added to the index so 
that I can pick up where I left off.

There are three approaches to doing this, two using Lucene. I don't 
know how I would do the two Lucene approaches, or even if they're 
possible.

1. Just keep a file in parallel with the index, reading and writing 
the timestamp of the last indexed document in it. I know how to do 
this, but I don't like the idea of keeping a separate file. 
This is similar to the way I chose (I used a property file for this, and 
stored certain data within it, in the index directory). I didn't like 
the idea at first either, but later I thought - why not? It is the 
simplest way. As long as the file name is not used by Lucene, I thought 
it should be safe.

Claes
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
I've used Lucene for a long time, but only in the most basic way. I 
have a custom analyzer and a slightly hacked query parser, but in 
general it's the basic add document/remove document/query documents 
cycle.

In my system, I'm indexing a store of external documents, maintaining 
an index for full-text querying. However, I might be turned off when 
documents are added, and then when I'm restarted, I'm going to need to 
determine the timestamp of the last document added to the index so that 
I can pick up where I left off.

There are three approaches to doing this, two using Lucene. I don't 
know how I would do the two Lucene approaches, or even if they're 
possible.

1. Just keep a file in parallel with the index, reading and writing the 
timestamp of the last indexed document in it. I know how to do this, 
but I don't like the idea of keeping a separate file.

2. Drop a timestamp onto each document as it's indexed. I've attached 
timestamp fields to documents in the past so that I could do range 
queries on them. However, I don't know how to do a query like "the 
document with the latest timestamp" or even if that's possible.

3. Create a dummy document (with some unique field identifier so you 
could quickly query for it) with a field "last timestamp". This is a 
"global value storage" approach, as you could just store any field with 
any value on it. But I'd be updating this timestamp field a lot, which 
means that every time I updated the index I'd have to remove this 
special document and reindex it. Is there any way to update the value 
of a field in a document directly in the index without removing and 
adding it again to the index? The field I'd want to update would just 
be stored, not indexed or tokenized.

Thanks for your help in guiding my exploration into the capabilities of 
Lucene.

Avi
--
Avi 'rlwimi' Drissman
[EMAIL PROTECTED]
Argh! This darn mail server is trunca
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


lucene 1.4 in maven repository

2004-08-25 Thread Zilverline info
Hi,
Can anyone tell me why there is no lucene 1.4 jar in the maven 
repository @ http://www.ibiblio.org/maven/lucene/jars/ ? Who makes them 
available? It would be very convenient to be able to get the latest 
version from there (or anywhere else)

regards,
 Michael Franken
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lock handling

2004-08-25 Thread Otis Gospodnetic
My suggestion was referring to a timestamp that could be obtained via
java.io.File, not something provided by Lucene.

Otis

--- Claes Holmerson <[EMAIL PROTECTED]> wrote:

> Yes, looking at the time of the lock was an idea I had but I could
> not
> find anything like a time stamp. Am I missing something obvious here?
> 
> Claes
> 
> Otis Gospodnetic wrote:
> 
> >Hello,
> >
> >If you use Lucene incorrectly (e.g. 2 IndexWriters writing to the
> same
> >index), you will see this error.  Lucene has no way of telling
> whether
> >the lock file was left over from a previous process, or whether it's
> a
> >valid lock file because another process is currently indexing
> documents
> >or some such.
> >You could try adding some logic to your app, though.  For instance,
> you
> >can look at lock's timestamp, and using IndexReader.unlock(...)
> method
> >to forcefully unlock the index.
> >
> >Otis
> >
> >--- Claes Holmerson <[EMAIL PROTECTED]> wrote:
> >
> >  
> >
> >>Hello,
> >>
> >>I am interested to hear how people handle locked indexes, for
> example
> >>
> >>when catching an IOException like below.
> >>
> >>java.io.IOException: Lock obtain timed out:
> >>Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock
> >>at org.apache.lucene.store.Lock.obtain(Lock.java:58)
> >>at
> >>org.apache.lucene.index.IndexWriter.(IndexWriter.java:223)
> >>at
> >>org.apache.lucene.index.IndexWriter.(IndexWriter.java:213)
> >>
> >>As far as I can tell, there is no good way to tell whether the lock
> >>is 
> >>only temporary (working as it should), or if it was created by a
> >>process 
> >>that later died, and therefore can not remove it. How can I detect
> >>the 
> >>latter case, and how should I best handle it?
> >>
> >>Thanks,
> >>Claes
> >>
> >>
>
>>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail:
> [EMAIL PROTECTED]
> >>
> >>
> >>
> >>
> >
> >
>
>-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >  
> >
> 
> -- 
> Claes Holmerson
> Polopoly - Cultivating the information garden
> Kungsgatan 88, SE-112 27 Stockholm, SWEDEN
> Direct: +46 8 506 782 59
> Mobile: +46 704 47 82 59
> Fax:  +46 8 506 782 51
> [EMAIL PROTECTED], http://www.polopoly.com
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: worddoucments search

2004-08-25 Thread Chandan Tamrakar
Santosh
please read the API' of lucene.

  When you can string from word doc. using textmining api's . try to
convert into some temp.  file and try indexing them

If you are able to index PDF and normal file what trouble will you face
indexing a string extracted from word docs ? please also read /search the
previous posting. it should help understanding about lucene more...


- Original Message - 
From: "Karthik N S" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, August 25, 2004 4:21 PM
Subject: RE: worddoucments search


> Hi
>
>   Santosh
>
>   Please .
>
>   If u have Downloded the Lucene (zip )bundel , First try to read the
> docs/index.html  which is in the bundel,
>   if  u are still in trouble, then  approach the Form for Help  [ Un
> necessarily  asking silly Questions will be ignored ]
>
>
> Karthik
>
>
>
>
> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, August 25, 2004 3:01 PM
> To: Lucene Users List
> Subject: Re: worddoucments search
>
>
> that part you have to do yourself.  It is easy, just create a new
> Document, create an appropriate Field, give it a name and the string
> value you got with textmining.org library, then add the Field to your
> Document, and then add the Document to the index with IndexWriter.
>
> Look at one of the articles about Lucene to get started.  I wrote one
> called something like Introduction to Text Indexing with Lucene.  You
> probably want to read that one to get going.
>
> Otis
>
> --- Santosh <[EMAIL PROTECTED]> wrote:
>
> > I have gon through textmining.org, I am able to extract text in
> > string
> > format. but how can I get it as
> > lucene document format
> > - Original Message -
> > From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Tuesday, August 24, 2004 11:54 PM
> > Subject: Re: worddoucments search
> >
> >
> >  As I just answered in a separate email to Ryan - we used
> > textmining.orglibrary, too, as an example of something that is easier
> > to use thanPOI.  It's been a while since I wrote that chapter, so it
> > slipped mymind when I replied.  Yes, use textmining.org first, you'll
> > be able toinclude it in your code in 2 minutes.  Good stuff.
> >
> >  Otis
> >
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what is wrong with query

2004-08-25 Thread Erik Hatcher
That is correct... fuzzy searches are only on a per-term basis.
If what you meant, though, was a phrase query ("full" near "name") you  
have to add an explicit slop factor like "full name"~5

Erik
On Aug 25, 2004, at 2:19 AM, Stephane James Vaucher wrote:
From: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
Fuzzy Searches
Lucene supports fuzzy searches based on the Levenshtein Distance, or
Edit Distance algorithm. To do a fuzzy search use the tilde, "~",  
symbol
at the end of a Single word Term.

I haven't used fuzzy searches, but it seems to indicate that it can  
only
be used with single word terms. The query parser might have been  
written
to support that (the output indicates that as well).

HTH,
sv
On Wed, 25 Aug 2004, Alex Kiselevski wrote:
I use QueryParser
And I got an exception :
org.apache.lucene.queryParser.ParseException: Encountered "~" at line  
1,
column 44.
Was expecting one of:
 ...
 ...
 ...
"+" ...
"-" ...
"(" ...
")" ...
"^" ...
 ...
 ...
 ...
 ...
 ...
"[" ...
"{" ...
 ...

at
org.apache.lucene.queryParser.QueryParser.generateParseException(Query 
Pa
rser.java:1045
at
org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser 
.j
ava:925)
at
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562)
at
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
at
com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java: 
89)
at com.stp.test.CVTest.main(CVTest.java:223)

-Original Message-
From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 25, 2004 10:07 AM
To: Lucene Users List
Subject: Re: what is wrong with query
You'll have to give us more information than that...
What is the problem you are seeing? I'll assume that you get no  
results.

Tell us of the structure of your documents and how you index every
field.
Concerning your syntax, if you are using the distributed query parser,
you don't need the + before name, nor the + before university as they
will be added by the parser.
sv
On Wed, 25 Aug 2004, Alex Kiselevski wrote:
Hi, pls,
Tell me what is wrong with query:
author:( +name AND "full name"~) AND book:( +university)
Alex Kiselevsky
 Speech Technology  Tel:972-9-776-43-46
R&D, Amdocs - IsraelMobile: 972-53-63 50 38
mailto:[EMAIL PROTECTED]

The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged. The information is
intended to be conveyed only to the designated recipient(s) of the
message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or
copying of this communication is strictly prohibited and may be
unlawful. If you have received this communication in error, please
notify us immediately by replying to the message and deleting it from
your computer. Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated  
recipient(s)
of the message. If the reader of this message is not the intended  
recipient,
you are hereby notified that any dissemination, use, distribution or  
copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us  
immediately
by replying to the message and deleting it from your computer.
Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Hebrew Analyzer

2004-08-25 Thread Alex Kiselevski

Hi, anybody heard about Hebrew Analyzer ?

Alex Kiselevsky
 Speech Technology  Tel:972-9-776-43-46
R&D, Amdocs - IsraelMobile: 972-53-63 50 38
mailto:[EMAIL PROTECTED]




The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

RE: worddoucments search

2004-08-25 Thread Karthik N S
Hi

  Santosh

  Please .

  If u have Downloded the Lucene (zip )bundel , First try to read the
docs/index.html  which is in the bundel,
  if  u are still in trouble, then  approach the Form for Help  [ Un
necessarily  asking silly Questions will be ignored ]


Karthik




-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 25, 2004 3:01 PM
To: Lucene Users List
Subject: Re: worddoucments search


that part you have to do yourself.  It is easy, just create a new
Document, create an appropriate Field, give it a name and the string
value you got with textmining.org library, then add the Field to your
Document, and then add the Document to the index with IndexWriter.

Look at one of the articles about Lucene to get started.  I wrote one
called something like Introduction to Text Indexing with Lucene.  You
probably want to read that one to get going.

Otis

--- Santosh <[EMAIL PROTECTED]> wrote:

> I have gon through textmining.org, I am able to extract text in
> string
> format. but how can I get it as
> lucene document format
> - Original Message -
> From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Tuesday, August 24, 2004 11:54 PM
> Subject: Re: worddoucments search
>
>
>  As I just answered in a separate email to Ryan - we used
> textmining.orglibrary, too, as an example of something that is easier
> to use thanPOI.  It's been a while since I wrote that chapter, so it
> slipped mymind when I replied.  Yes, use textmining.org first, you'll
> be able toinclude it in your code in 2 minutes.  Good stuff.
>
>  Otis
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How not to show results with the same score?

2004-08-25 Thread B. Grimm [Eastbeam GmbH]
hi there,
i browsed through the list and had some different searches but i do not 
find, what i'm looking for.

i got an index which is generated by a bot, collecting websites. there 
are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1
these different urls have the same content and when u search for a word, 
matching, both are returned, which is correct.

they have excatly the same score because of there content an so one, so 
i would like to know if its possible "to group by" (mysql, of course) 
the returned score, so that only the first match is collected into 
"Hits" and all following matches with the same score are ignored.

it would be great if anyone has an idea how to do that.
thanks and have a nice day.
bastian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lock handling

2004-08-25 Thread Claes Holmerson
Yes, looking at the time of the lock was an idea I had but I could not
find anything like a time stamp. Am I missing something obvious here?
Claes
Otis Gospodnetic wrote:
Hello,
If you use Lucene incorrectly (e.g. 2 IndexWriters writing to the same
index), you will see this error.  Lucene has no way of telling whether
the lock file was left over from a previous process, or whether it's a
valid lock file because another process is currently indexing documents
or some such.
You could try adding some logic to your app, though.  For instance, you
can look at lock's timestamp, and using IndexReader.unlock(...) method
to forcefully unlock the index.
Otis
--- Claes Holmerson <[EMAIL PROTECTED]> wrote:
 

Hello,
I am interested to hear how people handle locked indexes, for example
when catching an IOException like below.
java.io.IOException: Lock obtain timed out:
Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock
   at org.apache.lucene.store.Lock.obtain(Lock.java:58)
   at
org.apache.lucene.index.IndexWriter.(IndexWriter.java:223)
   at
org.apache.lucene.index.IndexWriter.(IndexWriter.java:213)
As far as I can tell, there is no good way to tell whether the lock
is 
only temporary (working as it should), or if it was created by a
process 
that later died, and therefore can not remove it. How can I detect
the 
latter case, and how should I best handle it?

Thanks,
Claes
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
Claes Holmerson
Polopoly - Cultivating the information garden
Kungsgatan 88, SE-112 27 Stockholm, SWEDEN
Direct: +46 8 506 782 59
Mobile: +46 704 47 82 59
Fax:  +46 8 506 782 51
[EMAIL PROTECTED], http://www.polopoly.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: worddoucments search

2004-08-25 Thread Otis Gospodnetic
that part you have to do yourself.  It is easy, just create a new
Document, create an appropriate Field, give it a name and the string
value you got with textmining.org library, then add the Field to your
Document, and then add the Document to the index with IndexWriter.

Look at one of the articles about Lucene to get started.  I wrote one
called something like Introduction to Text Indexing with Lucene.  You
probably want to read that one to get going.

Otis

--- Santosh <[EMAIL PROTECTED]> wrote:

> I have gon through textmining.org, I am able to extract text in
> string
> format. but how can I get it as
> lucene document format
> - Original Message -
> From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Tuesday, August 24, 2004 11:54 PM
> Subject: Re: worddoucments search
> 
> 
>  As I just answered in a separate email to Ryan - we used
> textmining.orglibrary, too, as an example of something that is easier
> to use thanPOI.  It's been a while since I wrote that chapter, so it
> slipped mymind when I replied.  Yes, use textmining.org first, you'll
> be able toinclude it in your code in 2 minutes.  Good stuff.
> 
>  Otis
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search Applet

2004-08-25 Thread Simon mcIlwaine
Hi Jon,

Where do I go to get the attached files?

Many Thanks

Simon

- Original Message - 
From: "Jon Schuster" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 6:25 PM
Subject: RE: Lucene Search Applet


> Hi all,
>
> The changes I made to get past the System.getProperty issues are
essentially
> the same in the three files org.apache.lucene.index.IndexWriter,
> org.apache.lucene.store.FSDirectory, and
> org.apache.lucene.search.BooleanQuery.
>
> Change the static initializations from a form like this:
>
>   public static long WRITE_LOCK_TIMEOUT =
>
> Integer.parseInt(System.getProperty("org.apache.lucene.writeLockTimeout",
>   "1000"));
>
> to a separate declaration and static initializer block like this:
>
>public static long WRITE_LOCK_TIMEOUT;
>static
>{
> try
> {
> WRITE_LOCK_TIMEOUT =
> Integer.parseInt(System.getProperty("org.apache.lucene.writeLockTimeout",
> "1000"));
> }
> catch ( Exception e )
> {
> WRITE_LOCK_TIMEOUT = 1000;
> }
>};
>
> As before, the variables are initialized when the class is loaded, but if
> the System.getProperty fails, the variable still gets initialized to its
> default value in the catch block.
>
> You can use a separate static block for each variable, or put them all
into
> a single static block. You could also add a setter for each variable if
you
> want the ability to set the value separately from the class init.
>
> In the FSDirectory class, the variables DISABLE_LOCKS and LOCK_DIR are
> marked final, which I had to remove to do the initialization as described.
>
> I've also attached the three modified files if you want to just copy and
> paste.
>
> --Jon
>
> -Original Message-
> From: Simon mcIlwaine [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 23, 2004 7:37 AM
> To: Lucene Users List
> Subject: Re: Lucene Search Applet
>
> Hi,
>
> Just used the RODirectory and I'm now getting the following error:
> java.security.AccessControlException: access denied
> (java.util.PropertyPermission user.dir read) I'm reckoning that this is
what
> Jon was on about with System.getProperty() within certain files because im
> using an applet. Is this correct and if so can someone show me one of the
> hacked files so that I know what I need to modify.
>
> Many Thanks
>
> Simon
> .
> - Original Message -
> From: "Simon mcIlwaine" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Monday, August 23, 2004 3:12 PM
> Subject: Re: Lucene Search Applet
>
> > Hi Stephane,
> >
> > A bit of a stupid question but how do you mean set the system property
> > disableLuceneLocks=true? Can I do it from a call from FSDirectory API or
> do
> > I have to actually hack the code? Also if I do use RODirectory how do I
go
> > about using it? Do I have to update the Lucene JAR archive file with
> > RODirectory class included as I tried using it and its not recognising
the
> > class?
> >
> > Many Thanks
> >
> > Simon
> >
> > - Original Message -
> > From: "Stephane James Vaucher" <[EMAIL PROTECTED]>
> > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > Sent: Monday, August 23, 2004 2:22 PM
> > Subject: Re: Lucene Search Applet
> >
> >
> > > Hi Simon,
> > >
> > > Does this work? From FSDirectory api:
> > >
> > > If the system property 'disableLuceneLocks' has the String value of
> > > "true", lock creation will be disabled.
> > >
> > > Otherwise, I think there was a Read-Only Directory hack:
> > >
> > >
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html
> > >
> > > HTH,
> > > sv
> > >
> > > On Mon, 23 Aug 2004, Simon mcIlwaine wrote:
> > >
> > > > Thanks Jon that works by putting the jar file in the archive
> attribute.
> > Now
> > > > im getting the disablelock error cause of the unsigned applet. Do I
> just
> > > > comment out the code anywhere where System.getProperty() appears in
> the
> > > > files that you specified and then update the JAR Archive?? Is it
> > possible
> > > > you could show me one of the hacked files so that I know what I'm
> > modifying?
> > > > Does anyone else know if there is another way of doing this without
> > having
> > > > to hack the source code?
> > > >
> > > > Many thanks.
> > > >
> > > > Simon
> > > >
> > > > - Original Message -
> > > > From: "Jon Schuster" <[EMAIL PROTECTED]>
> > > > To: "Lucene Users List" <[EMAIL PROTECTED]>
> > > > Sent: Saturday, August 21, 2004 2:08 AM
> > > > Subject: Re: Lucene Search Applet
> > > >
> > > >
> > > > > I have Lucene working in an applet and I've seen this problem only
> > when
> > > > > the jar file really was not available (typo in the jar name),
which
> is
> > > > > what you'd expect. It's possible that the classpath for your
> > > > > application is not the same as the classpath for the applet;
perhaps
> > > > > they're using different VMs or JREs from different locations.
> > > > >
> > > > > Try referencing the Lucene jar file in the archive attribute of
the
> > > > > apple

Re: worddoucments search

2004-08-25 Thread Santosh
I have gon through textmining.org, I am able to extract text in string
format. but how can I get it as
lucene document format
- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, August 24, 2004 11:54 PM
Subject: Re: worddoucments search


 As I just answered in a separate email to Ryan - we used textmining.orglibrary, too, 
as an example of something that is easier to use thanPOI.  It's been a while since I 
wrote that chapter, so it slipped mymind when I replied.  Yes, use textmining.org 
first, you'll be able toinclude it in your code in 2 minutes.  Good stuff.

 Otis





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lock handling

2004-08-25 Thread Otis Gospodnetic
Hello,

If you use Lucene incorrectly (e.g. 2 IndexWriters writing to the same
index), you will see this error.  Lucene has no way of telling whether
the lock file was left over from a previous process, or whether it's a
valid lock file because another process is currently indexing documents
or some such.
You could try adding some logic to your app, though.  For instance, you
can look at lock's timestamp, and using IndexReader.unlock(...) method
to forcefully unlock the index.

Otis

--- Claes Holmerson <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> I am interested to hear how people handle locked indexes, for example
> 
> when catching an IOException like below.
> 
> java.io.IOException: Lock obtain timed out:
> Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock
> at org.apache.lucene.store.Lock.obtain(Lock.java:58)
> at
> org.apache.lucene.index.IndexWriter.(IndexWriter.java:223)
> at
> org.apache.lucene.index.IndexWriter.(IndexWriter.java:213)
> 
> As far as I can tell, there is no good way to tell whether the lock
> is 
> only temporary (working as it should), or if it was created by a
> process 
> that later died, and therefore can not remove it. How can I detect
> the 
> latter case, and how should I best handle it?
> 
> Thanks,
> Claes
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: what is wrong with query

2004-08-25 Thread Stephane James Vaucher
From: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html

Fuzzy Searches

Lucene supports fuzzy searches based on the Levenshtein Distance, or
Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol
at the end of a Single word Term.

I haven't used fuzzy searches, but it seems to indicate that it can only
be used with single word terms. The query parser might have been written
to support that (the output indicates that as well).

HTH,
sv

On Wed, 25 Aug 2004, Alex Kiselevski wrote:

>
> I use QueryParser
> And I got an exception :
> org.apache.lucene.queryParser.ParseException: Encountered "~" at line 1,
> column 44.
> Was expecting one of:
>  ...
>  ...
>  ...
> "+" ...
> "-" ...
> "(" ...
> ")" ...
> "^" ...
>  ...
>  ...
>  ...
>  ...
>  ...
> "[" ...
> "{" ...
>  ...
>
> at
> org.apache.lucene.queryParser.QueryParser.generateParseException(QueryPa
> rser.java:1045
> at
> org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.j
> ava:925)
> at
> org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562)
> at
> org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
> at
> org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
> at
> com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java:89)
> at com.stp.test.CVTest.main(CVTest.java:223)
>
> -Original Message-
> From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, August 25, 2004 10:07 AM
> To: Lucene Users List
> Subject: Re: what is wrong with query
>
>
> You'll have to give us more information than that...
>
> What is the problem you are seeing? I'll assume that you get no results.
>
> Tell us of the structure of your documents and how you index every
> field.
>
> Concerning your syntax, if you are using the distributed query parser,
> you don't need the + before name, nor the + before university as they
> will be added by the parser.
>
> sv
>
> On Wed, 25 Aug 2004, Alex Kiselevski wrote:
>
> >
> > Hi, pls,
> > Tell me what is wrong with query:
> > author:( +name AND "full name"~) AND book:( +university)
> >
> >
> > Alex Kiselevsky
> >  Speech Technology  Tel:972-9-776-43-46
> > R&D, Amdocs - IsraelMobile: 972-53-63 50 38
> > mailto:[EMAIL PROTECTED]
> >
> >
> >
> >
> > The information contained in this message is proprietary of Amdocs,
> > protected from disclosure, and may be privileged. The information is
> > intended to be conveyed only to the designated recipient(s) of the
> > message. If the reader of this message is not the intended recipient,
> > you are hereby notified that any dissemination, use, distribution or
> > copying of this communication is strictly prohibited and may be
> > unlawful. If you have received this communication in error, please
> > notify us immediately by replying to the message and deleting it from
> > your computer. Thank you.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> The information contained in this message is proprietary of Amdocs,
> protected from disclosure, and may be privileged.
> The information is intended to be conveyed only to the designated recipient(s)
> of the message. If the reader of this message is not the intended recipient,
> you are hereby notified that any dissemination, use, distribution or copying of
> this communication is strictly prohibited and may be unlawful.
> If you have received this communication in error, please notify us immediately
> by replying to the message and deleting it from your computer.
> Thank you.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: what is wrong with query

2004-08-25 Thread Alex Kiselevski

I use QueryParser
And I got an exception :
org.apache.lucene.queryParser.ParseException: Encountered "~" at line 1,
column 44.
Was expecting one of:
 ...
 ...
 ...
"+" ...
"-" ...
"(" ...
")" ...
"^" ...
 ...
 ...
 ...
 ...
 ...
"[" ...
"{" ...
 ...

at
org.apache.lucene.queryParser.QueryParser.generateParseException(QueryPa
rser.java:1045
at
org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.j
ava:925)
at
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562)
at
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
at
com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java:89)
at com.stp.test.CVTest.main(CVTest.java:223)

-Original Message-
From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 25, 2004 10:07 AM
To: Lucene Users List
Subject: Re: what is wrong with query


You'll have to give us more information than that...

What is the problem you are seeing? I'll assume that you get no results.

Tell us of the structure of your documents and how you index every
field.

Concerning your syntax, if you are using the distributed query parser,
you don't need the + before name, nor the + before university as they
will be added by the parser.

sv

On Wed, 25 Aug 2004, Alex Kiselevski wrote:

>
> Hi, pls,
> Tell me what is wrong with query:
> author:( +name AND "full name"~) AND book:( +university)
>
>
> Alex Kiselevsky
>  Speech TechnologyTel:972-9-776-43-46
> R&D, Amdocs - Israel  Mobile: 972-53-63 50 38
> mailto:[EMAIL PROTECTED]
>
>
>
>
> The information contained in this message is proprietary of Amdocs,
> protected from disclosure, and may be privileged. The information is
> intended to be conveyed only to the designated recipient(s) of the
> message. If the reader of this message is not the intended recipient,
> you are hereby notified that any dissemination, use, distribution or
> copying of this communication is strictly prohibited and may be
> unlawful. If you have received this communication in error, please
> notify us immediately by replying to the message and deleting it from
> your computer. Thank you.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: what is wrong with query

2004-08-25 Thread Stephane James Vaucher
You'll have to give us more information than that...

What is the problem you are seeing? I'll assume that you get no results.

Tell us of the structure of your documents and how you index every field.

Concerning your syntax, if you are using the distributed query parser, you
don't need the + before name, nor the + before university as they will be
added by the parser.

sv

On Wed, 25 Aug 2004, Alex Kiselevski wrote:

>
> Hi, pls,
> Tell me what is wrong with query:
> author:( +name AND "full name"~) AND book:( +university)
>
>
> Alex Kiselevsky
>  Speech TechnologyTel:972-9-776-43-46
> R&D, Amdocs - Israel  Mobile: 972-53-63 50 38
> mailto:[EMAIL PROTECTED]
>
>
>
>
> The information contained in this message is proprietary of Amdocs,
> protected from disclosure, and may be privileged.
> The information is intended to be conveyed only to the designated recipient(s)
> of the message. If the reader of this message is not the intended recipient,
> you are hereby notified that any dissemination, use, distribution or copying of
> this communication is strictly prohibited and may be unlawful.
> If you have received this communication in error, please notify us immediately
> by replying to the message and deleting it from your computer.
> Thank you.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]