Re: index: how to store binary data or objects ?

2004-02-10 Thread Andrzej Bialecki
Dror Matalon wrote:

On Tue, Feb 10, 2004 at 03:59:50AM +0100, [EMAIL PROTECTED] wrote:

Hi Lucent Users!

Searching the documentation, API and this mailinglist results in:
"no way to store objects or binary data in an UnIndexed
org.apache.lucene.document.Field to attach it to the index directly"
Is there a way to do this? What would you suggest to do?


1. Store the binary data in files and store the path in Lucene. There's
scallability issues here when you handle more than a few hundred
thousand objects.
Just a comment: for ext2fs and BSD FFS (dunno about NT) scalability 
issues with this approach can be partially addressed by building a tree 
of subdirectories, instead of using just one. I.e. a file named 
"myThesis.pdf" would go into /m/y/t/myThesis.pdf. This way the time 
needed to list the files in a given directory is reduced (both unixes 
can already cache the inode numbers for name/inode lookup, so there is 
no significant time increase to lookup a longer path).

FreeBSD also has a special kind of filesystem, which uses inodes in a 
flat space (no directories). It was specifically designed for storing 
large numbers of files efficiently. Recent versions of Java on FreeBSD 
(1.4.2) seem to be very stable and performing well, so that could also 
be an option.

After all, a filesystem _is_ a kind of very specialized database... ;-)

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Another Newbie question--FSDirectory

2004-02-10 Thread Otis Gospodnetic
You should probably always try to use Directory, and not String nor
FSDirectory.
Directory is the most abstract 'index type and location entity', and
using it smartly allows you to change your index type and location more
easily, should you ever choose to do that.

Otis

--- Scott Smith <[EMAIL PROTECTED]> wrote:
> I was creating the IndexSearcher using a standard String containing
> the
> Lucene index directory pathname.  I noticed that "everyone" seems to
> create a FSDirectory and use that to create the searcher.  However,
> no
> one seems to use this for the IndexWriter.  Can someone tell me what
> the
> advantage of using the FSDirectory is over just specifying the index
> directory in a String? why it would be used for searching, but not
> for
> indexing?  Anything else relevant to FSDirectory?  Is it mostly
> convention and the real reason for it's existence is when you use the
> RAMDirectory or CompoundFileDirectory (or you don't want some method
> to
> care what flavor directory it is looking at)?
> 
> Scott
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index advice...

2004-02-10 Thread Otis Gospodnetic
Without seeing more information/code, I can't tell which part of your
system slows down with time, but I can tell you that Lucene's 'add'
does not slow over time (i.e. as the index gets larger).  Therefore, I
would look elsewhere for causes of the slowdown.
The easiest thing to do is add logging to suspicious portions of the
code.  That will narrow the scope of the code you need to analyze.

Otis


--- [EMAIL PROTECTED] wrote:
> Hey Lucene-users,
> 
> I'm setting up a Lucene index on 5G of PDF files (full-text search). 
> I've 
> been really happy with Lucene so far but I'm curious what tips and
> strategies 
> I can use to optimize my performance at this large size.
> 
> So far I am using pretty much all of the defaults (I'm new to
> Lucene).
> 
> I am using PDFBox to add the documents to the index.
> I can usually add about 800 or so PDF files and then the add loop:
> 
> for ( int i = 0; i < fileNames.length; i++ ) {
>   Document doc = IndexFile.index(baseDirectory+documentRoot+"fileNames
> [i]); 
>   writer.addDocument(doc);
> }
> 
> 
> really starts to slow down.  Doesn't seem to be memory related.
> Thoughts anyone?
> 
> Thanks in advance,
> CK Hill
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ordering by a field value

2004-02-10 Thread Otis Gospodnetic
There were some recent contributions that should make this possible and
simple to do.
The code should be added to Lucene CVS repository in the next week or
so.

Otis

--- Gabe <[EMAIL PROTECTED]> wrote:
> 
> Hi,
> 
> I was wondering whether it was possible to sort search
> results by the order of the String value of a stored
> or unstored field. How would one implement this?
> 
> Thanks,
> Gabe
> 
> __
> Do you Yahoo!?
> Yahoo! Finance: Get your refund fast by filing online.
> http://taxes.yahoo.com/filing.html
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index: how to store binary data or objects ?

2004-02-10 Thread petite_abeille
On Feb 10, 2004, at 03:59, [EMAIL PROTECTED] wrote:

Is there a way to do this?
Lucene deals with text. You could always serialize your objects in a 
byte array, hex encode them or something, and store that in an 
appropriate field.

What would you suggest to do?
Don't store your objects in Lucene :)

As other have pointed out, you will be much better off storing your 
object somewhere else (files, db, btree, whatever) and only use Lucene 
to store a reference to those objects.

For a concrete example of this approach, take a look at ZOE [1] source 
code [2].

It uses Lucene for, er, indexing and JDBM [2] to store the 
corresponding object's binaries.

Cheers,

PA.

[1] http://zoe.nu/itstories/story.php?data=stories&num=16&sec=1
[2] http://zoe.nu/misc/Workspace20031122.tgz
[3] http://jdbm.sourceforge.net/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index advice...

2004-02-10 Thread Leo Galambos
Otis Gospodnetic napsal(a):

Without seeing more information/code, I can't tell which part of your
system slows down with time, but I can tell you that Lucene's 'add'
does not slow over time (i.e. as the index gets larger).  Therefore, I
would look elsewhere for causes of the slowdown.
 

Otis, can you point me to some proofs that time of "insert" operation 
does not depend on the index size, please? Amortized time of "insert" is 
O(log(docsIndexed/mergeFac)), I think. Thus I do not know how it could 
be O(1).

Thank you.
Leo
AFAIK the issue with PDF files can be based on the PDF parser (I already 
encountered this with PDFbox).

The easiest thing to do is add logging to suspicious portions of the
code.  That will narrow the scope of the code you need to analyze.
Otis

--- [EMAIL PROTECTED] wrote:
 

Hey Lucene-users,

I'm setting up a Lucene index on 5G of PDF files (full-text search). 
I've 
been really happy with Lucene so far but I'm curious what tips and
strategies 
I can use to optimize my performance at this large size.

So far I am using pretty much all of the defaults (I'm new to
Lucene).
I am using PDFBox to add the documents to the index.
I can usually add about 800 or so PDF files and then the add loop:
for ( int i = 0; i < fileNames.length; i++ ) {
	Document doc = IndexFile.index(baseDirectory+documentRoot+"fileNames
[i]); 
	writer.addDocument(doc);
}

really starts to slow down.  Doesn't seem to be memory related.
Thoughts anyone?
Thanks in advance,
CK Hill


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index: how to store binary data or objects ?

2004-02-10 Thread petite_abeille
On Feb 10, 2004, at 09:32, Andrzej Bialecki wrote:

Just a comment: for ext2fs and BSD FFS (dunno about NT) scalability 
issues with this approach can be partially addressed by building a 
tree of subdirectories, instead of using just one. I.e. a file named 
"myThesis.pdf" would go into /m/y/t/myThesis.pdf. This way the time 
needed to list the files in a given directory is reduced (both unixes 
can already cache the inode numbers for name/inode lookup, so there is 
no significant time increase to lookup a longer path).
Yes. But you have to watch out for overall path length limit though. An 
alternative strategy is to hash your keys and store that as the 
directory path. This what some browsers do to store their cache.

FreeBSD also has a special kind of filesystem, which uses inodes in a 
flat space (no directories). It was specifically designed for storing 
large numbers of files efficiently. Recent versions of Java on FreeBSD 
(1.4.2) seem to be very stable and performing well, so that could also 
be an option.
Yes. The file system can be used to simulate a fairly reasonable 
database. The problem then is not so much the time it takes to look up 
those files, but rather opening, reading, and closing them.

After all, a filesystem _is_ a kind of very specialized database... ;-)
This is true and works quite well to a certain extend.

But it suffers from one major flaw in my experience: you run out of 
file descriptors very quickly. And lets face it, it's quite slow also 
:)

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index advice...

2004-02-10 Thread Otis Gospodnetic

--- Leo Galambos <[EMAIL PROTECTED]> wrote:
> Otis Gospodnetic napsal(a):
> 
> >Without seeing more information/code, I can't tell which part of
> your
> >system slows down with time, but I can tell you that Lucene's 'add'
> >does not slow over time (i.e. as the index gets larger).  Therefore,
> I
> >would look elsewhere for causes of the slowdown.
> >  
> >
> 
> Otis, can you point me to some proofs that time of "insert" operation
> 
> does not depend on the index size, please? Amortized time of "insert"
> is O(log(docsIndexed/mergeFac)), I think.

This would imply that Lucene gets slower as it adds more documents to
the index.  Have you observed this behaviour?  I haven't.

> Thus I do not know how it could be O(1).

~ O(1) is what I have observed through experiments with indexing of
several million documents.

Otis


> AFAIK the issue with PDF files can be based on the PDF parser (I
> already 
> encountered this with PDFbox).
> 
> >The easiest thing to do is add logging to suspicious portions of the
> >code.  That will narrow the scope of the code you need to analyze.
> >
> >Otis
> >
> >
> >--- [EMAIL PROTECTED] wrote:
> >  
> >
> >>Hey Lucene-users,
> >>
> >>I'm setting up a Lucene index on 5G of PDF files (full-text
> search). 
> >>I've 
> >>been really happy with Lucene so far but I'm curious what tips and
> >>strategies 
> >>I can use to optimize my performance at this large size.
> >>
> >>So far I am using pretty much all of the defaults (I'm new to
> >>Lucene).
> >>
> >>I am using PDFBox to add the documents to the index.
> >>I can usually add about 800 or so PDF files and then the add loop:
> >>
> >>for ( int i = 0; i < fileNames.length; i++ ) {
> >>Document doc =
> IndexFile.index(baseDirectory+documentRoot+"fileNames
> >>[i]); 
> >>writer.addDocument(doc);
> >>}
> >>
> >>
> >>really starts to slow down.  Doesn't seem to be memory related.
> >>Thoughts anyone?
> >>
> >>Thanks in advance,
> >>CK Hill
> >>
> >>
> >>
>
>>-
> >>To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>For additional commands, e-mail:
> [EMAIL PROTECTED]
> >>
> >>
> >>
> >
> >
>
>-
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> >  
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index advice...

2004-02-10 Thread Scott ganyo
I have.  While document.add() itself doesn't increase over time, the 
merge does.  Ways of partially overcoming this include increasing the 
mergeFactor (but this will increase the number of file handles used), 
or building blocks of the index in memory and then merging them to 
disk.  This has been discussed before, so you should be able to find 
additional information on this fairly easily.

Scott

On Feb 10, 2004, at 7:55 AM, Otis Gospodnetic wrote:

--- Leo Galambos <[EMAIL PROTECTED]> wrote:
Otis Gospodnetic napsal(a):

Without seeing more information/code, I can't tell which part of
your
system slows down with time, but I can tell you that Lucene's 'add'
does not slow over time (i.e. as the index gets larger).  Therefore,
I
would look elsewhere for causes of the slowdown.


Otis, can you point me to some proofs that time of "insert" operation

does not depend on the index size, please? Amortized time of "insert"
is O(log(docsIndexed/mergeFac)), I think.
This would imply that Lucene gets slower as it adds more documents to
the index.  Have you observed this behaviour?  I haven't.
Thus I do not know how it could be O(1).
~ O(1) is what I have observed through experiments with indexing of
several million documents.
Otis


AFAIK the issue with PDF files can be based on the PDF parser (I
already
encountered this with PDFbox).
The easiest thing to do is add logging to suspicious portions of the
code.  That will narrow the scope of the code you need to analyze.
Otis

--- [EMAIL PROTECTED] wrote:


Hey Lucene-users,

I'm setting up a Lucene index on 5G of PDF files (full-text
search).
I've
been really happy with Lucene so far but I'm curious what tips and
strategies
I can use to optimize my performance at this large size.
So far I am using pretty much all of the defaults (I'm new to
Lucene).
I am using PDFBox to add the documents to the index.
I can usually add about 800 or so PDF files and then the add loop:
for ( int i = 0; i < fileNames.length; i++ ) {
Document doc =
IndexFile.index(baseDirectory+documentRoot+"fileNames
[i]);
writer.addDocument(doc);
}
really starts to slow down.  Doesn't seem to be memory related.
Thoughts anyone?
Thanks in advance,
CK Hill



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: Index advice...

2004-02-10 Thread petite_abeille
On Feb 10, 2004, at 14:03, Scott ganyo wrote:

I have.  While document.add() itself doesn't increase over time, the 
merge does.  Ways of partially overcoming this include increasing the 
mergeFactor (but this will increase the number of file handles used), 
or building blocks of the index in memory and then merging them to 
disk.  This has been discussed before, so you should be able to find 
additional information on this fairly easily.
This is what I noticed also: adding documents by itself is a fairly 
benign operation, but anything that triggers an index merge in one form 
or another is a killer as an index grows in size.

So, overall, adding more documents does slow down the indexing.

At least this is the impression I get. But I would love to be proven 
wrong on this :)

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index advice...

2004-02-10 Thread Leo Galambos
Otis Gospodnetic napsal(a):

Thus I do not know how it could be O(1).
   

~ O(1) is what I have observed through experiments with indexing of
several million documents.
 

What did you exactly measured? Just the time of the insert operation 
(incl. merge(), of course)? Was it a test on real documents?

THX
Leo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index: how to store binary data or objects ?

2004-02-10 Thread Markus Brosch
> 1. Store the binary data in files and store the path in Lucene. There's
> scallability issues here when you handle more than a few hundred
> thousand objects.

> 2. Store the binary data in a database and store a unique id in Lucene.
> This will scale better but binary data fetching from the db might be
> slow.

Thank you all for your comments!
In general I understand your suggestions - mostly because of scaling issues.

My application will deal with "small" data sets. The problem is, that I want
to index the content (String) of some objects. I want to refer to that
object once I found this by a keyword or whatever.  So, using a simple map or
tree? 

Another problem is, that my objects can change their content and must be
"reindexed". Is it possible to remove the single index for that object and build
a new one without reindexing all?

Thank you for help!
Best regards, Markus



-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index: how to store binary data or objects ?

2004-02-10 Thread petite_abeille
On Feb 10, 2004, at 14:53, Markus Brosch wrote:

My application will deal with "small" data sets. The problem is, that 
I want
to index the content (String) of some objects. I want to refer to that
object once I found this by a keyword or whatever.  So, using a simple 
map or
tree?
Something along these lines:

- When indexing your object, you create one Lucene document for it and 
store its unique identifier as a keyword along side whatever you want 
to index.

- When retrieving your documents, you can use this keyword to reference 
your object.

Another problem is, that my objects can change their content and must 
be
"reindexed". Is it possible to remove the single index for that object 
and build
a new one without reindexing all?
Yes.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index advice...

2004-02-10 Thread Otis Gospodnetic

--- Leo Galambos <[EMAIL PROTECTED]> wrote:
> Otis Gospodnetic napsal(a):
> 
> >>Thus I do not know how it could be O(1).
> >>
> >>
> >
> >~ O(1) is what I have observed through experiments with indexing of
> >several million documents.
> >  
> >
> 
> What did you exactly measured? Just the time of the insert operation 
> (incl. merge(), of course)? Was it a test on real documents?

I didn't really measure anything, I only observed this, as my focus was
something else, not performance measurements.
It is true that every time an insert/add triggers a merge operation,
things will slow down, but from what I recall (and this was about 1
year ago), the overall performance was steady as the index grew.

Documents were articifially created from random dictionary words.
Their size was variable, but not by a lot.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene 1.3 final -- Lock/Segment permission errors

2004-02-10 Thread Clay, Brian
Thanks.

It actually works out that we have to edit either the java.policy or
[application].policy file to add the directory that we are intending to use.
Unfortunately it is not as dynamic as one would hope.

For those that might have the same problem here is the line that is added to
the .policy file:

permission java.io.Permission "[directory]", "read,write, delete";

Bri

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 09, 2004 3:52 PM
To: Lucene Users List
Subject: Re: Lucene 1.3 final -- Lock/Segment permission errors


I would first look at the exact command line that is used to start the app
server.  Could it be that includes something like
-Djava.io.temp=some-directory-here ? Lucene uses java.io.temp System
property to determine the location/directory to use for lock files.  Maybe
this app server uses some directory with insufficient permissions.


Otis

--- "Clay, Brian" <[EMAIL PROTECTED]> wrote:
> We are in the process of migrating an application from WAS 5.1 to Sun 
> Application server 8 (for technical reason version 7 is incompatible 
> with our application). Under WAS Lucene works great; however, when
> building an
> index under Sun we get a java.security.AccessControlException when
> Lucene
> attempts to delete any segment file also any time the application
> attempts
> to delete the write.lock file the same error is being thrown. It
> seems that
> the permissions are fine for Read/Write; however, delete is illegal.
> Though
> the documents created are deleted without error.
> 
> Has anyone gotten Lucene to work on Sun Application Server 8 or 
> encountered this problem?
> 
> Any help would be great.
> 
> Thanks,
> 
> Brian Clay
> 
> ~
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index advice...

2004-02-10 Thread Chong, Herb
the merges start taking longer and longer on my systems.

Herb

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 10, 2004 9:18 AM
To: Lucene Users List
Subject: Re: Index advice...

It is true that every time an insert/add triggers a merge operation,
things will slow down, but from what I recall (and this was about 1
year ago), the overall performance was steady as the index grew.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ordering by a field value

2004-02-10 Thread Gabe

Thanks Otis.

What will the names of the relevant files be and will
I be able to use 1.3 final still (simply integrating
the contributions into my own code) or would I have to
go with the latest code from CVS?

Thanks again,
Gabe

--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:
> There were some recent contributions that should
> make this possible and
> simple to do.
> The code should be added to Lucene CVS repository in
> the next week or
> so.
> 
> Otis
> 
> --- Gabe <[EMAIL PROTECTED]> wrote:
> > 
> > Hi,
> > 
> > I was wondering whether it was possible to sort
> search
> > results by the order of the String value of a
> stored
> > or unstored field. How would one implement this?
> > 
> > Thanks,
> > Gabe
> > 
> > __
> > Do you Yahoo!?
> > Yahoo! Finance: Get your refund fast by filing
> online.
> > http://taxes.yahoo.com/filing.html
> > 
> >
>
-
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > 
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Ordering by a field value

2004-02-10 Thread Otis Gospodnetic
I suggest you check either Lucene patches link on Lucene site or check
lucene-dev mailing list archives for details.
You should be able to drop the new classes into unzipped Lucene 1.3
final Jar, and re-jar everything.

Otis


--- Gabe <[EMAIL PROTECTED]> wrote:
> 
> Thanks Otis.
> 
> What will the names of the relevant files be and will
> I be able to use 1.3 final still (simply integrating
> the contributions into my own code) or would I have to
> go with the latest code from CVS?
> 
> Thanks again,
> Gabe
> 
> --- Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
> > There were some recent contributions that should
> > make this possible and
> > simple to do.
> > The code should be added to Lucene CVS repository in
> > the next week or
> > so.
> > 
> > Otis
> > 
> > --- Gabe <[EMAIL PROTECTED]> wrote:
> > > 
> > > Hi,
> > > 
> > > I was wondering whether it was possible to sort
> > search
> > > results by the order of the String value of a
> > stored
> > > or unstored field. How would one implement this?
> > > 
> > > Thanks,
> > > Gabe
> > > 
> > > __
> > > Do you Yahoo!?
> > > Yahoo! Finance: Get your refund fast by filing
> > online.
> > > http://taxes.yahoo.com/filing.html
> > > 
> > >
> >
> -
> > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > 
> > 
> > 
> >
> -
> > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > 
> 
> 
> __
> Do you Yahoo!?
> Yahoo! Finance: Get your refund fast by filing online.
> http://taxes.yahoo.com/filing.html
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index advice...

2004-02-10 Thread Leo Galambos
Otis Gospodnetic napsal(a):

--- Leo Galambos <[EMAIL PROTECTED]> wrote:
 

Otis Gospodnetic napsal(a):

   

Thus I do not know how it could be O(1).
  

   

~ O(1) is what I have observed through experiments with indexing of
several million documents.
 

What did you exactly measured? Just the time of the insert operation 
(incl. merge(), of course)? Was it a test on real documents?
   

I didn't really measure anything, I only observed this, as my focus was
something else, not performance measurements.
It is true that every time an insert/add triggers a merge operation,
things will slow down, but from what I recall (and this was about 1
year ago), the overall performance was steady as the index grew.
 

Try the same test with mergeFactor=2, you will see the difference.

Leo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Another Newbie question--FSDirectory

2004-02-10 Thread Scott Smith
I agree that is sensible.  Does FSDirectory provide anything more than a
wrapper around OS directory structures?

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 10, 2004 1:42 AM
To: Lucene Users List
Subject: Re: Another Newbie question--FSDirectory


You should probably always try to use Directory, and not String nor
FSDirectory. Directory is the most abstract 'index type and location
entity', and using it smartly allows you to change your index type and
location more easily, should you ever choose to do that.

Otis

--- Scott Smith <[EMAIL PROTECTED]> wrote:
> I was creating the IndexSearcher using a standard String containing 
> the Lucene index directory pathname.  I noticed that "everyone" seems 
> to create a FSDirectory and use that to create the searcher.  However,
> no
> one seems to use this for the IndexWriter.  Can someone tell me what
> the
> advantage of using the FSDirectory is over just specifying the index
> directory in a String? why it would be used for searching, but not
> for
> indexing?  Anything else relevant to FSDirectory?  Is it mostly
> convention and the real reason for it's existence is when you use the
> RAMDirectory or CompoundFileDirectory (or you don't want some method
> to
> care what flavor directory it is looking at)?
> 
> Scott
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



The First Parameter of the IndexWriter

2004-02-10 Thread Caroline Jen
I am constructing a web site.  I am learning the
Lucene so that I can use it to search the database.  I
started with reading the "Introdution In Text Indexing
with Jakarta Apache Lucene" at
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html

and in the example given, it looks that I have to
specify a directory for the first parameter of the
IndexWriter (see below).

   String indexDir = System.getProperty
   ("java.io.tmpdir", "tmp") + System.getProperty
   ("file.separator") + "index-1";

   Analyzer analyzer = new StandardAnalyzer();
   boolean createFlag = true;

   IndexWriter writer = new IndexWriter(indexDir, 
analyzer, createFlag);

I have a record created and stored in a table in my
database whenever a user submits his/her inputs.  And
I want to index that record.  What should be the
indexDir in my case?  Should I follow the above
example and use "java.io.tmpdir"?  I sort of doubt it.
 Please advise.

__
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The First Parameter of the IndexWriter

2004-02-10 Thread Stephane James Vaucher
You should probably take a look at the javadoc:
http://jakarta.apache.org/lucene/docs/api/index.html

As for where to store the index, you'll want to put it somewhere where all
potential users can access it, as well as where there is enough space for
your index. In a nutshell, you need to think of:
- amount of storage required
- permissions (e.g. if you need to access it from a app server with
security restrictions)
- access, on a shared HD or not
- deployment, if for a product, then it should be included in your
installation strategy, so you might use c:/Program Files/.../MyApp/index,
or /usr/local/MyApp/index.

On win, I personnally use my D drive in a path corresponding
to d://index.

HTH,
sv

On Tue, 10 Feb 2004, Caroline Jen wrote:

> I am constructing a web site.  I am learning the
> Lucene so that I can use it to search the database.  I
> started with reading the "Introdution In Text Indexing
> with Jakarta Apache Lucene" at
> http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html
>
> and in the example given, it looks that I have to
> specify a directory for the first parameter of the
> IndexWriter (see below).
>
>String indexDir = System.getProperty
>("java.io.tmpdir", "tmp") + System.getProperty
>("file.separator") + "index-1";
>
>Analyzer analyzer = new StandardAnalyzer();
>boolean createFlag = true;
>
>IndexWriter writer = new IndexWriter(indexDir,
> analyzer, createFlag);
>
> I have a record created and stored in a table in my
> database whenever a user submits his/her inputs.  And
> I want to index that record.  What should be the
> indexDir in my case?  Should I follow the above
> example and use "java.io.tmpdir"?  I sort of doubt it.
>  Please advise.
>
> __
> Do you Yahoo!?
> Yahoo! Finance: Get your refund fast by filing online.
> http://taxes.yahoo.com/filing.html
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



commit.lock file

2004-02-10 Thread Supun Edirisinghe
Hi everybody, I'm new to the mail list.

I'm also new to using Lucene.

We use lucene to index some of our pages.

sometimes (for a reason unknown to us) a commit.lock file is left and 
searches using the index  don't work.

what are some of the causes for this commit.lock file to persist.

I've read in the faq that it is written so that access to the segments 
is synchronized correctly.

What are some good strategies to make make this file go away? Would it 
be a good idea to assign a  program to  just check the timestamp on 
that file and just delete it if it has been there for a long time?

all comments are welcome.

thanks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


how to "re-index"

2004-02-10 Thread Markus Brosch
> When retrieving your documents, you can use this keyword to reference 
> your object.
> 
> > Another problem is, that my objects can change their content and must 
> > be "reindexed". Is it possible to remove the single index for that
object 
> > and build a new one without reindexing all?
> 
> Yes.

Thank you for your answers!

However, I have problems with "reindexing". 
First, I index all my object contents. Then some of these objects can change
and need to be re-indexed. 

I did it with IndexWriter(Dir, Analyzer, FALSE). With the boolean value
"false" the new document will be added to the index, but the old document still
remains in the index :-/ 

Any suggestions? THANK YOU!

-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: how to "re-index"

2004-02-10 Thread Markus Brosch
> However, I have problems with "reindexing". 
> First, I index all my object contents. Then some of these objects can
> change
> and need to be re-indexed. 
> 
> I did it with IndexWriter(Dir, Analyzer, FALSE). With the boolean value
> "false" the new document will be added to the index, but the old document
> still remains in the index :-/ 

Sorry for the second mail, but maybe I sould say that I am looking for an
UPDATE of the index! What I am doing at the moment is adding (see above) and
deleting with IndexReader ...

Thanks ;-)

-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]