Re: rc4 and FileNotFoundException: an update

2002-04-27 Thread Steven J. Owens


On Fri, Apr 26, 2002 at 07:05:23PM +0200, petite_abeille wrote:
 I guess it's really not my day...
 [...]
 Well, it's pretty ugly. Whatever I'm doing with Lucene in the previous 
 package (com.lucene) is magnified many folds in rc4. After processing a 
 paltry 16 objects I got:
 
 SZFinder.findObjectsWithSpecificationInStore: 
 java.io.FileNotFoundException: _2.f14 (Too many open files)

 Sounds like a pretty nasty situation.  

 One suggestion I have for you is that Doug is usually very
helpful with problems like this IF you can first narrow down what is
happening to the point that you can post a clear, specific, isolated
test that consistently causes the problem to happen.  This makes sense
- any effort to solve the problem will first involve isolating the
bug, and that's a task you're best suited for, since you know your
system best.

 So maybe your best approach would be to take a copy of your
system as above, and start gradually stripping out stuff, testing
between each run, until you have most of the application-specific
stuff removed, but the problem is still reoccurring consistently.
Then post your code and ask if some of the more lucene-knowledgable
can take a look.

 Re: index integrity, I agree that it would be really, really nice
to have some sort of sanity check.  I have yet to actually get into
the internals of the index, but I'd guess that there must be some sort
of at least superficial check, maybe some sort of format check.  

 If I was going to kludge something together, the first approach
I'd take would be to just open the index and roll through all of the
Documents in it, accessing all of the fields (or maybe just a few main
fields per Document).  Im not sure what I'd *do* with the field
values (printing them out to the screen might take a while), other
than perhaps checking for nulls.  But I suspect that if the code gets
throught that without causing an exception or getting null values,
then at least the index's internal format is intact.  Maybe the test
code could save the number of lucene Document objects in the index in
between checks (and, of course, update this number when you add or
remove documents), and make sure it still has the right number of
documents.

 As for repairing an index, I think that's working sort of against
the grain of Lucene.  In your case, it sounds like rebuilding the
index is important, because you're using Lucene as a data store.  I
have some similar issues myself in some things I want to build (I end
up wanting both a data store and a search index; ultimately I've ended
up choosing to have a separate data store for the extra data).  But
Lucene is a search index, meant to be used more in a cache-like style,
so there's an underlying assumption that the original data is always
around to reindex.  Thus, repairing an index is less important, since
it is assumed you can always rebuild it.  

 I don't know much of the theories behind data store systems.  It
occurs to me that using Lucene as a data store, you'll always be
working against the grain, always swimming upstream.  Maybe it'd be a
better idea to figure out some way to use Lucene as the indexing
technology in a data store, the way traditional RDBMSes use indexes,
for speeding access.  

 Or possibly you should look at Xindice (http://xml.apache.org/xindice/)
which is an XML database.  You might find it easier to adapt that to your
needs.  I'm kind of curious as to how fast Xindice's XPath execution is, and
what their indexing is based on - there might be a use for Lucene there.

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: rc4 and FileNotFoundException: an update

2002-04-27 Thread petite_abeille

Hi Steven,

 Sounds like a pretty nasty situation.

It is...

 This makes sense - any effort to solve the problem will first involve 
 isolating the
 bug, and that's a task you're best suited for, since you know your
 system best.

Ok... From what I understand, this situation arise depending on my 
usage pattern of Lucene. For example, if I use it in batch mode (eg, 
through some tools to stress test my app by loading a couple of millions 
of objects), everything works perfectly fine. However, when running my 
app in a more interactive mode (eg, with user interaction, object 
indexing, writing and searching at the same time) I run into this 
exception very quickly. The problem, seems to have something to do with 
Searcher and/or how I'm using them. I need to investigate in that 
direction... Also, what it the magic formula for minimizing 
RandomAccessFile usage in Lucene to a very strict minimum? Is 
IndexWriter.mergeFactor the only parameter I can play with, or am I 
missing some other configuration that might help?

 Then post your code and ask if some of the more lucene-knowledgeable 
 can take a look.

Unfortunately, it's not that straightforward as I'm using Lucene as part 
of some sort of custom built oodbms and this behavior seems to be usage 
related... You can check the app at http://homepage.mac.com/zoe_info/ if 
that helps.

  Re: index integrity, I agree that it would be really, really nice to 
 have some sort of sanity check.

I'm not familiar with Lucene internals, but is it conceivable to have 
some sort of checksum per document and/or index that will help to 
identify corrupted data?

 As for repairing an index, I think that's working sort of against the 
 grain of Lucene.

:-(

 In your case, it sounds like rebuilding the index is important, because 
 you're using Lucene as a data store.

Well, not exactly. I'm just using Lucene to index my data store (with a 
bunch of Field.Keyword and Field.Unstored). The actual object storage is 
handled externally to Lucene. However, I need a consistent index as I'm 
using it as part of my object tree.

 Maybe it'd be a better idea to figure out some way to use Lucene as the 
 indexing
 technology in a data store, the way traditional RDBMSes use indexes,
 for speeding access.

I agree. It's how I'm using it more or less. Nevertheless, for the sake 
of reliability, I need to have some level of confidence that the 
underlying indexes are sane... And a way to correct the problem if 
they are not. In my case, I will happily trade speed for reliability as 
I cannot afford to have inconsistent indexes. A corrupted index is of 
not use to me.

 Or possibly you should look at Xindice (http://xml.apache.org/xindice/) 
 which is an XML database.

I'm familiar with Xindice and other related toolboxes. However, I have 
some peculiar requirements, so I decided to custom made my own 
persistency layer. Works fine so far. Just this very annoying exception. 
Also this situation seems to arise on UNIX systems only as I never heard 
anybody complaining about it on any Windows type platforms... Very odd 
in any case...

Thanks for your help in any case.

PA


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]