Hello Erick,

I guess I'm the one asking for pardon - but sure not you! It seems, you're first guess could already be the correct one. Disc space IS kind of short and I believe it could have run out; since Solr is performing a rollback after the failure, I didn't notice (beside the fact that this is one of our server machine, but apparently the wrong mount point...).

I not yet absolutely sure of this, but it would explain a lot and it really looks like it. So thank you for this maybe not so obvious hint :)

But you also mentioned the merging strategy. I left everything on the standards that come with the Solr download concerning these things. Could it be that such a great index needs another treatment? Could you point me to a Wiki page or something where I get a few tips?

Thanks a lot, I will try building the index on a partition with enough space, perhaps that will already do it.

Best regards,

    Erik

Am 16.11.2010 14:19, schrieb Erick Erickson:
Several questions. Pardon me if they're obvious, but I've spent faaaar
too much of my life overlooking the obvious...

1>  Is it possible you're running out of disk? 40-50G could suck up
a lot of disk, especially when merging. You may need that much again
free when a merge occurs.
2>  speaking of merging, what are your merge settings? How are you
triggering merges. See<mergeFactor>  and associated in solrconfig.xml?
3>  You might get some insight by removing the Solr indexing part, can
you spin through your parsing from beginning to end? That would
eliminate your questions about whether you're XML parsing is the
problem.


40-50G is a large index, but it's certainly within Solr's capability,
so you're not hitting any built-in limits.

My first guess would be that you're running out of disk, at least
that's the first thing I'd check next...

Best
Erick

On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<erik.faess...@uni-jena.de>wrote:

  Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline (
www.pubmed.gov, over 18 million XML documents). My goal is to be able to
retrieve single XML documents by their ID. Each document comes with a unique
ID, the PubMedID. So my schema (important portions) looks like this:

<field name="pmid" type="string" indexed="true" stored="true"
required="true" />
<field name="date" type="tdate" indexed="true" stored="true"/>
<field name="xml" type="text" indexed="true" stored="true"/>

<uniqueKey>pmid</uniqueKey>
<defaultSearchField>pmid</defaultSearchField>

pmid holds the ID, data hold the creation date; xml holds the whole XML
document (mostly below 5kb). I used the DataImporter to do this. I had to
write some classes (DataSource, EntityProcessor, DateFormatter) myself, so
theoretically, the error could lie there.

What happens is that indexing just looks fine at the beginning. Memory
usage is quite below the maximum (max of 20g, usage of below 5g, most of the
time around 3g). It goes several hours in this manner until it suddenly
stopps. I tried this a few times with minor tweaks, non of which made any
difference. The last time such a crash occurred, over 16.5 million documents
already had been indexed (argh, so close...). It never stops at the same
document and trying to index the documents, where the error occurred, just
runs fine. Index size on disc was between 40g and 50g the last time I had a
look.

This is the log from beginning to end:

(I decided to just attach the log for the sake of readability ;) ).

As you can see, Solr's error message is not quite complete. There are no
closing brackets. The document is cut in half on this message and not even
the error message itself is complete: The 'D' of
(D)ataImporter.runCmd(DataImporter.java:389) right after the document text
is missing.

I have one thought concerning this: I get the input documents as an
InputStream which I read buffer-wise (at most 1000bytes per read() call). I
need to deliver the documents in one large byte-Array to the XML parser I
use (VTD XML).
But I don't only get the individual small XML documents but always one
larger XML blob with exactly 30,000 of these documents. I use a self-written
EntityProcessor to extract the single documents from the larger blob. These
blobs have a size of about 50 to 150mb. So what I do is to read these large
blobs in 1000bytes steps and store each byte array in an ArrayList<byte[]>.
Afterwards, I create the ultimate byte[] and do System.arraycopy from the
ArrayList into the byte[].
I tested this and it looks fine to me. And how I said, indexing the
documents where the error occurred just works fine (that is, indexing the
whole blob containing the single document). I just mention this because it
kind of looks like there is this cut in the document and the missing 'D'
reminds me of char-encoding errors. But I don't know for real, opening the
error log in vi doesn't show any broken characters (the last time I had such
problems, vi could identify the characters in question, other editors just
wouldn't show them).

Further ideas from my side: Is the index too big? I think I read something
about a large index would be something around 10million documents, I aim to
approximately double this number. But would this cause such an error? In the
end: What exactly IS the error?

Sorry for the lot of text, just trying to describe the problem as detailed
as possible. Thanks a lot for reading and I appreciate any ideas! :)

Best regards,

    Erik


Reply via email to