I am using the solr cloud branch on 6 machines.  I first load PubMed into 
HBase, and then push the fields I care about to solr.  Indexing from HBase to 
solr takes about 18 minutes.  Loading to hbase takes a little longer (2 
hours?), but it only happens once so I haven't spent much time trying to 
optimize.

This gives me the flexibility of a solr search as well as full document 
retrieval (and additional processing) from hbase.

Dave

-----Original Message-----
From: Erik Fäßler [mailto:erik.faess...@uni-jena.de] 
Sent: Tuesday, November 16, 2010 9:16 AM
To: solr-user@lucene.apache.org
Subject: Re: DIH full-import failure, no real error message

  Thank you very much, I will have a read on your links.

The full-text-red-flag is exactly the thing why I'm testing this with 
Solr. As was said before by Dennis, I could also use a database as long 
as I don't need sophisticated query capabilities. To be honest, I don't 
know the performance gap between a Lucene index and a database in such a 
case. I guess I will have to test it.
This is thought as a substitution for holding every single file on disc. 
But I need the whole file information because it's not clear which 
information will be required in the future. And we don't want to 
re-index every time we add a new field (not yet, that is ;)).

Best regards,

     Erik

Am 16.11.2010 16:27, schrieb Erick Erickson:
> The key is that Solr handles merges by copying, and only after
> the copy is complete does it delete the old index. So you'll need
> at least 2x your final index size before you start, especially if you
> optimize...
>
> Here's a handy matrix of what you need in your index depending
> upon what you want to do:
> http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCase
>
> Leaving out what you don't use will help by shrinking your index.
>
> <http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCase>the
> thing that jumps out is that you're storing your entire XML document
> as well as indexing it. Are you expecting to return the document
> to the user? Storing the entire document is is a red-flag, you
> probably don't want to do this. If you need to return the entire
> document some time, one strategy is to index whatever you need
> to search, and index what you need to fetch the document from
> an external store. You can index the values of selected tags as fields in
> your documents. That would also give you far more flexibility
> when searching.
>
> Best
> Erick
>
>
>
>
> On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßler<erik.faess...@uni-jena.de>wrote:
>
>>   Hello Erick,
>>
>> I guess I'm the one asking for pardon - but sure not you! It seems, you're
>> first guess could already be the correct one. Disc space IS kind of short
>> and I believe it could have run out; since Solr is performing a rollback
>> after the failure, I didn't notice (beside the fact that this is one of our
>> server machine, but apparently the wrong mount point...).
>>
>> I not yet absolutely sure of this, but it would explain a lot and it really
>> looks like it. So thank you for this maybe not so obvious hint :)
>>
>> But you also mentioned the merging strategy. I left everything on the
>> standards that come with the Solr download concerning these things.
>> Could it be that such a great index needs another treatment? Could you
>> point me to a Wiki page or something where I get a few tips?
>>
>> Thanks a lot, I will try building the index on a partition with enough
>> space, perhaps that will already do it.
>>
>> Best regards,
>>
>>     Erik
>>
>> Am 16.11.2010 14:19, schrieb Erick Erickson:
>>
>>   Several questions. Pardon me if they're obvious, but I've spent faaaar
>>> too much of my life overlooking the obvious...
>>>
>>> 1>   Is it possible you're running out of disk? 40-50G could suck up
>>> a lot of disk, especially when merging. You may need that much again
>>> free when a merge occurs.
>>> 2>   speaking of merging, what are your merge settings? How are you
>>> triggering merges. See<mergeFactor>   and associated in solrconfig.xml?
>>> 3>   You might get some insight by removing the Solr indexing part, can
>>> you spin through your parsing from beginning to end? That would
>>> eliminate your questions about whether you're XML parsing is the
>>> problem.
>>>
>>>
>>> 40-50G is a large index, but it's certainly within Solr's capability,
>>> so you're not hitting any built-in limits.
>>>
>>> My first guess would be that you're running out of disk, at least
>>> that's the first thing I'd check next...
>>>
>>> Best
>>> Erick
>>>
>>> On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<erik.faess...@uni-jena.de
>>>> wrote:
>>>    Hey all,
>>>> I'm trying to create a Solr index for the 2010 Medline-baseline (
>>>> www.BLOCKEDpubmed.gov, over 18 million XML documents). My goal is to be 
>>>> able to
>>>> retrieve single XML documents by their ID. Each document comes with a
>>>> unique
>>>> ID, the PubMedID. So my schema (important portions) looks like this:
>>>>
>>>> <field name="pmid" type="string" indexed="true" stored="true"
>>>> required="true" />
>>>> <field name="date" type="tdate" indexed="true" stored="true"/>
>>>> <field name="xml" type="text" indexed="true" stored="true"/>
>>>>
>>>> <uniqueKey>pmid</uniqueKey>
>>>> <defaultSearchField>pmid</defaultSearchField>
>>>>
>>>> pmid holds the ID, data hold the creation date; xml holds the whole XML
>>>> document (mostly below 5kb). I used the DataImporter to do this. I had to
>>>> write some classes (DataSource, EntityProcessor, DateFormatter) myself,
>>>> so
>>>> theoretically, the error could lie there.
>>>>
>>>> What happens is that indexing just looks fine at the beginning. Memory
>>>> usage is quite below the maximum (max of 20g, usage of below 5g, most of
>>>> the
>>>> time around 3g). It goes several hours in this manner until it suddenly
>>>> stopps. I tried this a few times with minor tweaks, non of which made any
>>>> difference. The last time such a crash occurred, over 16.5 million
>>>> documents
>>>> already had been indexed (argh, so close...). It never stops at the same
>>>> document and trying to index the documents, where the error occurred,
>>>> just
>>>> runs fine. Index size on disc was between 40g and 50g the last time I had
>>>> a
>>>> look.
>>>>
>>>> This is the log from beginning to end:
>>>>
>>>> (I decided to just attach the log for the sake of readability ;) ).
>>>>
>>>> As you can see, Solr's error message is not quite complete. There are no
>>>> closing brackets. The document is cut in half on this message and not
>>>> even
>>>> the error message itself is complete: The 'D' of
>>>> (D)ataImporter.runCmd(DataImporter.java:389) right after the document
>>>> text
>>>> is missing.
>>>>
>>>> I have one thought concerning this: I get the input documents as an
>>>> InputStream which I read buffer-wise (at most 1000bytes per read() call).
>>>> I
>>>> need to deliver the documents in one large byte-Array to the XML parser I
>>>> use (VTD XML).
>>>> But I don't only get the individual small XML documents but always one
>>>> larger XML blob with exactly 30,000 of these documents. I use a
>>>> self-written
>>>> EntityProcessor to extract the single documents from the larger blob.
>>>> These
>>>> blobs have a size of about 50 to 150mb. So what I do is to read these
>>>> large
>>>> blobs in 1000bytes steps and store each byte array in an
>>>> ArrayList<byte[]>.
>>>> Afterwards, I create the ultimate byte[] and do System.arraycopy from the
>>>> ArrayList into the byte[].
>>>> I tested this and it looks fine to me. And how I said, indexing the
>>>> documents where the error occurred just works fine (that is, indexing the
>>>> whole blob containing the single document). I just mention this because
>>>> it
>>>> kind of looks like there is this cut in the document and the missing 'D'
>>>> reminds me of char-encoding errors. But I don't know for real, opening
>>>> the
>>>> error log in vi doesn't show any broken characters (the last time I had
>>>> such
>>>> problems, vi could identify the characters in question, other editors
>>>> just
>>>> wouldn't show them).
>>>>
>>>> Further ideas from my side: Is the index too big? I think I read
>>>> something
>>>> about a large index would be something around 10million documents, I aim
>>>> to
>>>> approximately double this number. But would this cause such an error? In
>>>> the
>>>> end: What exactly IS the error?
>>>>
>>>> Sorry for the lot of text, just trying to describe the problem as
>>>> detailed
>>>> as possible. Thanks a lot for reading and I appreciate any ideas! :)
>>>>
>>>> Best regards,
>>>>
>>>>     Erik
>>>>
>>>>


Reply via email to