I am using the solr cloud branch on 6 machines. I first load PubMed into HBase, and then push the fields I care about to solr. Indexing from HBase to solr takes about 18 minutes. Loading to hbase takes a little longer (2 hours?), but it only happens once so I haven't spent much time trying to optimize.
This gives me the flexibility of a solr search as well as full document retrieval (and additional processing) from hbase. Dave -----Original Message----- From: Erik Fäßler [mailto:erik.faess...@uni-jena.de] Sent: Tuesday, November 16, 2010 9:16 AM To: solr-user@lucene.apache.org Subject: Re: DIH full-import failure, no real error message Thank you very much, I will have a read on your links. The full-text-red-flag is exactly the thing why I'm testing this with Solr. As was said before by Dennis, I could also use a database as long as I don't need sophisticated query capabilities. To be honest, I don't know the performance gap between a Lucene index and a database in such a case. I guess I will have to test it. This is thought as a substitution for holding every single file on disc. But I need the whole file information because it's not clear which information will be required in the future. And we don't want to re-index every time we add a new field (not yet, that is ;)). Best regards, Erik Am 16.11.2010 16:27, schrieb Erick Erickson: > The key is that Solr handles merges by copying, and only after > the copy is complete does it delete the old index. So you'll need > at least 2x your final index size before you start, especially if you > optimize... > > Here's a handy matrix of what you need in your index depending > upon what you want to do: > http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCase > > Leaving out what you don't use will help by shrinking your index. > > <http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCase>the > thing that jumps out is that you're storing your entire XML document > as well as indexing it. Are you expecting to return the document > to the user? Storing the entire document is is a red-flag, you > probably don't want to do this. If you need to return the entire > document some time, one strategy is to index whatever you need > to search, and index what you need to fetch the document from > an external store. You can index the values of selected tags as fields in > your documents. That would also give you far more flexibility > when searching. > > Best > Erick > > > > > On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßler<erik.faess...@uni-jena.de>wrote: > >> Hello Erick, >> >> I guess I'm the one asking for pardon - but sure not you! It seems, you're >> first guess could already be the correct one. Disc space IS kind of short >> and I believe it could have run out; since Solr is performing a rollback >> after the failure, I didn't notice (beside the fact that this is one of our >> server machine, but apparently the wrong mount point...). >> >> I not yet absolutely sure of this, but it would explain a lot and it really >> looks like it. So thank you for this maybe not so obvious hint :) >> >> But you also mentioned the merging strategy. I left everything on the >> standards that come with the Solr download concerning these things. >> Could it be that such a great index needs another treatment? Could you >> point me to a Wiki page or something where I get a few tips? >> >> Thanks a lot, I will try building the index on a partition with enough >> space, perhaps that will already do it. >> >> Best regards, >> >> Erik >> >> Am 16.11.2010 14:19, schrieb Erick Erickson: >> >> Several questions. Pardon me if they're obvious, but I've spent faaaar >>> too much of my life overlooking the obvious... >>> >>> 1> Is it possible you're running out of disk? 40-50G could suck up >>> a lot of disk, especially when merging. You may need that much again >>> free when a merge occurs. >>> 2> speaking of merging, what are your merge settings? How are you >>> triggering merges. See<mergeFactor> and associated in solrconfig.xml? >>> 3> You might get some insight by removing the Solr indexing part, can >>> you spin through your parsing from beginning to end? That would >>> eliminate your questions about whether you're XML parsing is the >>> problem. >>> >>> >>> 40-50G is a large index, but it's certainly within Solr's capability, >>> so you're not hitting any built-in limits. >>> >>> My first guess would be that you're running out of disk, at least >>> that's the first thing I'd check next... >>> >>> Best >>> Erick >>> >>> On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<erik.faess...@uni-jena.de >>>> wrote: >>> Hey all, >>>> I'm trying to create a Solr index for the 2010 Medline-baseline ( >>>> www.BLOCKEDpubmed.gov, over 18 million XML documents). My goal is to be >>>> able to >>>> retrieve single XML documents by their ID. Each document comes with a >>>> unique >>>> ID, the PubMedID. So my schema (important portions) looks like this: >>>> >>>> <field name="pmid" type="string" indexed="true" stored="true" >>>> required="true" /> >>>> <field name="date" type="tdate" indexed="true" stored="true"/> >>>> <field name="xml" type="text" indexed="true" stored="true"/> >>>> >>>> <uniqueKey>pmid</uniqueKey> >>>> <defaultSearchField>pmid</defaultSearchField> >>>> >>>> pmid holds the ID, data hold the creation date; xml holds the whole XML >>>> document (mostly below 5kb). I used the DataImporter to do this. I had to >>>> write some classes (DataSource, EntityProcessor, DateFormatter) myself, >>>> so >>>> theoretically, the error could lie there. >>>> >>>> What happens is that indexing just looks fine at the beginning. Memory >>>> usage is quite below the maximum (max of 20g, usage of below 5g, most of >>>> the >>>> time around 3g). It goes several hours in this manner until it suddenly >>>> stopps. I tried this a few times with minor tweaks, non of which made any >>>> difference. The last time such a crash occurred, over 16.5 million >>>> documents >>>> already had been indexed (argh, so close...). It never stops at the same >>>> document and trying to index the documents, where the error occurred, >>>> just >>>> runs fine. Index size on disc was between 40g and 50g the last time I had >>>> a >>>> look. >>>> >>>> This is the log from beginning to end: >>>> >>>> (I decided to just attach the log for the sake of readability ;) ). >>>> >>>> As you can see, Solr's error message is not quite complete. There are no >>>> closing brackets. The document is cut in half on this message and not >>>> even >>>> the error message itself is complete: The 'D' of >>>> (D)ataImporter.runCmd(DataImporter.java:389) right after the document >>>> text >>>> is missing. >>>> >>>> I have one thought concerning this: I get the input documents as an >>>> InputStream which I read buffer-wise (at most 1000bytes per read() call). >>>> I >>>> need to deliver the documents in one large byte-Array to the XML parser I >>>> use (VTD XML). >>>> But I don't only get the individual small XML documents but always one >>>> larger XML blob with exactly 30,000 of these documents. I use a >>>> self-written >>>> EntityProcessor to extract the single documents from the larger blob. >>>> These >>>> blobs have a size of about 50 to 150mb. So what I do is to read these >>>> large >>>> blobs in 1000bytes steps and store each byte array in an >>>> ArrayList<byte[]>. >>>> Afterwards, I create the ultimate byte[] and do System.arraycopy from the >>>> ArrayList into the byte[]. >>>> I tested this and it looks fine to me. And how I said, indexing the >>>> documents where the error occurred just works fine (that is, indexing the >>>> whole blob containing the single document). I just mention this because >>>> it >>>> kind of looks like there is this cut in the document and the missing 'D' >>>> reminds me of char-encoding errors. But I don't know for real, opening >>>> the >>>> error log in vi doesn't show any broken characters (the last time I had >>>> such >>>> problems, vi could identify the characters in question, other editors >>>> just >>>> wouldn't show them). >>>> >>>> Further ideas from my side: Is the index too big? I think I read >>>> something >>>> about a large index would be something around 10million documents, I aim >>>> to >>>> approximately double this number. But would this cause such an error? In >>>> the >>>> end: What exactly IS the error? >>>> >>>> Sorry for the lot of text, just trying to describe the problem as >>>> detailed >>>> as possible. Thanks a lot for reading and I appreciate any ideas! :) >>>> >>>> Best regards, >>>> >>>> Erik >>>> >>>>