On Nov 7, 2011, at 12:06pm, Chris Hostetter wrote:

> 
> : I see that https://issues.apache.org/jira/browse/SOLR-653 removed this 
> : support from SolrJ, because it was deemed too dangerous for mere 
> : mortals.
> 
> I believe the concern was that the "novice level" API was very in your 
> face about asking if you wanted to "overwrite" and made it too easy to 
> hurt yourself.
> 
> It should still be fairly trivial to specify overwrite=false in a SolrJ 
> request -- just not using hte convenience methods.  something like...
> 
>       UpdateRequest req = new UpdateRequest();
>       req.add(myBigCollectionOfDocuments);
>       req.setParam(UpdateParams.OVERWRITE, true);
>       req.process(mySolrServer);

That seemed to work, thanks for the suggestion - though using (in case anybody 
else reads this)

   req.setParam(UpdateParams.OVERWRITE, Boolean.toString(false));

I'll need to run some tests to check performance improvements.

> : For Hadoop-based workflows, it's straightforward to ensure that the 
> : unique key field is really unique, thus if the performance gain is 
> : significant, I might look into figuring out some way (with a trigger 
> : lock) of re-enabling this support in SolrJ.
> 
> it's not just an issue of knowing that the key is unique -- it's an issue 
> of being certain that your index does not contain any documents with the 
> same key as a document you are about to add.  If you are generating a 
> completley new solr index from data that you are certain is unique -- then 
> you will probably see some perf gains.  but if you are adding to an 
> existing index, i would avoid it. 

For Hadoop workflows, the output is always fresh (unless you do some 
interesting helicopter stunts).

So yes, by default the index is always being rebuilt from scratch.

And thus as long as the primary key is being used as the reduce-phase key, it's 
easy to ensure uniqueness in the index.

Thanks again,

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to