Re: Generating Index offline and loading into solrcloud

2020-01-30 Thread vchauras
Hey Sameer, I tried using the tool on hadoop master node (AWS EMR) like: hadoop jar cloudera-search-1.0.0-cdh5.2.0-jar-with-dependencies.jar \ org.apache.solr.hadoop.MapReduceIndexerTool \ -D 'mapred.child.java.opts=-Xmx500m' \ --log4j ~/log4j.properties \ --morphline-file

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread KNitin
Great. Thanks! On Thu, Nov 19, 2015 at 11:24 AM, Sameer Maggon wrote: > If you are trying to create a large index and want speedups there, you > could use the MapReduceTool - > https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At a > high level, it

Generating Index offline and loading into solrcloud

2015-11-19 Thread KNitin
Hi, I was wondering if there are existing tools that will generate solr index offline (in solrcloud mode) that can be later on loaded into solrcloud, before I decide to implement my own. I found some tools that do only solr based index loading (non-zk mode). Is there one with zk mode enabled?

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread KNitin
Ah got it. Another generic question, is there too much of a difference between generating files in map reduce and loading into solrcloud vs using solr NRT api? Has any one run any test of that sort? Thanks a ton, Nitin On Thu, Nov 19, 2015 at 3:00 PM, Erick Erickson

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread Sameer Maggon
If you are trying to create a large index and want speedups there, you could use the MapReduceTool - https://github.com/cloudera/search/tree/cdh5-1.0.0_5.2.1/search-mr. At a high level, it takes your files (csv, json, etc) as input can create either a single or a sharded index that you can either

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread Erick Erickson
Note two things: 1> this is running on Hadoop 2> it is part of the standard Solr release as MapReduceIndexerTool, look in the contribs... If you're trying to do this yourself, you must be very careful to index docs to the correct shard then merge the correct shards. MRIT does this all

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread KNitin
Thanks, Eric. Looks like MRIT uses Embedded solr running per mapper/reducer and uses that to index documents. Is that the recommended model? Can we use raw lucene libraries to generate index and then load them into solrcloud? (Barring the complexities for indexing into right shard and merging

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread Erick Erickson
Sure, you can use Lucene to create indexes for shards if (and only if) you deal with the routing issues About updates: I'm not talking about atomic updates at all. The usual model for Solr is if you have a unique key defined, new versions of documents replace old versions of documents based

Re: Generating Index offline and loading into solrcloud

2015-11-19 Thread Erick Erickson
Apples/Oranges question: They're different beasts. The NRT stuff (spark-solr for example, Cloudera's Flume sink as well, custom SolrJ clients, whatever) is constrained by the number of Solr servers you have running, more specifically the number of shards. When you're feeding docs fast enough that