Dear Erick, Could you please name those problems that SolrCloud can not tackle them alone? Maybe I need solrCloud+ Hadoop and I am not aware of that yet. Regards.
On Thu, Aug 7, 2014 at 7:37 PM, Erick Erickson <erickerick...@gmail.com> wrote: > If SolrCloud meets your needs, without Hadoop, then > there's no real reason to introduce the added complexity. > > There are a bunch of problems that do _not_ work > well with SolrCloud over non-Hadoop file systems. For > those problems, the combination of SolrCloud and Hadoop > make tackling them possible. > > Best, > Erick > > > On Thu, Aug 7, 2014 at 3:55 AM, Ali Nazemian <alinazem...@gmail.com> > wrote: > > > Thank you very much. But why we should go for solr distributed with > hadoop? > > There is already solrCloud which is pretty applicable in the case of big > > index. Is there any advantage for sending indexes over map reduce that > > solrCloud can not provide? > > Regards. > > > > > > On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > > > bq: Are you aware of Cloudera search? I know they provide an integrated > > > Hadoop ecosystem. > > > > > > What Cloudera Search does via the MapReduceIndexerTool (MRIT) is > create N > > > sub-indexes for > > > each shard in the M/R paradigm via EmbeddedSolrServer. Eventually, > these > > > sub-indexes for > > > each shard are merged (perhaps through some number of levels) in the > > reduce > > > phase and > > > maybe merged into a live Solr instance (--go-live). You'll note that > this > > > tool requires the > > > address of the ZK ensemble from which it can get the network topology, > > > configuration files, > > > all that rot. If you don't use the --go-live option, the output is > still > > a > > > Solr index, it's just that > > > the index for each shard is left in a specific directory on HDFS. Being > > on > > > HDFS allows > > > this kind of M/R paradigm for massively parallel indexing operations, > and > > > perhaps massively > > > complex analysis. > > > > > > Nowhere is there any low-level non-Solr manipulation of the indexes. > > > > > > The Flume fork just writes directly to the Solr nodes. It knows about > the > > > ZooKeeper > > > ensemble and the collection too and communicates via SolrJ I'm pretty > > sure. > > > > > > As far as integrating with HDFS, you're right, HA is part of the > package. > > > As far as using > > > the Solr indexes for analysis, well you can write anything you want to > > use > > > the Solr indexes > > > from anywhere in the M/R world and have them available from anywhere in > > the > > > cluster. There's > > > no real need to even have Solr running, you could use the output from > > MRIT > > > and access the > > > sub-shards with the EmbeddedSolrServer if you wanted, leaving out all > the > > > pesky servlet > > > container stuff. > > > > > > bq: So why we go for HDFS in the case of analysis if we want to use > SolrJ > > > for this purpose? > > > What is the point? > > > > > > Scale and data access in a nutshell. In the HDFS world, you can scale > > > pretty linearly > > > with the number of nodes you can rack together. > > > > > > Frankly though, if your data set is small enough to fit on a single > > machine > > > _and_ you can get > > > through your analysis in a reasonable time (reasonable here is up to > > you), > > > then HDFS > > > is probably not worth the hassle. But in the big data world where we're > > > talking petabyte scale, > > > having HDFS as the underpinning opens up possibilities for working on > > data > > > that were > > > difficult/impossible with Solr previously. > > > > > > Best, > > > Erick > > > > > > > > > > > > On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian <alinazem...@gmail.com> > > > wrote: > > > > > > > Dear Erick, > > > > I remembered some times ago, somebody asked about what is the point > of > > > > modify Solr to use HDFS for storing indexes. As far as I remember > > > somebody > > > > told him integrating Solr with HDFS has two advantages. 1) having > > hadoop > > > > replication and HA. 2) using indexes and Solr documents for other > > > purposes > > > > such as Analysis. So why we go for HDFS in the case of analysis if we > > > want > > > > to use SolrJ for this purpose? What is the point? > > > > Regards. > > > > > > > > > > > > On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian <alinazem...@gmail.com> > > > > wrote: > > > > > > > > > Dear Erick, > > > > > Hi, > > > > > Thank you for you reply. Yeah I am aware that SolrJ is my last > > option. > > > I > > > > > was thinking about raw I/O operation. So according to your reply > > > probably > > > > > it is not applicable somehow. What about the Lily project that > > Michael > > > > > mentioned? Is that consider SolrJ too? Are you aware of Cloudera > > > search? > > > > I > > > > > know they provide an integrated Hadoop ecosystem. Do you know what > is > > > > their > > > > > suggestion? > > > > > Best regards. > > > > > > > > > > > > > > > > > > > > On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson < > > > erickerick...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > >> What you haven't told us is what you mean by "modify the > > > > >> index outside Solr". SolrJ? Using raw Lucene? Trying to modify > > > > >> things by writing your own codec? Standard Java I/O operations? > > > > >> Other? > > > > >> > > > > >> You could use SolrJ to connect to an existing Solr server and > > > > >> both read and modify at will form your M/R jobs. But if you're > > > > >> thinking of trying to write/modify the segment files by raw I/O > > > > >> operations, good luck! I'm 99.99% certain that's going to cause > > > > >> you endless grief. > > > > >> > > > > >> Best, > > > > >> Erick > > > > >> > > > > >> > > > > >> On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian < > alinazem...@gmail.com > > > > > > > >> wrote: > > > > >> > > > > >> > Actually I am going to do some analysis on the solr data using > map > > > > >> reduce. > > > > >> > For this purpose it might be needed to change some part of data > or > > > add > > > > >> new > > > > >> > fields from outside solr. > > > > >> > > > > > >> > > > > > >> > On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey <s...@elyograg.org > > > > > > wrote: > > > > >> > > > > > >> > > On 8/5/2014 7:04 AM, Ali Nazemian wrote: > > > > >> > > > I changed solr 4.9 to write index and data on hdfs. Now I am > > > going > > > > >> to > > > > >> > > > connect to those data from the outside of solr for changing > > some > > > > of > > > > >> the > > > > >> > > > values. Could somebody please tell me how that is possible? > > > > Suppose > > > > >> I > > > > >> > am > > > > >> > > > using Hbase over hdfs for do these changes. > > > > >> > > > > > > >> > > I don't know how you could safely modify the index without a > > > Lucene > > > > >> > > application or another instance of Solr, but if you do manage > to > > > > >> modify > > > > >> > > the index, simply reloading the core or restarting Solr should > > > cause > > > > >> it > > > > >> > > to pick up the changes. Either you would need to make sure > that > > > Solr > > > > >> > > never modifies the index, or you would need some way of > > > coordinating > > > > >> > > updates so that Solr and the other application would never try > > to > > > > >> modify > > > > >> > > the index at the same time. > > > > >> > > > > > > >> > > Thanks, > > > > >> > > Shawn > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > -- > > > > >> > A.Nazemian > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > A.Nazemian > > > > > > > > > > > > > > > > > > > > > -- > > > > A.Nazemian > > > > > > > > > > > > > > > -- > > A.Nazemian > > > -- A.Nazemian