Thank you very much. But why we should go for solr distributed with hadoop?
There is already solrCloud which is pretty applicable in the case of big
index. Is there any advantage for sending indexes over map reduce that
solrCloud can not provide?
Regards.


On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: Are you aware of Cloudera search? I know they provide an integrated
> Hadoop ecosystem.
>
> What Cloudera Search does via the MapReduceIndexerTool (MRIT) is create N
> sub-indexes for
> each shard in the M/R paradigm via EmbeddedSolrServer. Eventually, these
> sub-indexes for
> each shard are merged (perhaps through some number of levels) in the reduce
> phase and
> maybe merged into a live Solr instance (--go-live). You'll note that this
> tool requires the
> address of the ZK ensemble from which it can get the network topology,
> configuration files,
> all that rot. If you don't use the --go-live option, the output is still a
> Solr index, it's just that
> the index for each shard is left in a specific directory on HDFS. Being on
> HDFS allows
> this kind of M/R paradigm for massively parallel indexing operations, and
> perhaps massively
> complex analysis.
>
> Nowhere is there any low-level non-Solr manipulation of the indexes.
>
> The Flume fork just writes directly to the Solr nodes. It knows about the
> ZooKeeper
> ensemble and the collection too and communicates via SolrJ I'm pretty sure.
>
> As far as integrating with HDFS, you're right, HA is part of the package.
> As far as using
> the Solr indexes for analysis, well you can write anything you want to use
> the Solr indexes
> from anywhere in the M/R world and have them available from anywhere in the
> cluster. There's
> no real need to even have Solr running, you could use the output from MRIT
> and access the
> sub-shards with the EmbeddedSolrServer if you wanted, leaving out all the
> pesky servlet
> container stuff.
>
> bq: So why we go for HDFS in the case of analysis if we want to use SolrJ
> for this purpose?
> What is the point?
>
> Scale and data access in a nutshell. In the HDFS world, you can scale
> pretty linearly
> with the number of nodes you can rack together.
>
> Frankly though, if your data set is small enough to fit on a single machine
> _and_ you can get
> through your analysis in a reasonable time (reasonable here is up to you),
> then HDFS
> is probably not worth the hassle. But in the big data world where we're
> talking petabyte scale,
> having HDFS as the underpinning opens up possibilities for working on data
> that were
> difficult/impossible with Solr previously.
>
> Best,
> Erick
>
>
>
> On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian <alinazem...@gmail.com>
> wrote:
>
> > Dear Erick,
> > I remembered some times ago, somebody asked about what is the point of
> > modify Solr to use HDFS for storing indexes. As far as I remember
> somebody
> > told him integrating Solr with HDFS has two advantages. 1) having hadoop
> > replication and HA. 2) using indexes and Solr documents for other
> purposes
> > such as Analysis. So why we go for HDFS in the case of analysis if we
> want
> > to use SolrJ for this purpose? What is the point?
> > Regards.
> >
> >
> > On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian <alinazem...@gmail.com>
> > wrote:
> >
> > > Dear Erick,
> > > Hi,
> > > Thank you for you reply. Yeah I am aware that SolrJ is my last option.
> I
> > > was thinking about raw I/O operation. So according to your reply
> probably
> > > it is not applicable somehow. What about the Lily project that Michael
> > > mentioned? Is that consider SolrJ too? Are you aware of Cloudera
> search?
> > I
> > > know they provide an integrated Hadoop ecosystem. Do you know what is
> > their
> > > suggestion?
> > > Best regards.
> > >
> > >
> > >
> > > On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > >> What you haven't told us is what you mean by "modify the
> > >> index outside Solr". SolrJ? Using raw Lucene? Trying to modify
> > >> things by writing your own codec? Standard Java I/O operations?
> > >> Other?
> > >>
> > >> You could use SolrJ to connect to an existing Solr server and
> > >> both read and modify at will form your M/R jobs. But if you're
> > >> thinking of trying to write/modify the segment files by raw I/O
> > >> operations, good luck! I'm 99.99% certain that's going to cause
> > >> you endless grief.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>
> > >> On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian <alinazem...@gmail.com>
> > >> wrote:
> > >>
> > >> > Actually I am going to do some analysis on the solr data using map
> > >> reduce.
> > >> > For this purpose it might be needed to change some part of data or
> add
> > >> new
> > >> > fields from outside solr.
> > >> >
> > >> >
> > >> > On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey <s...@elyograg.org>
> > wrote:
> > >> >
> > >> > > On 8/5/2014 7:04 AM, Ali Nazemian wrote:
> > >> > > > I changed solr 4.9 to write index and data on hdfs. Now I am
> going
> > >> to
> > >> > > > connect to those data from the outside of solr for changing some
> > of
> > >> the
> > >> > > > values. Could somebody please tell me how that is possible?
> > Suppose
> > >> I
> > >> > am
> > >> > > > using Hbase over hdfs for do these changes.
> > >> > >
> > >> > > I don't know how you could safely modify the index without a
> Lucene
> > >> > > application or another instance of Solr, but if you do manage to
> > >> modify
> > >> > > the index, simply reloading the core or restarting Solr should
> cause
> > >> it
> > >> > > to pick up the changes. Either you would need to make sure that
> Solr
> > >> > > never modifies the index, or you would need some way of
> coordinating
> > >> > > updates so that Solr and the other application would never try to
> > >> modify
> > >> > > the index at the same time.
> > >> > >
> > >> > > Thanks,
> > >> > > Shawn
> > >> > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > A.Nazemian
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Reply via email to