Re: solr over hdfs for accessing/ changing indexes outside solr

Ali Nazemian Thu, 07 Aug 2014 09:56:43 -0700

Dear Erick,
Could you please name those problems that SolrCloud can not tackle them
alone? Maybe I need solrCloud+ Hadoop and I am not aware of that yet.
Regards.



On Thu, Aug 7, 2014 at 7:37 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> If SolrCloud meets your needs, without Hadoop, then
> there's no real reason to introduce the added complexity.
>
> There are a bunch of problems that do _not_ work
> well with SolrCloud over non-Hadoop file systems. For
> those problems, the combination of SolrCloud and Hadoop
> make tackling them possible.
>
> Best,
> Erick
>
>
> On Thu, Aug 7, 2014 at 3:55 AM, Ali Nazemian <alinazem...@gmail.com>
> wrote:
>
> > Thank you very much. But why we should go for solr distributed with
> hadoop?
> > There is already solrCloud which is pretty applicable in the case of big
> > index. Is there any advantage for sending indexes over map reduce that
> > solrCloud can not provide?
> > Regards.
> >
> >
> > On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> > > bq: Are you aware of Cloudera search? I know they provide an integrated
> > > Hadoop ecosystem.
> > >
> > > What Cloudera Search does via the MapReduceIndexerTool (MRIT) is
> create N
> > > sub-indexes for
> > > each shard in the M/R paradigm via EmbeddedSolrServer. Eventually,
> these
> > > sub-indexes for
> > > each shard are merged (perhaps through some number of levels) in the
> > reduce
> > > phase and
> > > maybe merged into a live Solr instance (--go-live). You'll note that
> this
> > > tool requires the
> > > address of the ZK ensemble from which it can get the network topology,
> > > configuration files,
> > > all that rot. If you don't use the --go-live option, the output is
> still
> > a
> > > Solr index, it's just that
> > > the index for each shard is left in a specific directory on HDFS. Being
> > on
> > > HDFS allows
> > > this kind of M/R paradigm for massively parallel indexing operations,
> and
> > > perhaps massively
> > > complex analysis.
> > >
> > > Nowhere is there any low-level non-Solr manipulation of the indexes.
> > >
> > > The Flume fork just writes directly to the Solr nodes. It knows about
> the
> > > ZooKeeper
> > > ensemble and the collection too and communicates via SolrJ I'm pretty
> > sure.
> > >
> > > As far as integrating with HDFS, you're right, HA is part of the
> package.
> > > As far as using
> > > the Solr indexes for analysis, well you can write anything you want to
> > use
> > > the Solr indexes
> > > from anywhere in the M/R world and have them available from anywhere in
> > the
> > > cluster. There's
> > > no real need to even have Solr running, you could use the output from
> > MRIT
> > > and access the
> > > sub-shards with the EmbeddedSolrServer if you wanted, leaving out all
> the
> > > pesky servlet
> > > container stuff.
> > >
> > > bq: So why we go for HDFS in the case of analysis if we want to use
> SolrJ
> > > for this purpose?
> > > What is the point?
> > >
> > > Scale and data access in a nutshell. In the HDFS world, you can scale
> > > pretty linearly
> > > with the number of nodes you can rack together.
> > >
> > > Frankly though, if your data set is small enough to fit on a single
> > machine
> > > _and_ you can get
> > > through your analysis in a reasonable time (reasonable here is up to
> > you),
> > > then HDFS
> > > is probably not worth the hassle. But in the big data world where we're
> > > talking petabyte scale,
> > > having HDFS as the underpinning opens up possibilities for working on
> > data
> > > that were
> > > difficult/impossible with Solr previously.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > >
> > > On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian <alinazem...@gmail.com>
> > > wrote:
> > >
> > > > Dear Erick,
> > > > I remembered some times ago, somebody asked about what is the point
> of
> > > > modify Solr to use HDFS for storing indexes. As far as I remember
> > > somebody
> > > > told him integrating Solr with HDFS has two advantages. 1) having
> > hadoop
> > > > replication and HA. 2) using indexes and Solr documents for other
> > > purposes
> > > > such as Analysis. So why we go for HDFS in the case of analysis if we
> > > want
> > > > to use SolrJ for this purpose? What is the point?
> > > > Regards.
> > > >
> > > >
> > > > On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian <alinazem...@gmail.com>
> > > > wrote:
> > > >
> > > > > Dear Erick,
> > > > > Hi,
> > > > > Thank you for you reply. Yeah I am aware that SolrJ is my last
> > option.
> > > I
> > > > > was thinking about raw I/O operation. So according to your reply
> > > probably
> > > > > it is not applicable somehow. What about the Lily project that
> > Michael
> > > > > mentioned? Is that consider SolrJ too? Are you aware of Cloudera
> > > search?
> > > > I
> > > > > know they provide an integrated Hadoop ecosystem. Do you know what
> is
> > > > their
> > > > > suggestion?
> > > > > Best regards.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson <
> > > erickerick...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > >> What you haven't told us is what you mean by "modify the
> > > > >> index outside Solr". SolrJ? Using raw Lucene? Trying to modify
> > > > >> things by writing your own codec? Standard Java I/O operations?
> > > > >> Other?
> > > > >>
> > > > >> You could use SolrJ to connect to an existing Solr server and
> > > > >> both read and modify at will form your M/R jobs. But if you're
> > > > >> thinking of trying to write/modify the segment files by raw I/O
> > > > >> operations, good luck! I'm 99.99% certain that's going to cause
> > > > >> you endless grief.
> > > > >>
> > > > >> Best,
> > > > >> Erick
> > > > >>
> > > > >>
> > > > >> On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian <
> alinazem...@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > Actually I am going to do some analysis on the solr data using
> map
> > > > >> reduce.
> > > > >> > For this purpose it might be needed to change some part of data
> or
> > > add
> > > > >> new
> > > > >> > fields from outside solr.
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey <s...@elyograg.org
> >
> > > > wrote:
> > > > >> >
> > > > >> > > On 8/5/2014 7:04 AM, Ali Nazemian wrote:
> > > > >> > > > I changed solr 4.9 to write index and data on hdfs. Now I am
> > > going
> > > > >> to
> > > > >> > > > connect to those data from the outside of solr for changing
> > some
> > > > of
> > > > >> the
> > > > >> > > > values. Could somebody please tell me how that is possible?
> > > > Suppose
> > > > >> I
> > > > >> > am
> > > > >> > > > using Hbase over hdfs for do these changes.
> > > > >> > >
> > > > >> > > I don't know how you could safely modify the index without a
> > > Lucene
> > > > >> > > application or another instance of Solr, but if you do manage
> to
> > > > >> modify
> > > > >> > > the index, simply reloading the core or restarting Solr should
> > > cause
> > > > >> it
> > > > >> > > to pick up the changes. Either you would need to make sure
> that
> > > Solr
> > > > >> > > never modifies the index, or you would need some way of
> > > coordinating
> > > > >> > > updates so that Solr and the other application would never try
> > to
> > > > >> modify
> > > > >> > > the index at the same time.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Shawn
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > A.Nazemian
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > A.Nazemian
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: solr over hdfs for accessing/ changing indexes outside solr

Reply via email to