RE: Best practice to index a large crawl through Solr?

Markus Jelsma Mon, 22 Oct 2012 13:40:59 -0700

Hi - Hadoop can write more records per second than Solr can analyze and store,  
especially with multiple reducers (threads in Solr). SolrCloud is notoriously 
slow when it comes to indexing compared to a stand-alone setup. However, this 
should not be a problem at all as your not dealing with millions of records. 
Trying to tie HBase as a backend to Solr is not a good idea at all. The best 
and fastest storage for Solr is a disk and MMappedDirectory enabled (default in 
recent version) and plenty of RAM. Keep in mind that Solr keeps several parts 
of the index in memory and others if it can and it is very efficient in doing 
that.


With only a few million records it's easy and fast enough to run Hadoop locally 
(or pseudo if you can) and have a single Solr node running.
 
-----Original message-----
> From:Thilina Gunarathne <cset...@gmail.com>
> Sent: Mon 22-Oct-2012 22:35
> To: user@nutch.apache.org
> Subject: Re: Best practice to index a large crawl through Solr?
> 
> Hi Alex,
> Thanks again for the information.
> 
> My current requirement is to implement a  simple searching application for
> a publication. Our current data sizes probably would not exceed the amount
> of records you mentioned and for now, we should be fine with a single Solr
> instance. I'm going to check out the SolrCloud for our future needs.
> 
> >Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does
> >sound pretty crazy.
> I agree :).. Unfortunately (or may be luckily) I do not have much time to
> invest on this and I'll probably have to rely on the existing tools, rather
> than trying to reinvent the wheels :)..
> 
> thanks,
> Thilina
> 
> 
> On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres <
> acace...@hyperiongray.com> wrote:
> 
> > No problem. Wrt to your first question, Solr would actually be storing
> > this data locally. Solr sharding actually uses its own mechanism
> > called SolrCloud. I'd recommend checking it out here:
> > http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not
> > used it myself.
> >
> > Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does
> > sound pretty crazy. You can most definitely find a more efficient way
> > to do this, either by going to HBase directly from the start (I
> > wouldn't do so personally) or just using Solr. It might be good to
> > know what kind of application you are looking to build and asking more
> > specifically.
> >
> > Alex
> >
> > On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne <cset...@gmail.com>
> > wrote:
> > > Hi Alex,
> > > Thanks for the very fast response :)..
> > >
> > > It sort of depends on your purpose and the amount of data. I currently
> > >> have a single Solr instance (~1GB of memory, 2 processors on the
> > >> server) serving almost ~3,700,000 records from Nutch and it's still
> > >> working great for me. If you have around that I'd say a single Solr
> > >> instance is OK, depending on if you are planning on making your data
> > >> publicly available or not.
> > >>
> > > This is very useful information. In this case, would the Solr instance be
> > > retrieving and storing all the data locally or is it still using the
> > Nutch
> > > data store to retrieve the actual content while serving the queries?
> > >
> > >
> > >> If you're creating something larger of some sort, Solr 4.0, which
> > >> supports sharding natively would be a great option (I think it's still
> > >> in Beta, but if you're feeling brave...). This is especially true if
> > >> you are creating a search engine of some sort, or would like easily
> > >> searchable data.
> > >>
> > > That's interesting. I'll check that out. By any chance, do you know
> > whether
> > > the Solr sharding is using the HDFS to store the data or is it using it's
> > > own infrastructure?
> > >
> > >
> > >> I would imagine doing this directly from HBase would not be a great
> > >> option, as Nutch is storing the data in the format that is convenient
> > >> for Nutch itself to use, and not so much in a format that it is
> > >> friendly for you to reuse for your own purposes.
> > >>
> > > I was actually thinking  of a scenario where we would use Solr to index
> > the
> > > data and storing the resultant index in HBase.  Then using the HBase
> > > directly to perform simple index lookups..  Please pardon my lack of
> > > knowledge on Nutch and Solr, if the above sounds ludicrous :)..
> > >
> > > thanks,
> > > Thilina
> > >
> > >
> > >> IMO your best bet is going to try out Solr 4.0.
> > >>
> > >> Alex
> > >>
> > >> On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne <cset...@gmail.com>
> > >> wrote:
> > >> > Dear All,
> > >> > What would be the best practice to index a large crawl using Solr? The
> > >> > crawl is performed on a multi node Hadoop cluster using HBase as the
> > back
> > >> > end.. Would Solr become a bottleneck if we use just a single Solr
> > >> instance?
> > >> >  Is it possible to store the indexed data on HBase and to serve them
> > from
> > >> > the HBase it self?
> > >> >
> > >> > thanks a lot,
> > >> > Thilina
> > >> >
> > >> > --
> > >> > https://www.cs.indiana.edu/~tgunarat/
> > >> > http://www.linkedin.com/in/thilina
> > >> > http://thilina.gunarathne.org
> > >>
> > >>
> > >>
> > >> --
> > >> ___
> > >>
> > >> Alejandro Caceres
> > >> Hyperion Gray, LLC
> > >> Owner/CTO
> > >>
> > >
> > >
> > >
> > > --
> > > https://www.cs.indiana.edu/~tgunarat/
> > > http://www.linkedin.com/in/thilina
> > > http://thilina.gunarathne.org
> >
> >
> >
> > --
> > ___
> >
> > Alejandro Caceres
> > Hyperion Gray, LLC
> > Owner/CTO
> >
> 
> 
> 
> -- 
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org
>

RE: Best practice to index a large crawl through Solr?

Reply via email to