Re: advice, config files for crawling private wikipedia mirror

Markus Jelsma Mon, 10 Oct 2011 07:49:26 -0700

Check this:

http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia


On Monday 10 October 2011 16:41:27 Fred Zimmerman wrote:
> so let me make sure I understand.  what this guy did is that he made an XML
> file from his local backup of wikipedia but he didn't crawl it?maybe I
> don't need to crawl it, either, since the XML file can include the "id"
> field which is where Solr keeps URLs, right?
> 
> What I want to be able to do is submit a search to Solr, get back an answer
> set as a file using wt=csv, use a shell script to wget the documents in the
> answer set and then process them various ways. I already have this working
> on test data, I just need to be able to include the wiki data in the search
> & research results so the shell script can go get them too.
> 
> 
> 
> On Mon, Oct 10, 2011 at 10:32 AM, Markus Jelsma
> 
> <[email protected]>wrote:
> > That's something different. Indexing to Solr from a local backup of
> > wikipedia
> > is much and much quicker as you don't have to go through the whole
> > crawldb and
> > push all data to a reducer and finally to Solr.
> > 
> > On Monday 10 October 2011 16:28:02 Fred Zimmerman wrote:
> > > OK, that sounds good.  Tell me about the indexing.  I came across an
> > > article where someone had indexed about 10% of a wikipedia clone
> > > 
> > > http://h3x.no/2011/05/10/guide-solr-performance-tuning
> > > 
> > > who with a much bigger machine and a *lot* of tuning was able to reduce
> > > time required from 168m to 16min for the 600,000 records.
> > > 
> > > Fred
> > > 
> > > 
> > > 
> > > On Mon, Oct 10, 2011 at 10:15 AM, Markus Jelsma
> > > 
> > > <[email protected]>wrote:
> > > > Hi,
> > > > 
> > > > Based on our experience i would recommend running Nutch on a Hadoop
> > > > pseudo- cluster with a bit more memory and at least 4 CPU cores.
> > > > Fetch and parse of those url's wont' be a problem but updating the
> > > > crawldb
> > 
> > and
> > 
> > > > generating fetch
> > > > lists is going to be a problem.
> > > > 
> > > > Are you also indexing? Then that will also be a very costly process.
> > > > 
> > > > Cheers
> > > > 
> > > > On Saturday 08 October 2011 19:29:49 Fred Zimmerman wrote:
> > > > > HI,
> > > > > 
> > > > > I am looking for advice on how to configure Nutch (and Solr) to
> > > > > crawl
> > 
> > a
> > 
> > > > > private Wikipedia mirror.
> > > > > 
> > > > >    - It is my mirror on an intranet so I do not need to be polite
> > > > >    to
> > > > > 
> > > > > myself. -  I need to complete this 11 million page crawl as fast as
> > > > > I reasonably can.
> > > > > 
> > > > >    - Both crawler and mirror are 1.7GB machines dedicated to this
> > 
> > task.
> > 
> > > > >    -  I only need to crawl internal links (not external).
> > > > >    - Eventually I will need to update the crawl but a monthly
> > > > >    update will
> > > > > 
> > > > > be sufficient.
> > > > > 
> > > > > Any advice (and sample config files) would be much appreciated!
> > > > > 
> > > > > Fred
> > > > 
> > > > --
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: advice, config files for crawling private wikipedia mirror

Reply via email to