Check this: http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia
On Monday 10 October 2011 16:41:27 Fred Zimmerman wrote: > so let me make sure I understand. what this guy did is that he made an XML > file from his local backup of wikipedia but he didn't crawl it?maybe I > don't need to crawl it, either, since the XML file can include the "id" > field which is where Solr keeps URLs, right? > > What I want to be able to do is submit a search to Solr, get back an answer > set as a file using wt=csv, use a shell script to wget the documents in the > answer set and then process them various ways. I already have this working > on test data, I just need to be able to include the wiki data in the search > & research results so the shell script can go get them too. > > > > On Mon, Oct 10, 2011 at 10:32 AM, Markus Jelsma > > <[email protected]>wrote: > > That's something different. Indexing to Solr from a local backup of > > wikipedia > > is much and much quicker as you don't have to go through the whole > > crawldb and > > push all data to a reducer and finally to Solr. > > > > On Monday 10 October 2011 16:28:02 Fred Zimmerman wrote: > > > OK, that sounds good. Tell me about the indexing. I came across an > > > article where someone had indexed about 10% of a wikipedia clone > > > > > > http://h3x.no/2011/05/10/guide-solr-performance-tuning > > > > > > who with a much bigger machine and a *lot* of tuning was able to reduce > > > time required from 168m to 16min for the 600,000 records. > > > > > > Fred > > > > > > > > > > > > On Mon, Oct 10, 2011 at 10:15 AM, Markus Jelsma > > > > > > <[email protected]>wrote: > > > > Hi, > > > > > > > > Based on our experience i would recommend running Nutch on a Hadoop > > > > pseudo- cluster with a bit more memory and at least 4 CPU cores. > > > > Fetch and parse of those url's wont' be a problem but updating the > > > > crawldb > > > > and > > > > > > generating fetch > > > > lists is going to be a problem. > > > > > > > > Are you also indexing? Then that will also be a very costly process. > > > > > > > > Cheers > > > > > > > > On Saturday 08 October 2011 19:29:49 Fred Zimmerman wrote: > > > > > HI, > > > > > > > > > > I am looking for advice on how to configure Nutch (and Solr) to > > > > > crawl > > > > a > > > > > > > private Wikipedia mirror. > > > > > > > > > > - It is my mirror on an intranet so I do not need to be polite > > > > > to > > > > > > > > > > myself. - I need to complete this 11 million page crawl as fast as > > > > > I reasonably can. > > > > > > > > > > - Both crawler and mirror are 1.7GB machines dedicated to this > > > > task. > > > > > > > - I only need to crawl internal links (not external). > > > > > - Eventually I will need to update the crawl but a monthly > > > > > update will > > > > > > > > > > be sufficient. > > > > > > > > > > Any advice (and sample config files) would be much appreciated! > > > > > > > > > > Fred > > > > > > > > -- > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

