Hi Furkan, I'm not able to have it working. Maybe I misunderstood your email.
Simplifying, let's assume my website has the following structure http://www.xyz.com/book/1 that contains a link towards http://www.xyz.com/book/2 http://www.xyz.com/book/3 The /2 and /3 also contain some outlink, so the site map is the following - http://www.xyz.com/book/1 - http://www.xyz.com/book/2 - http://www.xyz.com/book/5 - http://www.xyz.com/book/6 - http://www.xyz.com/book/3 - http://www.xyz.com/book/7 - http://www.xyz.com/book/8 I put http://www.xyz.com/book/1 in the seed file and the following line in the regex-urlfilter.txt (the only uncommented line) +^http://www.xyz.com/book/([0-9]*\.) Running bin/crawl -i -D solr.server.url=http://localhost:8983/solr/woozlee urls/few/captain-gazza.txt TestCrawl x Injector: Total number of urls rejected by filters: 0 Injector: Total number of urls after normalization: 1 ... Indexing 1 documents Indexer: number of documents indexed, deleted, or skipped: Indexer: 1 indexed (add/update) Indexer: finished at 2016-04-16 07:54:35, elapsed: 00:00:04 Cleaning up index if possible /home/solr/apache-nutch-1.11/bin/nutch clean -Dsolr.server.url= http://localhost:8983/solr/woozlee TestCrawl/crawldb Sat Apr 16 07:54:39 CEST 2016 : Iteration 2 of 5 Generating a new segment /home/solr/apache-nutch-1.11/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true TestCrawl/crawldb TestCrawl/segments -topN 50000 -numFetchers 1 -noFilter Generator: starting at 2016-04-16 07:54:40 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: 0 records selected for fetching, exiting ... Generate returned 1 (no new segments created) Escaping loop: no more URLs to fetch now Whatever is *x*, the cycle completes shortly and indexes only the URL in the seed list (i.e. I have one record indexed in Solr). Again, many thanks for your help Best, Andrea On Fri, Apr 15, 2016 at 9:23 PM, Andrea Gazzarini <gxs...@gmail.com> wrote: > Hi Furkan, > many thanks, I'm going to try and I'll let you know. > > For the first question, I'm not sure about the overall size but we're > talking about 2milions (growing) pages; in general nothing that can be > easily handled with a from-scratch and custom solution. > > I was wondering if, from a functional perspective, Nutch is a good fit for > "automatizing" the periodic indexing (in Solr, this is my ultimate goal) of > that website. If that works the same mechanism will be used for other > websites as well. > > Best, > Andrea > On 15 Apr 2016 18:16, "Furkan KAMACI" <furkankam...@gmail.com> wrote: > >> Hi Andrea, >> >> Regex URL Filter works like that: >> >> This accepts anything else: >> >> *+.* >> >> Let's assume that you want to crawl Nutch's website. If you wished to >> limit >> the crawl to the nutch.apache.org domain, than definition should be that: >> >> * +^http://([a-z0-9]*\.)*nutch.apache.org/ <http://nutch.apache.org/>* >> >> So, if your more like this section has that pattern: >> >> *http://www.xyz.com/book/{book_id} <http://www.xyz.com/book/{book_id}>* >> >> Than your definition should be that: >> >> *+^http://www.xyz.com/book/([0-9]*\.)* >> <http://www.xyz.com/book/([0-9]*\.)*>* >> >> For your first question, you should tell us what is the approximate size >> of >> the data you will crawl, etc. and do you have any other needs? >> >> Kind Regards, >> Furkan KAMACI >> >> >> On Fri, Apr 15, 2016 at 4:17 PM, Andrea Gazzarini <gxs...@gmail.com> >> wrote: >> >> > Hi guys, >> > just playing as a Nutch newbie in a simple (at least I think) use case: >> > >> > I have a website (e.g. http://www.xyz.com) that allows searching for >> > books. Here, as any straight search website I have two kind of pages: >> > >> > * a page that shows search results (depending on the user entered >> > search terms) >> > * a details page about a given book. Each details page is a permalink >> > which follows a given naming convention (e.g. >> > http://www.xyz.com/book/{book id}) >> > >> > The details page has something like a "more like this" section that >> > contains permalinks to other (similar) books. >> > Now, my requirement is to index in Solr *all* details page of such >> website. >> > >> > If Nutch is a suitable tool for doing that (and this is actually the >> first >> > question), could you please give me some hint about how to configure it? >> > >> > Specifically, I tried put a seed file with just one entry >> > >> > http://www.xyx.com/book/1 >> > >> > and then I configured my regex-urlfilter.txt >> > >> > +^http://www.xyx.com/book >> > >> > But it indexes only the /1 page. I imagined that the "more like this" >> > section of the /1 page would act as a set of outlinks for getting >> further >> > details pages (where in turns there are further MLT sections, and so on) >> > >> > Best, >> > Andrea >> > >> > >> >