I tried running with one thread, still same results. Any hint on how do we make Nutch aware of session cookies
--- Thanks/Regards, Parvez On Wed, Sep 9, 2009 at 12:51 AM, <[email protected]> wrote: > how many threads are you running at? > > nutch doesnt know about sessions; > > you might have to do something like fetching one thread at a time but > thats slow. > > or maybe make nutch aware of session cookies. > > > I am crawling at depth 40 as there are 40 pages in the pagination. > > > > It works fine till the first 6 pages and after that it goes to the 7th > > page, > > but looks like its different session and hence the pagination wont work. > > > > I mean if you you directly hit page 7, using the URL, the pagination wont > > work and will return empty set. > > > > But if you go in the sequence in the same session the pagination works. > > > > > > --- > > Thanks/Regards, > > Parvez > > > > > > On Wed, Sep 9, 2009 at 12:15 AM, <[email protected]> wrote: > > > >> could be tricky from what i've seen; > >> > >> theres limits on how many times you can hit one host/ip; > >> > >> also what depth you are crawling at may come to play in your case (which > >> is probably what you want to look at in this case). > >> > >> > >> > Any hint to increase the session time of the Nutch crawl thread. > >> > I tried crawling with one thread, still no luck. > >> > > >> > ---- > >> > Thanks/Regards, > >> > Parvez > >> > > >> > > >> > > >> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <[email protected]> > >> wrote: > >> > > >> >> I have a paginated pages, which will only work if its crawled in a > >> given > >> >> sequence, and in the same session. > >> >> > >> >> For example first URL is > >> >> > >> >> http://www.myhost.com/?page_number=1 > >> >> http://www.myhost.com/?page_number=2 > >> >> http://www.myhost.com/?page_number=3 > >> >> > >> >> The first page has link to second page. > >> >> Second page has link to first and second page. > >> >> Third page has link to third and second page. > >> >> So On... > >> >> > >> >> Nutch is able to crawl the the first 6 pages, but beyond that it is > >> not > >> >> able to crawl or is getting empty result. > >> >> > >> >> If I manually click through the pagination, in a browser, I can reach > >> >> till > >> >> the end with no problem. > >> >> > >> >> Is the Nutch Crawl Session timing out? How do we increase it. > >> >> > >> >> I tried crawling with on thread but still same result. > >> >> > >> >> Any suggestion ? > >> >> > >> >> --- > >> >> Thanks/Regards, > >> >> Parvez > >> >> > >> >> > >> > > >> > >> > >> > > > > >
