i dont know; look around the httpclient code. but you probably want to make sure its a client session issue first.
i could be wrong. > I tried running with one thread, still same results. > Any hint on how do we make Nutch aware of session cookies > > --- > Thanks/Regards, > Parvez > > > On Wed, Sep 9, 2009 at 12:51 AM, <[email protected]> wrote: > >> how many threads are you running at? >> >> nutch doesnt know about sessions; >> >> you might have to do something like fetching one thread at a time but >> thats slow. >> >> or maybe make nutch aware of session cookies. >> >> > I am crawling at depth 40 as there are 40 pages in the pagination. >> > >> > It works fine till the first 6 pages and after that it goes to the 7th >> > page, >> > but looks like its different session and hence the pagination wont >> work. >> > >> > I mean if you you directly hit page 7, using the URL, the pagination >> wont >> > work and will return empty set. >> > >> > But if you go in the sequence in the same session the pagination >> works. >> > >> > >> > --- >> > Thanks/Regards, >> > Parvez >> > >> > >> > On Wed, Sep 9, 2009 at 12:15 AM, <[email protected]> wrote: >> > >> >> could be tricky from what i've seen; >> >> >> >> theres limits on how many times you can hit one host/ip; >> >> >> >> also what depth you are crawling at may come to play in your case >> (which >> >> is probably what you want to look at in this case). >> >> >> >> >> >> > Any hint to increase the session time of the Nutch crawl thread. >> >> > I tried crawling with one thread, still no luck. >> >> > >> >> > ---- >> >> > Thanks/Regards, >> >> > Parvez >> >> > >> >> > >> >> > >> >> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <[email protected]> >> >> wrote: >> >> > >> >> >> I have a paginated pages, which will only work if its crawled in a >> >> given >> >> >> sequence, and in the same session. >> >> >> >> >> >> For example first URL is >> >> >> >> >> >> http://www.myhost.com/?page_number=1 >> >> >> http://www.myhost.com/?page_number=2 >> >> >> http://www.myhost.com/?page_number=3 >> >> >> >> >> >> The first page has link to second page. >> >> >> Second page has link to first and second page. >> >> >> Third page has link to third and second page. >> >> >> So On... >> >> >> >> >> >> Nutch is able to crawl the the first 6 pages, but beyond that it >> is >> >> not >> >> >> able to crawl or is getting empty result. >> >> >> >> >> >> If I manually click through the pagination, in a browser, I can >> reach >> >> >> till >> >> >> the end with no problem. >> >> >> >> >> >> Is the Nutch Crawl Session timing out? How do we increase it. >> >> >> >> >> >> I tried crawling with on thread but still same result. >> >> >> >> >> >> Any suggestion ? >> >> >> >> >> >> --- >> >> >> Thanks/Regards, >> >> >> Parvez >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> >
