I was trying to fetch one specific url with ? symbol and nutch was refusing to 
fetch it. But if I fetch domain itself, nutch fetched links with ? symbol also. 
Now, I noticed that nutch did not fetch all files on this given domain. But if 
I direct nutch to an unfetched? file's? url it fetches it.? I used this command 
"bin/nutch crawl urls -dir crawl -depth 6". If I specify -topN 50 nutch does 
not fetch my files at all.

So, my question is, how to make nutch to fetch all files under a given domain?


Thanks.
A.


 

-----Original Message-----
From: [email protected]
To: [email protected]
Sent: Mon, 2 Mar 2009 3:36 pm
Subject: Re: urls with ? and & symbols











 Hello,

I have one specific domain. I tested further and it looks like nutch? fetches 
this domain's other links but the ones with ?. Also nutch fetches other domains 
with ? symbol.

 
How to know if robots.txt on this domain blocks this specific links to be 
fetched?

Thanks.
A.


 

-----Original Message-----
From: Bartosz Gadzimski <[email protected]>
To: [email protected]
Sent: Sun, 1 Mar 2009 11:13 am
Subject: Re: urls with ? and & symbols









[email protected] pisze:?

>  Hello,?

>?

> I use nutch-0.9 and try to index urls with ? and & symbols. I have commented 
this line? -[...@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and 
conf/regex-urlfilter.txt files.?

> However nutch still ignores these urls.?

>?

> Does anyone know how this can be fixed??

>?

> Thanks in advance.?

> A.?

>?

>?

>  
>?

>?

>?

>?

>   
Hi,?
?

If you commented out those line it should be fine. That part is correct 
so problem is somewhere else.?
?

I must give us more information like:?

- does your nutch crawles and index "normal" URL's (without ? and &)?

- are you crawling domains that are NOT blocked in crawl-urlfilter?

- is robots.txt on this domain doesn't block your url's?

- are your talking about one specific domain or many different??
?

Thanks,?

Bartosz?



 




 

Reply via email to