The general answer is: it dependes, usually is "polite" to present your robot 
to the website so the webmaster knows what is accessing the site, this is why 
google and a lot of other search engines (big and small) use a distinctive name 
for their crawlers/bots. That being said, the first site that you mention works 
fine for a quick parsechecker that I've executed:

➜  local  bin/nutch parsechecker 
http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
fetching: 
http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
parsing: 
http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
contentType: text/html
signature: 8e90c6d581f27c36828d433f746e4d7a
---------
Url
---------------

http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: "Dressing for the Dark"
Outlinks: 151
  outlink: toUrl: 
http://www.neimanmarcus.com/cssbundle/1468949595/bundles/product_rwd.css anchor:
  outlink: toUrl: 
http://www.neimanmarcus.com/category/templates/css/r_rBrand.css anchor:
  outlink: toUrl: 
http://www.neimanmarcus.com/category/templates/css/r_rProduct.css anchor:
  outlink: toUrl: 
http://www.neimanmarcus.com/jsbundle/2144966094/bundles/general_rwd.js anchor:
...

(trimmed due length)

As for the second one I wasn't able to do a test, the provided blocks access 
from my IP/country:

This request is blocked by the SonicWALL Gateway Geo IP Service.
Country Name:Cuba. 

Reading your experience with this website, looks like an error in the website 
programming, basically I'm assuming they are saying if your User Agent is not 
X,Y or Z then serve the mobile version, this could worth reporting.

Trying to fool the website giving the impression that your bot is a regular 
user by tweaking the user agent could work for now, but could draw in 
webmaster's attention and could be a cause for blocking your access, this 
depends a lot on the webmaster :). But for your particular case could be your 
only solution if the webmaster doesn't have a problem with the increase in 
traffic.

Regards,

----- Original Message -----
From: "Meraj A. Khan" <mera...@gmail.com>
To: user@nutch.apache.org
Sent: Saturday, February 28, 2015 12:09:47 AM
Subject: [MASSMAIL]Re: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?

Hi Jorge,

Yes, I was exploring changing the http.agent.name property value in
case where the sites either serve the mobile version or outright deny
the request if no agent is specified.

For example the following URL will give Request Rejected response if
the User-Agent is not specified.

http://www.neimanmarcus.com/Dressing-for-the-Dark-New-This-Week/prod176400153_cat46660760__/p.prod

And the following URL will server a mobile version.

http://www.techforless.com/cgi-bin/tech4less/60PN5000.

So is it a good practice to set the  http.agent.name  to something
like the below , to mimic a Chrome browser?

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2228.0 Safari/537.36

On Fri, Feb 27, 2015 at 3:21 PM, Jorge Luis Betancourt González
<jlbetanco...@uci.cu> wrote:
> Hi Meraj,
>
> Can you provide an example URL? explain exactly what you're after? if the 
> page you're trying to fetch has a lot of javascript/ajax keep in mind that 
> the browsers do a lot of stuff with the downloaded page, for instance when 
> you enter a page, the HTML is downloaded, the referenced CSS files are also 
> fetched and applied to the HTML (also inline styles, etc.), if any javascript 
> is referenced is also downloaded and executed on top of the loaded DOM (also 
> inline script tags). The same applies to fonts, etc. The browsers "knows" how 
> to deal with all this resources, also the CSS is applied depending on which 
> browser you're using. The Nutch crawler only knows about the downloaded HTML 
> (similar to what you see when you view the source code of an HTML webpage) it 
> doesn't know what a CSS style is, basically the crawler only is interested 
> in: the links and the textual/binary content of the webpage, so when a page 
> es fetched by Nutch, the HTML is downloaded but the other resources (fonts, 
> styles, javascript) are not applied to the fetched page.
>
> Tweaking the http.agent.name property in the nutch-site.xml only will help 
> with those sites that change what their response based on the user agent (one 
> for mobile and other different for desktop browsers). This approach is being 
> replaced by the responsive design, meaning that the user agent is not 
> important for how the page is rendered.
>
> In the current trunk of the upcoming 1.10 version a plugin has been merged 
> that could address this, basically this plugin uses selenium to render the 
> page and then feed Nutch with the resulting HTML, meaning that 
> ajax/javascript interactions will be present in the content that Nutch will 
> parse in the next stage.
>
> Also we need more information about your use case or what you're trying to 
> accomplish.
>
> Hope it helps,
>
> Regards,
>
> ----- Original Message -----
> From: "Meraj A. Khan" <mera...@gmail.com>
> To: user@nutch.apache.org
> Sent: Friday, February 27, 2015 12:47:06 AM
> Subject: [MASSMAIL]How to make Nutch 1.7 request mimic a browser?
>
> In some instances the content that is downloaded in Fetch phase from a
> HTTP URL is not what you would get if you were to access the request
> from a well known browser like Google Chrome for example, that is
> because the server is expecting a user agent value that represents a
> browser.
>
> There is a http.agent.name property in nutch-site.xml, is it the same
> property that should be used to set the user agent to make the server
> respond to a Nutch get request the same way as it would for a request
> from a browser ? Or is there an another configurable property ?
>
> For example the user agent value for a Chrome browser is below.
>
> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)
> Chrome/41.0.2228.0 Safari/537.36
>
>
> Thanks.

Reply via email to