[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

Tim Allison (Jira) Wed, 13 Sep 2023 07:02:15 -0700


     [ 
https://issues.apache.org/jira/browse/NUTCH-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated NUTCH-3001:
-------------------------------
    Description: 
It looks like the selenium protocol requires that there be a content-type 
header. 

The logic seems to be: If the content type is html or xhtml, use selenium, 
otherwise just grab the bytes.  

However, with the current logic, if the content-type is null, nothing is 
pulled.  

My guess is that the logic should be : if the content type is not null and 
equals html or xhtml use selenium, otherwise grab the bytes.

Right?

{noformat}
      String contentType = getHeader(Response.CONTENT_TYPE);

      // handle with Selenium only if content type in HTML or XHTML
      if (contentType != null) {
         if (contentType.contains("text/html")
            || contentType.contains("application/xhtml")) {
               readPlainContent(url);
         } else {
...
{noformat}

  was:
It looks like the selenium protocol requires that there be a content-type 
header. 

The logic seems to be: If the content type is html or xhtml, use selenium, 
otherwise just grab the bytes.  

However, with the current logic, if the content-type is null, nothing is 
pulled.  

My guess is that the logic should be : if the content type is not null and 
equals html or xhtml use selenium, otherwise grab the bytes.

Right?

{noformat}
      String contentType = getHeader(Response.CONTENT_TYPE);

      // handle with Selenium only if content type in HTML or XHTML
      if (contentType != null) {
{noformat}


> protocol-selenium requires Content-Type header 
> -----------------------------------------------
>
>                 Key: NUTCH-3001
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3001
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Major
>
> It looks like the selenium protocol requires that there be a content-type 
> header. 
> The logic seems to be: If the content type is html or xhtml, use selenium, 
> otherwise just grab the bytes.  
> However, with the current logic, if the content-type is null, nothing is 
> pulled.  
> My guess is that the logic should be : if the content type is not null and 
> equals html or xhtml use selenium, otherwise grab the bytes.
> Right?
> {noformat}
>       String contentType = getHeader(Response.CONTENT_TYPE);
>       // handle with Selenium only if content type in HTML or XHTML
>       if (contentType != null) {
>          if (contentType.contains("text/html")
>             || contentType.contains("application/xhtml")) {
>                readPlainContent(url);
>          } else {
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3001) protocol-selenium requires Content-Type header

Reply via email to