Inquiries on potential improvements

2023-01-20 Thread Kamil Mroczek
Hello,

I have a few improvements to Nutch that I would like to get feedback on
whether this community thinks I should submit them to the main branch. Once
I get my first PR approved I can start to add these. Some of these might
not be good ideas as well so happy to hear that feedback.

1. json-indexer: indexes documents in json lines format

2. selenium extracts the html tag vs the body tag (sample commit
):
I needed this to extract the title of the page since that often lives in
the head tag. I am hesitant about this change because it could have bigger
effects.

3. Add ability to extract meta tags with "property" attribute (sample commit

).

4. Allow selenium to handle gzip content (sample commit
):
This is a port of the code from HTMLUnit that does the same thing. I needed
this to process RSS feeds properly.

5. Treat RSS feeds as normal webpages by adding links to next segment fetch
(sample  commit

)


Kamil


[jira] [Updated] (NUTCH-2980) Upgrade Selenium Java to 4.7.2

2023-01-19 Thread Kamil Mroczek (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Mroczek updated NUTCH-2980:
-
Description: 
Selenium version is quite old and had some issues processing a website. Once I 
switched to the latest version I was able to scrape that websites. Good to keep 
it up to date since we were already 1 major release behind.

Upgrading Selenium Java from 3.141.59 to 4.7.2 and Selenium HTMLUnit from 
2.35.1 to 4.7.0.

  was:
Selenium version is quite old and had some issues processing a website. Once I 
switched to the latest version I was able to scrape that websites. Good to keep 
it up to date since we were already 1 major release behind.

Upgrading Selenium Java from 3.141.59 to 4.7.2 and Selenium HTMLUnit from 
2.35.1 to 4.7.0.

 

We were running a 


> Upgrade Selenium Java to 4.7.2
> --
>
> Key: NUTCH-2980
> URL: https://issues.apache.org/jira/browse/NUTCH-2980
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.19
>    Reporter: Kamil Mroczek
>Priority: Major
> Fix For: 1.20
>
>
> Selenium version is quite old and had some issues processing a website. Once 
> I switched to the latest version I was able to scrape that websites. Good to 
> keep it up to date since we were already 1 major release behind.
> Upgrading Selenium Java from 3.141.59 to 4.7.2 and Selenium HTMLUnit from 
> 2.35.1 to 4.7.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-2980) Upgrade Selenium Java to 4.7.2

2023-01-19 Thread Kamil Mroczek (Jira)
Kamil Mroczek created NUTCH-2980:


 Summary: Upgrade Selenium Java to 4.7.2
 Key: NUTCH-2980
 URL: https://issues.apache.org/jira/browse/NUTCH-2980
 Project: Nutch
  Issue Type: Improvement
  Components: plugin, protocol
Affects Versions: 1.19
Reporter: Kamil Mroczek
 Fix For: 1.20


Selenium version is quite old and had some issues processing a website. Once I 
switched to the latest version I was able to scrape that websites. Good to keep 
it up to date since we were already 1 major release behind.

Upgrading Selenium Java from 3.141.59 to 4.7.2 and Selenium HTMLUnit from 
2.35.1 to 4.7.0.

 

We were running a 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Upgrading Selenium

2023-01-18 Thread Kamil Mroczek
Thanks Markus. Let me submit the upgrade first to get my first commit in
and then go from there. That optimization of reducing the number of HTTP
requests will useful so I will look into that.

On Tue, Jan 17, 2023 at 1:56 PM Markus Jelsma 
wrote:

> Hello Kamil,
>
> Yes, the plugin needs some upgrading indeed. We use a modern version of it
> elsewhere and it works really well, at least better than HtmlUnit.
>
> Besides that, the plugin also needs some overhaul. It currently first
> downloads the URL with HttpClient, and then, depending on MIME-type, it may
> or may not forward the URL to Selenium so it can be downloaded again.
>
> There is a lot of code in the plugin that should be removed. I would also
> opt for merging the lib-selenium plugin with the protocol-selenium plugin.
> There is no obvious need for having it separated.
>
> These can be, of course, separate tasks.
>
> Regards,
> Markus
>
> Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek :
>
>> Hello,
>>
>> I am sending a message to inquire whether I should submit a patch which
>> updates selenium to the latest version. Although it is a major version
>> upgrade to the library, very few code changes were needed to update.
>>
>> For a preview of the changes I made you can look here
>> <https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
>> Although not used in the code anymore (it was commented out), PhantomJS
>> support has been removed from Selenium in the latest version. The commit
>> also removes Opera since it was commented out but I can leave that in if
>> needed. The build and tests pass. I have been using the Chrome driver
>> successfully with it and would just need to run a quick test with Firefox
>> to make sure it works too.
>>
>> I have only been using Nutch for about a month but have spent quite a bit
>> of time looking over different parts of the code to understand how to
>> configure it and change it.
>>
>> Kamil
>>
>


Upgrading Selenium

2023-01-17 Thread Kamil Mroczek
Hello,

I am sending a message to inquire whether I should submit a patch which
updates selenium to the latest version. Although it is a major version
upgrade to the library, very few code changes were needed to update.

For a preview of the changes I made you can look here
.
Although not used in the code anymore (it was commented out), PhantomJS
support has been removed from Selenium in the latest version. The commit
also removes Opera since it was commented out but I can leave that in if
needed. The build and tests pass. I have been using the Chrome driver
successfully with it and would just need to run a quick test with Firefox
to make sure it works too.

I have only been using Nutch for about a month but have spent quite a bit
of time looking over different parts of the code to understand how to
configure it and change it.

Kamil