Re: Inquiries on potential improvements

2023-01-21 Thread Sebastian Nagel

Hi Kamil,

> 1. json-indexer: indexes documents in json lines format

Sounds good. There's already an indexer-csv (works only in local mode).

> 2. selenium extracts the html tag vs the body tag

Definitely makes sense.

> I am hesitant about this change because it could have bigger effects.

In doubt, could add a new method and make it configurable whether inner or outer 
HTML is returned.


3. Add ability to extract meta tags with "property" attribute

+1

(should be applied also to parse-tika - there's some duplicated code between 
parse-html and parse-tika)


> 4. Allow selenium to handle gzip content

Definitely. However, I wounder whether this couldn't be delegated to one of the 
other HTTP protocol plugins. But I agree this might be tricky:

- calling a plugin from another one
- maybe better in the way discussed in the user list:
  send a HEAD request and decide which way to go given on the Content-Type
  response header

> 5. Treat RSS feeds as normal webpages by adding links to next segment fetch

This is actually done if you let parse-tika parse the feeds. The "feed" plugin 
is very special in this respect. It takes every feed item as one document and 
forwards it to the index. This is a different use case. I'm open for 
discussions, however.


Thanks for your contributions!

Best,
Sebastian

On 1/20/23 16:28, Kamil Mroczek wrote:

Hello,

I have a few improvements to Nutch that I would like to get feedback on whether 
this community thinks I should submit them to the main branch. Once I get my 
first PR approved I can start to add these. Some of these might not be good 
ideas as well so happy to hear that feedback.


1. json-indexer: indexes documents in json lines format

2. selenium extracts the html tag vs the body tag (sample commit 
): I needed this to extract the title of the page since that often lives in the head tag. I am hesitant about this change because it could have bigger effects.


3. Add ability to extract meta tags with "property" attribute (sample commit 
).


4. Allow selenium to handle gzip content (sample commit 
): This is a port of the code from HTMLUnit that does the same thing. I needed this to process RSS feeds properly.


5. Treat RSS feeds as normal webpages by adding links to next segment fetch 
(sample  commit 
)



Kamil


Inquiries on potential improvements

2023-01-20 Thread Kamil Mroczek
Hello,

I have a few improvements to Nutch that I would like to get feedback on
whether this community thinks I should submit them to the main branch. Once
I get my first PR approved I can start to add these. Some of these might
not be good ideas as well so happy to hear that feedback.

1. json-indexer: indexes documents in json lines format

2. selenium extracts the html tag vs the body tag (sample commit
):
I needed this to extract the title of the page since that often lives in
the head tag. I am hesitant about this change because it could have bigger
effects.

3. Add ability to extract meta tags with "property" attribute (sample commit

).

4. Allow selenium to handle gzip content (sample commit
):
This is a port of the code from HTMLUnit that does the same thing. I needed
this to process RSS feeds properly.

5. Treat RSS feeds as normal webpages by adding links to next segment fetch
(sample  commit

)


Kamil