[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2024-04-30 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842526#comment-17842526
 ] 

Joe Gilvary commented on NUTCH-585:
---

[~dbeckstrom] I'm not sure which patch you were asking about. I used the source 
for the new 1.20 release and applied the patch that [~ad-...@gmx.at] posted 
after an edit to the line numbers for the update to src/plugin/build.xml. It 
built cleanly and seems to work exactly as advertised in my tests with 
indexchecker.

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>  Components: HTML, parse-filter, parser, plugin
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842426#comment-17842426
 ] 

Hudson commented on NUTCH-3054:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #160 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/160/])
NUTCH-3054 Address deprecation of Node16 for all GitHub Actions (#817) (github: 
[https://github.com/apache/nutch/commit/7ac3ce28e065fb5160f96ce7bce1ec840f87d0dc])
* (edit) .github/workflows/master-build.yml


> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3054.
---

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3054.
-
Resolution: Fixed

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842410#comment-17842410
 ] 

ASF GitHub Bot commented on NUTCH-3054:
---

lewismc merged PR #817:
URL: https://github.com/apache/nutch/pull/817




> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3054 Address deprecation of Node16 for all GitHub Actions [nutch]

2024-04-30 Thread via GitHub


lewismc merged PR #817:
URL: https://github.com/apache/nutch/pull/817


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842384#comment-17842384
 ] 

Markus Jelsma commented on NUTCH-3028:
--

Ok, the Content object is now also available in the evaluation. I added an 
example of it to the description above.

 

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Attachment: NUTCH-3028-2.patch

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Markus Jelsma (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-3028:
-
Description: 
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}

{color:#00}or {color}

{color:#00}-expr 
'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}

  was:
Filtering segment data to WARC is now possible using JEXL expressions. In the 
next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata 
are exported to WARC.

{color:#00}-expr 
'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}


> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}
> {color:#00}or {color}
> {color:#00}-expr 
> 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3055) README: fix Github "hub" commands

2024-04-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842308#comment-17842308
 ] 

ASF GitHub Bot commented on NUTCH-3055:
---

sebastian-nagel opened a new pull request, #818:
URL: https://github.com/apache/nutch/pull/818

   (no comment)




> README: fix Github "hub" commands
> -
>
> Key: NUTCH-3055
> URL: https://issues.apache.org/jira/browse/NUTCH-3055
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.21
>
>
> The [README.md|https://github.com/apache/nutch/blob/master/README.md] 
> contains [Github hub|https://hub.github.com/] commands but with "git" as 
> command (executable) name, maybe an alias or some other magic. However, if 
> hub isn't installed, these commands fail with {{git: 'pull-request' is not a 
> git command. See 'git --help'.}} or similar.
> We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3055) README: fix Github "hub" commands

2024-04-30 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3055:
--

 Summary: README: fix Github "hub" commands
 Key: NUTCH-3055
 URL: https://issues.apache.org/jira/browse/NUTCH-3055
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


The [README.md|https://github.com/apache/nutch/blob/master/README.md] contains 
[Github hub|https://hub.github.com/] commands but with "git" as command 
(executable) name, maybe an alias or some other magic. However, if hub isn't 
installed, these commands fail with {{git: 'pull-request' is not a git command. 
See 'git --help'.}} or similar.

We should use the command "hub" instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842291#comment-17842291
 ] 

Sebastian Nagel commented on NUTCH-3028:


+1 lgtm.

One question: if there is no parseData, the JEXL expression is not evaluated. 
Since WARC files may inlcude only the raw HTML plus fetch/capture metadata, 
successfully parsing a document is not a requirement to archive it in a WARC 
file. Might be useful to have the JEXL filtering also available for unparsed 
docs.

> WARCExported to support filtering by JEXL
> -
>
> Key: NUTCH-3028
> URL: https://issues.apache.org/jira/browse/NUTCH-3028
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch
>
>
> Filtering segment data to WARC is now possible using JEXL expressions. In the 
> next example, all records with SOME_KEY=SOME_VALUE in their parseData 
> metadata are exported to WARC.
> {color:#00}-expr 
> 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-30 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842284#comment-17842284
 ] 

Sebastian Nagel commented on NUTCH-3045:


See also NUTCH-2987. Until HADOOP-17177 / HADOOP-18887 are done, we might be 
forced to upkeep JDK 11 runtime compatibility, so that Nutch runs on recent 
Hadoop versions and distributions. I fully agree that Java 17 offers some nice 
syntax improvements, though. :)

> Upgrade from Java 11 to 17
> --
>
> Key: NUTCH-3045
> URL: https://issues.apache.org/jira/browse/NUTCH-3045
> Project: Nutch
>  Issue Type: Task
>  Components: build, ci/cd
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.21
>
>
> This parent issue will track and organize work pertaining to upgrading Nutch 
> to JDK 17.
> Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)