[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842526#comment-17842526 ] Joe Gilvary commented on NUTCH-585: --- [~dbeckstrom] I'm not sure which patch you were asking about. I used the source for the new 1.20 release and applied the patch that [~ad-...@gmx.at] posted after an edit to the line numbers for the update to src/plugin/build.xml. It built cleanly and seems to work exactly as advertised in my tests with indexchecker. > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > --- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Components: HTML, parse-filter, parser, plugin >Affects Versions: 0.9.0 > Environment: All operating systems >Reporter: Andrea Spinelli >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > Attachments: blacklist_whitelist_plugin.patch, > nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch > > > We are using nutch to index our own web sites; we would like not to index > certain parts of our pages, because we know they are not relevant (for > instance, there are several links to change the background color) and > generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML > comments, like > > ... ignored part ... > > We feel this might be useful to someone else, maybe factorizing the comment > strings as constants in the configuration files (say parser.html.ignore.start > and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any > expression of interest - or for an explanation why waht we are doing is > plain wrong! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842426#comment-17842426 ] Hudson commented on NUTCH-3054: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #160 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/160/]) NUTCH-3054 Address deprecation of Node16 for all GitHub Actions (#817) (github: [https://github.com/apache/nutch/commit/7ac3ce28e065fb5160f96ce7bce1ec840f87d0dc]) * (edit) .github/workflows/master-build.yml > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3054. --- > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3054. - Resolution: Fixed > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842410#comment-17842410 ] ASF GitHub Bot commented on NUTCH-3054: --- lewismc merged PR #817: URL: https://github.com/apache/nutch/pull/817 > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] NUTCH-3054 Address deprecation of Node16 for all GitHub Actions [nutch]
lewismc merged PR #817: URL: https://github.com/apache/nutch/pull/817 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842384#comment-17842384 ] Markus Jelsma commented on NUTCH-3028: -- Ok, the Content object is now also available in the evaluation. I added an example of it to the description above. > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > {color:#00}or {color} > {color:#00}-expr > 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Attachment: NUTCH-3028-2.patch > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028-2.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > {color:#00}or {color} > {color:#00}-expr > 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-3028: - Description: Filtering segment data to WARC is now possible using JEXL expressions. In the next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata are exported to WARC. {color:#00}-expr 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} {color:#00}or {color} {color:#00}-expr 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} was: Filtering segment data to WARC is now possible using JEXL expressions. In the next example, all records with SOME_KEY=SOME_VALUE in their parseData metadata are exported to WARC. {color:#00}-expr 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} > {color:#00}or {color} > {color:#00}-expr > 'content.getMetadata().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3055) README: fix Github "hub" commands
[ https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842308#comment-17842308 ] ASF GitHub Bot commented on NUTCH-3055: --- sebastian-nagel opened a new pull request, #818: URL: https://github.com/apache/nutch/pull/818 (no comment) > README: fix Github "hub" commands > - > > Key: NUTCH-3055 > URL: https://issues.apache.org/jira/browse/NUTCH-3055 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Trivial > Fix For: 1.21 > > > The [README.md|https://github.com/apache/nutch/blob/master/README.md] > contains [Github hub|https://hub.github.com/] commands but with "git" as > command (executable) name, maybe an alias or some other magic. However, if > hub isn't installed, these commands fail with {{git: 'pull-request' is not a > git command. See 'git --help'.}} or similar. > We should use the command "hub" instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3055) README: fix Github "hub" commands
Sebastian Nagel created NUTCH-3055: -- Summary: README: fix Github "hub" commands Key: NUTCH-3055 URL: https://issues.apache.org/jira/browse/NUTCH-3055 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.20 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.21 The [README.md|https://github.com/apache/nutch/blob/master/README.md] contains [Github hub|https://hub.github.com/] commands but with "git" as command (executable) name, maybe an alias or some other magic. However, if hub isn't installed, these commands fail with {{git: 'pull-request' is not a git command. See 'git --help'.}} or similar. We should use the command "hub" instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842291#comment-17842291 ] Sebastian Nagel commented on NUTCH-3028: +1 lgtm. One question: if there is no parseData, the JEXL expression is not evaluated. Since WARC files may inlcude only the raw HTML plus fetch/capture metadata, successfully parsing a document is not a requirement to archive it in a WARC file. Might be useful to have the JEXL filtering also available for unparsed docs. > WARCExported to support filtering by JEXL > - > > Key: NUTCH-3028 > URL: https://issues.apache.org/jira/browse/NUTCH-3028 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.21 > > Attachments: NUTCH-3028-1.patch, NUTCH-3028.patch > > > Filtering segment data to WARC is now possible using JEXL expressions. In the > next example, all records with SOME_KEY=SOME_VALUE in their parseData > metadata are exported to WARC. > {color:#00}-expr > 'parseData.getParseMeta().get("SOME_KEY").equals("SOME_VALUE")'{color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3045) Upgrade from Java 11 to 17
[ https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842284#comment-17842284 ] Sebastian Nagel commented on NUTCH-3045: See also NUTCH-2987. Until HADOOP-17177 / HADOOP-18887 are done, we might be forced to upkeep JDK 11 runtime compatibility, so that Nutch runs on recent Hadoop versions and distributions. I fully agree that Java 17 offers some nice syntax improvements, though. :) > Upgrade from Java 11 to 17 > -- > > Key: NUTCH-3045 > URL: https://issues.apache.org/jira/browse/NUTCH-3045 > Project: Nutch > Issue Type: Task > Components: build, ci/cd >Reporter: Lewis John McGibbney >Priority: Critical > Fix For: 1.21 > > > This parent issue will track and organize work pertaining to upgrading Nutch > to JDK 17. > Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023). -- This message was sent by Atlassian Jira (v8.20.10#820010)