Re: [VOTE] Apache Nutch 1.20 Release

2024-04-11 Thread Lewis John McGibbney
Hi Seb,

On 2024/04/11 13:30:53 Sebastian Nagel wrote:
> 
> https://github.com/sebastian-nagel/nutch-test-single-node-cluster/

I think we should make this into an integration test suite and run it as part 
of CI. I’ve been meaning and wanting to do this for the __longest__ time…!

> 
> One note about the CHANGES.md: it's now a mixture of HTML and plain text.
> It does not use the potential of markdown, e.g. sections / headlines for
> the releases to make the change log navigable via a table of contents.
> The embedded HTML makes it less readable if viewed in a text editor.
> The rendering on Github [5] is acceptable with only minor glitches,
> mostly the placement of multiple lines in a single paragraph:
>https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md
> We also have a change log on Jira:
>https://s.apache.org/ovjf3
> That's why I wouldn't call the CHANGES.md a "blocker". We should
> update the formatting after the release to make it again easily
> readable in source code and improve the document structure utilizing
> the markdown markup.

Excellent suggestion. I was focusing on including the hyperlinks and clearly 
compromised other change log benefits. I will address this after the release. 
Thank you


[jira] [Commented] (NUTCH-3040) Upgrade to Hadoop 3.4.0

2024-04-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836191#comment-17836191
 ] 

Tim Allison commented on NUTCH-3040:


:cry-sob: This is great news!

> Upgrade to Hadoop 3.4.0
> ---
>
> Key: NUTCH-3040
> URL: https://issues.apache.org/jira/browse/NUTCH-3040
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> [Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released.
> Many dependencies are upgraded, including commons-io 2.14.0 which would have 
> saved us a lot of work in NUTCH-2959.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3040) Upgrade to Hadoop 3.4.0

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3040:
--

 Summary: Upgrade to Hadoop 3.4.0
 Key: NUTCH-3040
 URL: https://issues.apache.org/jira/browse/NUTCH-3040
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


[Hadoop 3.4.0|https://hadoop.apache.org/release/3.4.0.html] has been released.

Many dependencies are upgraded, including commons-io 2.14.0 which would have 
saved us a lot of work in NUTCH-2959.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Apache Nutch 1.20 Release

2024-04-11 Thread Sebastian Nagel

Hi Lewis,

here's my +1

 * signatures of release packages are valid
 * build from the source package successful, unit tests pass
 * tested few Nutch tools in the binary package (local mode)
 * run a sample crawl and tested many Nutch tools on a single-node cluster
   running Hadoop 3.4.0, see
   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/

One note about the CHANGES.md: it's now a mixture of HTML and plain text.
It does not use the potential of markdown, e.g. sections / headlines for
the releases to make the change log navigable via a table of contents.
The embedded HTML makes it less readable if viewed in a text editor.
The rendering on Github [5] is acceptable with only minor glitches,
mostly the placement of multiple lines in a single paragraph:
  https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md
We also have a change log on Jira:
  https://s.apache.org/ovjf3
That's why I wouldn't call the CHANGES.md a "blocker". We should
update the formatting after the release to make it again easily
readable in source code and improve the document structure utilizing
the markdown markup.

~Sebastian

On 4/9/24 23:28, lewis john mcgibbney wrote:

Hi Folks,

A first candidate for the Nutch 1.20 release is available at [0] where 
accompanying SHA512 and ASC signatures can also be found.

Information on verifying releases can be found at [1].

The release candidate comprises a .zip and tar.gz archive of the sources at [2] 
and complementary binary distributions. In addition, a staged maven repository 
is available at [3].


The Nutch 1.20 release report is available at [4].

Please vote on releasing this package as Apache Nutch 1.20. The vote is open for 
at least the next 72 hours and passes if a majority of at least three +1 Nutch 
PMC votes are cast.


[ ] +1 Release this package as Apache Nutch X.XX.

[ ] -1 Do not release this package because…

Cheers,
lewismc
P.S. Here is my +1.

[0] https://dist.apache.org/repos/dist/dev/nutch/1.20 

[1] http://nutch.apache.org/downloads.html#verify 

[2] https://github.com/apache/nutch/tree/release-1.20 

[3] https://repository.apache.org/content/repositories/orgapachenutch-1021/ 


[4] https://s.apache.org/ovjf3 

--
http://home.apache.org/~lewismc/ 
http://people.apache.org/keys/committer/lewismc 



[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836133#comment-17836133
 ] 

Markus Jelsma commented on NUTCH-3039:
--

Thanks for spotting that!

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836126#comment-17836126
 ] 

ASF GitHub Bot commented on NUTCH-3039:
---

sebastian-nagel opened a new pull request, #812:
URL: https://github.com/apache/nutch/pull/812

   Pass ftp:// URLs to the standard JVM URLStreamHandler




> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-3039:
--

Assignee: Sebastian Nagel

> Failure to handle ftp:// URLs
> -
>
> Key: NUTCH-3039
> URL: https://issues.apache.org/jira/browse/NUTCH-3039
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
>
> Nutch fails to handle ftp:// URLs:
> - URLNormalizerBasic returns the empty string because creating the URL 
> instance fails with a MalformedURLException:
>   {noformat}
> echo "ftp://ftp.example.com/path/file.txt; \
>   | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
> - fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due 
> to a MalformedURLException:
>   {noformat}
> bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
>"ftp://ftp.example.com/path/file.txt;
> ...
> Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
> java.net.MalformedURLException
> at 
> org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
> ...{noformat}
> The issue is caused by NUTCH-2429:
> - we do not provide a dedicated URL stream handler for ftp URLs
> - but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] NUTCH-3039 Failure to handle ftp:// URLs [nutch]

2024-04-11 Thread via GitHub


sebastian-nagel opened a new pull request, #812:
URL: https://github.com/apache/nutch/pull/812

   Pass ftp:// URLs to the standard JVM URLStreamHandler


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3039:
--

 Summary: Failure to handle ftp:// URLs
 Key: NUTCH-3039
 URL: https://issues.apache.org/jira/browse/NUTCH-3039
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Affects Versions: 1.19
Reporter: Sebastian Nagel
 Fix For: 1.21


Nutch fails to handle ftp:// URLs:
- URLNormalizerBasic returns the empty string because creating the URL instance 
fails with a MalformedURLException:
  {noformat}
echo "ftp://ftp.example.com/path/file.txt; \
  | nutch normalizerchecker -stdin -normalizer urlnormalizer-basic{noformat}
- fetching a ftp:// URL with the protocol-ftp plugin enabled also fails due to 
a MalformedURLException:
  {noformat}
bin/nutch parsechecker -Dplugin.includes='protocol-ftp|parse-tika' \
   "ftp://ftp.example.com/path/file.txt;
...
Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: 
java.net.MalformedURLException
at 
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:113)
...{noformat}


The issue is caused by NUTCH-2429:
- we do not provide a dedicated URL stream handler for ftp URLs
- but also do not pass ftp:// URLs to the standard JVM handler



--
This message was sent by Atlassian Jira
(v8.20.10#820010)