[jira] [Work started] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Joe Gilvary (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3032 started by Joe Gilvary.
--
> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832510#comment-17832510
 ] 

ASF GitHub Bot commented on NUTCH-3032:
---

CatChullain commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028497765

   Thanks again, @lewismc. 
   
   I did add those INFO messages, but I found an extra call to setIndexedConf 
from setConf that the filter() method handles more cleanly, so I removed that,  
too.





> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub


CatChullain commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028497765

   Thanks again, @lewismc. 
   
   I did add those INFO messages, but I found an extra call to setIndexedConf 
from setConf that the filter() method handles more cleanly, so I removed that,  
too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832486#comment-17832486
 ] 

Hudson commented on NUTCH-3036:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #155 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/155/])
NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… (#807) 
(github: 
[https://github.com/apache/nutch/commit/1563396d952393462fffab1f686e9ffd5d006cf6])
* (edit) src/plugin/lib-selenium/README.md
* (edit) src/plugin/lib-selenium/howto_upgrade_selenium.md
* (edit) src/plugin/lib-selenium/plugin.xml
* (edit) 
src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
* (edit) README.md
* (edit) 
src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java
* (edit) src/plugin/protocol-interactiveselenium/README.md
* (edit) src/plugin/protocol-selenium/README.md
* (edit) 
src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* (edit) src/plugin/lib-selenium/ivy.xml


> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832485#comment-17832485
 ] 

Hudson commented on NUTCH-3035:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #155 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/155/])
NUTCH-3035 Update license and notice file for release of 1.20 (#808) (github: 
[https://github.com/apache/nutch/commit/5a95bc6537f86270bd0c798357d57220b316a7cb])
* (edit) NOTICE-binary
* (edit) LICENSE-binary


> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832479#comment-17832479
 ] 

ASF GitHub Bot commented on NUTCH-3032:
---

lewismc commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028343327

   Hi @CatChullain I associated this Jira ticket to the 1.20 release and made 
you assignee  
   We will get it merged soon and roll the release.




> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub


lewismc commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028343327

   Hi @CatChullain I associated this Jira ticket to the 1.20 release and made 
you assignee  
   We will get it merged soon and roll the release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3032:

Fix Version/s: 1.20

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-3032:
---

Assignee: Joe Gilvary

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2856 stopped by Lewis John McGibbney.
---
> Implement a protocol-smb plugin based on hierynomus/smbj
> 
>
> Key: NUTCH-2856
> URL: https://issues.apache.org/jira/browse/NUTCH-2856
> Project: Nutch
>  Issue Type: New Feature
>  Components: external, plugin, protocol
>Reporter: Hiran Chaudhuri
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> The plugin protocol-smb advertized on 
> [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually 
> refers to the JCIFS library. According to this library's homepage 
> [https://www.jcifs.org/]:
> _If you're looking for the latest and greatest open source Java SMB library, 
> this is not it. JCIFS has been in maintenance-mode-only for several years and 
> although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and 
> various utility classes), jCIFS does not support the newer SMB2/3 variants of 
> the SMB protocol which is slowly becoming required (Windows 10 requires 
> SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their 
> products. *So if SMB1 is disabled on your network, JCIFS' file related 
> operations will NOT work.*_
> Looking at 
> [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1]
> _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June 
> 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators 
> Update do not have SMB1 installed by default._
> As a conclusion, the chances that SMB1 protocol is installed and/or 
> configured are getting vastly smaller. Therefore some migration towards 
> SMB2/3 is required. Luckily the JCIFS homepage lists alternatives:
>  * [jcifs-codelibs|https://github.com/codelibs/jcifs]
>  * [jcifs-ng|https://github.com/AgNO3/jcifs-ng]
>  * [smbj|https://github.com/hierynomus/smbj]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-2887) Migrate to JUnit 5 Jupiter

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2887 stopped by Lewis John McGibbney.
---
> Migrate to JUnit 5 Jupiter
> --
>
> Key: NUTCH-2887
> URL: https://issues.apache.org/jira/browse/NUTCH-2887
> Project: Nutch
>  Issue Type: Improvement
>  Components: test
> Environment: Migrate 
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> This effort is a bit of a beast. See the [JUnit migration 
> tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips]
>  for general guidance. A general grep for junit in src produces the following
> {code:bash}
> ./test/nutch-site.xml
> ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
> ./test/org/apache/nutch/net/TestURLNormalizers.java
> ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java
> ./test/org/apache/nutch/net/TestURLFilters.java
> ./test/org/apache/nutch/util/TestStringUtil.java
> ./test/org/apache/nutch/util/TestSuffixStringMatcher.java
> ./test/org/apache/nutch/util/TestEncodingDetector.java
> ./test/org/apache/nutch/util/TestMimeUtil.java
> ./test/org/apache/nutch/util/TestPrefixStringMatcher.java
> ./test/org/apache/nutch/util/DumpFileUtilTest.java
> ./test/org/apache/nutch/util/TestNodeWalker.java
> ./test/org/apache/nutch/util/WritableTestUtils.java
> ./test/org/apache/nutch/util/TestTableUtil.java
> ./test/org/apache/nutch/util/TestURLUtil.java
> ./test/org/apache/nutch/util/TestGZIPUtils.java
> ./test/org/apache/nutch/parse/TestParseText.java
> ./test/org/apache/nutch/parse/TestOutlinks.java
> ./test/org/apache/nutch/parse/TestParseData.java
> ./test/org/apache/nutch/parse/TestOutlinkExtractor.java
> ./test/org/apache/nutch/parse/TestParserFactory.java
> ./test/org/apache/nutch/segment/TestSegmentMerger.java
> ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
> ./test/org/apache/nutch/plugin/TestPluginSystem.java
> ./test/org/apache/nutch/fetcher/TestFetcher.java
> ./test/org/apache/nutch/protocol/TestProtocolFactory.java
> ./test/org/apache/nutch/protocol/TestContent.java
> ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java
> ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java
> ./test/org/apache/nutch/crawl/TestTextProfileSignature.java
> ./test/org/apache/nutch/crawl/TestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestGenerator.java
> ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
> ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
> ./test/org/apache/nutch/crawl/TestInjector.java
> ./test/org/apache/nutch/crawl/TestLinkDbMerger.java
> ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java
> ./test/org/apache/nutch/service/TestNutchServer.java
> ./test/org/apache/nutch/metadata/TestMetadata.java
> ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
> ./test/org/apache/nutch/indexer/TestIndexingFilters.java
> ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java
> ./bin/nutch
> ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java
> ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
> ./plugin/urlfilter-domaindenylist/build.xml
> ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java
> ./plugin/protocol-imaps/plugin.xml
> ./plugin/protocol-imaps/ivy.xml
> ./plugin/protocol-imaps/lib/junit-4.13.jar
> ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar
> ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar
> ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java
> ./plugin/protocol-file/build.xml
> ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
> ./plugin/urlnormalizer-regex/build.xml
> ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
> ./plugin/build-plugin.xml
> ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
> ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
> ./plugin/urlnormalizer-protocol/build.xml
> ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java
> ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java
> ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java
> ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
> 

[jira] [Closed] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2832.
---

> Create tutorial on sending Nutch logs to Elasticsearch
> --
>
> Key: NUTCH-2832
> URL: https://issues.apache.org/jira/browse/NUTCH-2832
> Project: Nutch
>  Issue Type: New Feature
>  Components: configuration, deployment
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> A while back I used to use [Chukwa|https://chukwa.apache.org/] for log 
> aggregation and analysis. Chukwa is now retired. 
> I a bit of research into directly logging Log4j2 into Elasticsearch and came 
> across 
> [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which 
> looks pretty simple.
> I'm going to have a crack at implementing this functionality as a 
> configuration option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2832.
-
Resolution: Won't Fix

Given the license changes regarding the concerned backend I have no interest 
implementing this anymore. 

> Create tutorial on sending Nutch logs to Elasticsearch
> --
>
> Key: NUTCH-2832
> URL: https://issues.apache.org/jira/browse/NUTCH-2832
> Project: Nutch
>  Issue Type: New Feature
>  Components: configuration, deployment
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> A while back I used to use [Chukwa|https://chukwa.apache.org/] for log 
> aggregation and analysis. Chukwa is now retired. 
> I a bit of research into directly logging Log4j2 into Elasticsearch and came 
> across 
> [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which 
> looks pretty simple.
> I'm going to have a crack at implementing this functionality as a 
> configuration option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3036.
-
Resolution: Fixed

> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3036.
---

> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832474#comment-17832474
 ] 

ASF GitHub Bot commented on NUTCH-3036:
---

lewismc merged PR #807:
URL: https://github.com/apache/nutch/pull/807




> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832473#comment-17832473
 ] 

ASF GitHub Bot commented on NUTCH-3035:
---

lewismc merged PR #808:
URL: https://github.com/apache/nutch/pull/808




> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3035 Update license and notice file for release of 1.20 [nutch]

2024-03-30 Thread via GitHub


lewismc merged PR #808:
URL: https://github.com/apache/nutch/pull/808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3035.
---

> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3035.
-
Resolution: Fixed

> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3036 Upgrade org.seleniumhq.selenium:selenium-java dependency i… [nutch]

2024-03-30 Thread via GitHub


lewismc merged PR #807:
URL: https://github.com/apache/nutch/pull/807


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3037.
-
Resolution: Fixed

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3037.
---

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3037 Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 [nutch]

2024-03-30 Thread via GitHub


lewismc merged PR #809:
URL: https://github.com/apache/nutch/pull/809


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832472#comment-17832472
 ] 

ASF GitHub Bot commented on NUTCH-3037:
---

lewismc merged PR #809:
URL: https://github.com/apache/nutch/pull/809




> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832471#comment-17832471
 ] 

ASF GitHub Bot commented on NUTCH-3032:
---

lewismc commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028304406

   @CatChullain thanks for your patience whilst we work this one  
   
   > … I wonder where might be good spots for INFO level messages
   
   The reason I suggested that the log level be revised from `INFO` to `DEBUG` 
was that any logging needs to make sense in the context of the entire log. Said 
another way, plugin logging needs to complement the core crawler tasks.
   That being said, if you want to include `INFO` for the following scenarios 
then please go ahead. 
   * recording the count value, and
   * indicating when overwrite is true
   Your rationale is sound. 
   
   




> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub


lewismc commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028304406

   @CatChullain thanks for your patience whilst we work this one  
   
   > … I wonder where might be good spots for INFO level messages
   
   The reason I suggested that the log level be revised from `INFO` to `DEBUG` 
was that any logging needs to make sense in the context of the entire log. Said 
another way, plugin logging needs to complement the core crawler tasks.
   That being said, if you want to include `INFO` for the following scenarios 
then please go ahead. 
   * recording the count value, and
   * indicating when overwrite is true
   Your rationale is sound. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832463#comment-17832463
 ] 

ASF GitHub Bot commented on NUTCH-3032:
---

CatChullain commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028122038

   Thanks, Lewis! I moved all four to DEBUG, but I wonder where might be good 
spots for INFO level messages. I'm thinking of the operator or tech who doesn't 
dig into code and has an issue in the config. During dev & test myself, I 
sometimes forgot to increment the index.arbitrary.function.count and the plugin 
ignored the later fields. Just outputting that count value, and maybe something 
when overwrite is true, might be helpful for alerting someone that the config 
might not be what they'd believed.
   
   Do either of those (or something else) seem worthwhile, or does it make more 
sense to let people use it and see what issues they raise?




> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3032 Code for an ArbitraryIndexingFilter to index values resolved by user POJO code at index time [nutch]

2024-03-30 Thread via GitHub


CatChullain commented on PR #810:
URL: https://github.com/apache/nutch/pull/810#issuecomment-2028122038

   Thanks, Lewis! I moved all four to DEBUG, but I wonder where might be good 
spots for INFO level messages. I'm thinking of the operator or tech who doesn't 
dig into code and has an issue in the config. During dev & test myself, I 
sometimes forgot to increment the index.arbitrary.function.count and the plugin 
ignored the later fields. Just outputting that count value, and maybe something 
when overwrite is true, might be helpful for alerting someone that the config 
might not be what they'd believed.
   
   Do either of those (or something else) seem worthwhile, or does it make more 
sense to let people use it and see what issues they raise?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org