[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception
[ https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847521#comment-17847521 ] Joe Gilvary commented on NUTCH-3057: Happy Saturday, [~lewi...@apache.org], I worked on the plugin and this fix with some raspberry pi hosts at home, but of course, found the error at work. I didn't see it until I was running with the 1.20 release in a pre-prod system. I set up individual POJOs for a few fields and added a typo in nutch-site.xml. As soon as I saw the exception during indexing and what made it into Solr, I knew what was wrong. A D'oh! moment indeed. Let me know, please, if there's anything else I need to do, process-wise, to have this correct for the next distro. > Arbitrary indexer "leaks" previous value into a field processed after an > exception > -- > > Key: NUTCH-3057 > URL: https://issues.apache.org/jira/browse/NUTCH-3057 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.20 >Reporter: Joe Gilvary >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception
[ https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847453#comment-17847453 ] Joe Gilvary commented on NUTCH-3057: The arbitrary indexer plug-in can add multiple new fields to a doc by appending numeric suffixes to the config values for each. If an exception interferes with setting a value and there's a config for a successive field to process, the plug in can insert the wrong value for that successively-configured field. > Arbitrary indexer "leaks" previous value into a field processed after an > exception > -- > > Key: NUTCH-3057 > URL: https://issues.apache.org/jira/browse/NUTCH-3057 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.20 >Reporter: Joe Gilvary >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception
Joe Gilvary created NUTCH-3057: -- Summary: Arbitrary indexer "leaks" previous value into a field processed after an exception Key: NUTCH-3057 URL: https://issues.apache.org/jira/browse/NUTCH-3057 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.20 Reporter: Joe Gilvary -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842526#comment-17842526 ] Joe Gilvary commented on NUTCH-585: --- [~dbeckstrom] I'm not sure which patch you were asking about. I used the source for the new 1.20 release and applied the patch that [~ad-...@gmx.at] posted after an edit to the line numbers for the update to src/plugin/build.xml. It built cleanly and seems to work exactly as advertised in my tests with indexchecker. > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > --- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Components: HTML, parse-filter, parser, plugin >Affects Versions: 0.9.0 > Environment: All operating systems >Reporter: Andrea Spinelli >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.21 > > Attachments: blacklist_whitelist_plugin.patch, > nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch > > > We are using nutch to index our own web sites; we would like not to index > certain parts of our pages, because we know they are not relevant (for > instance, there are several links to change the background color) and > generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML > comments, like > > ... ignored part ... > > We feel this might be useful to someone else, maybe factorizing the comment > strings as constants in the configuration files (say parser.html.ignore.start > and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any > expression of interest - or for an explanation why waht we are doing is > plain wrong! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Gilvary resolved NUTCH-3032. Resolution: Fixed I believe this meets all the goals in the discussions now. > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Assignee: Joe Gilvary >Priority: Major > Labels: indexing > Fix For: 1.20 > > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3032 started by Joe Gilvary. -- > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Assignee: Joe Gilvary >Priority: Major > Labels: indexing > Fix For: 1.20 > > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Gilvary updated NUTCH-3032: --- Patch Info: Patch Available > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Priority: Major > Labels: indexing > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825873#comment-17825873 ] Joe Gilvary edited comment on NUTCH-3032 at 3/14/24 11:05 PM: -- -Done!- Updated the patch file 2024-03-14 because it had an extraneous file from the tests that wasn't actually used in the tests I included. was (Author: JIRAUSER304553): Done! > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Priority: Major > Labels: indexing > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Gilvary updated NUTCH-3032: --- Attachment: NUTCH-3032.patch > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Priority: Major > Labels: indexing > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Gilvary updated NUTCH-3032: --- Attachment: (was: NUTCH-3032.patch) > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Priority: Major > Labels: indexing > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825873#comment-17825873 ] Joe Gilvary commented on NUTCH-3032: Done! > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Priority: Major > Labels: indexing > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Gilvary updated NUTCH-3032: --- Attachment: NUTCH-3032.patch > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Priority: Major > Labels: indexing > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825855#comment-17825855 ] Joe Gilvary edited comment on NUTCH-3032 at 3/12/24 11:06 PM: -- I have the code cleaned up and a few Junit tests. When I follow the instructions at https://github.com/apache/nutch/tree/master for contributing, git tells me it doesn't recognize 'fork' ('is not a git command'). Before I do something gittish that will be difficult to remedy, I figured I'd ask for advice. :) Do I just push now, or is there some other version of fork I should be using? was (Author: JIRAUSER304553): I have the code cleaned up and a few Junit tests. When I follow the instructions at https://github.com/apache/nutch/tree/master for contributing, git tells me it doesn't recognize 'fork' is not a git command. Before I do something gittish that will be difficult to remedy, I figured I'd ask for advice. :) Do I just push now, or is there some other version of fork I should be using? > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Priority: Major > Labels: indexing > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825855#comment-17825855 ] Joe Gilvary commented on NUTCH-3032: I have the code cleaned up and a few Junit tests. When I follow the instructions at https://github.com/apache/nutch/tree/master for contributing, git tells me it doesn't recognize 'fork' is not a git command. Before I do something gittish that will be difficult to remedy, I figured I'd ask for advice. :) Do I just push now, or is there some other version of fork I should be using? > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Priority: Major > Labels: indexing > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
Joe Gilvary created NUTCH-3032: -- Summary: Indexing plugin as an adapter for end user's own POJO instances Key: NUTCH-3032 URL: https://issues.apache.org/jira/browse/NUTCH-3032 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Joe Gilvary It could be helpful to let end users manipulate information at indexing time with their own code without the need for writing their own indexing plugin. I mentioned this on the dev mailing list (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some description of my work in progress. One potential use is to address some of the same concerns that NUTCH-585 discusses regarding an alternative approach to picking and choosing which content to index, but this approach would allow making index time decisions, rather than setting the configuration for all content at the start of the indexing run. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2900) Integrate Nutch with Kerberized Solr Cloud
[ https://issues.apache.org/jira/browse/NUTCH-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515351#comment-17515351 ] Joe Gilvary commented on NUTCH-2900: I see a similar error at injection when Solr uses the MultiAuth plugin even though one of the schemes is "basic" with the solr.BasicAuthPlugin class. Can this issue cover JWT and MultiAuth for the Solr indexer, or should they go as distinct issues? > Integrate Nutch with Kerberized Solr Cloud > -- > > Key: NUTCH-2900 > URL: https://issues.apache.org/jira/browse/NUTCH-2900 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.18 >Reporter: Geng Hong >Priority: Major > > Currently, we are unable to integrate the Nutch with Solr Cloud that enabled > with Kerberos authentication. The error message as below appears: > > WARN auth.HttpAuthenticator - NEGOTIATE authentication error: No valid > credentials provided (Mechanism level: No valid credentials provided > (Mechanism level: Failed to find any Kerberos tgt)) > > > > Error 401 Authentication required > > HTTP ERROR 401 > Problem accessing /solr/admin/collections. Reason: > Authentication required > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (NUTCH-2823) IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer
[ https://issues.apache.org/jira/browse/NUTCH-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Gilvary updated NUTCH-2823: --- Description: The string validation for the IndexWriters.describe() fails when the value in index-writers.xml is too long. I encountered the exception when using three comma-separated URL values in a config that worked for Nutch 1.15.The schema doesn't allow multiple values, but the documentation says a comma-separated list works. Indexing ran without the exception when I changed to use only one host's URL (Solr Cloud). Sebastian duplicated the error with a long string value for the param, so it's not directly due to the comma separated values. While googling I found this thread in the archives where Markus encountered it going from 1.15 to 1.16: mail-archives.apache.org/mod_mbox/nutch-user/201910.mbox/<05eda22b-14b2-309f-3bc7-d6d85c218...@googlemail.com> I also found a change in 1.16 that might be relevant: NUTCH-2602 https://issues.apache.org/jira/browse/NUTCH-2602 My stack trace: {{java.lang.Exception: java.lang.IllegalStateException: text width is less than 1, was <-26>}} \{{ at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)}} \{{ at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559)}} {{Caused by: java.lang.IllegalStateException: text width is less than 1, was <-26>}} \{{ at org.apache.commons.lang3.Validate.validState(Validate.java:829)}} \{{ at de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)}} \{{ at de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)}} \{{ at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)}} \{{ at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)}} \{{ at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)}} \{{ at org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)}} \{{ at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)}} \{{ at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)}} \{{ at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)}} \{{ at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)}} \{{ at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}} \{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}} \{{ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}} \{{ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}} \{{ at java.lang.Thread.run(Thread.java:748)}} Thanks, Joe was: The string validation for the IndexWriters.describe() fails when the value in index-writers.xml is too long. I encountered the exception when using three comma-separated URL values in a config that worked for Nutch 1.15.The schema doesn't allow multiple values, but the documentation says a comma-separated list works. Indexing ran without the exception when I changed to use only one host's URL (Solr Cloud). Sebastian duplicated the error with a long string value for the param, so it's not directly due to the comma separated values. While googling I found this thread in the archives where Marcus encountered it going from 1.15 to 1.16: mail-archives.apache.org/mod_mbox/nutch-user/201910.mbox/<05eda22b-14b2-309f-3bc7-d6d85c218...@googlemail.com> I also found a change in 1.16 that might be relevant: NUTCH-2602 https://issues.apache.org/jira/browse/NUTCH-2602 My stack trace: {{java.lang.Exception: java.lang.IllegalStateException: text width is less than 1, was <-26>}} {{ at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)}} {{ at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559)}} {{Caused by: java.lang.IllegalStateException: text width is less than 1, was <-26>}} {{ at org.apache.commons.lang3.Validate.validState(Validate.java:829)}} {{ at de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)}} {{ at de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)}} {{ at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)}} {{ at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)}} {{ at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)}} {{ at org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)}} {{ at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)}} {{ at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)}} {{ at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)}} {{ at
[jira] [Created] (NUTCH-2823) IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer
Joe Gilvary created NUTCH-2823: -- Summary: IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer Key: NUTCH-2823 URL: https://issues.apache.org/jira/browse/NUTCH-2823 Project: Nutch Issue Type: Bug Components: indexer, plugin Affects Versions: 1.17, 1.16 Reporter: Joe Gilvary The string validation for the IndexWriters.describe() fails when the value in index-writers.xml is too long. I encountered the exception when using three comma-separated URL values in a config that worked for Nutch 1.15.The schema doesn't allow multiple values, but the documentation says a comma-separated list works. Indexing ran without the exception when I changed to use only one host's URL (Solr Cloud). Sebastian duplicated the error with a long string value for the param, so it's not directly due to the comma separated values. While googling I found this thread in the archives where Marcus encountered it going from 1.15 to 1.16: mail-archives.apache.org/mod_mbox/nutch-user/201910.mbox/<05eda22b-14b2-309f-3bc7-d6d85c218...@googlemail.com> I also found a change in 1.16 that might be relevant: NUTCH-2602 https://issues.apache.org/jira/browse/NUTCH-2602 My stack trace: {{java.lang.Exception: java.lang.IllegalStateException: text width is less than 1, was <-26>}} {{ at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)}} {{ at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559)}} {{Caused by: java.lang.IllegalStateException: text width is less than 1, was <-26>}} {{ at org.apache.commons.lang3.Validate.validState(Validate.java:829)}} {{ at de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)}} {{ at de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)}} {{ at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)}} {{ at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)}} {{ at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)}} {{ at org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)}} {{ at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)}} {{ at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)}} {{ at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)}} {{ at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)}} {{ at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}} {{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}} {{ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}} {{ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}} {{ at java.lang.Thread.run(Thread.java:748)}} Thanks, Joe -- This message was sent by Atlassian Jira (v8.3.4#803005)