[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception

2024-05-18 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847521#comment-17847521
 ] 

Joe Gilvary commented on NUTCH-3057:


Happy Saturday, [~lewi...@apache.org],

I worked on the plugin and this fix with some raspberry pi hosts at home, but 
of course, found the error at work. I didn't see it until I was running with 
the 1.20 release in a pre-prod system. I set up individual POJOs for a few 
fields and added a typo in nutch-site.xml. As soon as I saw the exception 
during indexing and what made it into Solr, I knew what was wrong. A D'oh! 
moment indeed.

Let me know, please, if there's anything else I need to do, process-wise, to 
have this correct for the next distro.

> Arbitrary indexer "leaks" previous value into a field processed after an 
> exception
> --
>
> Key: NUTCH-3057
> URL: https://issues.apache.org/jira/browse/NUTCH-3057
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.20
>Reporter: Joe Gilvary
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception

2024-05-17 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847453#comment-17847453
 ] 

Joe Gilvary commented on NUTCH-3057:


The arbitrary indexer plug-in can add multiple new fields to a doc by appending 
numeric suffixes to the config values for each. If an exception interferes with 
setting a value and there's a config for a successive field to process, the 
plug in can insert the wrong value for that successively-configured field.

> Arbitrary indexer "leaks" previous value into a field processed after an 
> exception
> --
>
> Key: NUTCH-3057
> URL: https://issues.apache.org/jira/browse/NUTCH-3057
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.20
>Reporter: Joe Gilvary
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3057) Arbitrary indexer "leaks" previous value into a field processed after an exception

2024-05-17 Thread Joe Gilvary (Jira)
Joe Gilvary created NUTCH-3057:
--

 Summary: Arbitrary indexer "leaks" previous value into a field 
processed after an exception
 Key: NUTCH-3057
 URL: https://issues.apache.org/jira/browse/NUTCH-3057
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.20
Reporter: Joe Gilvary






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2024-04-30 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842526#comment-17842526
 ] 

Joe Gilvary commented on NUTCH-585:
---

[~dbeckstrom] I'm not sure which patch you were asking about. I used the source 
for the new 1.20 release and applied the patch that [~ad-...@gmx.at] posted 
after an edit to the line numbers for the update to src/plugin/build.xml. It 
built cleanly and seems to work exactly as advertised in my tests with 
indexchecker.

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>  Components: HTML, parse-filter, parser, plugin
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.21
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-31 Thread Joe Gilvary (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Gilvary resolved NUTCH-3032.

Resolution: Fixed

I believe this meets all the goals in the discussions now.

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Joe Gilvary (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3032 started by Joe Gilvary.
--
> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-14 Thread Joe Gilvary (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Gilvary updated NUTCH-3032:
---
Patch Info: Patch Available

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-14 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825873#comment-17825873
 ] 

Joe Gilvary edited comment on NUTCH-3032 at 3/14/24 11:05 PM:
--

-Done!-

Updated the patch file 2024-03-14 because it had an extraneous file from the 
tests that wasn't actually used in the tests I included.


was (Author: JIRAUSER304553):
Done!

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-14 Thread Joe Gilvary (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Gilvary updated NUTCH-3032:
---
Attachment: NUTCH-3032.patch

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-14 Thread Joe Gilvary (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Gilvary updated NUTCH-3032:
---
Attachment: (was: NUTCH-3032.patch)

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-12 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825873#comment-17825873
 ] 

Joe Gilvary commented on NUTCH-3032:


Done!

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-12 Thread Joe Gilvary (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Gilvary updated NUTCH-3032:
---
Attachment: NUTCH-3032.patch

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-12 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825855#comment-17825855
 ] 

Joe Gilvary edited comment on NUTCH-3032 at 3/12/24 11:06 PM:
--

I have the code cleaned up and a few Junit tests. When I follow the 
instructions at https://github.com/apache/nutch/tree/master for contributing, 
git tells me it doesn't recognize 'fork' ('is not a git command'). Before I do 
something gittish that will be difficult to remedy, I figured I'd ask for 
advice. :) Do I just push now, or is there some other version of fork I should 
be using?


was (Author: JIRAUSER304553):
I have the code cleaned up and a few Junit tests. When I follow the 
instructions at https://github.com/apache/nutch/tree/master for contributing, 
git tells me it doesn't recognize 'fork' is not a git command. Before I do 
something gittish that will be difficult to remedy, I figured I'd ask for 
advice. :) Do I just push now, or is there some other version of fork I should 
be using?

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-12 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825855#comment-17825855
 ] 

Joe Gilvary commented on NUTCH-3032:


I have the code cleaned up and a few Junit tests. When I follow the 
instructions at https://github.com/apache/nutch/tree/master for contributing, 
git tells me it doesn't recognize 'fork' is not a git command. Before I do 
something gittish that will be difficult to remedy, I figured I'd ask for 
advice. :) Do I just push now, or is there some other version of fork I should 
be using?

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Priority: Major
>  Labels: indexing
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-10 Thread Joe Gilvary (Jira)
Joe Gilvary created NUTCH-3032:
--

 Summary: Indexing plugin as an adapter for end user's own POJO 
instances
 Key: NUTCH-3032
 URL: https://issues.apache.org/jira/browse/NUTCH-3032
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Joe Gilvary


It could be helpful to let end users manipulate information at indexing time 
with their own code without the need for writing their own indexing plugin. I 
mentioned this on the dev mailing list 
(https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
description of my work in progress.

One potential use is to address some of the same concerns that NUTCH-585 
discusses regarding an alternative approach to picking and choosing which 
content to index, but this approach would allow making index time decisions, 
rather than setting the configuration for all content at the start of the 
indexing run.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2900) Integrate Nutch with Kerberized Solr Cloud

2022-03-31 Thread Joe Gilvary (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515351#comment-17515351
 ] 

Joe Gilvary commented on NUTCH-2900:


I see a similar error at injection when Solr uses the MultiAuth plugin even 
though one of the schemes is "basic" with the solr.BasicAuthPlugin class. 

 

Can this issue cover JWT and MultiAuth for the Solr indexer, or should they go 
as distinct issues?

> Integrate Nutch with Kerberized Solr Cloud
> --
>
> Key: NUTCH-2900
> URL: https://issues.apache.org/jira/browse/NUTCH-2900
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.18
>Reporter: Geng Hong
>Priority: Major
>
> Currently, we are unable to integrate the Nutch with Solr Cloud that enabled 
> with Kerberos authentication. The error message as below appears:
>  
> WARN auth.HttpAuthenticator - NEGOTIATE authentication error: No valid 
> credentials provided (Mechanism level: No valid credentials provided 
> (Mechanism level: Failed to find any Kerberos tgt))
>  
> 
>  
>  Error 401 Authentication required
>  
>  HTTP ERROR 401
>  Problem accessing /solr/admin/collections. Reason:
>   Authentication required
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (NUTCH-2823) IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer

2020-08-13 Thread Joe Gilvary (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Gilvary updated NUTCH-2823:
---
Description: 
The string validation for the IndexWriters.describe() fails when the value in 
index-writers.xml is too long.

I encountered the exception when using three comma-separated URL values in a 
config that worked for Nutch 1.15.The schema doesn't allow multiple values, but 
the documentation says a comma-separated list works.

Indexing ran without the exception when I changed to use only one host's URL 
(Solr Cloud). Sebastian duplicated the error with a long string value for the 
param, so it's not directly due to the comma separated values.

While googling I found this thread in the archives where Markus encountered it 
going from 1.15 to 1.16:

mail-archives.apache.org/mod_mbox/nutch-user/201910.mbox/<05eda22b-14b2-309f-3bc7-d6d85c218...@googlemail.com>

I also found a change in 1.16 that might be relevant: NUTCH-2602
 https://issues.apache.org/jira/browse/NUTCH-2602

My stack trace:

{{java.lang.Exception: java.lang.IllegalStateException: text width is less than 
1, was <-26>}}
 \{{ at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)}}
 \{{ at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559)}}
 {{Caused by: java.lang.IllegalStateException: text width is less than 1, was 
<-26>}}
 \{{ at org.apache.commons.lang3.Validate.validState(Validate.java:829)}}
 \{{ at 
de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)}}
 \{{ at 
de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)}}
 \{{ at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)}}
 \{{ at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)}}
 \{{ at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)}}
 \{{ at 
org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)}}
 \{{ at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)}}
 \{{ at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)}}
 \{{ at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)}}
 \{{ at 
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)}}
 \{{ at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}
 \{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
 \{{ at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
 \{{ at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
 \{{ at java.lang.Thread.run(Thread.java:748)}}

 

 Thanks,

 Joe

  was:
The string validation for the IndexWriters.describe() fails when the value in 
index-writers.xml is too long.

I encountered the exception when using three comma-separated URL values in a 
config that worked for Nutch 1.15.The schema doesn't allow multiple values, but 
the documentation says a comma-separated list works.

Indexing ran without the exception when I changed to use only one host's URL 
(Solr Cloud). Sebastian duplicated the error with a long string value for the 
param, so it's not directly due to the comma separated values.

While googling I found this thread in the archives where Marcus encountered it 
going from 1.15 to 1.16:

mail-archives.apache.org/mod_mbox/nutch-user/201910.mbox/<05eda22b-14b2-309f-3bc7-d6d85c218...@googlemail.com>

I also found a change in 1.16 that might be relevant: NUTCH-2602
https://issues.apache.org/jira/browse/NUTCH-2602

My stack trace:

{{java.lang.Exception: java.lang.IllegalStateException: text width is less than 
1, was <-26>}}
{{ at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)}}
{{ at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559)}}
{{Caused by: java.lang.IllegalStateException: text width is less than 1, was 
<-26>}}
{{ at org.apache.commons.lang3.Validate.validState(Validate.java:829)}}
{{ at 
de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)}}
{{ at 
de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)}}
{{ at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)}}
{{ at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)}}
{{ at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)}}
{{ at 
org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)}}
{{ at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)}}
{{ at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)}}
{{ at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)}}
{{ at 

[jira] [Created] (NUTCH-2823) IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer

2020-08-13 Thread Joe Gilvary (Jira)
Joe Gilvary created NUTCH-2823:
--

 Summary: IllegalStateException in IndexWriters.describe() when 
validating url param for SolrIndexer
 Key: NUTCH-2823
 URL: https://issues.apache.org/jira/browse/NUTCH-2823
 Project: Nutch
  Issue Type: Bug
  Components: indexer, plugin
Affects Versions: 1.17, 1.16
Reporter: Joe Gilvary


The string validation for the IndexWriters.describe() fails when the value in 
index-writers.xml is too long.

I encountered the exception when using three comma-separated URL values in a 
config that worked for Nutch 1.15.The schema doesn't allow multiple values, but 
the documentation says a comma-separated list works.

Indexing ran without the exception when I changed to use only one host's URL 
(Solr Cloud). Sebastian duplicated the error with a long string value for the 
param, so it's not directly due to the comma separated values.

While googling I found this thread in the archives where Marcus encountered it 
going from 1.15 to 1.16:

mail-archives.apache.org/mod_mbox/nutch-user/201910.mbox/<05eda22b-14b2-309f-3bc7-d6d85c218...@googlemail.com>

I also found a change in 1.16 that might be relevant: NUTCH-2602
https://issues.apache.org/jira/browse/NUTCH-2602

My stack trace:

{{java.lang.Exception: java.lang.IllegalStateException: text width is less than 
1, was <-26>}}
{{ at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)}}
{{ at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559)}}
{{Caused by: java.lang.IllegalStateException: text width is less than 1, was 
<-26>}}
{{ at org.apache.commons.lang3.Validate.validState(Validate.java:829)}}
{{ at 
de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)}}
{{ at 
de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)}}
{{ at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)}}
{{ at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)}}
{{ at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)}}
{{ at 
org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)}}
{{ at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)}}
{{ at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)}}
{{ at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)}}
{{ at 
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)}}
{{ at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}
{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{ at java.lang.Thread.run(Thread.java:748)}}

 

 Thanks,

 Joe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)