from:"Sebastian Nagel"

Re: Help posting question

2024-04-25 Thread Sebastian Nagel


Hi Sheham,

the nutch-site.xml configures

  
mapreduce.task.timeout
1800
  

1.8 seconds (1800 milliseconds) is very short. The default is 600 seconds or 10 
minutes, see [1]. Since Nutch needs to finish fetching before the task timeout 
applies, threads fetching not quickly enough and still running at the end are 
killed.


I would suggest to keep the property "mapreduce.task.timeout" on its default 
value.

Best,
Sebastian

[1] 
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml#mapreduce.task.timeout


On 4/24/24 16:38, Lewis John McGibbney wrote:

Hi Sheham,

On 2024/04/20 08:47:41 Sheham Izat wrote:


The Fetcher job was aborted, does that still mean that it went through the
entire list of seed urls?


Yes it processed the entire generated segment but the fetcher…

* hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/,  
https://www.adu.com/ and https://www.lowes.com/
* was denied by robots.txt for https://sourceforge.net/, 
https://onsclothing.com/, https://kinto-usa.com/, https://twitter.com/, 
https://www.linkedin.com/, etc.
* encountered problems processing some robots.txt files for 
https://twitter.com/, https://www.trustradius.com/
There may be some other issues encountered buy the fetcher.

This is not at all uncommon. The fetcher completed successfully after 7 
seconds. You could progress with your crawl.



I will go through the mailing list questions.


If you need more assistance please let us know. You will find plenty of 
pointers on this mailing list archive though.

lewismc

Re: [VOTE] Apache Nutch 1.20 Release

2024-04-11 Thread Sebastian Nagel


Hi Lewis,

here's my +1

 * signatures of release packages are valid
 * build from the source package successful, unit tests pass
 * tested few Nutch tools in the binary package (local mode)
 * run a sample crawl and tested many Nutch tools on a single-node cluster
   running Hadoop 3.4.0, see
   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/

One note about the CHANGES.md: it's now a mixture of HTML and plain text.
It does not use the potential of markdown, e.g. sections / headlines for
the releases to make the change log navigable via a table of contents.
The embedded HTML makes it less readable if viewed in a text editor.
The rendering on Github [5] is acceptable with only minor glitches,
mostly the placement of multiple lines in a single paragraph:
  https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md
We also have a change log on Jira:
  https://s.apache.org/ovjf3
That's why I wouldn't call the CHANGES.md a "blocker". We should
update the formatting after the release to make it again easily
readable in source code and improve the document structure utilizing
the markdown markup.

~Sebastian

On 4/9/24 23:28, lewis john mcgibbney wrote:

Hi Folks,

A first candidate for the Nutch 1.20 release is available at [0] where 
accompanying SHA512 and ASC signatures can also be found.

Information on verifying releases can be found at [1].

The release candidate comprises a .zip and tar.gz archive of the sources at [2] 
and complementary binary distributions. In addition, a staged maven repository 
is available at [3].


The Nutch 1.20 release report is available at [4].

Please vote on releasing this package as Apache Nutch 1.20. The vote is open for 
at least the next 72 hours and passes if a majority of at least three +1 Nutch 
PMC votes are cast.


[ ] +1 Release this package as Apache Nutch X.XX.

[ ] -1 Do not release this package because…

Cheers,
lewismc
P.S. Here is my +1.

[0] https://dist.apache.org/repos/dist/dev/nutch/1.20 
<https://dist.apache.org/repos/dist/dev/nutch/1.20>
[1] http://nutch.apache.org/downloads.html#verify 
<http://nutch.apache.org/downloads.html#verify>
[2] https://github.com/apache/nutch/tree/release-1.20 
<https://github.com/apache/nutch/tree/release-1.20>
[3] https://repository.apache.org/content/repositories/orgapachenutch-1021/ 
<https://repository.apache.org/content/repositories/orgapachenutch-1021/>

[4] https://s.apache.org/ovjf3 <https://s.apache.org/ovjf3>

--
http://home.apache.org/~lewismc/ <http://home.apache.org/~lewismc/>
http://people.apache.org/keys/committer/lewismc 
<http://people.apache.org/keys/committer/lewismc>

Re: truncation, parsing and indexing?

2023-10-23 Thread Sebastian Nagel

Hi Tim,

>> I'm using the okhttp protocol, because I don't think the http protocol
>> stores truncation information.

protocol-http could mark truncations as well, however. Please, also open an 
issue for this and other protocol plugins.

>> Should I open a ticket to have ParseSegment also check for okhttp's header (
>> http.content.truncated=true)?

Yes, please.

> One work around to ignore parse exceptions (at least in the Tika parser):
> https://github.com/tballison/nutch/tree/ignore-parse-exception

One potential improvement: could still parse MIME types which are parseable
when truncated, most important HTML pages.

>> If I understand correctly, ParseSegment is checking for truncation, but it
>> requires a Content-Length header to work. In my case, there is no
>> Content-Length header, so it assumes the file is not truncated.

With chunked Content-Encoding, there is usually no Content-Length header.

Even if there is a Content-Length header: it indicates the compressed length 
with HTTP Content-Encoding "gzip", "deflate" or "brotli".

>> Is there a way to index files even if they are truncated or if there is a
>> parse exception?
>>
>> If indexing is a bridge too far, what's the most efficient way to dump a
>> list of urls that are truncated and/or had a parse exception?

Let me think about it...

Best,
Sebastian

On 10/18/23 18:16, Tim Allison wrote:

One work around to ignore parse exceptions (at least in the Tika parser): 
https://github.com/tballison/nutch/tree/ignore-parse-exception

Proposed fix for truncation checking:
https://github.com/tballison/nutch/tree/okhttp-truncated

On 2023/10/18 14:28:45 Tim Allison wrote:

I'm trying to configure Nutch to index pages/files that are truncated (in
addition to the successful non-truncated files).

I'm using the okhttp protocol, because I don't think the http protocol
stores truncation information.

I'm using parse-tika, and the "parser.skip.truncated" is set to
default=true.

The particular PDF that I'm experimenting with is returned chunked with gz
compression.  There is no length header in the response.

For this PDF, okhttp correctly marks it as truncated, but then the file is
sent to parsetika, which throws a parse exception. The file is then not
sent to the index.

If I understand correctly, ParseSegment is checking for truncation, but it
requires a Content-Length header to work. In my case, there is no
Content-Length header, so it assumes the file is not truncated.

Should I open a ticket to have ParseSegment also check for okhttp's header (
http.content.truncated=true)?

Is there a way to index files even if they are truncated or if there is a
parse exception?

If indexing is a bridge too far, what's the most efficient way to dump a
list of urls that are truncated and/or had a parse exception?

Thank you!

Best,

   Tim

Re: Exclude HTML elements from Crawl

2023-09-23 Thread Sebastian Nagel

Hi Michael,

> I wonder if there is not already a build-in option to exclude HTML
> elements (like a div with a given id or class or other elements like header).

No, there isn't one so far.

> I know https://issues.apache.org/jira/browse/NUTCH-585

> I also do not understand why this little patch has not already been added to
> Nutch? Are there drawbacks?

Well, good question. Don't know. I'll have a look...

Maybe, one comment: I definitely agree that it would be very useful to have some 
configurable method to clean up the HTML-to-text extract from undesired content 
(headers, footers, etc.) - ideally, it should be possible to use the full 
expressive power of CSS for that.

Thanks for the suggestion and remembering us! Nutch is a community project and
any contribution is welcome and appreciated!

Best,
Sebastian

On 9/21/23 15:46, Fritsch, Michael wrote:

Hello,

I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each 
page has elements like TOCs which should not be included.

I know https://issues.apache.org/jira/browse/NUTCH-585 
 and included one of the patches.

However, I wonder if there is not already a build-in option to exclude HTML 
elements (like a div with a given id or class or other elements like header).

I also do not understand why this little patch has not already been added to 
Nutch? Are there drawbacks?

Regards,

Michael

Dr. Michael Fritsch
Technical Editor

A picture containing graphics, graphic design, font, logo Description 
automatically generated 

**

*Elevate Experience. Drive Impact.*

E-Mail: michael.frit...@coremedia.com 

Phone: +49 (0) 40 325 587 0
*www.coremedia.com* 

A pink and red letter on a black background Description automatically generated 
with low confidence A logo of 
a camera Description automatically generated with low confidence 
A picture containing colorfulness, 
screenshot, graphics, red Description automatically generated 
A pink bird with wings 
Description automatically generated with low confidence 

Diagram Description automatically generated 

CoreMedia GmbH

Rödingsmarkt 9, 20459 Hamburg, Germany

Managing Director: Sören Stamer

Commercial Register: Amtsgericht Hamburg, HRB 162480

Re: Change log file directory

2023-08-07 Thread Sebastian Nagel


Hi,

yes, this is possible by pointing the environment variable
NUTCH_LOG_DIR to a different folder.
The default is: $NUTCH_HOME/logs/

See also the script bin/nutch which is called by bin/crawl:
 https://github.com/apache/nutch/blob/master/src/bin/nutch#L30
(it's also possible to change the log file name)

Best,
Sebastian

On 8/2/23 09:58, Raj Chidara wrote:

Hello
   Currently, whenever we run bin/crawl, the generated log is stored in logs 
folder of Nutch.  Can we change this behavior so that I can store log file in 
my own folder say /root/mylog/nutchlog.txt.  Please suggest.




Thanks and Regards

Raj Chidara

Re: Maximum header limit (1000) exceeded

2023-07-26 Thread Sebastian Nagel

Hi Steve,

> copy and pasted an email thread together and there are a few
> weird characters in it.

Ok. That explains the error.

> there is a way to tell nutch
> to choose some other parser.

Yes, that's possible. In the conf/ folder there is a file
parse-plugins.xml - if you add the following lines

files of MIME type message/rfc822 are parsed using the
HTML parser.

> there are a few weird characters in it

Might be that the parse-html parser also chokes on that content.

Another option could be to manipulate the tika-mimetypes.xml to
override the MIME detection and forward those files to some
custom MIME type. But that might not be that easy.

Best,
Sebastian

On 7/26/23 18:08, Steve Cohen wrote:

Thanks for the reply.

I can't share the file but it isn't in eml format. It looks like someone
copy and pasted an email thread together and there are a few
weird characters in it. I have no problem using less to view it. I am
wondering why it is parsing it as email and if there is a way to tell nutch
to choose some other parser. I have over 500 of the errors so I don't want
to skip them.

Thanks,
Steve Cohen

On Wed, Jul 26, 2023 at 10:36 AM Sebastian Nagel
 wrote:

Hi Steve,

  >

file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67

what does the file contain? An .eml file (following RFC822)?
Would it be possible to share this file or at least a chunk large
enough to reproduce the issue?

The error message might indicate that there are too many headers
- 1000 is the limit for the max. header count, see [1].
But then it's hardly a email message but some other file format
erroneously detected as email.

In doubt, if parsing this file is mandatory, you could also post
the error on the Tika user mailing list, see [2].

Best,
Sebastian

[1]

https://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/MimeConfig.html#setMaxHeaderCount(int)
[2] https://tika.apache.org/mail-lists.html

On 7/24/23 16:43, Steve Cohen wrote:

Hello,

I am running nutch 1.19 and I am getting the following error:

2023-07-21 14:55:38,013 ERROR o.a.n.p.t.TikaParser [parse-0] Error

parsing

file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67

org.apache.tika.exception.TikaException: Failed to parse an email message
  at
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:110)

~[?:?]

  at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-core-2.3.0.jar:2.3.0]
  at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)

~[?:?]

  at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)

~[?:?]

  at

org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)

~[apache-nutch-1.19.jar:?]
  at

org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)

~[apache-nutch-1.19.jar:?]
  at java.util.concurrent.FutureTask.run(FutureTask.java:264)

~[?:?]

  at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)

~[?:?]
  at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

~[?:?]
  at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.james.mime4j.io.MaxHeaderLimitException: Maximum
header limit (1000) exceeded
  at
org.apache.james.mime4j.stream.MimeEntity.nextField(MimeEntity.java:254)
~[?:?]
  at
org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:296)
~[?:?]
  at

org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:374)

~[?:?]
  at

org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176)

~[?:?]
  at
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:98)

~[?:?]

Is there a way to increase the header limit in nutch-site.xml or

elsewhere?

I looked through the nutch-defaults.xml and didn't see the property but
maybe I missed it?

Thanks,
Steve Cohen

Re: Maximum header limit (1000) exceeded

2023-07-26 Thread Sebastian Nagel


Hi Steve,

> 
file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67


what does the file contain? An .eml file (following RFC822)?
Would it be possible to share this file or at least a chunk large
enough to reproduce the issue?

The error message might indicate that there are too many headers
- 1000 is the limit for the max. header count, see [1].
But then it's hardly a email message but some other file format
erroneously detected as email.

In doubt, if parsing this file is mandatory, you could also post
the error on the Tika user mailing list, see [2].

Best,
Sebastian

[1] 
https://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/MimeConfig.html#setMaxHeaderCount(int)

[2] https://tika.apache.org/mail-lists.html

On 7/24/23 16:43, Steve Cohen wrote:

Hello,

I am running nutch 1.19 and I am getting the following error:

2023-07-21 14:55:38,013 ERROR o.a.n.p.t.TikaParser [parse-0] Error parsing
file:/RMS/sha256/a0/ec/b0/a0/e0/ef/80/74/a0ecb0a0e0ef80747871563e2060b028c3abd330cb644ef7ee86fa9b133cbc67
org.apache.tika.exception.TikaException: Failed to parse an email message
 at
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:110) ~[?:?]
 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
~[tika-core-2.3.0.jar:2.3.0]
 at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151) ~[?:?]
 at
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90) ~[?:?]
 at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
~[apache-nutch-1.19.jar:?]
 at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
~[apache-nutch-1.19.jar:?]
 at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
 at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: org.apache.james.mime4j.io.MaxHeaderLimitException: Maximum
header limit (1000) exceeded
 at
org.apache.james.mime4j.stream.MimeEntity.nextField(MimeEntity.java:254)
~[?:?]
 at
org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:296)
~[?:?]
 at
org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:374)
~[?:?]
 at
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176)
~[?:?]
 at
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:98) ~[?:?]


Is there a way to increase the header limit in nutch-site.xml or elsewhere?
I looked through the nutch-defaults.xml and didn't see the property but
maybe I missed it?

Thanks,
Steve Cohen

[ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Sebastian Nagel


Dear all,

It is my pleasure to announce that Tim Allison has joined us
as a committer and member of the Nutch PMC.

You may already know Tim as a maintainer of and contributor to
Apache Tika. So, it was great to see contributions to the
Nutch source code from an experienced developer who is also
active in a related Apache project. Among other contributions
Tim recently implemented the indexer-opensearch plugin.

Thank you, Tim Allison, and congratulations on your new role
in the Apache Nutch community! And welcome on board!

Sebastian
(on behalf of the Nutch PMC)

Re: Nutch 1.19 Getting Error: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'

2023-05-15 Thread Sebastian Nagel


Hi Eric,

unfortunately, on Windows you also need to download and install winutils.exe and 
hadoop.dll,

see
  https://github.com/cdarlint/winutils and

https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io

The installation of Hadoop is not mandatory - the Nutch binary package
already includes Hadoop jar files.

Alternatively, you may prefer to run Nutch on Linux - no additional 
installations required.


Best,
Sebastian

On 5/15/23 04:07, Eric Valencia wrote:

Hello everyone,

So, I set up Nutch 1.19, Solr 8.11.2, and hadoop 3.3.5, to the best of my
knowledge.

After, I went into the nutch directory and ran this command:
*bin/nutch generate crawl/crawldb crawl/segments*

Then, I got an error:
*Exception in thread "main" java.lang.UnsatisfiedLinkError: 'boolean
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String,
int)'*

Does anyone know how to solve this problem?

Below is the full output:
$ bin/nutch generate crawl/crawldb crawl/segments
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]
2023-05-14 19:01:16,433 INFO o.a.n.p.PluginManifestParser [main] Plugins:
looking in:
C:\Users\User\Desktop\wiki\a\ApacheNutch\apache-nutch-1.19\plugins
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Plugin
Auto-activation mode: [true]
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Registered
Plugins:
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Regex URL
Filter (urlfilter-regex)
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Html Parse
Plug-in (parse-html)
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]HTTP
Framework (lib-http)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]the nutch
core extension points (nutch-extensionpoints)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic
Indexing Filter (index-basic)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Anchor
Indexing Filter (index-anchor)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Tika Parser
Plug-in (parse-tika)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic URL
Normalizer (urlnormalizer-basic)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL
Filter Framework (lib-regex-filter)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL
Normalizer (urlnormalizer-regex)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]URL
Validator (urlfilter-validator)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]CyberNeko
HTML Parser (lib-nekohtml)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]OPIC
Scoring Plug-in (scoring-opic)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]
  Pass-through URL Normalizer (urlnormalizer-pass)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Http
Protocol Plug-in (protocol-http)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]
  SolrIndexWriter (indexer-solr)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] Registered
Extension-Points:
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] (Nutch
Content Parser)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (HTML
Parse Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Scoring)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Normalizer)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Publisher)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Exchange)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Protocol)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Ignore Exemption Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Index Writer)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Segment Merge Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Indexing Filter)
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: starting
at 2023-05-14 19:01:16
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: Selecting
best-scoring urls due for fetch.
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: filtering:
true
2023-05-14

Re: Merging CrawlDBs

2023-02-02 Thread Sebastian Nagel

Hi Kamil,

> I was wondering if this script is advisable to use?

I haven't tried the script itself but some of the underlying commands
- mergedb, etc.

> merge command ($nutch_dir/nutch merge $index_dir $new_indexes)

Of course, some of the commands are obsolete. Long time ago, Nutch
used Lucene index shards directly. Now the management of indexes
(including merging of shards) is delegated to Solr or Elasticsearch.

> I plan to use it for crawls of non-overlapping urls.

... just a few thoughts about this particular use case:

Why you want to merge the data structures?

- if they're disjoint there is no need for it
- all operations (CrawlDb: generate, update, etc.)
  are much faster on smaller structures

If required: most of the Nutch jobs can read multiple segments or CrawlDbs.
However, it might be that the command-line tool expects only a single
CrawlDb or segment.
- we could extend the command-line params
- or just copy the sequence files into one single path

~Sebastian

On 2/2/23 01:54, Kamil Mroczek wrote:

Hi,

I am testing how merging crawls works and found this script
https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl.

I was wondering if this script is advisable to use? I plan to use it for
crawls of non-overlapping urls.

I am wary of using it since it is located under "Archive & Legacy" on the
wiki. But after running some tests it seems to function correctly. I only
had to remove the merge command ($nutch_dir/nutch merge $index_dir
$new_indexes)since that is not a command anymore.

I am not necessarily looking for a list of potential issues (if the list is
long), just trying to understand why it might be under the archive.

Kamil

Re: Unsubscribe from Users list

2023-01-25 Thread Sebastian Nagel


Hi,

please send a mail to

   user-unsubscr...@nutch.apache.org

See
   https://nutch.apache.org/community/mailing-lists/

Thanks!

Best,
Sebastian

On 1/25/23 14:53, Steven Zhu wrote:

Please unsubscribe me from the users list.

Steven

On Tue, Jan 24, 2023 at 10:27 PM Ankit gupta 
wrote:


Hello,

Please unsubscribe me from the users list.

Ankit Gupta

On Wed, Jan 25, 2023 at 4:45 AM Timeka Cobb  wrote:


Unsubscribe me from the list as well...Thank you!!

Timeka Cobb

On Tue, Jan 24, 2023, 6:14 PM Andrés Rincón Pacheco 
wrote:


Hello,

Please unsubscribe me from the users list.

Thanks.

Regards,

Andrés Rincón

Re: "Unparseable date" build issue with ANT on AWS EMR

2023-01-17 Thread Sebastian Nagel


Hi Kamil,

after some trials I come up with a different solution for the issue with the 
"unparseable date", see


  https://github.com/apache/nutch/pull/752

The solution providing a pattern reproducibly fails in certain locales, see
the comments in

  https://issues.apache.org/jira/browse/NUTCH-2974

Just in case you want to try it.

~Sebastian

On 11/21/22 10:36, Sebastian Nagel wrote:

Hi Kamil,

thanks for trying and finding a solution! I've open a JIRA issue to track the 
problem: https://issues.apache.org/jira/browse/NUTCH-2974


Thanks!

Sebastian

On 11/19/22 18:37, Kamil Mroczek wrote:

I've been able to work around this issue by adding "pattern" to touch tag
on line 101 in build.xml like so:



On Fri, Nov 18, 2022 at 12:32 PM Kamil Mroczek  wrote:


Hello,

When I run the "ant runtime" command I am getting:

/home/hadoop/apache-nutch/build.xml:101: Unparseable date: "01/25/1971
2:00 pm"

I've tried different date formats to no avail. There was a similar issue
that was fixed in version 1.19, NUTCH-2512
<https://issues.apache.org/jira/browse/NUTCH-2512>. I am using Nutch
1.19. I am using Java 11. This is running on the AWS EMR master node using
a vanilla AMI running AWS Linux 2.0.20221004.0. Some more debugging info
below.

Kamil
=
[hadoop@ip-172-31-25-62 apache-nutch]$ java -version
openjdk version "11.0.16.1" 2022-08-12 LTS
OpenJDK Runtime Environment Corretto-11.0.16.9.1 (build 11.0.16.1+9-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.16.9.1 (build 11.0.16.1+9-LTS,
mixed mode)

[hadoop@ip-172-31-25-62 apache-nutch]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

--- Ant diagnostics report ---
Apache Ant(TM) version 1.9.2 compiled on November 13 2017

---
  Implementation Version
---
core tasks : 1.9.2 in file:/usr/share/java/ant/ant.jar

---
  ANT PROPERTIES
---
ant.version: Apache Ant(TM) version 1.9.2 compiled on November 13 2017
ant.java.version: 1.8
Is this the Apache Harmony VM? no
Is this the Kaffe VM? no
Is this gij/gcj? no
ant.core.lib: /usr/share/java/ant/ant.jar
ant.home: /usr/share/ant

---
  ANT_HOME/lib jar listing
---
ant.home: /usr/share/ant
ant-bootstrap.jar (20919 bytes)
ant-launcher.jar (19038 bytes)
ant.jar (1998416 bytes)

---
  USER_HOME/.ant/lib jar listing
---
user.home: /home/hadoop
No such directory.

---
  Tasks availability
---
junitreport : Not Available (the implementation class is not present)
sshsession : Not Available (the implementation class is not present)
sshexec : Not Available (the implementation class is not present)
telnet : Not Available (the implementation class is not present)
scp : Not Available (the implementation class is not present)
antlr : Not Available (the implementation class is not present)
netrexxc : Not Available (the implementation class is not present)
ftp : Not Available (the implementation class is not present)
rexec : Not Available (the implementation class is not present)
sound : Not Available (the implementation class is not present)
image : Not Available (the implementation class is not present)
junit : Not Available (the implementation class is not present)
jdepend : Not Available (the implementation class is not present)
splash : Not Available (the implementation class is not present)
A task being missing/unavailable should only matter if you are trying to
use it

---
  org.apache.env.Which diagnostics
---
Not available.
Download it at http://xml.apache.org/commons/

---
  XML Parser information
---
XML Parser : org.apache.xerces.jaxp.SAXParserImpl
XML Parser Location: file:/usr/share/java/xerces-j2.jar
Namespace-aware parser : org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser
Namespace-aware parser Location: file:/usr/share/java/xerces-j2.jar

---
  XSLT Processor information
---
XSLT Processor :
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl
XSLT Processor Location: unknown

-

Re: Configuration Nutch in cluster mode

2023-01-17 Thread Sebastian Nagel


Hi Mike,

the Nutch configuration files are included in the job file found in 
runtime/deploy after build. This means you need to compile Nutch yourself

if used in "distributed" mode.

For exercising, you can first work in "pseudo-distributed" mode, i.e.
on a single-node Hadoop cluster. All commands are the same than in fully 
distributed mode.


If it helps, I prepared some setup scripts to run Nutch in pseudo-distributed 
mode:
  https://github.com/sebastian-nagel/nutch-test-single-node-cluster

Best,
Sebastian

On 1/15/23 04:26, Mike wrote:

I will now try to configure the bot url etc. before the building,
but how and where do I configure between the crawls e.g. number of pages
per host?

where do I configure nutch in cluster mode?

thx, mike

Re: Nutch/Hadoop Cluster

2023-01-17 Thread Sebastian Nagel


Hi Mike,

> It can be tedious to set up for the first time, and there are many components.

In case you prefer Linux packages, I can recommend Apache Bigtop, see
   https://bigtop.apache.org/
and for the list of package repositories
   https://downloads.apache.org/bigtop/stable/repos/

~Sebastian

On 1/15/23 01:06, Markus Jelsma wrote:

Hello Mike,


would it pay off for me to put a hadoop cluster on top of the 3 servers.


Yes, for as many reasons as Hadoop exists for. It can be tedious to set up
for the first time, and there are many components. But at least you have
three servers, which is kind of required by Zookeeper, that you will also
need.

Ideally you would have some additional VMs to run the controlling Hadoop
programs and perhaps the Hadoop client nodes on. The workers can run on
bare metal.


1.) a server would not be integrated directly into the crawl process as a

master.

What do you mean? Can you elaborate?


2.) can I run multiple crawl jobs on one server?


Yes! Just have separate instances of Nutch home dirs on your Hadoop client
nodes, each having their own configuration.

Regards,
Markus

Op za 14 jan. 2023 om 18:42 schreef Mike :


Hi!

I am now crawling the internet in local mode in parallel with up to 10
instances on 3 computers. would it pay off for me to put a hadoop cluster
on top of the 3 servers.

1.) a server would not be integrated directly into the crawl process as a
master.
2.) can I run multiple crawl jobs on one server?

Thanks

Re: CSV indexer file data overwriting

2022-11-24 Thread Sebastian Nagel


Hi Paul,

> the indexer was writing the
> documents info in the file (nutch.csv) twice,

Yes, I see. And now I know what I've overseen:

 .../bin/nutch index -Dmapreduce.job.reduces=2

You need to run the CSV indexer with only a single reducer.
In order to do so, please pass the option
  --num-tasks 1
to the script bin/crawl.

Alternatively, you could change
  NUM_TASKS=2
in bin/crawl to
  NUM_TASKS=1

This is related to why at now you can't run the CSV indexer
in (pseudo)distributed mode, see my previous note:

> A final note: the CSV indexer only works in local mode, it does not yet
> work in distributed mode (on a real Hadoop cluster). It was initially
> thought for debugging, not for larger production set up.

The issue is described here:
  https://issues.apache.org/jira/browse/NUTCH-2793

It's a though one because a solution requires a change of the IndexWriter 
interface. Index writers are plugins and do not know from which reducer

task they are run and to which path on a distributed or parallelized system
they have to write. On Hadoop the writing the output is done in two steps:
write to a local file and then "commit" the output to the final location on the 
distributed file system.


But yes, should have a look again at this issue which is stalled since quite
some time. Also because, it's now clear that you might run into issues even
in local mode.

Thanks for reporting the issue! If you can, please also comment on the Jira 
issue!

Best,
Sebastian

Re: CSV indexer file data overwriting

2022-11-23 Thread Sebastian Nagel


Hi Paul,

as far I can see the indexer is run only once and now indexes 26 documents:

org.apache.nutch.indexer.IndexingJob 2022-11-22 06:32:57,164 INFO 
o.a.n.i.IndexingJob [main] Indexer: 26  indexed (add/update)


The logs also indicate that both segments are indexed at once:

org.apache.nutch.indexer.IndexerMapReduce 2022-11-22 06:32:51,811 INFO 
o.a.n.i.IndexerMapReduce [main] IndexerMapReduces: adding segment: 
file:/home/paulesco/Downloads/apache-nutch-1.19/crawl/segments/20221122062645
org.apache.nutch.indexer.IndexerMapReduce 2022-11-22 06:32:51,814 INFO 
o.a.n.i.IndexerMapReduce [main] IndexerMapReduces: adding segment: 
file:/home/paulesco/Downloads/apache-nutch-1.19/crawl/segments/20221122062728



Best,
Sebastian

Re: "Unparseable date" build issue with ANT on AWS EMR

2022-11-21 Thread Sebastian Nagel


Hi Kamil,

thanks for trying and finding a solution! I've open a JIRA issue to track the 
problem: https://issues.apache.org/jira/browse/NUTCH-2974


Thanks!

Sebastian

On 11/19/22 18:37, Kamil Mroczek wrote:

I've been able to work around this issue by adding "pattern" to touch tag
on line 101 in build.xml like so:



On Fri, Nov 18, 2022 at 12:32 PM Kamil Mroczek  wrote:


Hello,

When I run the "ant runtime" command I am getting:

/home/hadoop/apache-nutch/build.xml:101: Unparseable date: "01/25/1971
2:00 pm"

I've tried different date formats to no avail. There was a similar issue
that was fixed in version 1.19, NUTCH-2512
. I am using Nutch
1.19. I am using Java 11. This is running on the AWS EMR master node using
a vanilla AMI running AWS Linux 2.0.20221004.0. Some more debugging info
below.

Kamil
=
[hadoop@ip-172-31-25-62 apache-nutch]$ java -version
openjdk version "11.0.16.1" 2022-08-12 LTS
OpenJDK Runtime Environment Corretto-11.0.16.9.1 (build 11.0.16.1+9-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.16.9.1 (build 11.0.16.1+9-LTS,
mixed mode)

[hadoop@ip-172-31-25-62 apache-nutch]$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

--- Ant diagnostics report ---
Apache Ant(TM) version 1.9.2 compiled on November 13 2017

---
  Implementation Version
---
core tasks : 1.9.2 in file:/usr/share/java/ant/ant.jar

---
  ANT PROPERTIES
---
ant.version: Apache Ant(TM) version 1.9.2 compiled on November 13 2017
ant.java.version: 1.8
Is this the Apache Harmony VM? no
Is this the Kaffe VM? no
Is this gij/gcj? no
ant.core.lib: /usr/share/java/ant/ant.jar
ant.home: /usr/share/ant

---
  ANT_HOME/lib jar listing
---
ant.home: /usr/share/ant
ant-bootstrap.jar (20919 bytes)
ant-launcher.jar (19038 bytes)
ant.jar (1998416 bytes)

---
  USER_HOME/.ant/lib jar listing
---
user.home: /home/hadoop
No such directory.

---
  Tasks availability
---
junitreport : Not Available (the implementation class is not present)
sshsession : Not Available (the implementation class is not present)
sshexec : Not Available (the implementation class is not present)
telnet : Not Available (the implementation class is not present)
scp : Not Available (the implementation class is not present)
antlr : Not Available (the implementation class is not present)
netrexxc : Not Available (the implementation class is not present)
ftp : Not Available (the implementation class is not present)
rexec : Not Available (the implementation class is not present)
sound : Not Available (the implementation class is not present)
image : Not Available (the implementation class is not present)
junit : Not Available (the implementation class is not present)
jdepend : Not Available (the implementation class is not present)
splash : Not Available (the implementation class is not present)
A task being missing/unavailable should only matter if you are trying to
use it

---
  org.apache.env.Which diagnostics
---
Not available.
Download it at http://xml.apache.org/commons/

---
  XML Parser information
---
XML Parser : org.apache.xerces.jaxp.SAXParserImpl
XML Parser Location: file:/usr/share/java/xerces-j2.jar
Namespace-aware parser : org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser
Namespace-aware parser Location: file:/usr/share/java/xerces-j2.jar

---
  XSLT Processor information
---
XSLT Processor :
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl
XSLT Processor Location: unknown

---
  System properties
---
java.runtime.name : OpenJDK Runtime Environment
java.vm.version : 11.0.16.1+9-LTS
sun.boot.library.path : /usr/lib/jvm/java-11-amazon-corretto.x86_64/lib
ant.library.dir : /usr/share/ant/lib
java.vm.vendor : Amazon.com Inc.
java.vendor.url : https://aws.amazon.com/corretto/
path.separator : :
java.vm.name : OpenJDK 64-Bit Server VM
sun.os.patch.level : unknown
user.country : US
sun.java.launcher : SUN_STANDARD
java.vm.specification.name : Java Virtual Machine Specification
user.dir :

[DISCUSS] Bug reporting - enabling Github issues?

2022-11-21 Thread Sebastian Nagel


Hi everybody,

because of a growing number of spam account creation public sign-ups to the
Apache JIRA have been disabled.

In order to allow users to report bugs, we have two options:
1 either users let us know about the issue on the mailing list and one of the
  Nutch PMC creates a user account using https://selfserve.apache.org/
2 or we enable Github issues for the Nutch repositories

Option 1 may hinder some users to report a bug because it takes some steps
and time until the JIRA account is created.

The two options are not mutually exclusive:
- because of existing issues in JIRA and because we create the release notes
  using JIRA, we are likely stick for some time to JIRA. The release notes also
  require that Github issues are duplicated on JIRA.
- on the other hand, it's not always the case that contributors use Github
  and have registered there

The recommendation of the infra team is:
  We suggest projects consider using GitHub Issues for customer-facing
  questions/bug reports/etc., while maintaining development issues on Jira.

If there are no objections I'll enable Github issues (see [1]) and add the
pointers to the Nutch site and README, and update the pull-request template.

Best,
Sebastian

[1] 
https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-Repositoryfeatures

Re: CSV indexer file data overwriting

2022-11-21 Thread Sebastian Nagel


Hi Paul,

yes, the CSV indexer removes the CSV output before it starts a new one.
The problem here is that the indexer is run twice in a loop.

Possible work-arounds - assumed you're using the script bin/crawl:

1 after each indexing command in the loop, move the CSV output so that
  it gets not deleted later:

  mv nutch.csv nutch-$(date +%Y%m%d%H%M%S).csv

2 run the index step after the loop. Instead of passing a single segment,
  you need to index all segments in the segments/ folder. Just replace
.../segments/$SEGMENT
  with
-dir .../segments/
  Work-around 2 has the advantage that the index is a single file.


For the long term we might add the option to include a unique component
in the CSV output file (eg. a timestamp). Or add work-around 2 to the
crawl script. Let us know if you need such a solution for the development
branch.

A final note: the CSV indexer only works in local mode, it does not yet
work in distributed mode (on a real Hadoop cluster). It was initially
thought for debugging, not for larger production set up.

Best,
Sebastian


On 11/18/22 15:16, Paul Escobar wrote:

I'm using CSV indexer to write nutch data, but in the nutch.csv file I find
only the last thirteen lines, it seems like the indexer is overwriting the
file, I've read nutch CSV Indexer documentation but I haven't found any
configuration related to this situation. Could someone help me to get all
the lines extracted by the parser? This is the log output and the
index-writes.xml configuration:


org.apache.nutch.plugin.PluginManifestParser 2022-11-18 07:48:02,323 INFO
o.a.n.p.PluginManifestParser [main] Plugins: looking in:
/home/paulesco/Downloads/apache-nutch-1.19/plugins
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,753 INFO
o.a.n.p.PluginRepository [main] Plugin Auto-activation mode: [true]
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,754 INFO
o.a.n.p.PluginRepository [main] Registered Plugins:
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,755 INFO
o.a.n.p.PluginRepository [main] Regex URL Filter (urlfilter-regex)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,755 INFO
o.a.n.p.PluginRepository [main] Html Parse Plug-in (parse-html)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,755 INFO
o.a.n.p.PluginRepository [main] HTTP Framework (lib-http)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,756 INFO
o.a.n.p.PluginRepository [main] the nutch core extension points
(nutch-extensionpoints)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,756 INFO
o.a.n.p.PluginRepository [main] Basic Indexing Filter (index-basic)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,757 INFO
o.a.n.p.PluginRepository [main] Anchor Indexing Filter (index-anchor)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,757 INFO
o.a.n.p.PluginRepository [main] Tika Parser Plug-in (parse-tika)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,758 INFO
o.a.n.p.PluginRepository [main] Extractor based XML/HTML Parser/Indexing
Filter (extractor)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,758 INFO
o.a.n.p.PluginRepository [main] Basic URL Normalizer (urlnormalizer-basic)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,759 INFO
o.a.n.p.PluginRepository [main] Regex URL Filter Framework
(lib-regex-filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,760 INFO
o.a.n.p.PluginRepository [main] Regex URL Normalizer (urlnormalizer-regex)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,760 INFO
o.a.n.p.PluginRepository [main] CyberNeko HTML Parser (lib-nekohtml)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,761 INFO
o.a.n.p.PluginRepository [main] URL Validator (urlfilter-validator)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,761 INFO
o.a.n.p.PluginRepository [main] OPIC Scoring Plug-in (scoring-opic)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,762 INFO
o.a.n.p.PluginRepository [main] Pass-through URL Normalizer
(urlnormalizer-pass)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,762 INFO
o.a.n.p.PluginRepository [main] Http Protocol Plug-in (protocol-http)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,763 INFO
o.a.n.p.PluginRepository [main] CSVIndexWriter (indexer-csv)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,763 INFO
o.a.n.p.PluginRepository [main] Registered Extension-Points:
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,764 INFO
o.a.n.p.PluginRepository [main] (Nutch Content Parser)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,764 INFO
o.a.n.p.PluginRepository [main] (Nutch URL Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,765 INFO
o.a.n.p.PluginRepository [main] (HTML Parse Filter)
org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,765 INFO
o.a.n.p.PluginRepository [main] (Nutch Scoring)

Re: Incomplete TLD List

2022-11-08 Thread Sebastian Nagel


Hi Mike, hi Markus,

there's also
  https://issues.apache.org/jira/browse/NUTCH-1806
which would make it much easier to keep up-to-date with the public suffix list.

Resp., because crawler-commons loads the public suffix list
(for historic reasons named "effective_tld_names.dat") from the class path
it would be quite easy to update the list by simple placing it in the
Nutch conf folder.

@Mike: please, let us know whether this is an option (for the long term). You 
may also upvote the Jira issue. Thanks!


Best,
Sebastian

On 11/8/22 11:45, Markus Jelsma wrote:

Hello Mike,

You can try adding the TLD to conf/domain-suffixes.xml and see if it works.

Regards,
Markus

Op di 8 nov. 2022 om 11:16 schreef Mike :


Hi!
Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
the TLD list?

 "url":"https://about.google/intl/en_FR/how-our-business-works/;,
 "tstamp":"2022-11-06T17:22:14.808Z",
 "domain":"google",
 "digest":"3b9a23d42f200392d12a697bbb8d4d87",


Thanks

Mike

[ANNOUNCE] Apache Nutch 1.19 Release

2022-09-08 Thread Sebastian Nagel

The Apache Nutch team is pleased to announce the release of
Apache Nutch v1.19.

Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ data structures.

Source and binary distributions are available for download from the
Apache Nutch download site:
   https://nutch.apache.org/downloads.html

Please verify signatures using the KEYS file available at the above
location when downloading the release.

This release includes more than 80 bug fixes and improvements, the full
list of changes can be seen in the release report
  https://s.apache.org/lf6li
Please also check the changelog for breaking changes:
  https://apache.org/dist/nutch/1.19/CHANGES.txt

Important changes are:
- Nutch builds on JDK 11
- protocol plugins can provide a custom URL stream handler to support
  custom URL schemes, eg. smb://
and notable dependency upgrades include:
  Hadoop 3.3.4
  Solr 8.11.2
  Tika 2.3.0

Thanks to everyone who contributed to this release!

[RESULT] was [VOTE] Release Apache Nutch 1.19 RC#1

2022-09-06 Thread Sebastian Nagel

Hi Folks,

thanks to everyone who was able to review the release candidate!

72 hours have definitely passed, please see below for vote results.

[4] +1 Release this package as Apache Nutch 1.19
   Markus Jelsma *
   BlackIce *
   Jorge Betancourt *
   Sebastian Nagel *

[0] -1 Do not release this package because ...

* Nutch PMC

The VOTE passes with 4 binding votes from Nutch PMC members.

I'll continue to publish the release packages and announce the release.

Thanks to everyone who contributed to Nutch and the 1.19 release.

Sebastian


On 8/22/22 17:30, Sebastian Nagel wrote:
> Hi Folks,
> 
> A first candidate for the Nutch 1.19 release is available at:
> 
>https://dist.apache.org/repos/dist/dev/nutch/1.19/
> 
> The release candidate is a zip and tar.gz archive of the binary and sources 
> in:
>https://github.com/apache/nutch/tree/release-1.19
> 
> In addition, a staged maven repository is available here:
>https://repository.apache.org/content/repositories/orgapachenutch-1020
> 
> We addressed 87 issues:
>https://s.apache.org/lf6li
> 
> 
> Please vote on releasing this package as Apache Nutch 1.19.
> The vote is open for the next 72 hours and passes if a majority
> of at least three +1 Nutch PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Nutch 1.19.
> [ ] -1 Do not release this package because…
> 
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
> 
> P.S.
> Here is my +1.
> - tested most of Nutch tools and run a test crawl on a single-node cluster
>   running Hadoop 3.3.4, see
>   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)

Re: Nutch 1.19 schema.xml

2022-09-04 Thread Sebastian Nagel

Hi Mike,

I think there shouldn't be any issues upgrading the new schema.xml into the Solr
core holding the index filled from Nutch. Maybe with two exceptions:
- index-geoip is used (then some field definitions may change)
- when an older Solr version is used (eg. not yet supporting
  solr.LatLonPointSpatialField)

In doubt, I'd run a test before to be sure that the production system
isn't broken.

Best,
Sebastian

On 9/4/22 18:08, Mike wrote:
> Hello Sebastian!
> 
> Thanks for your answer!
> Is it possible to simply update the schema.xml file without re-indexing?
> 
> Thanks
> Mike
> 
> Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel
> :
> 
>> Hi Mike,
>>
>> the Nutch/Solr schema.xml will be updated with the release of 1.19
>> (expected
>> soon, a vote about RC#1 is ongoing):
>>  [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType
>>  [NUTCH-2957] - add fall-back field definitions for unknown index fields
>>  [NUTCH-2956] - typos in field names filled by index-geoip
>>
>> See the commits on the schema.xml
>>
>> https://github.com/apache/nutch/commits/master/src/plugin/indexer-solr/schema.xml
>>
>> Best,
>> Sebastian
>>
>>
>> On 8/31/22 14:02, Mike wrote:
>>> Hello!
>>>
>>>
>>> Will the schema.xml stay the same in Nutch 1.19?
>>>
>>> thanks!
>>>
>>> mike
>>>
>>
>

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-09-02 Thread Sebastian Nagel

Hi Markus,

thanks!

Could you share the files in

  .ivy2/cache/org.apache.httpcomponents/httpasyncclient/

and maybe also the logs of a Nutch build starting with an empty ~/.ivy2/cache ?
I'll have a look and compare it what I find on my system - maybe use a new
thread on user@ or a Jira issue, I'll plan to close the vote over the weekend,
so let's keep this thread for the release vote alone.

Best,
Sebastian

On 8/29/22 14:17, Markus Jelsma wrote:
> Hello Sebastian,
> 
> No, the JAR isn't present. Multiple JARs are missing, probably because they
> are loaded after httpasyncclient. I checked the previously emptied Ivy
> cache. The Ivy files are there, but the JAR is missing there too.
> 
> markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/
> ivy-4.1.4.xml  ivy-4.1.4.xml.original  ivydata-4.1.4.properties
> 
> I manually downloaded the JAR from [1] and added it to the jars/ directory
> in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs
> some more things than just adding the JAR manually.
> 
> The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties
> file in the cache.
> 
> Since Ralf can compile it without problems, it seems to be an issue on my
> machine only. So Nutch seems fine, therefore +1.
> 
> Regards,
> Markus
> 
> [1]
> https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/
> 
> 
> Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel
> :
> 
>> Hi Ralf,
>>
>>> It fetches it parses
>>
>> So a +1 ?
>>
>> Best,
>> Sebastian
>>
>> On 8/25/22 05:22, BlackIce wrote:
>>> nevermind I made a typo...
>>>
>>> It fetches it parses
>>>
>>> On Thu, Aug 25, 2022 at 3:42 AM BlackIce  wrote:
>>>>
>>>> so far... it doesn't select anything when creating segments:
>>>> 0 records selected for fetching, exiting
>>>>
>>>> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
>>>>>
>>>>> I have been able to compile under OpenJDK 11
>>>>> Have not done anything further so far
>>>>> I'm gonna try to get to it this evening
>>>>>
>>>>> Greetz
>>>>> Ralf
>>>>>
>>>>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
>>>>>  wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Everything seems fine, the crawler seems fine when trying the binary
>>>>>> distribution. The source won't work because this computer still cannot
>>>>>> compile it. Clearing the local Ivy cache did not do much. This is the
>> known
>>>>>> compiler error with the elastic-indexer plugin:
>>>>>> compile:
>>>>>> [echo] Compiling plugin: indexer-elastic
>>>>>>[javac] Compiling 3 source files to
>>>>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
>>>>>>[javac]
>>>>>>
>> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
>>>>>> error: package org.apache.http.impl.nio.client does not exist
>>>>>>[javac] import
>> org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
>>>>>>[javac]   ^
>>>>>>[javac] 1 error
>>>>>>
>>>>>>
>>>>>> The binary distribution works fine though. I do see a lot of new
>> messages
>>>>>> when fetching:
>>>>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters
>> [LocalJobRunner
>>>>>> Map Task Executor #0] Found 0 extensions at
>>>>>> point:'org.apache.nutch.net.URLExemptionFilter'
>>>>>>
>>>>>> This is also new at start of each task:
>>>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>>>> SLF4J: Found binding in
>>>>>>
>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>>>
>>>>>> SLF4J: Found binding in
>>>>>>
>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>>>
>>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>>>> explanation.
>>>>>> SLF4

Re: Nutch 1.19 schema.xml

2022-09-02 Thread Sebastian Nagel

Hi Mike,

the Nutch/Solr schema.xml will be updated with the release of 1.19 (expected
soon, a vote about RC#1 is ongoing):
 [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType
 [NUTCH-2957] - add fall-back field definitions for unknown index fields
 [NUTCH-2956] - typos in field names filled by index-geoip

See the commits on the schema.xml
  
https://github.com/apache/nutch/commits/master/src/plugin/indexer-solr/schema.xml

Best,
Sebastian


On 8/31/22 14:02, Mike wrote:
> Hello!
> 
> 
> Will the schema.xml stay the same in Nutch 1.19?
> 
> thanks!
> 
> mike
>

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-28 Thread Sebastian Nagel

Hi Ralf,

> It fetches it parses

So a +1 ?

Best,
Sebastian

On 8/25/22 05:22, BlackIce wrote:
> nevermind I made a typo...
> 
> It fetches it parses
> 
> On Thu, Aug 25, 2022 at 3:42 AM BlackIce  wrote:
>>
>> so far... it doesn't select anything when creating segments:
>> 0 records selected for fetching, exiting
>>
>> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
>>>
>>> I have been able to compile under OpenJDK 11
>>> Have not done anything further so far
>>> I'm gonna try to get to it this evening
>>>
>>> Greetz
>>> Ralf
>>>
>>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
>>>  wrote:
>>>>
>>>> Hi,
>>>>
>>>> Everything seems fine, the crawler seems fine when trying the binary
>>>> distribution. The source won't work because this computer still cannot
>>>> compile it. Clearing the local Ivy cache did not do much. This is the known
>>>> compiler error with the elastic-indexer plugin:
>>>> compile:
>>>> [echo] Compiling plugin: indexer-elastic
>>>>[javac] Compiling 3 source files to
>>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
>>>>[javac]
>>>> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
>>>> error: package org.apache.http.impl.nio.client does not exist
>>>>[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
>>>>[javac]   ^
>>>>[javac] 1 error
>>>>
>>>>
>>>> The binary distribution works fine though. I do see a lot of new messages
>>>> when fetching:
>>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner
>>>> Map Task Executor #0] Found 0 extensions at
>>>> point:'org.apache.nutch.net.URLExemptionFilter'
>>>>
>>>> This is also new at start of each task:
>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>> SLF4J: Found binding in
>>>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>
>>>> SLF4J: Found binding in
>>>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>
>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>> explanation.
>>>> SLF4J: Actual binding is of type
>>>> [org.apache.logging.slf4j.Log4jLoggerFactory]
>>>>
>>>> And this one at the end of fetcher:
>>>> log4j:WARN No appenders could be found for logger
>>>> (org.apache.commons.httpclient.params.DefaultHttpParams).
>>>> log4j:WARN Please initialize the log4j system properly.
>>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>>>> more info.
>>>>
>>>> I am worried about the indexer-elastic plugin, maybe others have that
>>>> problem too? Otherwise everything seems fine.
>>>>
>>>> Markus
>>>>
>>>> Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel :
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> A first candidate for the Nutch 1.19 release is available at:
>>>>>
>>>>>https://dist.apache.org/repos/dist/dev/nutch/1.19/
>>>>>
>>>>> The release candidate is a zip and tar.gz archive of the binary and
>>>>> sources in:
>>>>>https://github.com/apache/nutch/tree/release-1.19
>>>>>
>>>>> In addition, a staged maven repository is available here:
>>>>>https://repository.apache.org/content/repositories/orgapachenutch-1020
>>>>>
>>>>> We addressed 87 issues:
>>>>>https://s.apache.org/lf6li
>>>>>
>>>>>
>>>>> Please vote on releasing this package as Apache Nutch 1.19.
>>>>> The vote is open for the next 72 hours and passes if a majority
>>>>> of at least three +1 Nutch PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Nutch 1.19.
>>>>> [ ] -1 Do not release this package because…
>>>>>
>>>>> Cheers,
>>>>> Sebastian
>>>>> (On behalf of the Nutch PMC)
>>>>>
>>>>> P.S.
>>>>> Here is my +1.
>>>>> - tested most of Nutch tools and run a test crawl on a single-node cluster
>>>>>   running Hadoop 3.3.4, see
>>>>>   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
>>>>>

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-28 Thread Sebastian Nagel

Hi Markus,

thanks!  What's your (final) decision?


>[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;

During build the class should be provided in
  build/plugins/indexer-elastic/httpasyncclient-4.1.4.jar
Could you verify whether this jar is there and whether it contains the class
file? See also:
  
https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/

> I am worried about the indexer-elastic plugin, maybe others have that
> problem too? Otherwise everything seems fine.

In order to fix it, we need to make the error reproducible resp. figure out
what the reason is.


Regarding the logging: we switched to log4j 2.x (NUTCH-2915) while Hadoop now
uses reload4j (HADOOP-18088 [1]). The logging configuration should be improved
to avoid the warnings in local mode. In distributed mode, the logging
configuration of the provided Hadoop takes over.


Best,
Sebastian

[1] https://issues.apache.org/jira/browse/HADOOP-18088


On 8/24/22 13:28, Markus Jelsma wrote:
> Hi,
> 
> Everything seems fine, the crawler seems fine when trying the binary
> distribution. The source won't work because this computer still cannot
> compile it. Clearing the local Ivy cache did not do much. This is the known
> compiler error with the elastic-indexer plugin:
> compile:
> [echo] Compiling plugin: indexer-elastic
>[javac] Compiling 3 source files to
> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
>[javac]
> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> error: package org.apache.http.impl.nio.client does not exist
>[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
>[javac]   ^
>[javac] 1 error
> 
> 
> The binary distribution works fine though. I do see a lot of new messages
> when fetching:
> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner
> Map Task Executor #0] Found 0 extensions at
> point:'org.apache.nutch.net.URLExemptionFilter'
> 
> This is also new at start of each task:
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 
> SLF4J: Found binding in
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type
> [org.apache.logging.slf4j.Log4jLoggerFactory]
> 
> And this one at the end of fetcher:
> log4j:WARN No appenders could be found for logger
> (org.apache.commons.httpclient.params.DefaultHttpParams).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> 
> I am worried about the indexer-elastic plugin, maybe others have that
> problem too? Otherwise everything seems fine.
> 
> Markus
> 
> Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel :
> 
>> Hi Folks,
>>
>> A first candidate for the Nutch 1.19 release is available at:
>>
>>https://dist.apache.org/repos/dist/dev/nutch/1.19/
>>
>> The release candidate is a zip and tar.gz archive of the binary and
>> sources in:
>>https://github.com/apache/nutch/tree/release-1.19
>>
>> In addition, a staged maven repository is available here:
>>https://repository.apache.org/content/repositories/orgapachenutch-1020
>>
>> We addressed 87 issues:
>>https://s.apache.org/lf6li
>>
>>
>> Please vote on releasing this package as Apache Nutch 1.19.
>> The vote is open for the next 72 hours and passes if a majority
>> of at least three +1 Nutch PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Nutch 1.19.
>> [ ] -1 Do not release this package because…
>>
>> Cheers,
>> Sebastian
>> (On behalf of the Nutch PMC)
>>
>> P.S.
>> Here is my +1.
>> - tested most of Nutch tools and run a test crawl on a single-node cluster
>>   running Hadoop 3.3.4, see
>>   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
>>
>

[VOTE] Release Apache Nutch 1.19 RC#1

2022-08-22 Thread Sebastian Nagel

Hi Folks,

A first candidate for the Nutch 1.19 release is available at:

   https://dist.apache.org/repos/dist/dev/nutch/1.19/

The release candidate is a zip and tar.gz archive of the binary and sources in:
   https://github.com/apache/nutch/tree/release-1.19

In addition, a staged maven repository is available here:
   https://repository.apache.org/content/repositories/orgapachenutch-1020

We addressed 87 issues:
   https://s.apache.org/lf6li


Please vote on releasing this package as Apache Nutch 1.19.
The vote is open for the next 72 hours and passes if a majority
of at least three +1 Nutch PMC votes are cast.

[ ] +1 Release this package as Apache Nutch 1.19.
[ ] -1 Do not release this package because…

Cheers,
Sebastian
(On behalf of the Nutch PMC)

P.S.
Here is my +1.
- tested most of Nutch tools and run a test crawl on a single-node cluster
  running Hadoop 3.3.4, see
  https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)

Re: [DISCUSS] Release 1.19 ?

2022-08-10 Thread Sebastian Nagel

Hi Markus,

> i'll submit a patch to upgrade to the current 2.4.1.

Great!

I'll work through the open issues, try to get patches
or open PRs merged. Also decided to update the license
and notice files (there are a couple of open issues).

Best,
Sebastian



On 8/9/22 15:25, Markus Jelsma wrote:
> Sounds good!
> 
> I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the
> current 2.4.1.
> 
> Thanks!
> Markus
> 
> Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel :
> 
>> Hi all,
>>
>> more than 60 issues are done for Nutch 1.19
>>
>>   https://issues.apache.org/jira/projects/NUTCH/versions/12349580
>>
>> including
>>  - important dependency upgrades
>>- Hadoop 3.3.3
>>- Any23 2.7
>>- Tika 2.3.0
>>  - plugin-specific URL stream handlers (NUTCH-2429)
>>  - migration
>>- from Java/JDK 8 to 11
>>- from Log4j 1 to Log4j 2
>>
>> ... and various other fixes and improvements.
>>
>> The last release (1.18) happened in January 2021, so it's definitely high
>> time
>> to release 1.19. As usual, we'll check all remaining issues whether they
>> should
>> be fixed now or can be done in a later release.
>>
>> I would be ready to push a release candidate during the next two weeks and
>> will start to work through the remaining issues and also check for
>> dependency
>> upgrades required to address potential vulnerabilities. Please, comment on
>> issues you want to get fixed already in 1.19! Reviews of open pull
>> requests and
>> patches are also welcome!
>>
>> Thanks,
>> Sebastian
>>
>

[DISCUSS] Release 1.19 ?

2022-08-09 Thread Sebastian Nagel

Hi all,

more than 60 issues are done for Nutch 1.19

  https://issues.apache.org/jira/projects/NUTCH/versions/12349580

including
 - important dependency upgrades
   - Hadoop 3.3.3
   - Any23 2.7
   - Tika 2.3.0
 - plugin-specific URL stream handlers (NUTCH-2429)
 - migration
   - from Java/JDK 8 to 11
   - from Log4j 1 to Log4j 2

... and various other fixes and improvements.

The last release (1.18) happened in January 2021, so it's definitely high time
to release 1.19. As usual, we'll check all remaining issues whether they should
be fixed now or can be done in a later release.

I would be ready to push a release candidate during the next two weeks and
will start to work through the remaining issues and also check for dependency
upgrades required to address potential vulnerabilities. Please, comment on
issues you want to get fixed already in 1.19! Reviews of open pull requests and
patches are also welcome!

Thanks,
Sebastian

Re: Unable to create core Caused by: solr.LatLonType

2022-08-06 Thread Sebastian Nagel

Fyi, the issue is tracked on

  https://issues.apache.org/jira/browse/NUTCH-2955

~Sebastian

On 7/14/22 12:54, Sebastian Nagel wrote:
> Hi Mike,
> 
> if you do not use the plugin index-geoip, you could simply delete the line
> 
> subFieldSuffix="_coordinate"/>
> 
> 
> Otherwise, after the deprecation and the removal of the LatLonType class [1],
> it should be:
> 
>   
> 
> But I haven't verified whether indexing with index-geoip enabled and the
> retrieval works.
> 
> 
> In any case, please open a Jira issue on
>https://issues.apache.org/jira/projects/NUTCH
> Thanks!
> 
> 
> Best,
> Sebastian
> 
> [1]
> https://solr.apache.org/docs/8_11_2/solr-core/org/apache/solr/schema/LatLonType.html
> 
> On 7/12/22 17:26, Mike wrote:
>> Hello!
>>
>> Is Nutch 1.18 compatible with Solr 9.0? I get an error when creating a core
>> with the Nutch schema.xml file:
>>
>>   # sudo -u solr /opt/solr/bin/solr create -c core01 -d
>> /opt/solr/server/solr/configsets/core01/conf/
>>
>> ERROR: Error CREATEing SolrCore 'core01': Unable to create core [core01]
>> Caused by: solr.LatLonType
>>
>> Thanks
>>
>> Mike
>>

Re: Question about Nutch plugins

2022-07-24 Thread Sebastian Nagel

Hi Rastko,

the description isn't really correct now as NUTCH_HOME is supposed to point to
the runtime

- if the binary package is used: this is the base folder of the package,
  eg. apache-nutch-1.18/

- if Nutch is built from the source, you usually point NUTCH_HOME to
  runtime/local/ - the directory tree below this folder looks pretty much
  the same as the binary package

Older versions of Nutch hadn't this separation of source and runtime.


> I use nutch by just unzipping apache-nutch-1.17-bin.tar.gz)?

If you want to build your own plugin, I'd recommend to start using
the Nutch source package, or even the current master by cloning the
Nutch git repository.


As always for a community project: feel free to improve the tutorial, obviously
it might be out of date.


Best,
Sebastian


On 7/23/22 13:28, Rastko.pavlovic wrote:
> Hi all,
> 
> I've been trying to implement this tutorial 
> https://cwiki.apache.org/confluence/display/nutch/WritingPluginExample on 
> Nutch 1.17. In several places, the tutorial refers to $NUTCH_HOME/src/plugin. 
> However, in my $NUTCH_HOME I only have a "plugin" directory and no src. If I 
> try building the plugin in a sub directory of the "plugin" directory with 
> ant, I get a problem where build.xml complains that it can't find 
> build-plugin.xml.
> 
> Does anyone maybe know what am I doing wrong (in case it helps, I use nutch 
> by just unzipping apache-nutch-1.17-bin.tar.gz)?
> 
> Many thanks in advance.
> 
> Best regards,
> Rastko
>

Re: Problem with Nutch <-> Eclipse

2022-07-19 Thread Sebastian Nagel

Hi Bob,

could you share which instructions and when the error happens - during import,
project build, running/debugging?

The usual way is

1. to write the Eclipse project configuration, run

   ant eclipse

2. import the written project configuration into Eclipse

Building or running/debugging Nutch in Eclipse is possible although requires
some work to get everything right.

Best,
Sebastian

On 7/15/22 23:00, Robert Scavilla wrote:
> Hello Kind People, I am trying to set up Nutch with eclipse. I am following
> the instructions and have an issue that I have not been able to resolve
> yet. I have the error: *"package org.w3c.dom is accessible from more than
> one module"*
> 
> There are several modules that get this same error message. The project
> compiles from the command line without error. It is not clear to me how to
> resolve this and I hope you can help.
> 
> Thank you!
> ...bob
>

Re: Unable to create core Caused by: solr.LatLonType

2022-07-14 Thread Sebastian Nagel

Hi Mike,

if you do not use the plugin index-geoip, you could simply delete the line

Otherwise, after the deprecation and the removal of the LatLonType class [1],
it should be:

But I haven't verified whether indexing with index-geoip enabled and the
retrieval works.

In any case, please open a Jira issue on
   https://issues.apache.org/jira/projects/NUTCH
Thanks!

Best,
Sebastian

[1]
https://solr.apache.org/docs/8_11_2/solr-core/org/apache/solr/schema/LatLonType.html

On 7/12/22 17:26, Mike wrote:
> Hello!
> 
> Is Nutch 1.18 compatible with Solr 9.0? I get an error when creating a core
> with the Nutch schema.xml file:
> 
>   # sudo -u solr /opt/solr/bin/solr create -c core01 -d
> /opt/solr/server/solr/configsets/core01/conf/
> 
> ERROR: Error CREATEing SolrCore 'core01': Unable to create core [core01]
> Caused by: solr.LatLonType
> 
> Thanks
> 
> Mike
>

Re: Does Nutch work with Hadoop Versions greater than 3.1.3?

2022-06-13 Thread Sebastian Nagel

Hi Michael,

Nutch (1.18, and trunk/master) should work together with more recent Hadoop
versions.

At Common Crawl we use a modified Nutch version based on the recent trunk
running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop cluster
with x64 and arm64 AWS EC2 instances.

But I'm sure there are more possible combinations.

One important note: in trunk/master there is a yet unsolved regression caused by
the newly introduced plugin-based URL stream handlers, see NUTCH-2936 and
NUTCH-2949. Unless these are resolved, you need to undo these commits in order
to run Nutch (built from trunk/master) in distributed mode.

Best,
Sebastian

On 6/13/22 01:37, Michael Coffey wrote:
> Do current 1.x versions of Nutch (1.18, and trunk/master) work with versions 
> of Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from October 
> 2019, and there are many newer versions available. For example, 3.1.4 came 
> out in 2020, and there are 3.2.x and 3.3.x versions that came out this year.
> 
> I don’t care about newer features in Hadoop, I just have general concerns 
> about stability and security. I am working on reviving an old project and 
> would like to put together the best possible infrastructure for the future.
> 
>

Re: FW: After update from 1.11 to 1.13 form login does not work

2022-05-10 Thread Sebastian Nagel

Hi Michael,

the only differences in the protocol-httpclient plugin between Nutch 1.11 and
1.13 are
- NUTCH-2280 [1] which allows to configure the cookie policy
- NUTCH-2355 [2] which allows to set an explicit cookie for a request URL

Could this be related?

Are there any useful hints what could be the reason in the log messages
if you set
  log4j.logger.org.apache.nutch.protocol.httpclient=TRACE
in the log4j.properties ?

Best,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-2280
[2] https://issues.apache.org/jira/browse/NUTCH-2355

On 5/9/22 15:30, Fritsch, Michael wrote:
> Hello,
> 
> I used nutch 1.11 to crawl pages behind a login page.
> The http-auth configuration looked like this:
> 
> ---
> 
> 
>   
> loginUrl=loginURL
>loginFormId="loginForm"
>loginRedirect="true">
> 
> value="username"/>
> value="password"/>
> 
> 
> 
>   
> 
> 
> 
> Everything worked fine. Then I updated to 1.13 (I also tried 1.18) and 
> changed the configuration as described in the http-auth.xml file:
> 
> -
> 
> 
>   
> loginUrl=loginURL
>loginFormId="loginForm"
>loginRedirect="true">
> 
> value="username"/>
> value="password"/>
> 
> 
> 
> 
> 
> 
>   BROWSER_COMPATIBILITY
> 
>   
> 
> 
> 
> ---
> 
> Now, the login did not work anymore. After some redirects, it gives an HTML 
> response 403. I tried all loginCookie policy entries, but nothing worked.
> The login is to a Zendesk support system with Atlassian Crowd as a login 
> provider. Has anything changed between 1.11 and 1.13 is something more strict 
> than before?
> 
> 
> I found a very similar question in this mailing list 
> (https://www.mail-archive.com/user@nutch.apache.org/msg15746.htmlfrom ) from 
> 2017, which has no solutions.
> 
> I would appreciate any help!
> 
> Best regards
> 
> Michael
> 
> 
> Dr. Michael Fritsch
> Technical Editor
> 
> T: +49.40.325587.214
> E: michael.frit...@coremedia.com
> 
> CoreMedia GmbH - Be iconic
> Ludwig-Erhard-Str. 18
> 20459 Hamburg, Germany
> www.coremedia.com
> 
> Managing Directory: Sören Stamer
> Commercial Register: Amtsgericht Hamburg, HR B 162480
> --
> Stay up to date and follow us on 
> LinkedIn or 
> Twitter
> 
>

Re: Nutch not crawling all URLs

2022-01-13 Thread Sebastian Nagel

Hi Roseline,

> Does it work at all with Chrome?

Yes.

> It seems you need to have some form of GUI to run it?

You need graphics libraries but not necessarily a graphical system.
Normally, you run the browser in headless mode without a graphical
device (monitor) attached.

> Is there some documentation or tutorial on this?

The README is probably the best documentation:
  src/plugin/protocol-selenium/README.md
  https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium

After installing chromium and the Selenium chromedriver, you can test whether it
works by running:

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL


Caveat: because browsers are updated frequently, you may need to use a recent
driver version and eventually also upgrade the Selenium dependencies in Nutch.
Let us know if you need help here.


> My use case is Text mining  and Machine Learning classification. I'm indexing
> into Solr and then transferring the indexed data to MongoDB for further
> processing.

Well, that's not an untypical use case for Nutch. And it's a long pipeline:
fetching, HTML parsing, extracting content fields, indexing. Nutch is able to
perform all steps. But I'd agree that browser-based crawling isn't that easy
to set up with Nutch.

Best,
Sebastian

On 1/12/22 17:53, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you. I did enjoy the holiday. Hope you did too. 
> 
> I have had a look at the protocol-selenium plugin, but it was a bit difficult 
> to understand. It appears it only works with Firefox. Does it work at all 
> with Chrome? I was also not sure of what values to set for the properties. It 
> seems you need to have some form of GUI to run it?
> 
> Is there some documentation or tutorial on this? My guess is that some of the 
> pages might not be crawling because of JavaScript. I might be wrong, but 
> would want to test that.
> 
> I think would be quite good for my use case because I am trying to implement 
> broad crawling. 
> 
> My use case is Text mining  and Machine Learning classification. I'm indexing 
> into Solr and then transferring the indexed data to MongoDB for further 
> processing.
> 
> Kind regards,
> Roseline
> 
> 
> 
> 
> 
> -Original Message-
> From: Sebastian Nagel  
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> Hi Roseline,
> 
>> the mail below went to my junk folder and I didn't see it.
> 
> No problem. I hope you nevertheless enjoyed the holidays.
> And sorry for any delays but I want to emphasize that Nutch is a community 
> project and in doubt it might take a few days until somebody finds the time 
> to respond.
> 
>> Could you confirm if you received all the urls I sent?
> 
> I've tried a view URLs you sent but not all of them. And to figure out the 
> reason why a site isn't crawled may take some time.
> 
>> Another question I have about Nutch is if it has problems with 
>> crawling javascript pages?
> 
> By default Nutch does not execute Javascript.
> 
> There is a protocol plugin (protocol-selenium) to fetch pages with a web 
> browser between Nutch and the crawled sites. This way Javascript pages can be 
> crawled for the price of some overhead in setting up the crawler and network 
> traffic to fetch the page dependencies (CSS, Javascript, images).
> 
>> I would ideally love to make the crawler work for my URLs than start 
>> checking for other crawlers and waste all the work so far.
> 
> Well, Nutch is for sure a good crawler. But as always: there are many other 
> crawlers which might be better adapted to a specific use case.
> 
> What's your use case? Indexing into Solr or Elasticsearch?
> Text mining? Archiving content?
> 
> Best,
> Sebastian
> 
> On 1/12/22 12:13, Roseline Antai wrote:
>> Hi Sebastian,
>>
>> For some reason, the mail below went to my junk folder and I didn't see it.
>>
>> The notco page - 
>> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3Dreserved=0
>>   was not indexed, no. When I enabled redirects, I was able to get a few 
>> pages, but they don't seem valid.
>>
>> Could you confirm if you received all the urls I sent

Re: Nutch not crawling all URLs

2022-01-12 Thread Sebastian Nagel

Hi Roseline,

> the mail below went to my junk folder and I didn't see it.

No problem. I hope you nevertheless enjoyed the holidays.
And sorry for any delays but I want to emphasize that Nutch is
a community project and in doubt it might take a few days
until somebody finds the time to respond.

> Could you confirm if you received all the urls I sent?

I've tried a view URLs you sent but not all of them. And to figure out the
reason why a site isn't crawled may take some time.

> Another question I have about Nutch is if it has problems with crawling
> javascript pages?

By default Nutch does not execute Javascript.

There is a protocol plugin (protocol-selenium) to fetch pages with a web
browser between Nutch and the crawled sites. This way Javascript pages
can be crawled for the price of some overhead in setting up the crawler and
network traffic to fetch the page dependencies (CSS, Javascript, images).

> I would ideally love to make the crawler work for my URLs than start checking
> for other crawlers and waste all the work so far.

Well, Nutch is for sure a good crawler. But as always: there are many
other crawlers which might be better adapted to a specific use case.

What's your use case? Indexing into Solr or Elasticsearch?
Text mining? Archiving content?

Best,
Sebastian

On 1/12/22 12:13, Roseline Antai wrote:
> Hi Sebastian,
> 
> For some reason, the mail below went to my junk folder and I didn't see it.
> 
> The notco page - https://notco.com/  was not indexed, no. When I enabled 
> redirects, I was able to get a few pages, but they don't seem valid.
> 
> Could you confirm if you received all the urls I sent?
> 
> Another question I have about Nutch is if it has problems with crawling 
> javascript pages?
> 
> I would ideally love to make the crawler work for my URLs than start checking 
> for other crawlers and waste all the work so far.
> 
> Just adding again, this is what my nutch-site.xml looks like:
> 
> 
> 
> 
> 
> 
> 
>  http.agent.name
>  Nutch Crawler
> 
> 
> http.agent.email 
> datalake.ng at gmail d 
> 
> 
> db.ignore.internal.links
> false
> 
> 
> db.ignore.external.links
> true
> 
> 
>   plugin.includes
>   
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier
> 
> 
> parser.skip.truncated
> false
> Boolean value for whether we should skip parsing for 
> truncated documents. By default this
> property is activated due to extremely high levels of CPU which 
> parsing can sometimes take.
> 
> 
>  
>db.max.outlinks.per.page
>-1
>The maximum number of outlinks that we'll process for a page.
>If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> outlinks
>will be processed for a page; otherwise, all outlinks will be processed.
>
>  
> 
>   http.content.limit
>   -1
>   The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   
> 
> 
>   db.ignore.external.links.mode
>   byHost
> 
> 
>   db.injector.overwrite
>   true
> 
> 
>   http.timeout
>   5
>   The default network timeout, in milliseconds.
> 
> 
> 
> Regards,
> Roseline
> 
> -Original Message-
> From: Sebastian Nagel  
> Sent: 13 December 2021 17:35
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3Dreserved=0
> 
> What is the status for   
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhichdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3Dreserved=0
>  is the final redirect
> target?
> Is the target page indexed?
> 
> ~Sebastian
>

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel

Hi Ayhan,

you mean?
https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt

Sebastian

On 12/13/21 20:59, Ayhan Koyun wrote:
> Hi,
> 
> as I wrote before, it seems that I am not the only one who can not crawl all 
> the seed.txt url's. I couldn't
> find a solution really. I collected 450 domains and approximately 200 nutch 
> will or can not crawl. I want to
> know why this happens, is there a solution to force crawling sites?
> 
> It would be great to get a satisfying answer, to know why this happens and 
> maybe how to solve it.
> 
> Thanks in advance
> 
> Ayhan
> 
>

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel

Hi Roseline,

> 5,36405,0,http://www.notco.com

What is the status for   https://notco.com/which is the final redirect
target?
Is the target page indexed?

~Sebastian

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel

Hi,

(looping back to user@nutch - sorry, pressed the wrong reply button)

> Some URLs were denied by robots.txt,
> while a few failed with: Http code=403

That's two ways to signalize that these pages shouldn't be crawled,
HTTP 403 means "Forbidden".

> 3. I looked in CrawlDB and most URLs are in there, but were not
> crawled, so this is something that I find very confusing.

The CrawlDb contains also URLs which failed for various reasons.
That's important in order to avoid that 404s, 403s etc. are retried
again and again.

> I also ran some of the URLs that were not crawled through this -
>  bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
>
> Some of the URLs that failed were parsed successfully, so I'm really
> confused as to why there are no results for them.
>

The "HTTP 403 Forbidden" could be from a "anti-bot protection" software.
If you run parsechecker at a different time or from a different machine,
and not repeatedly or too often it may succeed.

Best,
Sebastian

On 12/13/21 17:48, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you for your reply.
> 
> 1. All URLs were injected, so 20 in total. None was rejected.
> 
> 2. I've had a look at the log files and I can see that some of the URLs could 
> not be fetched because the robot.txt file could not be found. Would this be a 
> reason for why the fetch failed? Is there a way to go around it?
> 
> Some URLs were denied by robots.txt, while a few failed with: Http code=403 
> 
> 3. I looked in CrawlDB and most URLs are in there, but were not crawled, so 
> this is something that I find very confusing.
> 
> I also ran some of the URLs that were not crawled through this - bin/nutch 
> parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> Some of the URLs that failed were parsed successfully, so I'm really confused 
> as to why there are no results for them.
> 
> Do you have any suggestions on what I should try?
> 
> Dr Roseline Antai
> Research Fellow
> Hunter Centre for Entrepreneurship
> Strathclyde Business School
> University of Strathclyde, Glasgow, UK
> 
> 
> The University of Strathclyde is a charitable body, registered in Scotland, 
> number SC015263.
> 
> 
> -Original Message-
> From: Sebastian Nagel  
> Sent: 13 December 2021 12:19
> To: Roseline Antai 
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> For instance, when I inject 20 URLs, only 9 are fetched.
> 
> Are there any log messages about the 11 unfetched URLs in the log files.  Try 
> to look for a file "hadoop.log"
> (usually in $NUTCH_HOME/logs/) and look
>  1. how many URLs have been injected.
> There should be a log message
>  ... Total new urls injected: ...
>  2. If all 20 URLs are injected, there should be log
> messages about these URLs from the fetcher:
>  FetcherThread ... fetching ...
> If the fetch fails, there might be a message about
> this.
>  3. Look into the CrawlDb for the missing URLs.
>   bin/nutch readdb .../crawldb -url 
> or
>   bin/nutch readdb .../crawldb -dump ...
> You get the command-line options by calling
>   bin/nutch readdb
> without any arguments
> 
> Alternatively, verify fetching and parsing the URLs by
>   bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> 
>> 
>> db.ignore.external.links
>> true
>> 
> 
> Eventually, you want to follow redirects anyway? See
> 
> 
>   db.ignore.also.redirects
>   true
>   If true, the fetcher checks redirects the same way as
>   links when ignoring internal or external links. Set to false to
>   follow redirects despite the values for db.ignore.external.links and
>   db.ignore.internal.links.
>   
> 
> 
> Best,
> Sebastian
> 
> 
> On 12/13/21 13:02, Roseline Antai wrote:
>> Hi,
>>
>>
>>
>> I am working with Apache nutch 1.18 and Solr. I have set up the system 
>> successfully, but I’m now having the problem that Nutch is refusing to 
>> crawl all the URLs. I am now at a loss as to what I should do to 
>> correct this problem. It fetches about half of the URLs in the seed.txt file.
>>
>>
>>
>> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
>> number of changes based on the suggestions I saw on the Nutch forum, 
>> as well as on Stack overflow, but nothing seems to work.
>>
>>
>>
>> This is what my nutch-site.xml file looks like:
>>
>>
>>
>>
>>
>>

Re: Error When Connecting Elasticsearch with HTTPS Connection

2021-11-18 Thread Sebastian Nagel

Hi Shi Wei,

fyi: a fix for NUTCH-2903 is ready
  https://github.com/apache/nutch/pull/703

Sebastian


On 11/16/21 13:54, Sebastian Nagel wrote:
> Hi Shi Wei,
> 
> looks like you're the first trying to connect to ES from Nutch over
> HTTPS.  HTTP is used as default scheme and there is no way to configure
> the Elasticsearch index writer to use HTTPS.
> 
> Please open a Jira issue. It's a trivial fix.
> 
> 
> For a quick fix: in the Nutch source package (or git tree) edit the file
> 
> 
> src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
> 
> and change the line
> 
> hostsList[i++] = new HttpHost(host, port);
> into
> hostsList[i++] = new HttpHost(host, port, "https");
> 
> After that you need to build nutch by running
> 
>ant runtime
> 
> 
> See
> 
> https://www.javadoc.io/static/org.apache.httpcomponents/httpcore/4.3.2/org/apache/http/HttpHost.html#HttpHost(java.lang.String,%20int,%20java.lang.String)
> 
> Best,
> Sebastian
> 
> On 11/16/21 09:49, sw.l...@quandatics.com wrote:
>> Hi there,
>>
>> We encountered an issue when connecting to our Elasticsearch with HTTPS
>> connection as follow:
>>
>> 2021-11-16 16:37:22,034 DEBUG client.RestClient - request [POST
>> http://192.168.0.105:9200/_bulk?timeout=1m] failed
>> org.apache.http.ConnectionClosedException: Connection is closed
>>     at
>> org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:356)
>>
>>     at
>> org.apache.http.impl.nio.client.InternalRequestExecutor.endOfInput(InternalRequestExecutor.java:132)
>>
>>     at
>> org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
>>
>>     at
>> org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
>>
>>     at
>> org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
>>
>>     at
>> org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
>>
>>     at
>> org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
>>
>>     at
>> org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
>>
>>     at
>> org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
>>
>>     at
>> org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
>>
>>     at
>> org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
>>
>>     at
>> org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
>>
>>     at java.lang.Thread.run(Thread.java:748)
>>
>> Looks like the HTTP connection is established instead of HTTPS. Could
>> you advise if there are any required parameters that we missed out in
>> the index-writers.xml? Attached are the log file and index-writers.xml.
>>
>> Thanks in advance!
>>
>>
>> Best Regards,
>> Shi Wei

Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

2021-11-18 Thread Sebastian Nagel

The issue is now tracked in
  https://issues.apache.org/jira/browse/NUTCH-2907

On 10/28/21 15:31, Sebastian Nagel wrote:
> Hi Shi Wei,
> 
> sorry, but it looks like the Selenium protocol plugin has never been
> used with a proxy over https. There are two points which need (at a
> first glance) a rework:
> 
> 1. the protocol tries to establish a TLS/SSL connection to the proxy if
> the URL to be crawled is a https:// URL. There might be some proxies
> which can do this, but the proxies I'm aware of expect a HTTP CONNECT
> [1] for HTTPS proxying.
> 
> 2. probably also the browser / driver needs to be configured to
> use the same proxy. Afaics, this isn't done but is a requirement
> if the proxy is required for accessing web content. However, it
> might be possible by setting environment variables.
> 
> Sorry again. Feel free to open a Jira issue to get this fixed.
> 
> Best,
> Sebastian
> 
> [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method
> 
> 
> On 10/28/21 11:45, sw.l...@quandatics.com wrote:
>> Hi there,
>>
>>  
>>
>> Good day!
>>
>>  
>>
>> We would like to crawl the web data by executing the Nutch with Selenium
>> plugin with the following command:
>>
>>  
>>
>> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
>> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
>>
>>  
>>
>> However, it failed with the following error message:
>>
>>  
>>
>> 2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = 
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
>> true
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 1
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
>> Test/Nutch-1.18
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
>> true
>>
>> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
>>
>> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
>> handshake
>>
>> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
>>
>> at sun.security.ssl.SSL
>>
>>  
>>
>> FYI, we have tried the following approaches but the issues persisted.
>>
>>  
>>
>> 1. Set the http.tls.certificates.check to false
>>
>> 2. Import the website's certificates to our java truststores
>>
>> 3. Our Nutch is configured with proxy
>>
>>  
>>
>> Kindly advise. Thanks in advance!
>>
>>  
>>
>>  
>>
>> Best Regards,
>>
>> Shi Wei
>>
>>  
>>
>>

Re: encrypt password of the index-writer.xml

2021-11-17 Thread Sebastian Nagel

Hi Shi Wei,

(looping back to user@nutch - sorry, should have replied to the list)

First, the masking of sensitive strings is tracked in
   https://issues.apache.org/jira/browse/NUTCH-2905

Second, to disable the logging:

The logging class is IndexerOutputFormat, so you need to add

  log4j.logger.org.apache.nutch.indexer.IndexerOutputFormat=WARN

or for Nutch 1.19 and the current master edit the file conf/log4j2.xml
and add to the list of :


  


Best,
Sebastian

On 11/17/21 14:07, sw.l...@quandatics.com wrote:
> Hi Sebastian,
>  
> Thanks for your reply.
>  
> According to the statement, "You could set the log level for the class
> logging the password from INFO to WARN.", may we know which
> class/parameter that we should set to only restrict the Elasticsearch
> indexer logs to WARN level? This is because we have tried to set the
> following in the log4j.properties but it doesn't help.
>  
> log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticIndexWriter=WARN,cmdstdout
> log4j.logger.org.apache.nutch.indexwriter.elastic.ElasticUtils=WARN,cmdstdout
>  
>  
> Best Regards,
> Shi Wei
>  
> On 2021-11-15 21:26, Sebastian Nagel wrote:
>> Hi Shi Wei,
>>
>>> hide the password value in hadoop.log file table ?
>>
>> You could set the log level for the class logging the password from INFO
>> to WARN. Then the index writer configuration isn't logged anymore.
>> As said, this is a work-around not a final solution which should,
>> of course, mask passwords when logging.
>>
>>> We also ran into an issue where an https connection could not be
>>> established with elasticsearch
>>
>> If the problem persists could you start a separate thread?
>>
>> Thanks,
>> Sebastian
>>
>> On 11/12/21 10:57, sw.l...@quandatics.com
>> <mailto:sw.l...@quandatics.com> wrote:
>>> Hi, Sebastian
>>>
>>> Thanks for your suggestion, may I know if there is a way to hide the
>>> password value in hadoop.log file table ?
>>> We also ran into an issue where an https connection could not be
>>> established with elasticsearch. Do you have any suggestions to solve
>>> this problem?
>>> Thank
>>>
>>>
>>>
>>> Best Regards,
>>>  Shi Wei
>>>
>>> -Original Message-
>>> From: Sebastian Nagel >> <mailto:wastl.na...@googlemail.com.INVALID>>
>>> Sent: Friday, 12 November, 2021 1:20 AM
>>> To: user@nutch.apache.org <mailto:user@nutch.apache.org>
>>> Subject: Re: encrypt password of the index-writer.xml
>>>
>>> Hi Shi Wei,
>>>
>>> there is a way, although definitely not the recommended one.
>>> Sorry, and it took me a little bit to proof it.
>>>
>>> Do you know about external XML entities or XXE attacks?
>>>
>>> 1. On top of the index-writers.xml you add an entity declaration:
>>>
>>> 
>>> >>   
>>> ]>
>>>
>>>
>>> 2. it's used later in the index writer spec:
>>>
>>>   >>   class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>>> 
>>>   ...
>>>   
>>> 
>>>
>>> 3. you add your credentials snippet to the file /path/to/credentials.txt
>>>
>>>  >> value="SECRET"/>
>>>
>>> 4. and voila:
>>>
>>> $> bin/nutch index crawldb segment
>>> ...
>>> ├┼─┼─┤
>>> │username    │The username of Solr server. │username │
>>> ├┼─┼─┤
>>> │password    │The password of Solr server. │SECRET   │
>>> └┴─┴─┘
>>>
>>>
>>> Note: this is an dirty hack but not a security issue: with access to
>>> the index-writers.xml you can write anything into it.  But there is
>>> no guarantee that this hack will continue to work in the future.
>>>
>>> Would you please be so kind to open a Jira issue to add real support
>>> for passwords in the index-writers.xml
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>>
>>> On 11/10/21 11:16, sw.l...@quandatics.com
>>> <mailto:sw.l...@quandatics.com> wrote:
>>>> Hi ,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> We have tried the variable expansion method on the index-writers.xml,
>>>> it doesn't work. Could you advise if there are any alternative ways to
>>>> encrypt the password in the index-writers.xml file?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>> Shi Wei
>>>>
>>>>
>>>>
>>>>

Re: Error When Connecting Elasticsearch with HTTPS Connection

2021-11-16 Thread Sebastian Nagel

Hi Shi Wei,

looks like you're the first trying to connect to ES from Nutch over
HTTPS.  HTTP is used as default scheme and there is no way to configure
the Elasticsearch index writer to use HTTPS.

Please open a Jira issue. It's a trivial fix.


For a quick fix: in the Nutch source package (or git tree) edit the file


src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java

and change the line

hostsList[i++] = new HttpHost(host, port);
into
hostsList[i++] = new HttpHost(host, port, "https");

After that you need to build nutch by running

   ant runtime


See

https://www.javadoc.io/static/org.apache.httpcomponents/httpcore/4.3.2/org/apache/http/HttpHost.html#HttpHost(java.lang.String,%20int,%20java.lang.String)

Best,
Sebastian

On 11/16/21 09:49, sw.l...@quandatics.com wrote:
> Hi there,
> 
> We encountered an issue when connecting to our Elasticsearch with HTTPS
> connection as follow:
> 
> 2021-11-16 16:37:22,034 DEBUG client.RestClient - request [POST
> http://192.168.0.105:9200/_bulk?timeout=1m] failed
> org.apache.http.ConnectionClosedException: Connection is closed
>     at
> org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:356)
> 
>     at
> org.apache.http.impl.nio.client.InternalRequestExecutor.endOfInput(InternalRequestExecutor.java:132)
> 
>     at
> org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
> 
>     at
> org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
> 
>     at
> org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
> 
>     at
> org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
> 
>     at
> org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
> 
>     at
> org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
> 
>     at
> org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
> 
>     at
> org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
> 
>     at
> org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
> 
>     at
> org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
> 
>     at java.lang.Thread.run(Thread.java:748)
> 
> Looks like the HTTP connection is established instead of HTTPS. Could
> you advise if there are any required parameters that we missed out in
> the index-writers.xml? Attached are the log file and index-writers.xml.
> 
> Thanks in advance!
> 
> 
> Best Regards,
> Shi Wei

Re: JEXL unable to handle "if" statements?

2021-11-11 Thread Sebastian Nagel

Hi Max,

fyi, the Jira issue is created:
  https://issues.apache.org/jira/browse/NUTCH-2902
(to make sure that this is not forgotten)

Thanks,
Sebastian


On 10/11/21 18:11, Sebastian Nagel wrote:
> Hi Max,
> 
>> I was able to fix this by switching from JexlExpression to JexlScript. I
>> have a small patch that I'm happy to contribute!
> 
> Yes, that would be great!  Please open also a Jira issue so that the
> problem shows up in the Changelog.
> 
> Thanks!
> 
> Best,
> Sebastian
> 
> On 10/11/21 6:34 AM, Max Ockner wrote:
>> According to the commons-jexl change logs, a breaking change was released
>> in 3.0:
>>
>> "Syntactically enforce that expressions do not contain statements:
>> POTENTIAL EXPRESSION BREAK! (ie an expression is not a script and can NOT
>> use 'if','for'... and blocks)"
>>
>> I was able to fix this by switching from JexlExpression to JexlScript. I
>> have a small patch that I'm happy to contribute!
>>
>>
>>
>> On Sun, Oct 10, 2021 at 3:30 PM Max Ockner  wrote:
>>
>>> Hello,
>>>
>>> I'm trying to use JEXL expressions similar to the ones described here
>>> https://issues.apache.org/jira/browse/NUTCH-2368.
>>>
>>> I consistently get an error parsing my "if" statement.
>>>
>>> I can reproduce with a simpler expression:
>>> -Dgenerate.max.count.expr='if (true) {return 2} else {return 1}'
>>>
>>> I'm running 1.19 on java 11 (also tried with java 8).
>>>
>>> Has anyone else seen this problem?
>>>
>>> Thanks,
>>> Ma
>>>
>>
>

Re: encrypt password of the index-writer.xml

2021-11-11 Thread Sebastian Nagel

Hi Shi Wei,

there is a way, although definitely not the recommended one.
Sorry, and it took me a little bit to proof it.

Do you know about external XML entities or XXE attacks?

1. On top of the index-writers.xml you add an entity declaration:

]>

2. it's used later in the index writer spec:

  ...

3. you add your credentials snippet to the file /path/to/credentials.txt

4. and voila:

$> bin/nutch index crawldb segment
...
├┼─┼─┤
│username│The username of Solr server. │username │
├┼─┼─┤
│password│The password of Solr server. │SECRET   │
└┴─┴─┘

Note: this is an dirty hack but not a security issue: with access to the
index-writers.xml you can write anything into it.  But there is no
guarantee that this hack will continue to work in the future.

Would you please be so kind to open a Jira issue to add real support
for passwords in the index-writers.xml

Best,
Sebastian

On 11/10/21 11:16, sw.l...@quandatics.com wrote:
> Hi ,
> 
>  
> 
>  
> 
> We have tried the variable expansion method on the index-writers.xml, it
> doesn't work. Could you advise if there are any alternative ways to encrypt
> the password in the index-writers.xml file?
> 
>  
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
>

Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

2021-10-28 Thread Sebastian Nagel

Hi Shi Wei,

sorry, but it looks like the Selenium protocol plugin has never been
used with a proxy over https. There are two points which need (at a
first glance) a rework:

1. the protocol tries to establish a TLS/SSL connection to the proxy if
the URL to be crawled is a https:// URL. There might be some proxies
which can do this, but the proxies I'm aware of expect a HTTP CONNECT
[1] for HTTPS proxying.

2. probably also the browser / driver needs to be configured to
use the same proxy. Afaics, this isn't done but is a requirement
if the proxy is required for accessing web content. However, it
might be possible by setting environment variables.

Sorry again. Feel free to open a Jira issue to get this fixed.

Best,
Sebastian

[1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method


On 10/28/21 11:45, sw.l...@quandatics.com wrote:
> Hi there,
> 
>  
> 
> Good day!
> 
>  
> 
> We would like to crawl the web data by executing the Nutch with Selenium
> plugin with the following command:
> 
>  
> 
> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
> 
>  
> 
> However, it failed with the following error message:
> 
>  
> 
> 2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = 
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
> true
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 1
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
> Test/Nutch-1.18
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
> true
> 
> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
> 
> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
> handshake
> 
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
> 
> at sun.security.ssl.SSL
> 
>  
> 
> FYI, we have tried the following approaches but the issues persisted.
> 
>  
> 
> 1. Set the http.tls.certificates.check to false
> 
> 2. Import the website's certificates to our java truststores
> 
> 3. Our Nutch is configured with proxy
> 
>  
> 
> Kindly advise. Thanks in advance!
> 
>  
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
>

Re: Encrypt or Mask the password

2021-10-25 Thread Sebastian Nagel

Hi Shi Wei,

I'm not aware of any work-arounds.

Best,
Sebastian

Am Mo., 25. Okt. 2021 um 11:59 Uhr schrieb :

> Hi Sebastian,
>
> In case you have missed out the previous email, is there any possible
> workaround to integrate Nutch with kerberized Solr cloud?  e.g. via the
> HTTP Authentication Scheme
>
> Your sincerely,
> Shi Wei
>
> -Original Message-
> From: Sebastian Nagel 
> Sent: Monday, 25 October, 2021 5:31 PM
> To: user@nutch.apache.org
> Subject: Re: Encrypt or Mask the password
>
> Hi Shi Wei,
>
> for the nutch-site.xml it's possible to use Java properties and/or
> environment variables, see section "Variable expansion" in
>
>
> https://hadoop.apache.org/docs/r3.3.1/api/org/apache/hadoop/conf/Configuration.html
>
> In case you're asking about index-writers.xml - variable expansion
> (likely) does not work.
> Note: I didn't try it. But it's a scheme specific to Nutch and not a
> Hadoop configuration file and I cannot remember that any expansion
> mechanism is implemented when the index-writers.xml is read.
>
> Anyway, the better way would be to rely primarily on a credential
> provider, see
>
>
> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html
>
>
> https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html#getPassword-java.lang.String-
>
> If you have time, please open a Jira issue (or multiple ones) to get a
> more safe way to hold credentials implemented.
>
> Thanks,
> Sebastian
>
> Am Mo., 25. Okt. 2021 um 03:21 Uhr schrieb :
>
> > Hi,
> >
> >
> >
> > May I know if there is a way to encrypt or mask the password specified
> > in nutch-site.xml?
> >
> >
> >
> > Your sincerely,
> >
> > Shi Wei
> >
> >
>
>

Re: Encrypt or Mask the password

2021-10-25 Thread Sebastian Nagel

Hi Shi Wei,

for the nutch-site.xml it's possible to use Java properties and/or
environment variables,
see section "Variable expansion" in

https://hadoop.apache.org/docs/r3.3.1/api/org/apache/hadoop/conf/Configuration.html

In case you're asking about index-writers.xml - variable expansion (likely)
does not work.
Note: I didn't try it. But it's a scheme specific to Nutch and not a Hadoop
configuration file
and I cannot remember that any expansion mechanism is implemented when the
index-writers.xml
is read.

Anyway, the better way would be to rely primarily on a credential provider,
see

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html

https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html#getPassword-java.lang.String-

If you have time, please open a Jira issue (or multiple ones) to get a more
safe way to hold
credentials implemented.

Thanks,
Sebastian

Am Mo., 25. Okt. 2021 um 03:21 Uhr schrieb :

> Hi,
>
>
>
> May I know if there is a way to encrypt or mask the password specified in
> nutch-site.xml?
>
>
>
> Your sincerely,
>
> Shi Wei
>
>

Re: Cant integrate the kerberos enabled solr cloud with nutch

2021-10-22 Thread Sebastian Nagel


Hi Shi Wei,

could you also share the index writer configuration (conf/index-writers.xml)?

The default is unauthenticated access to Solr, see the snippet below.
The file httpclient-auth.xml is not relevant for the Solr indexer, it's
used if a crawled web site requires authentication in order to fetch
the content via the plugin protocol-httpclient.

Best,
Sebastian

  

  
  http://localhost:8983/solr/nutch"/>
  
  
  
  
  
  


On 10/22/21 10:10 AM, sw.l...@quandatics.com wrote:

Hi,

We have encountered a problem which can’t integrate the kerberos enabled solr 
cloud with nutch.

When execute "nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 -filter -normalize" command ,it will fail with "HTTP ERROR 401Problem 
accessing /solr/admin/collections. Reason:Authentication required" but we able to curl it with the keytab.


Version of Nutch :1.18

Your Sincerely,

Shi Wei

Re: JEXL unable to handle "if" statements?

2021-10-11 Thread Sebastian Nagel

Hi Max,

> I was able to fix this by switching from JexlExpression to JexlScript. I
> have a small patch that I'm happy to contribute!

Yes, that would be great!  Please open also a Jira issue so that the
problem shows up in the Changelog.

Thanks!

Best,
Sebastian

On 10/11/21 6:34 AM, Max Ockner wrote:

According to the commons-jexl change logs, a breaking change was released
in 3.0:

"Syntactically enforce that expressions do not contain statements:
POTENTIAL EXPRESSION BREAK! (ie an expression is not a script and can NOT
use 'if','for'... and blocks)"

I was able to fix this by switching from JexlExpression to JexlScript. I
have a small patch that I'm happy to contribute!

On Sun, Oct 10, 2021 at 3:30 PM Max Ockner  wrote:

Hello,

I'm trying to use JEXL expressions similar to the ones described here
https://issues.apache.org/jira/browse/NUTCH-2368.

I consistently get an error parsing my "if" statement.

I can reproduce with a simpler expression:
-Dgenerate.max.count.expr='if (true) {return 2} else {return 1}'

I'm running 1.19 on java 11 (also tried with java 8).

Has anyone else seen this problem?

Thanks,
Ma

Re: OkHttp NoClassDefFoundError: okhttp3/Authenticator

2021-07-24 Thread Sebastian Nagel


Hi Markus,

the okhttp protocol plugin should work out-of-the-box
and we use it in production (currently on Hadoop 3.2.2)

I remember that I had once an issue with the Hadoop library
having okhttp as a dependency which then caused a conflict.
It was solved by adding an exclusion rule to the Hadoop
dependency in ivy/ivy.xml:


   ...
   

(the exclusion rule should be now)
   

The okhttp library is quite popular which makes conflicts
more probable.

Best,
Sebastian

On 7/23/21 5:25 PM, Markus Jelsma wrote:

Hello,

With a 1.18 checkout i am trying the okhttp plugin. I couldn't get it to
work on 1.15 due to another NoClassDefFoundError, and now with 1.18, it
still doesn't work and throws another NoClassDefFoundError.

java.lang.NoClassDefFoundError: okhttp3/Authenticator
 at java.base/java.lang.Class.getDeclaredConstructors0(Native Method)
 at
java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3137)
 at java.base/java.lang.Class.getConstructor0(Class.java:3342)
 at java.base/java.lang.Class.getConstructor(Class.java:2151)
 at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:164)
 at
org.apache.nutch.protocol.ProtocolFactory.getProtocolInstanceByExtension(ProtocolFactory.java:177)
 at
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:146)
 at
org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:308)
Caused by: java.lang.ClassNotFoundException: okhttp3.Authenticator
 at
java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
 at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
 at
org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:104)
 at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
 ... 8 more

Any ideas what's going on?

Thanks,
Markus

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-15 Thread Sebastian Nagel


Hi Clark,

thanks for summarizing this discussion and sharing the final configuration!

Good to know that it's possible to run Nutch on Hadoop using S3A without
using HDFS (no namenode/datanodes running).

Best,
Sebastian

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel


> The local file system? Or hdfs:// or even s3:// resp. s3a://?

Also important: the value of "mapreduce.job.dir" - it's usually
on hdfs:// and I'm not sure whether the plugin loader is able to
read from other filesystems. At least, I haven't tried.


On 6/15/21 10:53 AM, Sebastian Nagel wrote:

Hi Clark,

sorry, I should read your mail until the end - you mentioned that
you downgraded Nutch to run with JDK 8.

Could you share to which filesystem does NUTCH_HOME point?
The local file system? Or hdfs:// or even s3:// resp. s3a://?

Best,
Sebastian


On 6/15/21 10:24 AM, Clark Benham wrote:

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



   plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints 






  and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


  However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel


Hi Clark,

sorry, I should read your mail until the end - you mentioned that
you downgraded Nutch to run with JDK 8.

Could you share to which filesystem does NUTCH_HOME point?
The local file system? Or hdfs:// or even s3:// resp. s3a://?

Best,
Sebastian


On 6/15/21 10:24 AM, Clark Benham wrote:

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



   plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints




  and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


  However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) snapshot=0

2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed,
job status: FAILED, reason: Task failed task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,562 ERROR crawl.Injector:

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel


Hi Clark,

the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like 
there's something wrong fundamentally, not only with the plugins.


> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3

Are you aware that the Nutch 1.19 will require JDK 11? - and the recent Nutch 
snapshots already do,
see NUTCH-2857. Hadoop 3.2.1 does not support JDK 11, you'd need to use 3.3.0. Is a plain vanilla Hadoop used, or a specific Hadoop 
distribution (eg. Cloudera, Amazon EMR)?


Note: the normal way to run Nutch is:
  $NUTCH_HOME/runtime/deploy/bin/nutch  ...
But in the end it will also call "hadoop jar apache-nutch-xyz.job ..."

Best,
Sebastian

On 6/15/21 10:24 AM, Clark Benham wrote:

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



   plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints




  and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


  However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Re: Apache Nutch help request for a school project :)

2021-06-07 Thread Sebastian Nagel

Hi Gorkem,

I haven't verified it by trying - but it may be that given your configuration
the Solr instance isn't reachable via
http://localhost:8983/solr/nutch
Inside the Docker network, host names are the same as container names, that is
http://solr:8983/solr/nutch
might work. Cf. the docker-compose networking documentation:
https://docs.docker.com/compose/networking/

In your docker-compose.yaml there is:

services:
solr:
container_name: solr
image: 'solr:8.5.2'
ports:
- '8983:8983'
...
nutch:
container_name: nutch
...
command: '/root/nutch/bin/crawl -i -D
solr.server.url=http://localhost:8983/solr/nutch -s urls crawl 1'

Please try to fix the URL not in the Sorl URL.

Important: you need to configure the Solr URL in the file
conf/index-writers.xml unless you're using
Nutch 1.14 or below. See

https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial#NutchTutorial-SetupSolrforsearch

In any case it's important to be able to read the logs (stdout/stderr and the
hadoop.log)! I know this
isn't trivial when using docker-compose but it will save you a lot of time when
searching for errors.
If you need help here, please let us know. Best start a separate thread in the
Nutch user mailing list.

Best,
Sebastian

On 6/7/21 3:18 PM, lewis john mcgibbney wrote:

I’ll have a look today. You can always use the mailing list as well. Feel
free to post your questions there and we will help you out :)

On Sun, Jun 6, 2021 at 12:43 gokmen.yontem
wrote:

Hi Lewis,
Sorry to bother you. I've been trying to configure Apache Nutch for
almost 10 days now and I'm about to give up. I saw that you are
contributing to this project and I thought maybe you can help me.
This is how desperate I am :)

Here's my repo if you have time:
https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
I'm trying to use docker images so there isn't much on the repo/

This is my current error:

nutch| Indexer: java.lang.RuntimeException: Indexing job did not
succeed, job status:FAILED, reason: NA
nutch| at
org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
nutch| at
org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
nutch| at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
nutch| at
org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)

People say that schema.xml could be wrong, but I'm using the most up to
date one from here

https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml

Many many thanks!
Best wishes,
Gorkem

Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-04 Thread Sebastian Nagel


Hi Lewis, hi Markus,

> snappy compression, which is a massive improvement for large data shuffling 
jobs

Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for
longer. We use it for a 25-billion CrawlDB: it's almost as fast (both 
compression
and decompression) as snappy and you get a compression ratio which is not far 
away
from bzip2 (which is very slow).

> worker/task nodes were run on spot instances

we do the same. However, the EMR is priced at 25% of the on-demand EC2
instance price. As spot prices are usually 50-70% of the on-demand price,
the EMR costs would add a non-trivial part.

> backup logic

Yes. We checkpoint the output of every step to S3 unless it runs less than
one hour.

> using Terraform or AWS CloudFormation

We use a shell script to bootstrap the instances.
Some parts have been added/rewritten using CloudFormation (the VPC setup).
The long-term plan is to use more templates and also bake a base machine
image to speed up the bootstraping.

> ARM support

ARM instances often (including spot instances) offer a better price/CPU ratio.
We've already switched to ARM for a couple of services/tasks - the efforts
are minimal: choose the right base image, installing and running Java or Python
workflows does not change. But I haven't tried Hadoop yet.

Thanks for sharing your experiences, I'll keep you updated about our decisions
and progress!

Best,
Sebastian

Re: DuplexWeb-Google - GoogleBot Crawler For Duplex / Google Assistant

2021-06-04 Thread Sebastian Nagel


Thanks! Interesting that the dublexweb bot ignores the wildcard user agent 
rules by default.

On 6/3/21 11:44 PM, lewis john mcgibbney wrote:

Some interesting content for a short read :)

https://www.seroundtable.com/duplexweb-google-bot-31522.html?utm_source=search_engine_roundtable_campaign=ser_newsletter_2021-06-03_medium=email

Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-01 Thread Sebastian Nagel


Hi,

does anybody have a recommendation for a free and production-ready Hadoop setup?

- HDFS + YARN
- run Nutch but also other MapReduce and Spark-on-Yarn jobs
- with native library support: libhadoop.so and compression
  libs (bzip2, zstd, snappy)
- must run on AWS EC2 instances and read/write to S3
- including smaller ones (2 vCPUs, 16 GiB RAM)
- ideally,
  - Hadoop 3.3.0
  - Java 11 and
  - support to run on ARM machines

So far, Common Crawl uses Cloudera CDH but with no free updates
anymore we consider either to switch to Amazon EMR, a Cloudera
subscription or to use vanilla Hadoop (esp. since only HDFS and YARN
are required).

A dockerized setup is also an option (at least, for development and
testing). So far, I've looked on [1] - the upgrade to Hadoop 3.3.0
was straight-forward [2]. But native library support is still missing.

Thanks,
Sebastian

[1] https://github.com/big-data-europe/docker-hadoop
[2] 
https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11

Re: Adding html field to NutchDocument

2021-06-01 Thread Sebastian Nagel


Hi Kieran,

thanks for the feedback!

> I didn't realise that it is intended for users to edit the bin/crawl file.

Maybe we should add a comment to encourage users to adapt the shell scripts
to their needs.  Almost 10 years ago, the Java "Crawl" class was replaced
by the scripts because a shell script is easy to modify and deploy, see
  https://issues.apache.org/jira/browse/NUTCH-1087

Best,
Sebastian


On 6/1/21 2:37 PM, Kieran Munday wrote:

Hi Sebastian,

Thank you for your response. It was a great help.
I didn't realise that it is intended for users to edit the bin/crawl file.
Although looking at it now it's clear.

This makes it easier for me to access the html content within my plugin,
thanks again

On Fri, May 28, 2021 at 8:36 PM Sebastian Nagel
 wrote:


Hi Kieran,

see the command-line options

  -addBinaryContent
index raw/binary content in field `binaryContent`
  -base64
 use Base64 encoding for binary content

of the Nutch index job [1]. Note that the content maybe indeed
binary, eg. for PDF documents but also for HTML pages which use
a different encoding than UTF-8.

Best,
Sebastian

[1]
https://wiki.apache.org/confluence/pages/viewpage.action?pageId=122916842


On 5/28/21 5:28 PM, Kieran Munday wrote:

Hi users@,

I am new to Nutch (v.1.17) and my current project requires the indexing

of

the html of crawled pages. It also requires fields that can be derived

from

the raw html such as image count, and charset.

I have looked on StackOverflow for how to achieve this and most people

from

my understanding seem to be recommending processing the segments to

extract

the html and modify the documents post-crawl. This doesn't fit my use

case

as I need to calculate these fields at crawl time before they are indexed
into Elasticsearch.

The other recommendations I have seen mention creating a plugin to

override

the parse-html plugin. However, I have found rather limited documentation
on how to do this correctly and am not sure on how to return from the
plugin in a way that the field propagates into the NutchDocument which

will

be processed in the Indexers' write method.

Do any of you have any advice or links to documentation that explains how
to modify what gets set in the NutchDocument?

Thank you in advance

Re: Adding html field to NutchDocument

2021-05-28 Thread Sebastian Nagel


Hi Kieran,

see the command-line options

-addBinaryContent
   index raw/binary content in field `binaryContent`
-base64
   use Base64 encoding for binary content

of the Nutch index job [1]. Note that the content maybe indeed
binary, eg. for PDF documents but also for HTML pages which use
a different encoding than UTF-8.

Best,
Sebastian

[1] https://wiki.apache.org/confluence/pages/viewpage.action?pageId=122916842


On 5/28/21 5:28 PM, Kieran Munday wrote:

Hi users@,

I am new to Nutch (v.1.17) and my current project requires the indexing of
the html of crawled pages. It also requires fields that can be derived from
the raw html such as image count, and charset.

I have looked on StackOverflow for how to achieve this and most people from
my understanding seem to be recommending processing the segments to extract
the html and modify the documents post-crawl. This doesn't fit my use case
as I need to calculate these fields at crawl time before they are indexed
into Elasticsearch.

The other recommendations I have seen mention creating a plugin to override
the parse-html plugin. However, I have found rather limited documentation
on how to do this correctly and am not sure on how to return from the
plugin in a way that the field propagates into the NutchDocument which will
be processed in the Indexers' write method.

Do any of you have any advice or links to documentation that explains how
to modify what gets set in the NutchDocument?

Thank you in advance

Re: Crawling same domain URL's

2021-05-11 Thread Sebastian Nagel

Hi Prateek,

alternatively, you could modify the URLPartitioner [1], so that during the
"generate" step
the URLs of a specific host or domain are distributed over more partitions. One
partition
is the fetch list of one fetcher map task. At Common Crawl we partition by
domain and made
the number of partitions configurable to assign more fetcher tasks to certain
super-domains,
e.g. wordpress.com or blogspot.com, see [2].

Best,
Sebastian

[1]
https://github.com/apache/nutch/blob/6c02da053d8ce65e0283a144ab59586e563608b8/src/java/org/apache/nutch/crawl/URLPartitioner.java#L75
[2]
https://github.com/commoncrawl/nutch/blob/98a137910aa30dcb4fa1acd720fb4a4b7d9c520f/src/java/org/apache/nutch/crawl/URLPartitioner.java#L131
(used by Generator2)

On 5/11/21 3:07 PM, Markus Jelsma wrote:

Hello Prateek,

You are right, it is limited by the number of CPU cores and how many
threads it can handle, but you can still process a million records per day
if you have a few cores. If you parse as a separate step, it can run even
faster.

Indeed, it won't work if you need to process 10 million recors of the same
host every day. If you want to use Hadoop for this, you can opt for a
custom YARN application [1]. We have done that too for some of our
distributed tools, it works very nice.

Regards,
Markus

[1]
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Op di 11 mei 2021 om 14:54 schreef prateek :

Hi Markus,

Depending upon the core of the machine, I can only increase the number of
threads upto a limit. After that performance degradation will come into the
picture.
So running a single mapper will still be a bottleneck in this case. I am
looking for options to distribute the same domain URLs across various
mappers. Not sure if that's even possible with Nutch or not.

Regards
Prateek

On Tue, May 11, 2021 at 11:58 AM Markus Jelsma

wrote:

Hello Prateet,

If you want to fetch stuff from the same host/domain as fast as you want,
increase the number of threads, and the number of threads per queue. Then
decrease all the fetch delays.

Regards,
Markus

Op di 11 mei 2021 om 12:48 schreef prateek :

Hi Lewis,

As mentioned earlier, it does not matter how many mappers I assign to

fetch

tasks. Since all the URLs are of the same domain, everything will be
assigned to the same mapper and all other mappers will have no task to
execute. So I am looking for ways I can crawl the same domain URLs

quickly.

Regards
Prateek

On Mon, May 10, 2021 at 1:02 AM Lewis John McGibbney <

lewi...@apache.org

wrote:

Hi Prateek,
mapred.map.tasks -->mapreduce.job.maps
mapred.reduce.tasks -->mapreduce.job.reduces
You should be able to override in these in nutch-site.xml then

publish

your Hadoop cluster.
lewismc

On 2021/05/07 15:18:38, prateek wrote:

Hi,

I am trying to crawl URLs belonging to the same domain (around

140k)

and

because of the fact that all the same domain URLs go to the same

mapper,

only one mapper is used for fetching. All others are just a waste

resources. These are the configurations I have tried till now but

it's

still very slow.

Attempt 1 -
fetcher.threads.fetch : 10
fetcher.server.delay : 1
fetcher.threads.per.queue : 1,
fetcher.server.min.delay : 0.0

Attempt 2 -
fetcher.threads.fetch : 10
fetcher.server.delay : 1
fetcher.threads.per.queue : 3,
fetcher.server.min.delay : 0.5

Is there a way to distribute the same domain URLs across all the
fetcher.threads.fetch? I understand that in this case crawl delay

cannot

reinforced across different mappers but for my use case it's ok to

crawl

aggressively. So any suggestions?

Regards
Prateek

Re: Redirection behavior

2021-05-06 Thread Sebastian Nagel


Hi Prateek,

(sorry, I pressed the wrong reply button, so redirecting the discussion back to 
user@nutch)


> I am not sure what I am missing.

Well, URL filters?  Robots.txt?  Don't know...


> I am currently using Nutch 1.16

Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1]) which 
caused Fetcher
not to follow redirects. But it was fixed already in Nutch 1.15.

I've retried using Nutch 1.16:
- using -Dplugin.includes='protocol-okhttp|parse-html'
   FetcherThread 43 fetching http://wikipedia.com/ (queue crawl delay=3000ms)
   FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
   FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl 
delay=3000ms)

Note: there might be an issue using protocol-http 
(-Dplugin.includes='protocol-http|parse-html')
together with Nutch 1.16:
   FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
   FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
   Couldn't get robots.txt for https://wikipedia.com/: 
java.net.SocketException: Socket is closed
   FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl 
delay=3000ms)
   FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl 
delay=3000ms)
   Couldn't get robots.txt for https://www.wikipedia.org/: 
java.net.SocketException: Socket is closed
   Failed to get protocol output java.net.SocketException: Socket is closed
at 
sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
at 
org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:162)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
   FetcherThread 43 fetch of https://www.wikipedia.org/ failed with: 
java.net.SocketException: Socket is closed

But it's not reproducible using Nutch master / 1.18 - as it relates to 
HTTPS/SSL it's likely fixed by NUTCH-2794 [2].

In any case, could you try to reproduce the problem using Nutch 1.18 ?

Best,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-2550
[2] https://issues.apache.org/jira/browse/NUTCH-2794


On 5/6/21 11:54 AM, prateek wrote:

Thanks for your reply Sebastian.

I am using http.redirect.max=5 for my setup.
In the seed URL, I am only passing http://wikipedia.com/ <http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> . CrawlDatum 
and ParseData shared in my earlier email are from http://wikipedia.com/ <http://wikipedia.com/> url.

I don't see the other redirected URL's in the logs or segments. Here is my log -

/2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
FetcherThread 1 Using queue mode : byHost
2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher: 
throughput threshold: -1
2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher: 
throughput threshold retries: 5
*2021-05-05 17:35:23,855 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching http://wikipedia.com/ 
<http://wikipedia.com/> (queue crawl delay=1000ms)*


*2021-05-05 17:35:29,095 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching https://zyfro.com/ 
<https://zyfro.com/> (queue crawl delay=1000ms)*

2021-05-05 17:35:29,095 INFO [main] com.linkedin.nutchplugin.http.Http: fetching 
https://zyfro.com/robots.txt <https://zyfro.com/robots.txt>
2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher: -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=1

2021-05-05 17:35:30,189 INFO [main] com.linkedin.nutchplugin.http.Http: fetching 
https://zyfro.com/ <https://zyfro.com/>
2021-05-05 17:35:30,786 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
FetcherThread 50 has no more work available/

I am not sure what I am missing.

Regards
Prateek


On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel mailto:wastl.na...@googlemail.com>> wrote:

Hi Prateek,

could you share information about all pages/URLs in the redirect chain?

http://wikipedia.com/ <http://wikipedia.com/>
https://wikipedia.com/ <https://wikipedia.com/>
https://www.wikipedia.org/ <https://www.wikipedia.org/>

If I'm not wrong, the shown  CrawlDatum and ParseData stems from
https://www.wikipedia.org/ <https://www.wikipedia.org/> and is 
_http_status_code_=200.
So, looks like the redirects have been followed.

Note: all 3 URLs should have records in the segment and the CrawlDb.

I've also verified that the above redirect chain is followed by Fetcher
with the following settings (passed on the command-line via -D) using
Nutch master (1.18):
   -Dhttp.redirect.max=3
   -Ddb.ignore.external.links=true
   -Ddb.ignore.externa

Re: Writing Nutch data in Parquet format

2021-05-05 Thread Sebastian Nagel


Hi Lewis,

> 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet 
format?

Yes, but not directly - it's a multi-step process. The outcome:
  
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

This Parquet index is optimized by sorting the rows by a special form of the 
URL [1] which
- drops the protocol or scheme
- reverses the host name and
- puts it in front of the remaining URL parts (path and query)
- with some additional normalization of path and query (eg. sorting of query 
params)

One example:
  https://example.com/path/search?q=foo=en
  com,example)/path/search?l=en=foo

The SURT URL is similar to the URL format used by Nutch2
  com.example/https/path/search?q=foo=en
to address rows in the WebPage table [2]. This format is inspired by the 
BigTable
paper [3].  The point is that  cf. [4].


Ok, back to the question: both 1) and 2) are trivial if you do not care about
writing an optimal Parquet files: just define a schema following the methods 
implementing
the Writable interface. Parquet is easier to feed into various data processing 
systems
because it integrates the schema. The Sequence file format requires that the
Writable formats are provided - although Spark and other big data tools support
Sequence files this requirement is sometimes a blocker, also because Nutch
does not ship a small "nutch-formats" jar.

Nevertheless, the price for Parquet is slower writing - which is ok for 
write-once-read-many
use cases. But the typical use case for Nutch is "write-once-read-twice":
- segment: read for CrawlDb update and indexing
- CrawlDb: read during update then replace, in some cycles read for 
deduplication, statistics, etc.


Lewis, I'd be really interested what your particular use case is?

Also because at Common Crawl we plan to provide more data in the Parquet 
format: page metadata,
links and text dumps. Storing URLs and wb page metadata efficiently was part of 
the motivation
for Dremel [5] which again inspired Parquet [6].


Best,
Sebastian


[1] https://github.com/internetarchive/surt
[2] 
https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling#Nutch2Crawling-Introduction
[3] 
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
[4] https://cloud.google.com/bigtable/docs/schema-design#domain-names
[5] https://research.google/pubs/pub36632/
[6] 
https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html


On 5/4/21 11:14 PM, Lewis John McGibbney wrote:

Hi user@,
Has anyone experimented/accomplished either
1) writing Nutch data directly as Parquet format, or
2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet 
format?
Thank you
lewismc

Re: googled for ever and still can't figure it out

2021-03-15 Thread Sebastian Nagel

Hi Andrew,

> if this flag is used *--sitemaps-from-hostdb always*

Do the crawled hosts announce the sitemap in their robots.txt?
If not does the sitemap URLs follow the pattern
  http://example.com/sitemap.xml ?

See https://cwiki.apache.org/confluence/display/NUTCH/SitemapFeature

If this is not the case, it's required to put the URLs pointing
to the sitemaps into a separate list and call bin/crawl with the
option `-sm `.

> nutch-default.xml set the interval to 2 seconds from default 30 days.

Ok, for one day or even few hours. But why "2 seconds"?

> I also don't understand why the crawldb is automatically deleted

The crawldb isn't removed but updated after each cycle by
- moving the previous version from "current/" to "old/"
- placing the updated version in "current/"

In doubt, and because bugs are always possible could you share the logs
from the SitemapProcessor ?

Best,
Sebastian

On 3/13/21 6:33 PM, Andrew MacKay wrote:

Hi

hoping for some help to get sitemaps.xml working
using this command to crawl  (nutch 1.18)

NUTCH_HOME/bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch
--sitemaps-from-hostdb always -s $NUTCH_HOME/urls/ $NUTCH_HOME/Crawl 10

if this flag is used *--sitemaps-from-hostdb always*
*this error occurs*

*Generator: number of items rejected during selection:Generator:201
SCHEDULE_REJECTEDGenerator: 0 records selected for fetching, exiting ...*

without this flag present   it crawls the site without issue and

nutch-default.xml set the interval to 2 seconds from default 30 days.

  db.fetch.interval.default

   2

I also don't understand why the crawldb is automatically deleted after each
crawl so I cannot runn any commands about url's that are not crawled.

Any help

Re: Extract all image and video links from a web page

2021-01-27 Thread Sebastian Nagel

Hi Prateek,

are there any URL filters which filter away image links?

You can verify this using the URL filter checker:

echo "https://example.com/image.jpg; \
| bin/nutch filterchecker -stdin

The default rules in conf/regex-urlfilter.txt exclude common
image suffixes. Note that there can be more URL filters activated
in the property plugin.includes.

Best,
Sebastian

On 1/26/21 3:14 PM, prateek wrote:

Hi Lewis,

Thanks for your suggestion.

I looked at the class fetching outlinks and saw that "img" is already part
of that -
https://github.com/apache/nutch/blob/680df6ba1dc68ad5ede5fca743304593d4d5b0a3/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L90.
So I am confused as to why I don't see any images in outlinks.
I have double checked that the property parser.html.outlinks.ignore_tags is
also not set. So ideally images should be part of outlinks already. But
when I run "bin/nutch readseg" to see the segments data, I don't see any
images being captured. Any Idea what am I missing?

If there is a way I can get all images in outlinks, then maybe I don't even
need a plugin for that.

Regards
Prateek

On Wed, Jan 20, 2021 at 5:37 PM Lewis John McGibbney
wrote:

Hi Prateek,

On 2021/01/19 15:58:29, prateek wrote:

Is the only other option is to
override HtmlParseFilter and add a new plugin?

Yes I think it is.

Also regarding separate objects, what i meant is if i store the image

links

in Outlink, then those links will also be stored in DB (because all

outlink

are stored for next crawl of depth > 1). I don't want to store those in
crawldb and just output in some other object within the record. I hope

this

makes sense

I understand. Seeing as you cannot upgrade then yes I think you need to
implement a new plugin to capture the outlinks as a new field in the
NutchDocument. You should also look into using the
'parser.html.outlinks.ignore_tags' configuration setting. You can specify
which tags are filtered.

lewismc

Re: NUTCH-2353

2020-12-06 Thread Sebastian Nagel

Hi,

no, NUTCH-2353 is still open, see
  https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2353
The implementation caused a regression, so it was reverted.

Best,
Sebastian


On 12/6/20 7:03 AM, Von Kursor wrote:
> Hello
> 
> Has this API enhancement been implemented under 1.17 ?
> 
> I was under the impression it had while looking at Jira, but I am
> seeing otherwise while trying to use it:
> 
> 2020-12-05 16:51:17,893 WARN phase.PhaseInterceptorChain - Interceptor
> for {http://resources.service.nutch.apache.org/}JobResource has thrown
> exception, unwinding now
> javax.ws.rs.InternalServerErrorException: HTTP 500 Internal Server Error
> at 
> org.apache.cxf.jaxrs.utils.SpecExceptions.toInternalServerErrorException(SpecExceptions.java:79)
> at 
> org.apache.cxf.jaxrs.utils.ExceptionUtils.toInternalServerErrorException(ExceptionUtils.java:111)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.convertExceptionToResponseIfPossible(JAXRSInInterceptor.java:225)
> at 
> org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:87)
> at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
> at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
> at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
> at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.eclipse.jetty.server.Server.handle(Server.java:505)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
> at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
> at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
> at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
> at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
> at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
> at 
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException:
> Unrecognized field "metadata" (class
> org.apache.nutch.service.model.request.SeedUrl), not marked as
> ignorable (3 known properties: "id", "seedList", "url"])
> at [Source: (org.apache.cxf.transport.http.AbstractHTTPDestination$1);
> line: 9, column: 27] (through reference chain:
> org.apache.nutch.service.model.request.SeedList["seedUrls"]->java.util.ArrayList[0]->org.apache.nutch.service.model.request.SeedUrl["metadata"])
> Is there any workaround ? I would really like to see this feature working.
> 
> Thank you,
>

Re: Nutch 2.4 with selenium

2020-10-10 Thread Sebastian Nagel

Hi,

> Nutch 2.4 with selenium

Nutch 2.4 does not include any plugin to use Selenium. In addition, 2.4 is for 
now the last release on the 2.x branch which is not
maintained anymore. You should use 1.x (1.17 is the
most recent release.

> standalone nutch crawling with selenium.

For 1.x there's a good README how to setup protocol-selenium:
  
https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/README.md

In general, the tutorial is the recommended way to start
  https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
Please try to get it running first without Selenium, it's important to 
understand first
how Nutch works before you start with the clearly more complex Selenium-based 
crawling.

Best,
Sebastian

On 10/7/20 2:49 PM, Gajalakshmi G wrote:
> Hi,
> 
> Thanks for the response, the 'conf/regex-urlfilter.txt' file was available 
> inside the current working directory.
> 
> Please guide me or share me useful links on standalone  nutch crawling with 
> selenium.
> 
> 
> 
> Thanks & Regards,
> 
> Gajalakshmi.G
> 
> Assistant Consultant
> 
> Tata Consultancy Services
> Mailto: 
> gajalakshm...@tcs.com
> 
> 
> From: Shashanka Balakuntala 
> Sent: Wednesday, October 7, 2020 5:49 PM
> To: user@nutch.apache.org 
> Subject: Re: Nutch 2.4 with selenium
> 
> "External email. Open with Caution"
> 
> Hi Gajalakshmi,
> 
> The NPE can be thrown because of the file not found on the disk. So in the
> working directory/current directory check if you have the file
> conf/regex-urlfilter.txt
> 
> 
> *Regards*
>   Shashanka Balakuntala Srinivasa
> 
> 
> 
> On Wed, Oct 7, 2020 at 2:09 PM Gajalakshmi G 
> wrote:
> 
>> Hi all,
>>
>> I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0
>> with Firefox version 79. I am getting the below error in injector job
>> itself.
>>
>> java.lang.Exception: java.lang.NullPointerException
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
>> Caused by: java.lang.NullPointerException
>> at java.io.Reader.(Reader.java:78)
>> at java.io.BufferedReader.(BufferedReader.java:101)
>> at java.io.BufferedReader.(BufferedReader.java:116)
>> at
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
>> at
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
>> at
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
>> at org.apache.nutch.net.URLFilters.(URLFilters.java:62)
>> at
>> org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>> at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>>
>> Please guide me on resolving this issue.
>>
>>
>>
>> Thanks & Regards,
>>
>> Gajalakshmi.G
>>
>> Assistant Consultant
>>
>> Tata Consultancy Services
>> Mailto: gajalakshm...@tcs.com<
>> https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11=mailto%3agajalakshmi.g%40tcs.com
>>>
>> =-=-=
>> Notice: The information contained in this e-mail
>> message and/or attachments to it may contain
>> confidential or privileged information. If you are
>> not the intended recipient, any dissemination, use,
>> review, distribution, printing or copying of the
>> information contained in this e-mail message
>> and/or attachments to it are strictly prohibited. If
>> you have received this communication in error,
>> please notify us by reply e-mail or telephone and
>> immediately and permanently delete the message
>> and any attachments. Thank you
>>
>>
>>
>

Re: Unable to get search result using Javascript client..

2020-10-01 Thread Sebastian Nagel

Hi,

this question is better asked on the Solr user mailing list
as Nutch people are not necessarily familiar with Solr on a deep level.

Please also share more details - which JavaScript client, the error message,
the log messages of the Solr server at this time. This helps to trace the
error down and detect the reason.

Best,
Sebastian


On 9/28/20 1:29 PM, SUNIL KUMAR DASH wrote:
> Dear All, I am unable to connect and get the search result from my solr 
> server (7.3.1 with nutch 1.15 ) from a simple JavaScript client..
> responseText showing blank. But on the standalone solr server I am able to 
> search the keyword and results are displayed ( core name=nutch
> )   My code goes like this. What could be the problem..Thanks in advance..
> Solr Search
> 
> SOLR SEARCH [ FROM JAVASCRIPT ]
> 
> 
> 
> 
> 
> Regards Sunil Kumar
> --
> IMPORTANT NOTE:
> 
> ISRO Satellite Centre (ISAC) was renamed as U R Rao Satellite Centre (URSC).
> Hence, the existing domain (isac.gov.in) is changed to new domain ursc.gov.in
> resulting into change of e-mail address from u...@isac.gov.in to 
> u...@ursc.gov.in.
> Please note this change and update your contact details for new domain 
> (ursc.gov.in).
> --
> Confidentiality Notice: This e-mail message, including any attachments, is for
> the sole use of the intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended recipient, please
> contact the sender by reply e-mail and destroy all copies of the original
> message.
> --
> 
>

Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

2020-09-08 Thread Sebastian Nagel

Hi Dimanshu,

Nutch is a community project. If you can, please take the time, be part of the 
community
and improve the documentation. Unlike for the source code, the barrier for the 
wiki is low:
anybody can and *is welcome* to register and update the Nutch Wiki. As a 100% 
volunteer project
we rely on contributions from the community including our users.

Thanks,
Sebastian

On 9/4/20 9:17 PM, Dimanshu Parihar wrote:
> Thanks Sebastian,
> This helps a lot. I got the point. They should change the documentation. A 
> lot of people gets confused because of that.
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> From: Sebastian Nagel<mailto:wastl.na...@googlemail.com.INVALID>
> Sent: Tuesday, August 11, 2020 4:56 PM
> To: user@nutch.apache.org<mailto:user@nutch.apache.org>
> Subject: Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode
> 
> Hi,
> 
> Nutch does not include a search component anymore. These steps are obsolete.
> 
> All you need is to setup your Hadoop cluster, then run
>$NUTCH_HOME/runtime/deploy/bin/nutch ...
> (instead of .../runtime/local/bin/nutch ...)
> 
> Alternatively, you could launch a Nutch tool, eg. Injector
> the following way:
> 
> hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.15-SNAPSHOT.job \
>org.apache.nutch.crawl.Injector ...
> 
> Best,
> Sebastian
> 
> 
> On 8/10/20 11:31 AM, Dimanshu Parihar wrote:
>>
>>
>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>> Hello Sir,
>> I have been using Nutch 1.17 in local mode and now I wanted to shift from 
>> local mode to deploy mode. For this, I tried the Apache Nutch Hadoop cluster 
>> setup link but I am stuck at the below given point :
>>
>> Problem :
>>
>> First copy the files from the nutch build to the deploy directory using 
>> something like the following command:
>>
>> cp -R /path/to/build/* /nutch/search
>>
>> Then make sure that all of the shell scripts are in unix format and are 
>> executable.
>>
>> dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
>> /nutch/search/bin/nutch
>>
>> chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
>> /nutch/search/bin/nutch
>>
>> dos2unix /nutch/search/config/*.sh
>>
>> chmod 700 /nutch/search/config/*.sh
>> Issue :
>> The issue is I ran ant command in nutch folder and runtime folder is created 
>> and a build folder is created. I copied the build/* files to search folder 
>> that I created in nutch folder itself. But after running these dos2unix 
>> commands, it says no bin/Hadoop and bin/nutch files found here which is 
>> obvious because my build folder didn’t had these files.
>> So can you please clarify these statements that how can I follow these steps?
>> I have only 1 user where I am setting all 3 hadoop, solr and nutch which is 
>> not root user.
>>
> 
>

Re: Unable to index on Hadoop 3.2.0 with 1.16

2020-08-12 Thread Sebastian Nagel

Hi Joe,

> I eliminated it when I updated the index-writers.xml for the solr_indexer_1
> to use only a single URL.

Thanks for the hint. I'm able to reproduce the error by adding an overlong URL 
to
  


Could you open an issue to fix this on
https://issues.apache.org/jira/projects/NUTCH ?

Thanks!

Best,
Sebastian


On 8/12/20 5:35 PM, Gilvary, Joseph wrote:
> Hi,
> 
> I wasn't on the list when this discussion happened, so I hope this will 
> thread correctly in archives. I linked to the archive below and tried to 
> include enough here to ensure searchers can find it if this won't thread.
> 
> I was getting an error with Nutch 1.17.  I never used 1.16, but upgraded from 
> 1.15 recently.
> 
> java.lang.Exception: java.lang.IllegalStateException: text width is less than 
> 1, was <-26>
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559)
> Caused by: java.lang.IllegalStateException: text width is less than 1, was 
> <-26>
> at org.apache.commons.lang3.Validate.validState(Validate.java:829)
> at 
> de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)
> at 
> de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)
> at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)
> at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)
> at 
> org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)
> at 
> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)
> at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)
> at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 
> This looks like the error that Markus Jelsma described in the earlier 
> discussion, though the invalid test width in my case was -26. I eliminated it 
> when I updated the index-writers.xml for the solr_indexer_1 to use only a 
> single URL. I don't know where the -26 comes from or the -41 Marcus was 
> getting, but the fact that they were different values told me that the issue 
> would be in the site-specific difference in our configs.
> 
> Adding the link in the archive were I found the earlier discussion:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201910.mbox/%3c05eda22b-14b2-309f-3bc7-d6d85c218...@googlemail.com%3E
> 
> Adding the only potentially relevant Jira link I found while searching:
> https://issues.apache.org/jira/browse/NUTCH-2602
> 
> It seems potentially relevant because Marcus started getting the error after 
> migrating to 1.16 & I started getting it when I went from 1.15 to 1.17.
> 
> Thanks. Stay safe, stay healthy,
> 
> Joe
>

Re: Regarding Nutch Hadoop Cluster Setup in Deploy Mode

2020-08-11 Thread Sebastian Nagel

Hi,

Nutch does not include a search component anymore. These steps are obsolete.

All you need is to setup your Hadoop cluster, then run
   $NUTCH_HOME/runtime/deploy/bin/nutch ...
(instead of .../runtime/local/bin/nutch ...)

Alternatively, you could launch a Nutch tool, eg. Injector
the following way:

hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.15-SNAPSHOT.job \
   org.apache.nutch.crawl.Injector ...

Best,
Sebastian


On 8/10/20 11:31 AM, Dimanshu Parihar wrote:
> 
> 
> Sent from Mail for Windows 10
> Hello Sir,
> I have been using Nutch 1.17 in local mode and now I wanted to shift from 
> local mode to deploy mode. For this, I tried the Apache Nutch Hadoop cluster 
> setup link but I am stuck at the below given point :
> 
> Problem :
> 
> First copy the files from the nutch build to the deploy directory using 
> something like the following command:
> 
> cp -R /path/to/build/* /nutch/search
> 
> Then make sure that all of the shell scripts are in unix format and are 
> executable.
> 
> dos2unix /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
> /nutch/search/bin/nutch
> 
> chmod 700 /nutch/search/bin/*.sh /nutch/search/bin/hadoop 
> /nutch/search/bin/nutch
> 
> dos2unix /nutch/search/config/*.sh
> 
> chmod 700 /nutch/search/config/*.sh
> Issue :
> The issue is I ran ant command in nutch folder and runtime folder is created 
> and a build folder is created. I copied the build/* files to search folder 
> that I created in nutch folder itself. But after running these dos2unix 
> commands, it says no bin/Hadoop and bin/nutch files found here which is 
> obvious because my build folder didn’t had these files.
> So can you please clarify these statements that how can I follow these steps?
> I have only 1 user where I am setting all 3 hadoop, solr and nutch which is 
> not root user.
>

[ANNOUNCE] New Nutch committer and PMC - Shashanka Balakuntala Srinivasa

2020-07-28 Thread Sebastian Nagel

Dear all,

it is my pleasure to announce that Shashanka Balakuntala Srinivasa has joined us
as a committer and member of the Nutch PMC. Shashanka Balakuntala has worked 
recently
on a longer list of Nutch issues and improvements.

Thanks, Shashanka Balakuntala, and congratulations on your new role within the 
Apache Nutch community! And thanks for your contributions and
efforts so far, hope to see more!

Welcome on board!

Sebastian
(on behalf of the Nutch PMC)

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-27 Thread Sebastian Nagel

> might have to create my own custom FetcherOutputFormat to allow out of
> order writes. I will check how I can do that.

Just replace the MapFile.Writer by a SequenceFile.Writer
Eventually, this may require further changes.

> I have also concluded this discussion here -
> https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.

Thanks for updating the discussion there!

On 7/22/20 4:09 PM, prateek sachdeva wrote:
> ctly Thanks a lot Sebastian. Yes, after checking the logs i saw "key out of
> order exception" and realized that MapFile expects entries to be in order
> and MapFile is used in FetcherOutputFormat while writing data to HDFS. I
> might have to create my own custom FetcherOutputFormat to allow out of
> order writes. I will check how I can do that.
> 
> I will also try to merge parsing and avro conversion to fetch Job directly
> so see if there are some improvements.
> 
> I have also concluded this discussion here -
> https://stackoverflow.com/questions/63003881/apache-nutch-1-16-fetcher-reducers/.
> So if you want to add something here, please feel free to do so.
> 
> Regards
> Prateek
> 
> On Tue, Jul 21, 2020 at 7:50 PM Sebastian Nagel
>  wrote:
> 
>> Hi Prateek,
>>
>>> if I do 0 reducers in
>>> the Fetch phase, I am not getting all the urls in output that I seeded in
>>> input. Looks like only a few of them made it to the final output.
>>
>> There should be error messages in the task logs caused by output not sorted
>> by URL (used as key in map files).
>>
>>
>>>> Final clarification - If I do fetcher.store.content=true and
>>>> fetcher.parse=true, I don't need that Parse Job in my workflow and
>> parsing
>>>> will be done as part of fetcher flow only?
>>
>> Yes, parsing is then done in the fetcher and the parse output is written to
>> crawl_parse, parse_text and parse_data.
>>
>> Best,
>> Sebastian
>>
>> On 7/21/20 3:42 PM, prateek sachdeva wrote:
>>> Correcting my statement below. I just realized that if I do 0 reducers in
>>> the Fetch phase, I am not getting all the urls in output that I seeded in
>>> input. Looks like only a few of them made it to the final output.
>>> So something is not working as expected if we use 0 reducers in the Fetch
>>> phase.
>>>
>>> Regards
>>> Prateek
>>>
>>> On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva 
>>> wrote:
>>>
>>>> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher
>> won't
>>>> make sense because of tooling that's built around it.
>>>> Answering your questions - No, we have not made any changes to
>>>> FetcherOutputFormat. Infact, the whole fetcher and parse job is the
>> same as
>>>> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
>>>> built wrappers around these classes to run using Azkaban (
>>>> https://azkaban.github.io/). And still it works if I assign 0 reducers
>> in
>>>> the Fetch phase.
>>>>
>>>> Final clarification - If I do fetcher.store.content=true and
>>>> fetcher.parse=true, I don't need that Parse Job in my workflow and
>> parsing
>>>> will be done as part of fetcher flow only?
>>>> Also, I agree with your point that if I modify FetcherOutputFormat to
>>>> include avro conversion step, I might get rid of that as well. This will
>>>> save some time for sure since Fetcher will be directly creating the
>> final
>>>> avro format that I need. So the only question remains is that if I do
>>>> fetcher.parse=true, can I get rid of parse Job as a separate step
>>>> completely.
>>>>
>>>> Regards
>>>> Prateek
>>>>
>>>> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
>>>>  wrote:
>>>>
>>>>> Hi Prateek,
>>>>>
>>>>> (regarding 1.)
>>>>>
>>>>> It's also possible to combine fetcher.store.content=true and
>>>>> fetcher.parse=true.
>>>>> You might save some time unless the fetch job is CPU-bound - it usually
>>>>> is limited by network and RAM for buffering content.
>>>>>
>>>>>> which code are you referring to?
>>>>>
>>>>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
>>>>> there are probably
>>>>> some more tools which also do.  If nothing is used in your workflow,
&

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-21 Thread Sebastian Nagel

Hi Prateek,

> if I do 0 reducers in
> the Fetch phase, I am not getting all the urls in output that I seeded in
> input. Looks like only a few of them made it to the final output.

There should be error messages in the task logs caused by output not sorted
by URL (used as key in map files).


>> Final clarification - If I do fetcher.store.content=true and
>> fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
>> will be done as part of fetcher flow only?

Yes, parsing is then done in the fetcher and the parse output is written to
crawl_parse, parse_text and parse_data.

Best,
Sebastian

On 7/21/20 3:42 PM, prateek sachdeva wrote:
> Correcting my statement below. I just realized that if I do 0 reducers in
> the Fetch phase, I am not getting all the urls in output that I seeded in
> input. Looks like only a few of them made it to the final output.
> So something is not working as expected if we use 0 reducers in the Fetch
> phase.
> 
> Regards
> Prateek
> 
> On Tue, Jul 21, 2020 at 2:13 PM prateek sachdeva 
> wrote:
> 
>> Makes complete sense. Agreed that 0 reducers in apache nutch fetcher won't
>> make sense because of tooling that's built around it.
>> Answering your questions - No, we have not made any changes to
>> FetcherOutputFormat. Infact, the whole fetcher and parse job is the same as
>> that of apache nutch 1.16(Fetcher.java and ParseSegment.java). We have
>> built wrappers around these classes to run using Azkaban (
>> https://azkaban.github.io/). And still it works if I assign 0 reducers in
>> the Fetch phase.
>>
>> Final clarification - If I do fetcher.store.content=true and
>> fetcher.parse=true, I don't need that Parse Job in my workflow and parsing
>> will be done as part of fetcher flow only?
>> Also, I agree with your point that if I modify FetcherOutputFormat to
>> include avro conversion step, I might get rid of that as well. This will
>> save some time for sure since Fetcher will be directly creating the final
>> avro format that I need. So the only question remains is that if I do
>> fetcher.parse=true, can I get rid of parse Job as a separate step
>> completely.
>>
>> Regards
>> Prateek
>>
>> On Tue, Jul 21, 2020 at 1:26 PM Sebastian Nagel
>>  wrote:
>>
>>> Hi Prateek,
>>>
>>> (regarding 1.)
>>>
>>> It's also possible to combine fetcher.store.content=true and
>>> fetcher.parse=true.
>>> You might save some time unless the fetch job is CPU-bound - it usually
>>> is limited by network and RAM for buffering content.
>>>
>>>> which code are you referring to?
>>>
>>> Maybe it isn't "a lot". The SegmentReader is assuming map files, and
>>> there are probably
>>> some more tools which also do.  If nothing is used in your workflow,
>>> that's fine.
>>> But if a fetcher without the reduce step should become the default for
>>> Nutch, we'd
>>> need to take care for all tools and also ensure backward-compatibility.
>>>
>>>
>>>> FYI- I tried running with 0 reducers
>>>
>>> I assume you've also adapted FetcherOutputFormat ?
>>>
>>> Btw., you could think about inlining the "avroConversion" (or parts of
>>> it) into FetcherOutputFormat which also could remove the need to
>>> store the content.
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> On 7/21/20 11:28 AM, prateek sachdeva wrote:
>>>> Hi Sebastian,
>>>>
>>>> Thanks for your reply. Couple of questions -
>>>>
>>>> 1. We have customized apache nutch jobs a bit like this. We have a
>>> separate parse job (ParseSegment.java) after fetch job (Fetcher.java). So
>>>> as suggested above, if I use fetcher.store.content=false, I am assuming
>>> the "content" folder will not be created and hence our parse job
>>>> won't work because it takes the content folder as an input file. Also,
>>> we have added an additional step "avroConversion" which takes input
>>>> as "parse_data", "parse_text", "content" and "crawl_fetch" and converts
>>> into a specific avro schema defined by us. So I think, I will end up
>>>> breaking a lot of things if I add fetcher.store.content=false and do
>>> parsing in the fetch phase only (fetcher.parse=true)
>>>>
>>>> image.png
>>>>
>>>> 2. In your earlier email, you said "a lot of code accessing the
>>> segments stil

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-21 Thread Sebastian Nagel

Hi Prateek,

(regarding 1.)

It's also possible to combine fetcher.store.content=true and fetcher.parse=true.
You might save some time unless the fetch job is CPU-bound - it usually is 
limited by network and RAM for buffering content.

> which code are you referring to?

Maybe it isn't "a lot". The SegmentReader is assuming map files, and there are 
probably
some more tools which also do.  If nothing is used in your workflow, that's 
fine.
But if a fetcher without the reduce step should become the default for Nutch, 
we'd
need to take care for all tools and also ensure backward-compatibility.


> FYI- I tried running with 0 reducers

I assume you've also adapted FetcherOutputFormat ?

Btw., you could think about inlining the "avroConversion" (or parts of it) into 
FetcherOutputFormat which also could remove the need to
store the content.

Best,
Sebastian


On 7/21/20 11:28 AM, prateek sachdeva wrote:
> Hi Sebastian,
> 
> Thanks for your reply. Couple of questions -
> 
> 1. We have customized apache nutch jobs a bit like this. We have a separate 
> parse job (ParseSegment.java) after fetch job (Fetcher.java). So
> as suggested above, if I use fetcher.store.content=false, I am assuming the 
> "content" folder will not be created and hence our parse job
> won't work because it takes the content folder as an input file. Also, we 
> have added an additional step "avroConversion" which takes input
> as "parse_data", "parse_text", "content" and "crawl_fetch" and converts into 
> a specific avro schema defined by us. So I think, I will end up
> breaking a lot of things if I add fetcher.store.content=false and do parsing 
> in the fetch phase only (fetcher.parse=true)
> 
> image.png
> 
> 2. In your earlier email, you said "a lot of code accessing the segments 
> still assumes map files", which code are you referring to? In my
> use case above, we are not sending the crawled output to any indexers. In the 
> avro conversion step, we just convert data into avro schema
> and dump to HDFS. Do you think we still need reducers in the fetch phase? 
> FYI- I tried running with 0 reducers and don't see any impact as
> such.
> 
> Appreciate your help.
> 
> Regards
> Prateek
> 
> On Tue, Jul 21, 2020 at 9:06 AM Sebastian Nagel 
>  wrote:
> 
> Hi Prateek,
> 
> you're right there is no specific reducer used but without a reduce step
> the segment data isn't (re)partitioned and the data isn't sorted.
> This was a strong requirement once Nutch was a complete search engine
> and the "content" subdir of a segment was used as page cache.
> Getting the content from a segment is fast if the segment is partitioned
> in a predictable way (hash partitioning) and map files are used.
> 
> Well, this isn't a strong requirement anymore, since Nutch uses Solr,
> Elasticsearch or other index services. But a lot of code accessing
> the segments still assumes map files. Removing the reduce step from
> the fetcher would also mean a lot of work in code and tools accessing
> the segments, esp. to ensure backward compatibility.
> 
> Have you tried to run the fetcher with
>  fetcher.parse=true
>  fetcher.store.content=false ?
> This will save a lot of time and without the need to write the large
> raw content the reduce phase should be fast, only a small fraction
> (5-10%) of the fetcher map phase.
> 
> Best,
> Sebastian
> 
> 
> On 7/20/20 11:38 PM, prateek sachdeva wrote:
> > Hi Guys,
> >
> > As per Apache Nutch 1.16 Fetcher class implementation here -
> > 
> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java,
> > this is a map only job. I don't see any reducer set in the Job. So my
> > question is why not set job.setNumreduceTasks(0) and save the time by
> > outputting directly to HDFS.
> >
> > Regards
> > Prateek
> >
>

Re: Apache Nutch 1.16 Fetcher reducers?

2020-07-21 Thread Sebastian Nagel

Hi Prateek,

you're right there is no specific reducer used but without a reduce step
the segment data isn't (re)partitioned and the data isn't sorted.
This was a strong requirement once Nutch was a complete search engine
and the "content" subdir of a segment was used as page cache.
Getting the content from a segment is fast if the segment is partitioned
in a predictable way (hash partitioning) and map files are used.

Well, this isn't a strong requirement anymore, since Nutch uses Solr,
Elasticsearch or other index services. But a lot of code accessing
the segments still assumes map files. Removing the reduce step from
the fetcher would also mean a lot of work in code and tools accessing
the segments, esp. to ensure backward compatibility.

Have you tried to run the fetcher with
 fetcher.parse=true
 fetcher.store.content=false ?
This will save a lot of time and without the need to write the large
raw content the reduce phase should be fast, only a small fraction
(5-10%) of the fetcher map phase.

Best,
Sebastian

On 7/20/20 11:38 PM, prateek sachdeva wrote:
> Hi Guys,
> 
> As per Apache Nutch 1.16 Fetcher class implementation here -
> https://github.com/apache/nutch/blob/branch-1.16/src/java/org/apache/nutch/fetcher/Fetcher.java,
> this is a map only job. I don't see any reducer set in the Job. So my
> question is why not set job.setNumreduceTasks(0) and save the time by
> outputting directly to HDFS.
> 
> Regards
> Prateek
>

[ANNOUNCE] Apache Nutch 1.17 Release

2020-07-02 Thread Sebastian Nagel

The Apache Nutch team is pleased to announce the release of
Apache Nutch v1.17.

Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ data structures.

Source and binary distributions are available for download from the
Apache Nutch download site:
   https://nutch.apache.org/downloads.html

Please verify signatures using the KEYS file available at the above
location when downloading the release.

This release includes more than 60 bug fixes and improvements, the full
list of changes can be seen in the release report
  https://s.apache.org/ovhry
Please also check the changelog for breaking changes:
  https://apache.org/dist/nutch/1.17/CHANGES.txt

Thanks to everyone who contributed to this release!

[RESULT] was [VOTE] Release Apache Nutch 1.17 RC#1

2020-07-01 Thread Sebastian Nagel

Hi Folks,

thanks to everyone who was able to review the release candidate!

72 hours have passed, please see below for vote results.

[4] +1 Release this package as Apache Nutch 1.17
   Markus Jelsma *
   Furkan Kamaci *
   Shashanka Balakuntala Srinivasa
   Sebastian Nagel *

[0] -1 Do not release this package because ...

* Nutch PMC

The VOTE passes with 3 binding votes from Nutch PMC members.

I'll continue and publish the release packages. Tomorrow, after the
packages have been propagated to all mirrors, I'll send the announcement.

Thanks to everyone who has contributed to Nutch and the 1.17 release.

Sebastian


On 6/30/20 11:47 AM, Markus Jelsma wrote:
> Hello,
> 
> +1 from me too!
> 
> Thanks,
> Markus
>  
> -Original message-
>> From:Furkan KAMACI 
>> Sent: Saturday 20th June 2020 18:15
>> To: d...@nutch.apache.org
>> Subject: Re: [VOTE] Release Apache Nutch 1.17 RC#1
>>
>> Hi,
>>
>> +1 from me (binding).
>>
>> I checked:
>>
>> - LICENSE and NOTICE are fine 
>> - No unexpected binary files 
>> - Checked PGP signatures
>> - Checked Checksums
>> - Code compiles and tests successfully run
>>
>> PS: You can point KEYS at release vote email: 
>> https://downloads.apache.org/nutch/KEYS 
>> <https://downloads.apache.org/nutch/KEYS>
>>
>> Kind Regards, 
>> Furkan KAMACI
>>
>> On Fri, Jun 19, 2020 at 11:04 AM Shashanka Balakuntala 
>> mailto:shbalakunt...@gmail.com>> wrote:
>> Hi Sebastian, 
>>
>> +1 
>> (NON-PMC Vote) from my side
>>
>> The build succeeds, tests pass and I have tested the indexing with solr and 
>> elastic and works for some crawls i have done. 
>> Regards 
>>   Shashanka Balakuntala Srinivasa 
>>        
>>
>>
>> On Thu, Jun 18, 2020 at 3:54 PM BlackIce > <mailto:blackice...@gmail.com>> wrote:
>> Hi , 
>> Gonna take it for a spin later 
>> Greetz 
>> RRK 
>>
>> On Thu, Jun 18, 2020, 12:23 Sebastian Nagel > <mailto:sna...@apache.org>> wrote:
>> Hi Folks,
>  
>>
>  
>> A first candidate for the Nutch 1.17 release is available at:
>  
>>
>  
>>    https://dist.apache.org/repos/dist/dev/nutch/1.17/ 
>> <https://dist.apache.org/repos/dist/dev/nutch/1.17/>
>  
>>
>  
>> The release candidate is a zip and tar.gz archive of the binary and sources 
>> in:
>  
>>    https://github.com/apache/nutch/tree/release-1.17 
>> <https://github.com/apache/nutch/tree/release-1.17>
>  
>>
>  
>> In addition, a staged maven repository is available here:
>  
>>    https://repository.apache.org/content/repositories/orgapachenutch-1018/ 
>> <https://repository.apache.org/content/repositories/orgapachenutch-1018/>
>  
>>
>  
>> We addressed 61 issues:
>  
>>    https://s.apache.org/ovhry <https://s.apache.org/ovhry>
>  
>>
>  
>>
>  
>> Please vote on releasing this package as Apache Nutch 1.17.
>  
>> The vote is open for the next 72 hours and passes if a majority of at
>  
>> least three +1 Nutch PMC votes are cast.
>  
>>
>  
>> [ ] +1 Release this package as Apache Nutch 1.17.
>  
>> [ ] -1 Do not release this package because…
>  
>>
>  
>> Cheers,
>  
>> Sebastian
>  
>> (On behalf of the Nutch PMC)
>  
>>
>  
>> P.S. Here is my +1.
>  
>

Re: protocol-interactiveselenium Custom Handler

2020-06-25 Thread Sebastian Nagel

Hi Craig,

in case, you're building Nutch from the git repo or from the source package
the easiest way is to put the file NewCustomHandler.java into
  src/plugin/protocol-interactiveselenium/src/java/.../handlers/
and run
  ant runtime
to compile and package Nutch including package your custom handler.

Using a jar isn't as simple, mostly because of the classpath encapsulation
of Nutch plugins.

1. add you jar as a dependency to
src/plugin/protocol-interactiveselenium/ivy.xml

2. register the file name of the jar in
src/plugin/protocol-interactiveselenium/plugin.xml
   as

3. build Nutch, see above

Of course, ivy must be able to pick the jar from one of
the repositories listed in
  ivy/ivysettings.xml

But it's possible to add your local Maven repo/cache by adding:

...

  ...

> An example of a custom handler that someone has written would be great.

There are some handler implementations in
  src/plugin/protocol-interactiveselenium/src/java/.../handlers/
I've never made use of them, but they look "custom", at least,
at the first glance, because one file name includes a typo. :)
If you have time please open a Jira issue at
  https://issues.apache.org/jira/projects/NUTCH
to fix the naming.

Thanks,
Sebastian

On 6/25/20 1:18 AM, Craig Tataryn wrote:
> Hello, I would like to create my own Custom Handler for
> protocol-interactiveselenium.
> 
> In reading the code [1] I see that when setting the config:
> 
> 
>   interactiveselenium.handlers
>   NewCustomHandler,DefaultHandler
>   
> 
> 
> the "NewCustomerHandler" would be loaded from the classpath assuming it was
> called: 
> org.apache.nutch.protocol.interactiveselenium.handlers.NewCustomerHandler.
> However, my question is: how do I get Nutch to incorporate my new .jar file
> containing the NewCustomerHandler?
> 
> I've written protocol and indexer plugins before, however this seems a bit
> different. An example of a custom handler that someone has written would be
> great.
> 
> Thanks,
> 
> Craig.
> 
> [1] -
> https://github.com/apache/nutch/blob/ea862f45b83177b41aebad9c18b900936d43a19a/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java#L364
>

[VOTE] Release Apache Nutch 1.17 RC#1

2020-06-18 Thread Sebastian Nagel

Hi Folks,

A first candidate for the Nutch 1.17 release is available at:

   https://dist.apache.org/repos/dist/dev/nutch/1.17/

The release candidate is a zip and tar.gz archive of the binary and sources in:
   https://github.com/apache/nutch/tree/release-1.17

In addition, a staged maven repository is available here:
   https://repository.apache.org/content/repositories/orgapachenutch-1018/

We addressed 61 issues:
   https://s.apache.org/ovhry


Please vote on releasing this package as Apache Nutch 1.17.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Nutch PMC votes are cast.

[ ] +1 Release this package as Apache Nutch 1.17.
[ ] -1 Do not release this package because…

Cheers,
Sebastian
(On behalf of the Nutch PMC)

P.S. Here is my +1.

Preparing to release 1.17

2020-06-16 Thread Sebastian Nagel

Hi,

the list of open issues for 1.17 became short, and I will move some of the 
remaining
issues to 1.18 to get the way free and prepare the first release candidate in 
the
next two days.

If there are urgent fixes (including a PR / patch). Let me know!

Thanks,
Sebastian

Re: Nutch 1.17 download available?

2020-06-08 Thread Sebastian Nagel

Hi Jim,

Nutch 1.17 should land soon but there are a couple of issue to be fixed before 
the release.

Best,
Sebastian


On 6/8/20 12:11 AM, Lewis John McGibbney wrote:
> Hi Jim,
> Response below
> 
> On 2020/06/06 14:23:24, Jim Anderson  wrote: 
>>
>> I cannot find a download for Nutch 1.17. Is Nutch 1.17 available for
>> download? If so, can someone please give me a pointer.
>>
> 
> Nutch 1.17 is current master branch e.g. in development, meaning that there 
> is no official release as of yet. 
> 
> The most recent version of Nutch is 1.16 which you can download from the 
> downloads page http://nutch.apache.org/downloads.html
> 
> Heads up here, all official releases are automatically archived at 
> archive.apache.org. For example, the Nutch releases are available at 
> http://archive.apache.org/dist/nutch/
> 
> Thanks. Any more questions please let us know :)
> 
> lewismc 
>

Re: [Non-DoD Source] Re: [DISCUSS] Release 1.17 ? (UNCLASSIFIED)

2020-04-23 Thread Sebastian Nagel

Hi Kris,

please follow the instructions given on
  https://nutch.apache.org/mailing_lists.html

Best,
Sebastian


On 4/23/20 1:48 PM, Musshorn, Kris T CTR USARMY SEC (USA) wrote:
> CLASSIFICATION: UNCLASSIFIED
> 
> Can you drop me from the masiling list?
> 
> Thanks,
> Kris T Musshorn CTR CECOM SharePoint Team
> 443-861-8614
> APG Bldg 6002 D5101/108
> I am currently teleworking and can be reached at 860 670 9494
> 
> -Original Message-
> From: lewis john mcgibbney [mailto:lewi...@apache.org] 
> Sent: Thursday, April 23, 2020 4:21 AM
> To: user-dig...@nutch.apache.org
> Cc: user@nutch.apache.org
> Subject: [Non-DoD Source] Re: [DISCUSS] Release 1.17 ?
> 
> All active links contained in this email were disabled.  Please verify the 
> identity of the sender, and confirm the authenticity of all links contained 
> within the message prior to copying and pasting the address to a Web browser. 
>  
> 
> 
> 
> 
> 
> 
> Hi Seb,
> Go for it. I’ll happily review.
> Excellent work folks... really excellent work.
> lewismc
> 
> On Wed, Apr 22, 2020 at 23:27  wrote:
> 
>>
>> user Digest 23 Apr 2020 06:27:46 - Issue 3055
>>
>> Topics (messages 34517 through 34517)
>>
>> [DISCUSS] Release 1.17 ?
>> 34517 by: Sebastian Nagel
>>
>> Administrivia:
>>
>> -
>> To post to the list, e-mail: user@nutch.apache.org To unsubscribe, 
>> e-mail: user-digest-unsubscr...@nutch.apache.org
>> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>>
>> --
>>
>>
>>
>>
>> -- Forwarded message --
>> From: Sebastian Nagel 
>> To: d...@nutch.apache.org, user@nutch.apache.org
>> Cc:
>> Bcc:
>> Date: Thu, 23 Apr 2020 08:27:39 +0200
>> Subject: [DISCUSS] Release 1.17 ?
>> Hi all,
>>
>> 30 issues are done now
>>   
>> Caution-https://issues.apache.org/jira/browse/NUTCH/fixforversion/1234
>> 6090
>>
>> including a number of important dependency upgrades:
>> - Hadoop 3.1 (NUTCH-2777)
>> - Elasticsearch 7.3.0 REST client (NUTCH-2739) Thanks to Shashanka 
>> Balakuntala Srinivasa for both!
>>
>> Dependency upgrades to be included (but still open right now):
>> - Tika 1.24.1
>> - Solr 8.5.1
>>
>> The last release (1.16) was in October, so it's definitely not too 
>> early to release 1.17.  As usual, we'll check all remaining issues 
>> whether they should be fixed now or can be done later in 1.18.
>>
>> I would be ready to push a release candidate during the next weeks and 
>> have already started to work through the remaining issues. Please, 
>> comment on issues you want to get fixed already in 1.17!
>>
>> Thanks,
>> Sebastian
>>
>> --
> Caution-http://home.apache.org/~lewismc/
> Caution-http://people.apache.org/keys/committer/lewismc
> 
> 
> CLASSIFICATION: UNCLASSIFIED
>

[DISCUSS] Release 1.17 ?

2020-04-23 Thread Sebastian Nagel

Hi all,

30 issues are done now
  https://issues.apache.org/jira/browse/NUTCH/fixforversion/12346090

including a number of important dependency upgrades:
- Hadoop 3.1 (NUTCH-2777)
- Elasticsearch 7.3.0 REST client (NUTCH-2739)
Thanks to Shashanka Balakuntala Srinivasa for both!

Dependency upgrades to be included (but still open right now):
- Tika 1.24.1
- Solr 8.5.1

The last release (1.16) was in October, so it's definitely not too early to
release 1.17.  As usual, we'll check all remaining issues whether they should
be fixed now or can be done later in 1.18.

I would be ready to push a release candidate during the next weeks and have
already started to work through the remaining issues. Please, comment on
issues you want to get fixed already in 1.17!

Thanks,
Sebastian

Re: finding broken links with nutch 1.14

2020-03-03 Thread Sebastian Nagel

Hi Robert,

404s are recorded in the CrawlDb after the tool "updatedb" is called.
Could you share the commands you're running? Please also have a look into the 
log files (esp. the
hadoop.log) - all fetches are logged and
also whether fetches have failed. If you cannot find a log message
for the broken links, it might be that the URLs are filtered. In this
case, please also share the configuration (if different from the default).

Best,
Sebastian

On 3/2/20 11:11 PM, Robert Scavilla wrote:
> Nutch 1.14:
> I am looking at the FetcherThread code. The 404 url does get flagged with
> a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb.
> It does however got into the linkdb. Please tell me how I can collect these
> 404 urls.
> 
> Any help would be appreciated,
> .,..bob
> 
>case ProtocolStatus.NOTFOUND:
> case ProtocolStatus.GONE: // gone
> case ProtocolStatus.ACCESS_DENIED:
> case ProtocolStatus.ROBOTS_DENIED:
>   output(fit.url, fit.datum, null, status,
>   CrawlDatum.STATUS_FETCH_GONE); // broken link is
> getting here
>   break;
> 
> On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla 
> wrote:
> 
>> Hi again, and thank you in advance for your kind help.
>>
>> I'm using Nutch 1.14
>>
>> I'm trying to use nutch to find broken links (404s) on a site. I
>> followed the instructions:
>> bin/nutch readdb /crawldb/ -dump myDump
>>
>> but the dump only shows 200 and 301 status. There is no sign of any broken
>> link. When enter just 1 broken link in the seed file the crawldb is empty.
>>
>> Please advise how I can inspect broken links with nutch1.14
>>
>> Thank you!
>> ...bob
>>
>

Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

2020-01-15 Thread Sebastian Nagel

Hi Joseph,

sorry for the late reply. Anyway: the patch for NUTCH-2525
fixes your problem. See also my comments in
   https://issues.apache.org/jira/browse/NUTCH-2525

Thanks,
Sebastian


On 1/2/20 2:55 PM, Gilvary, Joseph wrote:
> Happy New Year, Sebastian,
> 
> Thank you. That looks promising. Hope you enjoy the holiday!
> 
>  Joe 
> 
> -Original Message-----
> From: Sebastian Nagel  
> Sent: Thursday, January 2, 2020 7:42 AM
> To: user@nutch.apache.org
> Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15
> 
> Hi Joseph,
> 
> this could be related to
>
> https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390453013sdata=ze1ggDtnCA5%2BuAu6LQFFSZbu24U%2BY3WRHvvD%2BsdriT4%3Dreserved=0
> caused by not-all-lowercase meta keys.
> 
> I'm happy to check whether the attached patch fixes your problem when I'm 
> back from holidays in a few days.
> 
> Best,
> Sebastian
> 
> On 12/31/19 5:43 PM, Gilvary, Joseph wrote:
>> Thanks, Markus,
>>
>> Those are the tools I've been using to debug because it's quicker than 
>> reindexing even a test collection in Solr. So parsechecker shows that these 
>> fields are in the parse metadata, but I can't figure out how to get them 
>> into the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but 
>> the other namespaces using ':' aren't making it through and I'm at a loss.
>>
>> Nutch schema.xml:
>>
>>  > name="xmpTPg_NPages" type="int" indexed="true" stored="true"/>
>>
>> nutch-site.xml:
>>
>>   
>> index.parse.md
>> 
>> description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages
>>  
>>   
>>
>>
>> Parsechecker sees the values for the xmp stuff:
>>
>> Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 
>> pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 
>> access_permission:blah_blah_blah xmpTPg:NPages=23 
>> access_permission:can_modify=true pdf:docinfo:producer=Acrobat 
>> Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z
>>
>>
>> Indexchecker doesn't:
>>
>> fetching: 
>> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0
>> .01%2Ftest.pdfdata=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9c
>> be85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7
>> C637135657390462972sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2
>> FAOBXXM%3Dreserved=0
>> robots.txt whitelist not configured.
>> parsing: 
>> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.01%2Ftest.pdfdata=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390462972sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2FAOBXXM%3Dreserved=0
>> pdf:docinfo:title : Test File
>> tstamp :Tue Dec 31 11:23:28 EST 2019
>> pdf:docinfo:modified :  2011-04-27T18:36:58Z
>> pdf:docinfo:created :   2011-04-27T18:33:06Z
>>
>>
>> The Dublin Core values don't use colon ':' but dot '.' and they show up 
>> fine. There are embedded spaces in some of the xmp values, but the 
>> pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm 
>> wondering if there's anything special about the pdf:docinfo that isn't 
>> generalized or is somehow configurable for generalization to other 
>> namespaces. 
>>
>>  Thanks,
>>
>>  Joe
>>
>> -Original Message-
>> From: Markus Jelsma 
>> Sent: Tuesday, December 31, 2019 8:30 AM
>> To: user@nutch.apache.org
>> Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
>>
>> Hello Joseph,
>>
>>> Is there more documentation on having Nutch get what Tika sees into what 
>>> Solr will see?
>>
>> No, but i believe you would want to checkout the parsechecker and 
>> indexchecker tools. These tools display what Tika sees and what will be sent 
>> to Solr.
>>
>> Regards,
>> Markus
>>  
>> -Original message-
>>> From:Gilvary, Joseph 
>>> Sent: Tuesday 31st December 2019 14:19
>>> To: user@nutch.apache.org
>>> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15
>>>
>>> Happy New Year,
>>>
>>> I've searched the archives and the web as best I can, tinkered with 
>>> nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the 
>>> parse metadata into the Solr (7.6) index.
>>>
>>> I want to index stuff like:
>>>
>>> xmp:CreatorTool=PScript5.dll Version 5.2.2
>>> xmpTPg:NPages=23
>>>
>>> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but 
>>> swapping out ':' for '_' isn't working for the xmp stuff.
>>>
>>> Is there more documentation on having Nutch get what Tika sees into what 
>>> Solr will see?
>>>
>>> Any help appreciated.
>>>
>>> Thanks,
>>>
>>> Joe
>>>
>

Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

2020-01-02 Thread Sebastian Nagel

Hi Joseph,

this could be related to
   https://issues.apache.org/jira/browse/NUTCH-2525
caused by not-all-lowercase meta keys.

I'm happy to check whether the attached patch fixes your problem
when I'm back from holidays in a few days.

Best,
Sebastian

On 12/31/19 5:43 PM, Gilvary, Joseph wrote:
> Thanks, Markus,
> 
> Those are the tools I've been using to debug because it's quicker than 
> reindexing even a test collection in Solr. So parsechecker shows that these 
> fields are in the parse metadata, but I can't figure out how to get them into 
> the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the 
> other namespaces using ':' aren't making it through and I'm at a loss.
> 
> Nutch schema.xml:
> 
> 
> 
> 
> nutch-site.xml:
> 
>   
> index.parse.md
> 
> description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages
>  
>   
> 
> 
> Parsechecker sees the values for the xmp stuff:
> 
> Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 
> pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 
> access_permission:blah_blah_blah xmpTPg:NPages=23 
> access_permission:can_modify=true pdf:docinfo:producer=Acrobat Distiller 
> 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z
> 
> 
> Indexchecker doesn't:
> 
> fetching: http://127.0.01/test.pdf
> robots.txt whitelist not configured.
> parsing: http://127.0.01/test.pdf
> pdf:docinfo:title : Test File
> tstamp :Tue Dec 31 11:23:28 EST 2019
> pdf:docinfo:modified :  2011-04-27T18:36:58Z
> pdf:docinfo:created :   2011-04-27T18:33:06Z
> 
> 
> The Dublin Core values don't use colon ':' but dot '.' and they show up fine. 
> There are embedded spaces in some of the xmp values, but the 
> pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm 
> wondering if there's anything special about the pdf:docinfo that isn't 
> generalized or is somehow configurable for generalization to other 
> namespaces. 
> 
>  Thanks,
> 
>  Joe
> 
> -Original Message-
> From: Markus Jelsma  
> Sent: Tuesday, December 31, 2019 8:30 AM
> To: user@nutch.apache.org
> Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15 
> 
> Hello Joseph,
> 
>> Is there more documentation on having Nutch get what Tika sees into what 
>> Solr will see?
> 
> No, but i believe you would want to checkout the parsechecker and 
> indexchecker tools. These tools display what Tika sees and what will be sent 
> to Solr.
> 
> Regards,
> Markus
>  
> -Original message-
>> From:Gilvary, Joseph 
>> Sent: Tuesday 31st December 2019 14:19
>> To: user@nutch.apache.org
>> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15 
>>
>> Happy New Year,
>>
>> I've searched the archives and the web as best I can, tinkered with 
>> nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the 
>> parse metadata into the Solr (7.6) index.
>>
>> I want to index stuff like:
>>
>> xmp:CreatorTool=PScript5.dll Version 5.2.2
>> xmpTPg:NPages=23
>>
>> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping 
>> out ':' for '_' isn't working for the xmp stuff.
>>
>> Is there more documentation on having Nutch get what Tika sees into what 
>> Solr will see?
>>
>> Any help appreciated.
>>
>> Thanks,
>>
>> Joe
>>

Re: Fwd: Crawling 3 websites from one nutch

2019-12-27 Thread Sebastian Nagel

Hi,

the test compares names of the "host" and the registered domain:
  doc.getFieldValue('host')=='urgenthomework.com'

The host name is "www.urgenthomework.com". You can test it via:

  $> bin/nutch indexchecker https://www.urgenthomework.com/
  fetching: https://www.urgenthomework.com/
  ...
  host :  www.urgenthomework.com
  ...
  title : Homework Help for College, University and School Students
  ...

Best,
Sebastian


On 12/26/19 11:29 AM, Zara Parst wrote:
> Hi, Is it possible to crawl three different website like
> 
> 1. https://www.urgenthomework.com/
> 2. https://www.myassignmenthelp.net/
> 3. https://www.assignmenthelp.net/
> 
> in single nutch configuration and then send the respective index pages to
> corrosponding cores [ uah, mah , yah]  in solr.  I tried to acheieve it by
> exchange and writer id.  Please look below for my confirgurations
> 
> -exchange.xml-
> 
> 
> 
> 
> 
> 
> 
> *   id="indexer_solr_1" />   value="doc.getFieldValue('host')=='urgenthomework.com
> '" />  *
> 
> 
> 
> 
> 
> 
> 
> 
> *   id="indexer_solr_2" />   value="doc.getFieldValue('host')=='myassignmenthelp.net
> '" />  *
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *id="indexer_solr_3" />   value="doc.getFieldValue('host')=='assignmenthelp.net
> '" />  *
> 
> 
> 
> -index.writers.xml
> 
>   class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
> 
>   
>   http://localhost:8983/solr/uah; />
>   
>   
>   
>   
>   
>   
> 
> 
>   
> 
>   
>   
>   
> 
> 
> 
> 
>   
> 
>   
> 
> 
>class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
> 
>   
>   http://localhost:8983/solr/mah; />
>   
>   
>   
>   
>   
>   
> 
> 
>   
>   
>   
>   
> 
> 
> 
>   
> 
>   
> 
> 
> 
>class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
> 
>   
>   http://localhost:8983/solr/yah; />
>   
>   
>   
>   
>   
>   
> 
> 
>   
>   
>   
>   
> 
> 
> 
>   
> 
>   
> 
> ---
> 
> But it is not pushing data into corrosinding cores rather it is sending
> data in one core from different domain, Please do let me know. I am sure
> there has to be way to achieve it. I didnt try wth sobcollecion.xml. Do you
> think I can achieve it using subcollection?
>

Re: Fetch failed with protocol status: gone(11)

2019-12-17 Thread Sebastian Nagel

Hi Bob,

> I am not seeing the http status codes though??

Sorry, yes you're right. The headers are recorded but parsechecker
does not print them if fetching fails.

The server responds with a "400 Bad request" if the user-agent string
contains "nutch", reproducible by:
  wget --header 'User-Agent: nutch' -d https://www.avalonpontoons.com/
  ...
  ---response begin---
  HTTP/1.1 400 Bad Request
  ...

You could set the user-agent string:

 bin/nutch parsechecker \
   -Dhttp.agent.name=somethingelse \
   -Dhttp.agent.version='' ...

and this site should work. Recommended, to send a meaningful user-agent string.

Best,
Sebastian


On 12/17/19 9:43 PM, Robert Scavilla wrote:
> Thank you Sebastian. I added the run-time parameters and the output is
> identical. I am not seeing the http status codes though??
> 
> The log file shows:
> 
> 2019-12-17 15:37:36,602 INFO  parse.ParserChecker - fetching:
> https://www.avalonpontoons.com/
> 2019-12-17 15:37:36,872 INFO  protocol.RobotRulesParser - robots.txt
> whitelist not configured.
> 2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.host = null
> 2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.port = 8080
> 2019-12-17 15:37:36,873 INFO  http.Http - http.proxy.exception.list = false
> 2019-12-17 15:37:36,873 INFO  http.Http - http.timeout = 1
> 2019-12-17 15:37:36,873 INFO  http.Http - http.content.limit = -1
> 2019-12-17 15:37:36,873 INFO  http.Http - http.agent = FFDevBot/Nutch-1.14 (
> fourfront.us)
> 2019-12-17 15:37:36,873 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2019-12-17 15:37:36,873 INFO  http.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2019-12-17 15:37:36,873 INFO  http.Http - http.enable.cookie.header = true
> 
> the command line shows:
>> $NUTCHl/bin/nutch parsechecker -Dstore.http.headers=true
> -Dstore.http.request=true https://www.avalonpontoons.com/
> fetching: https://www.avalonpontoons.com/
> robots.txt whitelist not configured.
> Fetch failed with protocol status: gone(11), lastModified=0:
> https://www.avalonpontoons.com/
> 
> 
> On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel
>  wrote:
> 
>> Hi Bob,
>>
>> the relevant Javadoc comment stands before the declaration of a variable
>> (here a constant):
>>   /** Resource is gone. */
>>   public static final int GONE = 11;
>>
>> More detailed, GONE results from one of the following HTTP status codes:
>>  400 Bad request
>>  401 Unauthorized
>>  410 Gone   (*forever* gone, opposed to 404 Not Found)
>> See
>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>>
>> My guess would be that "www.sitename.com" requires authentication.
>>
>> Just repeat the request as
>>  bin/nutch parsechecker \
>> -Dstore.http.headers=true \
>> -Dstore.http.request=true \
>> ... 
>>
>> (I guess you're already using parsechecker or indexchecker)
>> This will show the HTTP headers where you'll find the exact HTTP status
>> code.
>>
>> Best,
>> Sebastian
>>
>>
>>
>> On 12/17/19 4:36 PM, Robert Scavilla wrote:
>>> Hi again, and thank in advance for your kind help.
>>>
>>> Nutch 1.14
>>>
>>> I am getting the following error message when crawling a site:
>>> *Fetch failed with protocol status: gone(11), lastModified=0:
>>> https://www.sitename.com <https://www.sitename.com>*
>>>
>>> The only documentation I can find says:
>>>
>>>> public static final int GONE = 11;
>>>> /** Resource has moved permanently. New url should be found in args. */
>>>>
>>> I'm not sure what this means. When I load the page in my browser it shows
>>> status codes 200 or 304 for all resources.
>>>
>>> The problem only exists on a single site - other sites crawl fine.
>>>
>>> I saved a page from the site locally and that page fetches successfully.
>>>
>>> Can you please steer my in the right direction. Many Thanks,
>>> ...bob
>>>
>>
>>
>

Re: Fetch failed with protocol status: gone(11)

2019-12-17 Thread Sebastian Nagel

Hi Bob,

the relevant Javadoc comment stands before the declaration of a variable (here 
a constant):
  /** Resource is gone. */
  public static final int GONE = 11;

More detailed, GONE results from one of the following HTTP status codes:
 400 Bad request
 401 Unauthorized
 410 Gone   (*forever* gone, opposed to 404 Not Found)
See 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

My guess would be that "www.sitename.com" requires authentication.

Just repeat the request as
 bin/nutch parsechecker \
-Dstore.http.headers=true \
-Dstore.http.request=true \
... 

(I guess you're already using parsechecker or indexchecker)
This will show the HTTP headers where you'll find the exact HTTP status code.

Best,
Sebastian



On 12/17/19 4:36 PM, Robert Scavilla wrote:
> Hi again, and thank in advance for your kind help.
> 
> Nutch 1.14
> 
> I am getting the following error message when crawling a site:
> *Fetch failed with protocol status: gone(11), lastModified=0:
> https://www.sitename.com *
> 
> The only documentation I can find says:
> 
>> public static final int GONE = 11;
>> /** Resource has moved permanently. New url should be found in args. */
>>
> I'm not sure what this means. When I load the page in my browser it shows
> status codes 200 or 304 for all resources.
> 
> The problem only exists on a single site - other sites crawl fine.
> 
> I saved a page from the site locally and that page fetches successfully.
> 
> Can you please steer my in the right direction. Many Thanks,
> ...bob
>

Re: Map reducer filtering too many sites during generation in Nutch 2.4

2019-11-14 Thread Sebastian Nagel

Hi Makkara,

> but I believe that this is the fault of the reducer
> Map input records=22048
> Map output records=4

The items are skipped in the mapper.

> Is this a known problem of Nutch 2.4, or have I just misconfigured
> something?

Could be the configuration or a bug in the storage layer causing not all items 
of the web table sent
to the mapper.

Please also note that we expect that 2.4 is the last release on the 2.X series. 
We've decided to
freeze the development on the 2.X branch for now, as no committer is actively 
working on it. Nutch
1.x is actively maintained.

Best,
Sebastian

On 11/13/19 3:42 PM, Makkara Mestari wrote:
> 
> 
> Hello
>  
> I have injected about 1300 domains to the seed list.
>  
> First two fetches work nicely, but after that, the crawler will only select 
> urls from a few domains, leaving all other urls permanently with the status 1 
> (unfetched), which number in tens of thousands. Currently the generator only 
> generates the same 4 urls every time, that are unreachable pages.
>  
> Im not sure, but I believe that this is the fault of the reducer, here is a 
> sample of output during generation phase with setting -topN  5
>  
> 
> 2019-11-13 13:22:29,186 INFO  mapreduce.Job - Job job_local1940214525_0001 
> completed successfully
> 2019-11-13 13:22:29,210 INFO  mapreduce.Job - Counters: 34
>     File System Counters
>     FILE: Number of bytes read=1313864
>     FILE: Number of bytes written=1904695
>     FILE: Number of read operations=0
>     FILE: Number of large read operations=0
>     FILE: Number of write operations=0
>     Map-Reduce Framework
>     Map input records=22048
>     Map output records=4
>     Map output bytes=584
>     Map output materialized bytes=599
>     Input split bytes=953
>     Combine input records=0
>     Combine output records=0
>     Reduce input groups=4
>     Reduce shuffle bytes=599
>     Reduce input records=4
>     Reduce output records=4
>     Spilled Records=8
>     Shuffled Maps =1
>     Failed Shuffles=0
>     Merged Map outputs=1
>     GC time elapsed (ms)=22
>     CPU time spent (ms)=0
>     Physical memory (bytes) snapshot=0
>     Virtual memory (bytes) snapshot=0
>     Total committed heap usage (bytes)=902823936
>     Generator
>     GENERATE_MARK=4
>     Shuffle Errors
>     BAD_ID=0
>     CONNECTION=0
>     IO_ERROR=0
>     WRONG_LENGTH=0
>     WRONG_MAP=0
>     WRONG_REDUCE=0
>     File Input Format Counters
>     Bytes Read=0
>     File Output Format Counters
>     Bytes Written=0
> 2019-11-13 13:22:29,238 INFO  crawl.GeneratorJob - GeneratorJob: finished at 
> 2019-11-13 13:22:29, time elapsed: 00:00:04
> 2019-11-13 13:22:29,238 INFO  crawl.GeneratorJob - GeneratorJob: generated 
> batch id: 1573651344-1856402192 containing 4 URLs
>  
> If I try resetting the crawldb, and injecting only one of the domains, then I 
> can crawl it compleately fine, this problem of never fetched pages only 
> arises if I try to work with a moderate amount of domains at the time (1300 
> in this case).
>  
> Is this a known problem of Nutch 2.4, or have I just misconfigured something?
>  
> -Makkara
>

Re: Metadata not indexed after migrating to Nutch 2.4

2019-11-11 Thread Sebastian Nagel

Hi Anton,

after a short look into MetadataIndexer:
- it does not request any fields from the webpage,
  see getFields() method
- this is a bug (but already was in 2.3.1)
- it could be worked around by activating another
  plugin which requests the METADATA field/column,
  eg. language-identifier/LanguageIndexingFilter

That's one possible explanation.

Please note that it is unlikely that there will be further
releases on the 2.x series of Nutch, see the release announcement
for more details.

Best,
Sebastian


On 11/11/19 12:44 PM, Anton Skarp wrote:
> Hi,
> 
> After migrating from nutch 2.3.1 to 2.4 I have not been able to conf nutch to 
> index metadata to elasticsearch. Indexchecker gets the metadata correctly 
> though.
> I have tried both hbase version 0.9.8-hadoop2 and also with mongodb. Both 
> contained the wanted metadata.
> 
> I have done some debugging and the problem seems to be that MetadataIndexer 
> filter methods parameter page does not even contain the metadata.
> 
> There are no exceptions/errors outputted by nutch or elasticsearch.
> 
> Any ideas on what is the problem and how I should approach fixing it.
> 
> 
> Regards. Anton
>

Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Sebastian Nagel

Hi Sachin,

> What I have observed is that it usually fetches, parses and indexes
> 1800 web pages.

This means 10 pages per minute.

How are the 1800 pages distributed over hosts?

The default delay between successive fetches to the same host is
5 seconds. If all pages belong to the same host, the crawler is
waiting 50 sec. every minute and the fetching is done in the remaining
10 sec.

If you have the explicit permission to access the host(s) aggressively, you can 
decrease the delay
(fetcher.server.delay) or even fetch in parallel from the same host 
(fetcher.threads.per.queue).
Otherwise, please keep the delay as is and be patient and polite! You also risk 
to get blocked by
the web admin.

> What I have understood here is that in local mode there is only one
> thread doing the fetch?

No. The number of parallel threads used in bin/crawl is 50.
 --num-threads 
Number of threads for fetching / sitemap processing [default: 50]

I can only second Markus: local mode is sufficient unless you're crawling
- significantly more than 10M+ URLs
- from 1000+ domains

With less domains/hosts there's nothing to distribute because all
URLs of one domain/host are processed in one fetcher task to ensure
politeness.

Best,
Sebastian

On 11/1/19 6:53 AM, Sachin Mittal wrote:
> Hi,
> I understood the point.
> I would also like to run nutch on my local machine.
> 
> So far I am running in standalone mode with default crawl script where
> fetch time limit is 180 minutes.
> What I have observed is that it usually fetches, parses and indexes 1800
> web pages.
> I am basically fetching the entire page and fetch process is one that takes
> maximum time.
> 
> I have a i7 processor with 16GB of RAM.
> 
> How can I increase the throughput here?
> What I have understood here is that in local mode there is only one thread
> doing the fetch?
> 
> I guess I would need multiple threads running in parallel.
> Would running nutch in pseudo distributed mode and answer here?
> It will then run multiple fetchers and I can increase my throughput.
> 
> Please let me know.
> 
> Thanks
> Sachin
> 
> 
> 
> 
> 
> 
> On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma 
> wrote:
> 
>> Hello Sachin,
>>
>> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
>> based provider. This is the most expensive option you have.
>>
>> Cheaper would be to rent some servers and install Hadoop yourself, getting
>> it up and running by hand on some servers will take the better part of a
>> day.
>>
>> The cheapest and easiest, and in almost all cases the best option, is not
>> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
>> couple of million URLs. So unless you want to crawl many different domains
>> and expect 10M+ URLs, stay local.
>>
>> When we first started our business almost a decade ago we rented VPSs
>> first and then physical machines. This ran fine for some years but when we
>> had the option to make some good investments, we bought our own hardware
>> and have been scaling up the cluster ever since. And with the previous and
>> most recent AMD based servers processing power became increasingly cheaper.
>>
>> If you need to scale up for long term, getting your own hardware is indeed
>> the best option.
>>
>> Regards,
>> Markus
>>
>>
>> -Original message-
>>> From:Sachin Mittal 
>>> Sent: Tuesday 22nd October 2019 15:59
>>> To: user@nutch.apache.org
>>> Subject: Best and economical way of setting hadoop cluster for
>> distributed crawling
>>>
>>> Hi,
>>> I have been running nutch in local mode and so far I am able to have a
>> good
>>> understanding on how it all works.
>>>
>>> I wanted to start with distributed crawling using some public cloud
>>> provider.
>>>
>>> I just wanted to know if fellow users have any experience in setting up
>>> nutch for distributed crawling.
>>>
>>> From nutch wiki I have some idea on what hardware requirements should be.
>>>
>>> I just wanted to know which of the public cloud providers (IaaS or PaaS)
>>> are good to setup hadoop clusters on. Basically ones on which it is easy
>> to
>>> setup/manage the cluster and ones which are easy on budget.
>>>
>>> Please let me know if you folks have any insights based on your
>> experiences.
>>>
>>> Thanks and Regards
>>> Sachin
>>>
>>
>

Re: what happens to older segments

2019-10-22 Thread Sebastian Nagel

Hi Sachin,

> does mergesegs by default updates the
> crawldb once it merges all the segments?

No it does not. That's already evident from the command-line help
(no CrawlDb passed as parameter):

$> bin/nutch mergesegs
SegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter]
...

> Or do we have to call the updatedb command on the merged segment to
> update the crawldb so that it has all the information for next
> cycle.

One segment usually holds fetch list and content from one cycle. The command updatedb should be 
called every cycle (for the latest segment). The script bin/crawl does this. There is no need to call

updatedb again with the merged segment.

Best,
Sebastian

On 10/22/19 11:43 AM, Sachin Mittal wrote:

Ok.
Understood.

I had one question though is that does mergesegs by default updates the
crawldb once it merges all the segments?
Or do we have to call the updatedb command on the merged segment to update
the crawldb so that it has all the information for next cycle.

Thanks
Sachin

On Tue, Oct 22, 2019 at 1:32 PM Sebastian Nagel
 wrote:

Hi Sachin,

  > I want to know once a new segment is generated is there any use of
  > previous segments and can they be deleted?

As soon as a segment is indexed and the CrawlDb is updated from this
segment, you may delete it. But keeping older segments allows
- reindexing in case something went wrong with the index
- debugging: check the HTML of a page

When segments are merged only the most recent record of one URL is kept -
saves storage space but
requires to run the mergesegs tool.

  > Also when we then start the fresh crawl cycle how do we instruct
  > nutch to use this new merged segment, or it automatically picks up
  > the newest segment as starting point?

The CrawlDb contains all necessary information for the next cycle.
It's mandatory to update the CrawlDb (command "updatedb") for each
segment which transfers the fetch status information (fetch time, HTTP
status, signature, etc.) from
the segment to the CrawlDb.

Best,
Sebastian

On 10/22/19 6:59 AM, Sachin Mittal wrote:

Hi,
I have been crawling using nutch.
What I have understood is that for each crawl cycle it creates a segment
and for the next crawl cycle it uses the outlinks from previous segment

to

generate and fetch next set of urls to crawl. Then it creates a new

segment

with those urls.

I want to know once a new segment is generated is there any use of

previous

segments and can they be deleted?

I also see a command line tool  mergesegs
<

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=122916832

.
Does it make sense to use this to merge old segments into new segment
before deleting old segments?

Also when we then start the fresh crawl cycle how do we instruct nutch to
use this new merged segment, or it automatically picks up the newest
segment as starting point?

Thanks
Sachin

Re: what happens to older segments

2019-10-22 Thread Sebastian Nagel

Hi Sachin,

> I want to know once a new segment is generated is there any use of
> previous segments and can they be deleted?

As soon as a segment is indexed and the CrawlDb is updated from this
segment, you may delete it. But keeping older segments allows
- reindexing in case something went wrong with the index
- debugging: check the HTML of a page

When segments are merged only the most recent record of one URL is kept - saves storage space but 
requires to run the mergesegs tool.

> Also when we then start the fresh crawl cycle how do we instruct
> nutch to use this new merged segment, or it automatically picks up
> the newest segment as starting point?

The CrawlDb contains all necessary information for the next cycle.
It's mandatory to update the CrawlDb (command "updatedb") for each
segment which transfers the fetch status information (fetch time, HTTP status, signature, etc.) from 
the segment to the CrawlDb.

Best,
Sebastian

On 10/22/19 6:59 AM, Sachin Mittal wrote:

Hi,
I have been crawling using nutch.
What I have understood is that for each crawl cycle it creates a segment
and for the next crawl cycle it uses the outlinks from previous segment to
generate and fetch next set of urls to crawl. Then it creates a new segment
with those urls.

I want to know once a new segment is generated is there any use of previous
segments and can they be deleted?

I also see a command line tool  mergesegs
.
Does it make sense to use this to merge old segments into new segment
before deleting old segments?

Also when we then start the fresh crawl cycle how do we instruct nutch to
use this new merged segment, or it automatically picks up the newest
segment as starting point?

Thanks
Sachin

Re: Unable to index on Hadoop 3.2.0 with 1.16

2019-10-22 Thread Sebastian Nagel


Hi Markus,

any updates on this? Just to make sure the issue gets resolved.

Thanks,
Sebastian

On 14.10.19 17:08, Markus Jelsma wrote:

Hello,

We're upgrading our stuff to 1.16 and got a peculiar problem when we started 
indexing:

2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception 
running child : java.lang.IllegalStateException: text width is less than 1, was 
<-41>
at org.apache.commons.lang3.Validate.validState(Validate.java:829)
at 
de.vandermeer.skb.interfaces.transformers.textformat.Text_To_FormattedText.transform(Text_To_FormattedText.java:215)
at 
de.vandermeer.asciitable.AT_Renderer.renderAsCollection(AT_Renderer.java:250)
at de.vandermeer.asciitable.AT_Renderer.render(AT_Renderer.java:128)
at de.vandermeer.asciitable.AsciiTable.render(AsciiTable.java:191)
at org.apache.nutch.indexer.IndexWriters.describe(IndexWriters.java:326)
at 
org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:45)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:542)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:615)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

The only IndexWriter we use is SolrIndexer, and locally everything is just fine. 


Any thoughts?

Thanks,
Markus

Re: Crawl Command Question

2019-10-19 Thread Sebastian Nagel

Hi Dave,

> the crawl script without the -i parameter, does that mean the crawl will
> run and complete without updating SOLR?

Yes.

> Then I'll use solrindex to push the crawled content into
> SOLR later, when I'm ready.

Better call "index", the command "solrindex" is deprecated,
in fact, it just calls IndexingJob same as "index".

Of course, you need to pass all unindexed segments to the
index command or call "index" iteratively.

Best,
Sebastian

On 19.10.19 23:05, Dave Beckstrom wrote:
> Hi Everyone,
> 
> Reading the help for the nutch crawl script, I have a question.  If I run
> the crawl script without the -i parameter, does that mean the crawl will
> run and complete without updating SOLR?  I need to crawl pages without
> updating SOLR.  Then I'll use solrindex to push the crawled content into
> SOLR later, when I'm ready.
> 
> 
> 
> Usage: crawl [-i|--index] [-D "key=value"] [-s ]   Rounds>
> -i|--index Indexes crawl results into a configured indexer
> -D... A Java property to pass to Nutch calls
> -s  Directory in which to look for a seeds file
>  Directory where the crawl/link/segments dirs are saved
>  The number of rounds to run this crawl for
>  Example: bin/crawl -i -s urls/ TestCrawl/  2
>

1 2 3 4 5 6 7 >

1 - 100 of 655 matches

Mail list logo