Nutch 1.19/Hadoop compatible

2023-03-07 Thread Mike
Hello! Is nutch 1.19 compatible with Hadoop 3.3.4? Thanks! mike

Re: Configuration Nutch in cluster mode

2023-01-17 Thread Mike
Hallo Sebastian! I have now installed hadoop, unfortunately there are problems. Will make a post.. Thanks Mike Am Di., 17. Jan. 2023 um 09:49 Uhr schrieb Sebastian Nagel : > Hi Mike, > > the Nutch configuration files are included in the job file found in > runtime/deploy

Configuration Nutch in cluster mode

2023-01-14 Thread Mike
I will now try to configure the bot url etc. before the building, but how and where do I configure between the crawls e.g. number of pages per host? where do I configure nutch in cluster mode? thx, mike

Nutch/Hadoop Cluster

2023-01-14 Thread Mike
Hi! I am now crawling the internet in local mode in parallel with up to 10 instances on 3 computers. would it pay off for me to put a hadoop cluster on top of the 3 servers. 1.) a server would not be integrated directly into the crawl process as a master. 2.) can I run multiple crawl jobs on one

Incomplete TLD List

2022-11-08 Thread Mike
ot;, "digest":"3b9a23d42f200392d12a697bbb8d4d87", Thanks Mike

Re: How should the headings plugin be configured?

2022-10-31 Thread Mike
8 > h1 :Apache Nutchâ„¢ > id :https://nutch.apache.org/ > > Can you check you configuration? Is a plugin name mispelled? Is the > headings plugin active during fetch/parse? Is the index-metadata plugin > active? > > Regards, > Markus > > > Op ma 31 okt. 2022 om 14:

Re: How should the headings plugin be configured?

2022-10-31 Thread Mike
Hello Markus! Thank you for taking care of my problem! I removed the metatag.h# fron index.parse.md but ntuch indexchecker do not show me still the fields. Am Mo., 31. Okt. 2022 um 12:56 Uhr schrieb Markus Jelsma < markus.jel...@openindex.io>: > Hello Mike, > > Please rem

Re: How should the headings plugin be configured?

2022-10-31 Thread Mike
e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin) The Nutch parsechecker shows me the fields but the indexchecker doesn't. Am Mo., 31. Okt. 2022 um 04:51 Uhr schrieb Mike : > Hello! > > I've tried everythin

How should the headings plugin be configured?

2022-10-30 Thread Mike
Hello! I've tried everything and set everything up and get the nutch headings plugin working: nutch-site.xml protocol-okhttp protocol-okhttp|...|parse-(html|tika|text|metatags)|index-(basic|anchor|more|metadata)|...|headings|nutch-extensionpoints schema.xml index-writers.xml

Nutch/Hadoop: Error (FreeGenerator job did not succeed)

2022-10-14 Thread Mike
op that I can't find? Thanks Mike

Re: Nutch 1.19 schema.xml

2022-09-04 Thread Mike
Hello Sebastian! Thanks for your answer! Is it possible to simply update the schema.xml file without re-indexing? Thanks Mike Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel : > Hi Mike, > > the Nutch/Solr schema.xml will be updated with the release of 1.19 > (exp

Nutch 1.19 schema.xml

2022-08-31 Thread Mike
Hello! Will the schema.xml stay the same in Nutch 1.19? thanks! mike

Unable to create core Caused by: solr.LatLonType

2022-07-12 Thread Mike
] Caused by: solr.LatLonType Thanks Mike

can nutch output xml?

2012-10-24 Thread Mike Whitman
to individual files. Ideally nutch would output these files so I wouldn't need to have solr, Luke, and some tool I need to write in the content processing chain. KISS right? Any thoughts on how to do this in the simplest way? thanks, Mike

Re: crawling forum pages

2012-10-09 Thread Mike Baranczak
How high did you set the depth? And why do you think it can't go any higher? On Oct 9, 2012, at 5:15 AM, Jiang Fung Wong wrote: Hi All, I am setting up nutch to crawl forum pages and index the posts in the content pages (threads). I face a problem: nutch could not discover all content

DataFileAvroStore vs. AvroStore

2012-10-08 Thread Mike Baranczak
What's the difference between those two data stores? I've read the javadocs, and I'm still confused. -MB

SOLR Indexing issue, possibly due to NUTCH-1084?

2012-08-07 Thread Mike Pountney
. Mike

Re: Custom HtmlParseFilter configurations

2011-02-02 Thread Mike Baranczak
: Hi Mike et all, Yes the adding of plugin.xml made it work. However, the outstanding question even now is that - even though my plugin.includes lists a lot of plugin names why is that I just see JSParser and my own custom parser in the HTMLParseFilters. The following is my plugin.includes

CrawlDatum.getFetchTime()

2011-02-01 Thread Mike Baranczak
From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1): Returns either the time of the last fetch, or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time. So is there any way to determine which of these two conditions is true, using just the information in

Re: Custom HtmlParseFilter configurations

2011-02-01 Thread Mike Baranczak
Yes, you do have to make a config file for your plugin to be seen by Nutch. If you built Nutch from source, you should have the directory build/plugins. That's where the compiled plugins are. The names of the directories under there are the names that get included in 'plugin.includes'. Take a

Antwort: Re: Few questions from a newbie

2011-01-26 Thread Mike Zuehlke
. Regards Mike Von:Arjun Kumar Reddy charjunkumar.re...@iiitb.net An: user@nutch.apache.org Datum: 26.01.2011 15:43 Betreff:Re: Few questions from a newbie I am developing an application based on twitter feeds...so 90% of the url's will be short urls. So, it is difficult for me

Issues with certain URLs not being fetched.

2010-10-12 Thread Mike Pountney
Signature: null Metadata: Thanks, Mike

Re: Issues with certain URLs not being fetched.

2010-10-12 Thread Mike Pountney
[mailto:sonalgoy...@gmail.com] Sent: 12 October 2010 11:17 To: user@nutch.apache.org Subject: Re: Issues with certain URLs not being fetched. Mike, the fetch will be based on the score of the url. Higher scoring urls are selected first. Thanks and Regards, Sonal Sonal Goyal | Founder and CEO

Re: Junk Links

2010-09-23 Thread Mike Baranczak
Take a look at the URLNormalizer plugins. On Sep 23, 2010, at 4:03 AM, Yavuz Selim YILMAZ wrote: Another question; I have thsi kind of urls; .aaa/ .aaa .bbb/ .ccc .ddd/ .ddd There are duplicates like that. What Im' trying to explain is, some of them is

Re: java.net.UnknownHostException and Timeout during Fetching?

2010-09-19 Thread Mike Baranczak
Reducing the number of threads might help, but 10 threads total doesn't seem like that much to begin with. I think a better solution would be to run your own private DNS server (preferably on the same machine as Nutch, or at least on the same local network). -MB On Sep 19, 2010, at 10:08

Re: how to skip invalid outlinks

2010-09-12 Thread Mike Baranczak
I had the same problem, and a lot of the bad links did seem to come from faulty JavaScript parsing. Jeff's suggestion is probably the best you can do for now. The long-term solution would be to fix the JavaScript parser plugin. -MB On Sep 11, 2010, at 3:09 PM, Jeff Zhou wrote: there is no

Which parsers to use with Nutch 1.1?

2010-09-08 Thread Mike Baranczak
The impression that I got from reading the mailing lists is that the developers are slowly moving to deprecate all the parser plugins in favor of Tika - but that this process is not quite finished in the 1.1 release, and that the Tika plugin is still a little wonky. Is this correct? -MB --

Dynamically changing the URL retry interval

2010-09-03 Thread Mike Pountney
I'd like to refetch pages that I know change frequently more often. Does anyone know of a way to set a lower retry interval on a set of pages matched by a regex? Thanks in advance, Mike