Hello!
Is nutch 1.19 compatible with Hadoop 3.3.4?
Thanks!
mike
Hallo Sebastian!
I have now installed hadoop, unfortunately there are problems.
Will make a post..
Thanks
Mike
Am Di., 17. Jan. 2023 um 09:49 Uhr schrieb Sebastian Nagel
:
> Hi Mike,
>
> the Nutch configuration files are included in the job file found in
> runtime/deploy
I will now try to configure the bot url etc. before the building,
but how and where do I configure between the crawls e.g. number of pages
per host?
where do I configure nutch in cluster mode?
thx, mike
Hi!
I am now crawling the internet in local mode in parallel with up to 10
instances on 3 computers. would it pay off for me to put a hadoop cluster
on top of the 3 servers.
1.) a server would not be integrated directly into the crawl process as a
master.
2.) can I run multiple crawl jobs on one
ot;,
"digest":"3b9a23d42f200392d12a697bbb8d4d87",
Thanks
Mike
8
> h1 :Apache Nutchâ„¢
> id :https://nutch.apache.org/
>
> Can you check you configuration? Is a plugin name mispelled? Is the
> headings plugin active during fetch/parse? Is the index-metadata plugin
> active?
>
> Regards,
> Markus
>
>
> Op ma 31 okt. 2022 om 14:
Hello Markus!
Thank you for taking care of my problem!
I removed the metatag.h# fron index.parse.md but ntuch indexchecker do not
show me still the fields.
Am Mo., 31. Okt. 2022 um 12:56 Uhr schrieb Markus Jelsma <
markus.jel...@openindex.io>:
> Hello Mike,
>
> Please rem
e.g. for 'description' or 'keywords' provided that these
values are generated
by a parser (see parse-metatags plugin)
The Nutch parsechecker shows me the fields but the indexchecker doesn't.
Am Mo., 31. Okt. 2022 um 04:51 Uhr schrieb Mike :
> Hello!
>
> I've tried everythin
Hello!
I've tried everything and set everything up and get the nutch headings
plugin working:
nutch-site.xml
protocol-okhttp
protocol-okhttp|...|parse-(html|tika|text|metatags)|index-(basic|anchor|more|metadata)|...|headings|nutch-extensionpoints
schema.xml
index-writers.xml
op that I can't find?
Thanks
Mike
Hello Sebastian!
Thanks for your answer!
Is it possible to simply update the schema.xml file without re-indexing?
Thanks
Mike
Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel
:
> Hi Mike,
>
> the Nutch/Solr schema.xml will be updated with the release of 1.19
> (exp
Hello!
Will the schema.xml stay the same in Nutch 1.19?
thanks!
mike
]
Caused by: solr.LatLonType
Thanks
Mike
to individual
files.
Ideally nutch would output these files so I wouldn't need to have solr,
Luke, and some tool I need to write in the content processing chain.
KISS right?
Any thoughts on how to do this in the simplest way?
thanks,
Mike
How high did you set the depth? And why do you think it can't go any higher?
On Oct 9, 2012, at 5:15 AM, Jiang Fung Wong wrote:
Hi All,
I am setting up nutch to crawl forum pages and index the posts in the
content pages (threads). I face a problem: nutch could not discover
all content
What's the difference between those two data stores? I've read the javadocs,
and I'm still confused.
-MB
.
Mike
:
Hi Mike et all,
Yes the adding of plugin.xml made it work.
However, the outstanding question even now is that - even though my
plugin.includes lists a lot of plugin names why is that I just see JSParser
and my own custom parser in the HTMLParseFilters.
The following is my plugin.includes
From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1):
Returns either the time of the last fetch, or the next fetch time, depending
on whether Fetcher or CrawlDbReducer set the time.
So is there any way to determine which of these two conditions is true, using
just the information in
Yes, you do have to make a config file for your plugin to be seen by Nutch.
If you built Nutch from source, you should have the directory build/plugins.
That's where the compiled plugins are. The names of the directories under there
are the names that get included in 'plugin.includes'. Take a
.
Regards
Mike
Von:Arjun Kumar Reddy charjunkumar.re...@iiitb.net
An: user@nutch.apache.org
Datum: 26.01.2011 15:43
Betreff:Re: Few questions from a newbie
I am developing an application based on twitter feeds...so 90% of the
url's
will be short urls.
So, it is difficult for me
Signature: null
Metadata:
Thanks,
Mike
[mailto:sonalgoy...@gmail.com]
Sent: 12 October 2010 11:17
To: user@nutch.apache.org
Subject: Re: Issues with certain URLs not being fetched.
Mike, the fetch will be based on the score of the url. Higher scoring urls
are selected first.
Thanks and Regards,
Sonal
Sonal Goyal | Founder and CEO
Take a look at the URLNormalizer plugins.
On Sep 23, 2010, at 4:03 AM, Yavuz Selim YILMAZ wrote:
Another question;
I have thsi kind of urls;
.aaa/
.aaa
.bbb/
.ccc
.ddd/
.ddd
There are duplicates like that.
What Im' trying to explain is, some of them is
Reducing the number of threads might help, but 10 threads total doesn't seem
like that much to begin with. I think a better solution would be to run your
own private DNS server (preferably on the same machine as Nutch, or at least on
the same local network).
-MB
On Sep 19, 2010, at 10:08
I had the same problem, and a lot of the bad links did seem to come from faulty
JavaScript parsing. Jeff's suggestion is probably the best you can do for now.
The long-term solution would be to fix the JavaScript parser plugin.
-MB
On Sep 11, 2010, at 3:09 PM, Jeff Zhou wrote:
there is no
The impression that I got from reading the mailing lists is that the
developers are slowly moving to deprecate all the parser plugins in favor of
Tika - but that this process is not quite finished in the 1.1 release, and
that the Tika plugin is still a little wonky. Is this correct?
-MB
--
I'd like to refetch pages that I know change frequently more often.
Does anyone know of a way to set a lower retry interval on a set of pages
matched by a regex?
Thanks in advance,
Mike
28 matches
Mail list logo