Hello Steve,
Having those spaces normalized/encoded is expected behaviour with
urlnormalizer-basic active. I would recommend to keep it this way and have
all URLs in Solr properly encoded. Having spaces in Solr IDs is also not
recommended as it can lead to unexpected behaviour.
If you really
Hello,
You can skip certain types of documents based on their file extension,
using the urlfilter-suffix. It only filters known suffixes. Filtering based
on content type is not possible, because to know the content type requires
fetching and parsing them.
You can skip specific content types when
t;
>
>
> On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma
> >* wrote ---
>
> Yes, remove the other protocol-* plugins from the configuration. With all
> three active it is not always determined which one is going to do the
> work.
>
> Op ma 30 jan. 2023 om 12:50
Hello,
Please check the logs for more information.
Regards,
Markus
Op ma 24 jul 2023 om 19:05 schreef Raj Chidara :
> Hi
>
> Nutch 1.19 compiled with ant without any errors and when running
> Injector, getting an error that
>
>
>
> 19:20:25.055 [main] ERROR org.apache.nutch.crawl.Injector -
Hello Mike,
> Is nutch 1.19 compatible with Hadoop 3.3.4?
Yes!
Regards,
Markus
Op di 7 mrt 2023 om 17:37 schreef Mike :
> Hello!
>
> Is nutch 1.19 compatible with Hadoop 3.3.4?
>
>
> Thanks!
>
> mike
>
Hello Joe,
> Now I'd like to capture and index the count of forward slash characters
'/'
It seems you are trying to do that with the subcollection plugin, i don't
think that is going to work with it.
Instead, i would suggest to write a simple index plugin that does the
counting, and adds the
Are there any additional steps to be
> followed for installation of selenium. Please suggest
>
>
> Thanks and Regards
>
> Raj Chidara
>
> - Original Message -
> From: Markus Jelsma (markus.jel...@openindex.io)
> Date: 30-01-2023 16:26
> To: user@nutch.apac
Hello Raj,
I think the same question about the same site was asked here some time ago.
Anyway, this site loads its content via Javascript. You will need a
protocol plugin that supports it, either protocol-htmlunit, or
protocol-selenium, instead of protocol-http or any other.
Change the
Hello Mike,
> would it pay off for me to put a hadoop cluster on top of the 3 servers.
Yes, for as many reasons as Hadoop exists for. It can be tedious to set up
for the first time, and there are many components. But at least you have
three servers, which is kind of required by Zookeeper, that
Hello Raj,
This site loads its content via Javascript, so you need a protocol plugin
that supports it. HtmlUnit does not seem to work with this site, but
Selenium does. Please change your protocol plugin accordingly in you
plugin.includes configuration directive.
I tested it with our own parser
jects must take
> advantage of any little contribution no matter the way.
>
> Best,
>
> El vie, 25 nov 2022 a las 7:21, Markus Jelsma ( >)
> escribió:
>
> > Hello Paul,
> >
> > > I tried to comment on this jira issue, but I don't have access,
> > unf
Hello Paul,
> I tried to comment on this jira issue, but I don't have access,
unfortunately I don't know how to do it.
Due to too much spam, it is no longer possible to create an account for
yourself, but we can do that for you if you wish
Regards,
Markus
Op do 24 nov. 2022 om 22:46 schreef
Hello,
The German site is crawlable, but it does produce awful URLs with some
;jsessionid=<> attached to it. The Chinese site is all Javascript, it
requires HtmlUnit or Selenium protocol plugin for it to work at all. No
guarantee if it will.
Regards,
Markus
Op wo 23 nov. 2022 om 11:07 schreef
Hello Mike,
You can try adding the TLD to conf/domain-suffixes.xml and see if it works.
Regards,
Markus
Op di 8 nov. 2022 om 11:16 schreef Mike :
> Hi!
> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
> the TLD list?
>
>
kus!
>
> Thank you for taking care of my problem!
>
> I removed the metatag.h# fron index.parse.md but ntuch indexchecker do not
> show me still the fields.
>
> Am Mo., 31. Okt. 2022 um 12:56 Uhr schrieb Markus Jelsma <
> markus.jel...@openindex.io>:
>
> > Hello Mike,
Hello Mike,
Please remove the metatag.* prefix in the index.parse.md config and i think
you should be fine.
Regards,
Markus
Op ma 31 okt. 2022 om 12:32 schreef Mike :
> Yes, sorry, I also forgot to post this setting:
>
>
>index.parse.md
>
>
>
>
Hello Mike,
I think it should be working just fine with it enabled in
protocol.includes. You can check Nutch' parser output by using:
$ bin/nutch parsechecker
You should see one or more h# output fields present. You can then use the
index-metadata plugin to map the parser output fields to the
Hello,
You cannot just run Nutch's JAR like that on Hadoop, you need the large
.job file instead. If you build Nutch from source, you will get a
runtime/deploy directory. Upload its contents to a Hadoop client and run
Nutch commands using bin/nutch ... You will then automatically use the
large
gt; >>> I have been able to compile under OpenJDK 11
> >>> Have not done anything further so far
> >>> I'm gonna try to get to it this evening
> >>>
> >>> Greetz
> >>> Ralf
> >>>
> >>> On Wed, Aug 24, 2022 a
Hi,
Everything seems fine, the crawler seems fine when trying the binary
distribution. The source won't work because this computer still cannot
compile it. Clearing the local Ivy cache did not do much. This is the known
compiler error with the elastic-indexer plugin:
compile:
[echo] Compiling
Sounds good!
I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the
current 2.4.1.
Thanks!
Markus
Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel :
> Hi all,
>
> more than 60 issues are done for Nutch 1.19
>
> https://issues.apache.org/jira/projects/NUTCH/versions/12349580
To add to Sebastian, it runs on Hadoop 3.3.x very good as well. Actually, i
never had any Hadoop version that could not run Nutch out of the box and
without issues.
Op ma 13 jun. 2022 om 11:54 schreef Sebastian Nagel
:
> Hi Michael,
>
> Nutch (1.18, and trunk/master) should work together with
Hello,
With a 1.18 checkout i am trying the okhttp plugin. I couldn't get it to
work on 1.15 due to another NoClassDefFoundError, and now with 1.18, it
still doesn't work and throws another NoClassDefFoundError.
java.lang.NoClassDefFoundError: okhttp3/Authenticator
at
Hello Sebastian,
We have always used vanilla Apache Hadoop on our own physical servers that
are running on the latest Debian, which also runs on ARM. It will run HDFS
and YARN and any other custom job you can think of. It has snappy
compression, which is a massive improvement for large data
ll be a bottleneck in this case. I am
> looking for options to distribute the same domain URLs across various
> mappers. Not sure if that's even possible with Nutch or not.
>
> Regards
> Prateek
>
> On Tue, May 11, 2021 at 11:58 AM Markus Jelsma >
> wrote:
>
> >
Hello Prateet,
If you want to fetch stuff from the same host/domain as fast as you want,
increase the number of threads, and the number of threads per queue. Then
decrease all the fetch delays.
Regards,
Markus
Op di 11 mei 2021 om 12:48 schreef prateek :
> Hi Lewis,
>
> As mentioned earlier,
Hello Abhay,
You only need to keep or merge old segments if you 'quickly' need to
reindex the data, and are unable to start with a fresh crawl. If you
frequently recrawl all urls, e.g. a month, then segments older than a month
can safely be removed.
You can also do daily an monthly merges, like
Hello Abhay,
You can configure a protocol plugin per host using the
host-protocol-mapping.txt configuration file. Its usage is:
or protocol:
Regards,
Markus
Op vr 2 apr. 2021 om 15:18 schreef Abhay Ratnaparkhi <
abhay.ratnapar...@gmail.com>:
> Hello,
>
> I would like to know how to
om 08:49 schreef Hany NASR :
> Hello Markus,
>
> I added the property in nutch-site.xml with no luck.
>
> The documents still exist in Solr; any advice?
>
> Regards,
> Hany
>
> From: Markus Jelsma
> Sent: Monday, March 8, 2021 3:40 PM
> To: user@nutch.apache.o
Hello Hany,
You need to tell the indexer to delete those record. This will help:
indexer.delete
true
Regards,
Markus
Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR :
> Hi All,
>
> I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> are still indexed and not
Thanks Sebastian!
-Original message-
> From:Sebastian Nagel
> Sent: Thursday 2nd July 2020 16:42
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org; annou...@apache.org
> Subject: [ANNOUNCE] Apache Nutch 1.17 Release
>
> The Apache Nutch team is pleased to announce the release
Hello Marcel,
You can use your /etc/hosts file for that purpose, assuming you are on Linux.
Regards,
Markus
-Original message-
> From:Marcel Haazen
> Sent: Tuesday 14th April 2020 12:12
> To: user@nutch.apache.org
> Subject: Resolve by IP
>
> Hi,
> I'm trying to crawl a specific
Hello Joseph,
> Is there more documentation on having Nutch get what Tika sees into what Solr
> will see?
No, but i believe you would want to checkout the parsechecker and indexchecker
tools. These tools display what Tika sees and what will be sent to Solr.
Regards,
Markus
-Original
nd I can increase my throughput.
>
> Please let me know.
>
> Thanks
> Sachin
>
>
>
>
>
>
> On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma
> wrote:
>
> > Hello Sachin,
> >
> > Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> >
score quantile 0.99:2.542474209925616E-4
> > min score: 3.0443254217971116E-5
> > avg score: 7.001118352666182E-4
> > max score: 1.3120110034942627
> > status 2 (db_fetched): 39150
> > status 3 (db_gone): 13
> > status 4 (db_redir_temp):
Hello Sachin,
Nutch can run on Amazon AWS without trouble, and probably on any Hadoop based
provider. This is the most expensive option you have.
Cheaper would be to rent some servers and install Hadoop yourself, getting it
up and running by hand on some servers will take the better part of a
Hello Dave,
First you should check the CrawlDB using readdb -stats. My bet is that your set
contains some redirects and gone (404), or transient errors. The number for
fetched and notModified added up should be about the same as the number of
documents indexed.
Regards,
Markus
Hello Sachin,
Once a URL gets filtered, by any plugin, it is rejected entirely.
If you want specific queries to pass the regex-urlfilter, you must let is pass
explicitly above this -[?*!@=] line, e.g. +passThisQuery=
Use bin/nutch filterchecker -stdIn for quick testing.
Regards,
Markus
Hello,
We're upgrading our stuff to 1.16 and got a peculiar problem when we started
indexing:
2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : java.lang.IllegalStateException: text width is less
than 1, was <-41>
at
Thanks Sebastian!
-Original message-
> From:Sebastian Nagel
> Sent: Friday 11th October 2019 17:03
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org; annou...@apache.org
> Subject: [ANNOUNCE] Apache Nutch 1.16 Release
>
> Hi folks!
>
> The Apache Nutch [0] Project Management
Hello Dave,
If you have just one specific page you do not want Nutch to index, or Solr to
show, you can either create a custom IndexingFilter that returns null
(rejecting it) for the specified URL, or add an additional filterQuery to Solr,
fq=-id:, filtering the specific URL from the results.
Hello Sebastian,
All tests pass nicely and i can easily run a crawl.
+1
Thanks,
Markus
By the way, what does this mean:
2019-10-03 12:48:49,696 INFO crawl.Generator - Generator: number of items
rejected during selection:
2019-10-03 12:48:49,698 INFO crawl.Generator - Generator: 1
Hello,
It doesn't say much except failure, no reason. You might want to set debugging
to TRACE, the authenticator logs on that level. You could also check if there
are server side messages.
Regards,
Markus
-Original message-
> From:Larry.Santello
> Sent: Thursday 25th April 2019
Hello Hany,
For Boilerpipe you can only select which extractor it should use. By default it
uses ArticleExtractor, which is the best choice in most cases. However, if
content is more spread out into separate blocks, CanolaExtractor could be a
better choice.
Regards,
Markus
-Original
Hello Alexis, see inline.
Regards,
Markus
-Original message-
> From:IZaBEE_Keeper
> Sent: Wednesday 20th March 2019 1:28
> To: user@nutch.apache.org
> Subject: RE: Limiting Results From Single Domain
>
> Markus Jelsma-2 wrote
> > Hello Alexis,
> >
>
y (HOST) ul.
> > Kapelanka 42A, 30-347 Kraków, Poland
> > __
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> >
Hello Suraj,
You can safely increase the number of reducers for UpdateHostDB to as high as
you like.
Regards,
Markus
-Original message-
> From:Suraj Singh
> Sent: Monday 18th March 2019 11:41
> To: user@nutch.apache.org
> Subject: Increasing the number of reducer in UpdateHostDB
>
>
Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no
choice, either skip large files, or increase memory.
Regards,
Markus
-Original message-
> From:hany.n...@hsbc.com.INVALID
> Sent: Thursday 14th March 2019 10:44
> To: user@nutch.apache.org
> Subject:
Hello Suraj,
That should be no problem. Duplicates are grouped by their signature, this
means you can have as many reducers as you would like.
Regards,
Markus
-Original message-
> From:Suraj Singh
> Sent: Wednesday 20th February 2019 12:56
> To: user@nutch.apache.org
> Subject:
Hello Tom,
To get parse metadata field indexed, you need the indexer-metadata plugin. Use
the index.parse.md parameter to define the fields you want to have indexed. Use
indexchecker to test.
Regards,
Markus
-Original message-
> From:Tom Potter
> Sent: Wednesday 13th February
Hello Suraj,
You can safely run the LinkDB merger with as many reducers as you like.
Regards,
Markus
-Original message-
> From:Suraj Singh
> Sent: Tuesday 18th December 2018 15:39
> To: user@nutch.apache.org
> Subject: Multiple Reducers for Linkdb
>
> Hello,
>
> Can we run
Nutch crawl interruption
>
> I think in the case that you interrupt the fetcher, you'll have the problem
> that URLs that where scheduled to be fetched on the interrupted cycle will
> never be fetched (because of NUTCH-1842).
>
> Yossi.
>
> > -Original Message
Hello Hany,
That depends. If you interrupt the fetcher, the segment being fetched can be
thrown away. But if you interrupt updatedb, you can remove the temp directory
and must get rid of the lock file. The latter is also true if you interrupt the
generator.
Regards,
Markus
-Original
Hello Nicholas,
Your IP might be blocked, or the firewall just drops the connection due to your
User-Agent name. We have no problems fetching this host.
Regards,
Markus
-Original message-
> From:Nicholas Roberts
> Sent: Wednesday 14th November 2018 7:58
> To: user@nutch.apache.org
Hello Hany,
Using parse-tika as your HTML parser, you can enable Boilerpipe (see
nutch-default).
Regards,
Markus
-Original message-
> From:hany.n...@hsbc.com
> Sent: Wednesday 14th November 2018 15:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from
Hello Ashish,
You might want to check out Apache ManifoldCF.
Regards.
Markus
http://manifoldcf.apache.org/
-Original message-
> From:Ashish Saini
> Sent: Monday 29th October 2018 18:56
> To: user@nutch.apache.org
> Subject: Getting Nutch To Crawl Sharepoint Online
>
> We are
Hello Hany,
There are a few, mine included, mentioned on the Nutch support wiki page [1].
Regards,
Markus
[1] https://wiki.apache.org/nutch/Support
-Original message-
> From:hany.n...@hsbc.com
> Sent: Friday 12th October 2018 9:25
> To: user@nutch.apache.org
> Subject: Apache
Hi Amarnatha,
-^.+(?:modal|exit).*\.html
Will work for all exampes given.
You can test regexes really well online [1]. If each input has true for
lookingAt, Nutch' regexfilter will filter the URL's.
Regards,
Markus
[1] https://www.regexplanet.com/advanced/java/index.html
-Original
Hi Benjamin,
If you do not specifically require Nutch 2.x, i would strongly suggest to go to
Nutch 1.x. It doesn't have the added hustle of a DB and DB layer, is much more
mature and gets the most commits of the two.
Regards,
Markus
-Original message-
> From:Benjamin Vachon
>
Hello Rustam,
You can use urlnormalizer-slash for this task.
Regards,
Markus
-Original message-
> From:Rustam
> Sent: Wednesday 29th August 2018 10:30
> To: user@nutch.apache.org
> Subject: Nutch Maven support for plugins
>
> It seems Nutch is available in Maven, but without its
en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms)
>
> That could be because of NUTCH-2623 (to be fixed in 1.16).
>
> If you have more examples, let me know. Otherwise, let's re-test if NUTCH-2623
> is fixed and the logging is improved. Could you open an issue fo
(or in general the status of
> a fetch).
> It would double the logged lines but would help to understand what the
> fetcher is doing,
> esp. regarding robots denied and redirects.
>
> Best,
> Sebastian
>
>
> On 08/01/2018 11:59 AM, Markus Jelsma wrote:
> >
All tests pass, crawler run fine so far, +1 for 1.15!
Regards,
Markus
-Original message-
> From:Sebastian Nagel
> Sent: Thursday 26th July 2018 17:05
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org
> Subject: [VOTE] Release Apache Nutch 1.15 RC#1
>
> Hi Folks,
>
> A first
Hello,
Yossi's suggestion is excellent if your case is crawl everything once, and
never again. However, if you need to crawl future articles as well, and have to
deal with mutations, then let the crawler run continuously without regard for
depth.
The latter is the usual case, because after
assumption that the document is valid and
> > conformant.
> >
> > Yossi.
> >
> >> -Original Message-
> >> From: Markus Jelsma
> >> Sent: 25 May 2018 23:45
> >> To: User
> >> Subject: Sitemap URL's concaten
n from the assumption that the document is valid and conformant.
>
> Yossi.
>
> > -Original Message-
> > From: Markus Jelsma
> > Sent: 25 May 2018 23:45
> > To: User
> > Subject: Sitemap URL's concatenated, causing status 14 not found
> >
&
Hello,
We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
Nutch things those two sitemap URL's are actually one consisting of both
concatenated.
Here is https://www.saxion.nl/sitemap.xml
http://www.sitemaps.org/schemas/sitemap/0.9;>
Hi,
Here are examples using Maven:
https://github.com/ATLANTBH/nutch-plugins/tree/master/nutch-plugins
Regards,
Markus
-Original message-
> From:Yash Thenuan Thenuan
> Sent: Monday 7th May 2018 11:51
> To: user@nutch.apache.org
> Subject: Re: Having plugin as
ath.random. For example, I can extract
> records above a specific score with "score>1.0". But the random thing doesn't
> work even though I have tried various thresholds.
>
> On Tuesday, May 1, 2018, 2:00:48 PM PDT, Markus Jelsma
> <markus.jel...@openindex.io> wr
Hello Michael,
I would think this should work as well. But since you mention .99 works fine,
did you try .1 as well to get ~10% output? It seems the expressions itself do
work at some level, and since this is a Jexl specific thing, you might want to
try the Jexl list as well. I could not find
Hello Chip,
I have no clue where the three hour limit could come from. Please take a
further look in the last few minutes of the logs.
The only thing i can think of is that a webserver would block you after some
amount of requests/time window, that would be visible in the logs. It is clear
That doesn't appear to be the case, fetcher's time bomb nicely logs when it
reached its limit, it also usually runs for longer than two seconds which we
see here.
What can you find in the logs? There must be some error beyond having hung
threads. Usually something with a hanging parser or GC
Hello Shiva,
Yes, that is possible, but it (ours) is not a fool proof solution.
We got our first hub classifier years ago in the form of a simple ParseFilter
backed by an SVM. The model was built solely on the HTML of positive and
negative examples, with very few features, so it was extremely
About URL Normalizers, you can use:
urlnormalizer-host to normalize between www- and non-www hosts, and
urlnormalizer-slash to normalize per host trailing or non-trailing slashes.
There are no committed tools that automate this, but if your set of sites is
limited, it is easy to manage by hand.
ot db.max.outlinks.per.page, db.max.anchor.length. Copy paste error...
>
> > -Original Message-
> > From: Markus Jelsma <markus.jel...@openindex.io>
> > Sent: 12 March 2018 14:01
> > To: user@nutch.apache.org
> > Subject: RE: UrlRegexFilter is getting des
scripts/apache-nutch-1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
scripts/apache-nutch-1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
int maxOutlinksPerPage =
Hello - see inline.
Regards,
Markus
-Original message-
> From:Semyon Semyonov
> Sent: Monday 12th March 2018 11:47
> To: usernutch.apache.org
> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
>
> Dear all,
>
lvalen...@gmail.com>
> Sent: Wednesday 7th March 2018 21:51
> To: user@nutch.apache.org
> Subject: Re: Need Tutorial on Nutch
>
> How about using nutch with a headless browser like CasperJS? Will this
> work? Have any of you tried this?
>
> On Tue, Mar 6, 2018 at
Hi,
I've got metadata, containing a capital in the field name. But index-metadata
lowercases its field names:
parseFieldnames.put(metatag.toLowerCase(Locale.ROOT), metatag);
This means index-metadata is useless if your metadata fields contain uppercase
characters. Was this done for a reason?
Hi,
Yes you are going to need code, and a lot more than just that, probably
including dropping the 'every two hour' requirement.
For your case you need either site-specific price extraction, which is easy but
a lot of work for 500+ sites. Or you need a more complicated generic algorithm,
of OpenIndex, and perhaps should
> be removed now that the code is part of Nutch, or is there a reason this
> normalizer must not be used with UpdateHostDb?
>
> Yossi.
>
> > -Original Message-
> > From: Markus Jelsma <markus.jel...@openindex.io>
>
Hi,
The reason is simple, we (company) needed this information based on hostname,
so we made a hostdb. I don't see any downside for supporting a domain mode.
Adding support for it through hostdb.url.mode seems like a good idea.
Regards,
Markus
-Original message-
> From:Yossi Tamari
Hi,
If you want to stay clear of all 2.x caveats, use Nutch 1.x. If you want the
most stable and feature rich version, use 1.x. If you want to limit the number
of wheels (Gora as DB abstraction, running and operate a separate DB server),
use 1.x. If you do not intend to crawl tens of millions
Checked and confirmed, even Dutch digraph IJ is folded properly, as well as the
upper case dotless Turkish i and the Spanish example you provided is folded
properly.
Correction for German (before Nagel corrects me), ö and ü are not normalized by
ICU folder according to German rules. Their
Hi,
My guess is you haven't reindexed after changing filter configuration, which is
required for index-time filters.
Regarding your fieldType, you can drop the lowercase and ASCII folding filters
and just keep the ICU folder, it will work for pretty much any character set.
It will normalize
<user-digest-h...@nutch.apache.org> wrote:
>
> >
> > From: Markus Jelsma <markus.jel...@openindex.io>
> > To: User <user@nutch.apache.org>
> > Cc:
> > Bcc:
> > Date: Wed, 17 Jan 2018 10:51:49 +
> > Subject: SitemapProcessor des
ought NUTCH-2442 forward.
> Time to review the patch of NUTCH-2466!
>
> On 01/17/2018 01:53 PM, Markus Jelsma wrote:
> > Ah thanks!
> >
> > I knew you'd fixed some of these, now i know my patch of NUTCH-2466
> > silently removes your commit!
> >
),
> only checking for exceptions isn't enough!
>
> Sebastian
>
> On 01/17/2018 11:51 AM, Markus Jelsma wrote:
> > Hello,
> >
> > We noticed some abnormalities in our crawl cycle caused by a sudden
> > reduction of our CrawlDB's size. The SitemapProcesso
Hello,
We noticed some abnormalities in our crawl cycle caused by a sudden reduction
of our CrawlDB's size. The SitemapProcessor ran, failed (timed out, see below)
and left us with a decimated CrawlDB.
This is odd because of:
} catch (Exception e) {
if (fs.exists(tempCrawlDb))
Thanks Sebastian!
-Original message-
> From:Sebastian Nagel
> Sent: Monday 25th December 2017 18:38
> To: user@nutch.apache.org; annou...@apache.org
> Subject: [ANNOUNCE] Apache Nutch 1.14 Release
>
> Dear Nutch users,
>
> the Apache Nutch [0] Project
nolaExtractor. I don't know
> what the differences are, but I bet ArticleExtractor (the default algorithm )
> inserts the Title.
>
>
>
>
> From: Markus Jelsma <markus.jel...@openindex.io>
> To: "user@nutch.apache.org" <us
> boilerpipe
>
> Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>
>
>
>
> tika.extractor.boilerpipe.algorithm
> ArticleExtractor
>
> Which Boilerpipe algorithm to use. Valid values
You could do that, but you would need to fiddle around in TikaParser.java.
Using TeeContentHandler you can add both the normal ContentHandler, and the
Boilerpipe version.
-Original message-
> From:Michael Coffey
> Sent: Wednesday 15th November 2017 20:34
Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble
getting it configured - it is really just setting a boolean value. Or does it
work, but not to your satisfaction?
The Bayan solution should work, theoretically, but just with a lot of tedious
manual per-site
cc list
-Original message-
> From:Markus Jelsma
> Sent: Wednesday 8th November 2017 0:15
> To: user@nutch.apache.org
> Subject: RE: Nutch(plugins) and R
>
> Hello - there are no responses, and i don't know what R is, but you are
> interested in HTML parsing,
Hello - there are no responses, and i don't know what R is, but you are
interested in HTML parsing, specifically topic detection, so here are my
thoughts.
We have done topic detection in our custom HTML parser, but in Nutch speak we
would do it in a ParseFilter implementation. Get the
er sends
> Content-Type: text/html; charset=utf-8
>
> Sebastian
>
> On 11/01/2017 07:06 PM, Markus Jelsma wrote:
> > Any ideas?
> >
> > Thanks!
> >
> >
> >
> > -Original message-
> >> From:Markus Jelsma <marku
Hi - Nutch has a parser for RSS and ATOM on-board:
https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/feed/FeedParser.html
You must configure it in your plugin.includes to use it.
Regards,
Markus
-Original message-
> From:Ankit Goel
>
gt;
> > For 1.17, the simplest solution, I think, is to allow users to configure
> > extending the detection limit via our @Field config methods, that is, via
> > tika-config.xml.
> >
> > To confirm, Nutch will allow users to specify a tika-config file? Will
>
> -----Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Tuesday, October 31, 2017 5:47 AM
> To: u...@tika.apache.org
> Subject: RE: Incorrect encoding detected
>
> Hello Timothy - what would be your preferred solution? Increase detection
1 - 100 of 1614 matches
Mail list logo