REST uses the old method invocation which sets overwrite and update to false,
which is wrong.
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L514
Please open a ticket.
M.
-Original message-
> From:Sujan Suppala
> Sent: Friday 14th October
s and Regards,
> Shubham Gupta
>
> On Friday 14 October 2016 04:11 PM, Markus Jelsma wrote:
> > Check the logs, this only tells you that i failed, not why.
> > M.
> >
> >
> >
> > -Original message-
> >> From:shubham.gupta
> &g
Check the logs, this only tells you that i failed, not why.
M.
-Original message-
> From:shubham.gupta
> Sent: Friday 14th October 2016 12:15
> To: user@nutch.apache.org
> Subject: Injector and Generator Job Failing
>
> Hey
>
> Whenever i run the nutch application, only the injector
se comment if you see any issues with this approach.
>
> Thanks
> Sujan
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Thursday, October 06, 2016 7:32 PM
> To: user@nutch.apache.org
> Subject: RE: nutch 1.12 How can I forc
Hi,
Just use the latest 1.12 if you have a choice. The archives are usually not
very useful.
The precompiled versions are identical to compiling the source yourself, with
the ant command.
Markus
-Original message-
> From:WebDawg
> Sent: Thursday 6th October 2016 15:10
> To: user@
Hi
You can use -adddays N in the generator job to fool it, or just use a lower
interval. Or, use the freegen tool to immediately crawl a set of URL's.
Markus
-Original message-
> From:Sujan Suppala
> Sent: Thursday 6th October 2016 15:56
> To: user@nutch.apache.org
> Subject: nutch
Nutch 2.3.1?
> Regards,
> Vladimir.
>
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: October-05-16 2:53 PM
> To: user@nutch.apache.org
> Subject: RE: Nutch scalability
>
> Where i wrote YJK, i of course meant CJK instead.
>
>
Hello - it might be the case that over time, due to additional URL filters, the
CrawlDB loses URL's (which can go 404), but are never deleted from the index,
and stay there forever.
If you really hate 404´ s, I'd just never delete the index, but keep a low
fetch interval, and have Nutch delete
Hello - you can try debug logging. Or get the whole list of URL's in a flat
file, and pipe it to bin/nutch org.apache.nutch.net.URLFilterChecker
-allCombined. URL's with a plus + are passed, URL's with a - minus are filtered.
-Original message-
> From:shubham.gupta
> Sent: Wednesda
Where i wrote YJK, i of course meant CJK instead.
-Original message-
> From:Markus Jelsma
> Sent: Wednesday 5th October 2016 20:34
> To: user@nutch.apache.org
> Subject: RE: Nutch scalability
>
> Hello Vladimir - answers inline.
> Markus
>
> -Original message-
> > From:Vlad
t; If I should upgrade to the latest version of Solr (6.2.1) is it advisable to
> upgrade my current version (1.9) of nutch? If so, should I upgrade to the
> latest version of nutch (1.12)?
>
> Jackie
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel
Hello - see inline.
Markus
-Original message-
> From:WebDawg
> Sent: Wednesday 5th October 2016 15:51
> To: user@nutch.apache.org
> Subject: Nutch and SOLR integration
>
> I am new to Solr and Nutch.
>
> I was working through the tutorials and managed to get everything
> going up to ma
Hello Vladimir - answers inline.
Markus
-Original message-
> From:Vladimir Loubenski
> Sent: Wednesday 5th October 2016 20:09
> To: user@nutch.apache.org
> Subject: Nutch scalability
>
> Hi,
> I have Nutch 2.3.1 installation with MongoDB.
>
> I want to understand what scalability opti
I think, as of 1.12, there is a parameter to disable the robots check, i am not
sure. Check nutch-default, it might be there.
M.
-Original message-
> From:Nestor
> Sent: Wednesday 5th October 2016 0:05
> To: user@nutch.apache.org
> Subject: Re: crawling a subfolder
>
> OK, Thanks fo
ssues.
Cheers
Markus
-Original message-
> From:Comcast
> Sent: Tuesday 4th October 2016 20:32
> To: user@nutch.apache.org
> Subject: Re: parsing issue - content and title fields combined
>
> I was not complaining
>
> Sent from my iPhone
>
> > On
t;
> this is slated for fix in v1.13.
> Great.
> K
>
> - Original Message -
>
> From: "Markus Jelsma"
> To: user@nutch.apache.org
> Sent: Tuesday, October 4, 2016 12:34:33 PM
> Subject: RE: parsing issue - content and title fields combined
the results have diff number of fields
>
> Maybe because I am trying to just crawl a subfolder mysite.com/subfolder and
> I am having problems configuring it to do this and is going and crawling
> other pages from the parent directory.
>
> Thanks!
>
>
>
> On
To my knowledge, there is no such thing, and it would probably never work
generic in any way. If you want to prevent section being extracted, Nutch has
support for Boilerpipe, an open source extractor. It has major drawbacks, but
can work fine in some cases.
M.
-Original message-
>
Hi - this is a known and open issue, but it has a patch:
https://issues.apache.org/jira/browse/NUTCH-1749
-Original message-
> From:KRIS MUSSHORN
> Sent: Tuesday 4th October 2016 16:53
> To: user@nutch.apache.org
> Subject: parsing issue - content and title fields combined
>
> Nutch
Hello - this is not Solr's maximum for a field at all. But it is Java's
maximum for String. Just don't use string when indexing.
Markus
-Original message-
> From:KRIS MUSSHORN
> Sent: Friday 30th September 2016 17:54
> To: user@nutch.apache.org
> Subject: Re: control order of operatio
Well, probably because you or something indexes different stuff to the Solr
index. The first doesn't come from Nutch, the second does.
Markus
-Original message-
> From:Nestor
> Sent: Tuesday 4th October 2016 2:07
> To: user@nutch.apache.org
> Subject: why the results have diff number
java/org/apache/nutch/parse/tika/TikaParser.java#L117
>
> > -Ursprüngliche Nachricht-
> > Von: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Gesendet: Freitag, 30. September 2016 14:15
> > An: user@nutch.apache.org
> > Betreff: RE: Tika removes tags
Hello - Tika does some HTML mapping under the hood, but it is configurable.
Tell Tika to use the IdentityMapper. I am not sure anymore which param you
need, check out TikaParser.java, it is somewhere near the bottom.
Markus
-Original message-
> From:Felix von Zadow
> Sent: Friday 3
> at com.ctc.wstx.sr.BasicStreamReader.handleEOF(BasicStreamReader.java:2134)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2040)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
> at org.apache.solr.handler.loader.XMLLoad
Yes, plain Hadoop map or sequence files on local storage or HDFS.
M.
-Original message-
> From:v0id null
> Sent: Thursday 8th September 2016 16:03
> To: user@nutch.apache.org
> Subject: Segment/CrawlDB in Nutch 1.x, how is it stored?
>
> I haven't realy been able to find this informa
t; huge size of logs. This leads to the failure of datanode and the job
> fails. And, if the logs are deleted periodically then the fetch phase
> takes a lot of time and it is uncertain that whether it will complete or
> not.
>
> Shubham Gupta
>
> On Wednesday 24 August 2016 05:2
t I posted
> previously
>
> -----Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Tuesday, September 6, 2016 3:02 PM
> To: user@nutch.apache.org
> Subject: RE: indexing metatags with Nutch 1.12
>
> Well, so we did add https to protocol-http&
s with Nutch 1.12
>
> Markus,
> I'm not sure how to answer your question.
> here are 2 xml files for your consideration.
>
> Kris
>
> ---
> From: "Markus Jelsma"
> To: user@nutch.apache.org
> Sent: Tuesday, September 6, 2016 2:30:39 PM
&g
Well, this is certainly not an indexing metatags problem. You need to use
protocol-httpclient for https, or configure protocol-http's plugin.xml to
support https. That's identical to protocol-httpclient's plugin.xml.
On the other hand, when we added support for https to protocol-http, did we
fo
Yes, use the LinkDB via the invertlinks command.
Markus
-Original message-
> From:Manish Verma
> Sent: Friday 26th August 2016 23:17
> To: user@nutch.apache.org
> Subject: Pull All URL List
>
> Hi,
>
> Using nutch 1.12 is there any way to get urls referring to given url ? Also
> can w
datanode and I have to delete those logs for smooth
> functioning of nutch.
>
> Also, I am unclear as to which parameter should be changed in the
> log4j.properties to reduce this size.
>
> Shubham Gupta
>
> On 08/24/2016 05:20 PM, Markus Jelsma wrote:
> > If
Hello - see inline.
Markus
-Original message-
> From:Ajmal Rahman
> Sent: Tuesday 16th August 2016 15:55
> To: user@nutch.apache.org
> Subject: Query on Single Crawl script to Crawl website (Nutch) and Index
> results (Solr)
>
> Dear Team,
>
> I have a query. I'm not sure if this is t
If it is Nutch logging, change its level in conf/log4j.properties. It can also
be Hadoop logging.
M.
-Original message-
> From:shubham.gupta
> Sent: Tuesday 23rd August 2016 8:15
> To: user@nutch.apache.org
> Subject: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1
Hello Jacquelyn,
This is very odd:
> Unexpected EOF in prolog
> at [row,col {unknown-source}]: [1,0]
We've fixed this problem a long time ago. It was a problem of non-unicode
codepoints in the data sent to Solr. The Solr indexing plugin strips them all,
and to my knowledge, there are no other
b_fetched* and
> https url has status *db_duplicate* so it deleted the duplicate.
>
> Second time it just deleted https version but did not add the http version
> which seems wrong, it should behave same way as it did first time.
>
> Thanks Mark
>
>
> On Mon, Aug 8, 2016
this is also the same
> > count which is shown as stats when indexing job completes (Indexer: 31125
> > indexed (add/update)
> >
> > What is index-dummy and how to use this ?
> >
> >
> > On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma
> > wrote:
> &g
Hello - are you sure you are observing docCount and not maxDoc? I don't
remember having seen this kind of behaviour in the past years.
If it is docCount, then i'd recommend using index-dummy backend twice and
diffing their results so you can see which documents are emitted, or not,
between index
>
> Network Bandwidth dedicated to Nutch: 2 Mbps
>
> Please help.
>
> Shubham Gupta
>
> On 07/29/2016 05:03 PM, Markus Jelsma wrote:
> > Hello Shubham,
> >
> > You can always eliminate the parse step by enabling the fetcher.parse
> > parameter. I
l change to https
>
> Markus so to crawl https and http urls successfully we just need to switch to
> a newer version of Nutch I.e. Higher than Nutch 1.10?
>
>
>
> On 8/5/16, 12:47 PM, "Markus Jelsma" wrote:
>
> >Hello - see inline.
> >Mar
Hello - see inline.
Markus
-Original message-
> From:Arora, Madhvi
> Sent: Friday 5th August 2016 18:03
> To: user@nutch.apache.org
> Subject: Protocol change to https
>
> Hi,
>
> We are using Nutch 1.10 and Solr 5. We have around 10 different web sites
> that are crawled regularly.
Yes, just keep Nutch running all the time with a refetch interval you choose,
defaults to 30 days. With -deleteGone switches when indexing you will be fine.
M.
-Original message-
> From:Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>
> Sent: Wednesday 3rd August 2016 19:11
> To: user
Hello, can you check the logs? There may be a problem with some libraries as
someone recently noticed as well.
Markus
-Original message-
> From:Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>
> Sent: Wednesday 27th July 2016 15:52
> To: user@nutch.apache.org
> Subject: progress (UNCLA
Hello Shubham,
You can always eliminate the parse step by enabling the fetcher.parse
parameter. It used to be bad advice but it is only very rarely a problem,
hanging fetchers can still terminate themselves in a proper manner. I am not
sure about 2.x but i think you can use this parameter.
Max
Hello Manish,
Usually Hadoop regulates the mapper count via splits, in only a very few cases
do you want control yourself. It certainly does not increase indexing speed
because reducers perform the indexing, which you can control, but i don't think
you should because Solr or Elastic can easily
Hello Alexandre!
Nutch is happy to index to 6.1 as long a client libraries and API's are
properly implemented. If, by default, Nutch doesn't work well with 6.x, it
should. There have been cases where indexing to the cloud didn't work, but as a
back up, Solr still allows indexing to a single nod
Here's a list of companies offering support for Apache Nutch including my own:
https://wiki.apache.org/nutch/Support
-Original message-
> From:Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>
> Sent: Thursday 21st July 2016 17:56
> To: user@nutch.apache.org
> Subject: help with integrat
t; if you could please send me some links or...
>
> Bests
> Shakiba Davari
>
>
> On Thu, Jul 21, 2016 at 7:10 PM, Markus Jelsma
> wrote:
>
> > Hello shakiba - the best solution, and solving a lot of additional
> > problems, is to make an indexing backend plugi
Hello shakiba - the best solution, and solving a lot of additional problems, is
to make an indexing backend plugin, specifically for your indexing service. The
coding involved is quite straightforward, except any nuances your indexing
backend might have.
Dumping files and reprocessing them for
ote:
>
> > Fantastic, thanks Markus
> >
> > On Wed, Jul 20, 2016 at 5:30 PM Markus Jelsma
> > wrote:
> >
> >> Hi Harry,
> >>
> >> The generator has Jexl support, check [1] for fields. Metadata is as-is.
> >>
> >> It
Hi Harry,
The generator has Jexl support, check [1] for fields. Metadata is as-is.
It's very simple:
# bin/nutch generate -expr "status == db_unfetched"
Cheers
[1]
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524
-Original message-
>
Hi Jamal - don't use managed schema with Solr 6.0 and/or 6.1. Just copy over
the schema Nutch provides and you are good to go.
Markus
-Original message-
> From:Jamal, Sarfaraz
> Sent: Friday 15th July 2016 15:47
> To: user@nutch.apache.org
> Subject: Newbie Nutch/Solr Question(s)
>
Hi Otis - it seems nobody has. Did you have a chance to try it? If you're
annoyed with reading lots of data each time, try compression first, it's major
benefit for no cost.
M.
-Original message-
> From:Otis Gospodnetić
> Sent: Friday 4th March 2016 15:51
> To: user@nutch.apache.or
Hello Mark - why? Although this is possible to do so, for what reason because
it makes no sense. Gone records are not reindexed, they are ignored, or with
the correct flags even removed from the index.
In any case, in Nutch 1.x the CrawlDB is read (optionally in trunk i believe)
and the number
Manish - to get a count for db_gone, Nutch readdb needs to check the whole
CrawlDB to get stats, collecting additional information is no overhead, just a
convenience. You get all the counts, or none, there is no other way and for
good reason.
M.
-Original message-
> From:Manish Ver
Jigal!
If they are not reindexed, they are probably not being refetched at all, for
some awkward reason. Are you using adaptive scheduling? In you case, you don't
need it, you force refetch every day. And it might just be your problem as
adaptive scheduling has this weird property.
There are s
utch.util.LockUtil.createLockFile(LockUtil.java:58)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
> at org.apache.nutch.crawl.Injector.run(Injector.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl
o turn purge on all time.
> Can we specify something in run time to delete these from crawldb(some script
> or runtime argument).
>
> Regards,
> MV
>
> > On Jul 12, 2016, at 1:48 AM, Markus Jelsma
> > wrote:
> >
> > Hi - what do you mean by control?
Hi , there are some Windows API calls in there that i will never understand.
Are there some kinds of symlinks you are working with or whatever they are
called in Windows? There must be something with Nutch/Hadoop getting access to
your disk. Check permissions, disk space and whatever you can thi
Hi - what do you mean by control? In any case, you can turn it on once and
purge db_gone, then turn if off again, right?
Markus
-Original message-
> From:Manish Verma
> Sent: Tuesday 12th July 2016 8:08
> To: user@nutch.apache.org
> Subject: Delete db_gone from crawdb
>
> Hi,
>
> W
Hello Jamal - yes, it does!
M
-Original message-
> From:Jamal, Sarfaraz
> Sent: Monday 11th July 2016 18:50
> To: Nutch help
> Subject: Does Nutch work with JRE8?
>
> Hi Guys,
>
> I see in the documentation that it is designed to work with jre7 -
>
> Does it also work for JRE8?
>
Hello Megha - upgrade to 1.12 and try again.
Markus
-Original message-
> From:Megha Bhandari
> Sent: Friday 8th July 2016 16:28
> To: user@nutch.apache.org
> Subject: Nutch 1.11 | Ignoring content header and footer content while
> parsing HTML
>
> Hi
>
> Read a couple of threads th
Hello - what memory is not getting released by what process? Crawls 'slowing
down' is usually the case because more and more records are being fetched.
I have never seen Nutch actually leaking memory in the JVM heap and since the
process' memory is largely dictated by the max heap size (default
Hello Manish!
1. Not really except with custom intervention. The protocol-htmlunit plugin
won't work either because it follows redirects at all levels, so it would index
the same content twice. Unfortunately, it must follow redirects to work because
assets as JS and CSS must follow redirects as
> //030: b800 104e 2d2c b800 0f2d b0 //
> // Stackmap Table://
> //append_frame(@47,Object[#143])/
>
>
> What should I do, please help.
> regards
>
> On 04/07/16 16:37, Markus Jelsma wrote:
> > Hello - there is no Boilerpipe support fo
Hello - there is no Boilerpipe support for 2.x.
Markus
-Original message-
> From:Nana Pandiawan
> Sent: Monday 4th July 2016 6:16
> To: user@nutch.apache.org
> Subject: Re: Remove Header from content
>
> Hi Markus Jelsma,
>
> If Boilerpipe support for Apache
Hello Jose Marcio!
You mean there is absolutely no self-repeating pattern anywhere in the URL? If
not, you are in trouble! Nutch URL filters don't operate in context of the page
the URL is located on, nor does it operate on groups of URL's.
The easiest approach is to limit the URL length to 512
Hello Jose-Marcio - Hadoop parameters can also be specified in
nutch-(default|site).xml. It behave identical to command line -D parameter
switchers.
Markus
-Original message-
> From:Jose-Marcio Martins da Cruz
> Sent: Tuesday 28th June 2016 17:05
> To: user@nutch.apache.org
> Subject:
Manish - Nutch has no support for it but you could write a custom
ContentHandler for parse-tike that support it. But since this is related to
your text extraction question, i'd recommend fixing the Boilerpipe issue.
Markus
-Original message-
> From:Manish Verma
> Sent: Wednesday 29th
t;
>
>
> tika.extractor.boilerpipe.algorithm
> CanolaExtractor
>
> Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
> or CanolaExtractor.
>
>
>
> Am I missing something here ?
>
>
> Regards,
>
Manish - you're in luck. Nutch 1.12 was released and has Boilerpipe support.
Check:
https://issues.apache.org/jira/browse/NUTCH-961
Markus
-Original message-
> From:Manish Verma
> Sent: Tuesday 28th June 2016 23:46
> To: user@nutch.apache.org
> Subject: Remove Header from content
>
Hello Manish - purge404's does not delete stuff from the index. Documents will
always be recrawled if there are hyperlinks to them.
Markus
-Original message-
> From:Manish Verma
> Sent: Wednesday 22nd June 2016 23:53
> To: user@nutch.apache.org
> Subject: Purging 404 Docs
>
> Hi Tea
Hi James - use the -noFilter and -noNormalize switches and you'll get your
first massive performance improvement.
M.
-Original message-
> From:James Mardell
> Sent: Wednesday 22nd June 2016 17:18
> To: user@nutch.apache.org
> Subject: Nutch generate slowdown
>
> We currently run a
You can use custom regex files etc, but no config. I recommend to just have
separate Nutch instances and working directories. We also separate all our
customers.
Markus
-Original message-
> From:Jose-Marcio Martins da Cruz
> Sent: Tuesday 21st June 2016 11:50
> To: user@nutch.apache
Yes indeed. Turn if off if you are doing incremental crawls. Turn it on if you
are going to perform one single crawl, and need OPIC scoring, for whatever
reason i cannot think of.
Markus
-Original message-
> From:Jigal van Hemert | alterNET internet BV
> Sent: Wednesday 22nd June 2
seed page
>
> Hi Markus,
>
> Thanks for your answer.
>
> 2016-06-21 13:59 GMT+02:00 Markus Jelsma :
>
> >
> > Have you set this parameter to 182? Probably not but anyway.
> >
> >
> > db.max.outlinks.per.page
> > 100
> >
>
Hello,
With Nutch 1.12 you can write a custom indexing filter that just removed that
field from NutchDocument, quite easy. You can also no store and not index that
field in Solr, basically ignoring it. Or you can just not query or boost on it.
Markus
-Original message-
> From:Megha
nt.java:184)
> ... 19 more
> 2016-06-21 13:22:00,803 ERROR indexer.IndexingJob - Indexer:
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
> at org.apache.nutch
Yes, this happens if you use recent Solr's with managed schema, it apparently
treats text as string types. There's a ticket to change that to TextField
though.
Markus
-Original message-
> From:Sebastian Nagel
> Sent: Tuesday 21st June 2016 23:15
> To: user@nutch.apache.org
> Subject
Hello Abdul,
Nutch will, by default not recrawl until some interval is passed. Check,
db.fetch.interval.default
2592000
The default number of seconds between re-fetches of a page (30
days).
Markus
-Original message-
> From:Abdul Munim
> Sent: Sunday 19th June 2016 21:3
Hello Jigal,
Have you set this parameter to 182? Probably not but anyway.
db.max.outlinks.per.page
100
The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, al
Hello Shakiba - please check Nutch' logs. The error is reported there.
Markus
-Original message-
> From:shakiba davari
> Sent: Thursday 16th June 2016 23:04
> To: user@nutch.apache.org
> Subject: Re: Indexing nutch crawled data in “Bluemix” solr
>
> Thanks so much Lewis. It really h
Hello Abdul - please check the logs, the real errors are reported there.
Markus
-Original message-
> From:Abdul Munim
> Sent: Sunday 19th June 2016 21:29
> To: user@nutch.apache.org
> Subject: nutch clean in crawl script throwing error
>
> Hi folks,
>
>
> I've setup Nutch 1.12 and
To those who upgrade,
The release announcement is missing some additional upgrade notes. If you use
the db.ignore.internal|external.links parameters, read the points below.
Regards,
Markus
-
Fellow committers, Nutch 1.12 contains a breaking change N
puting LinkRank on all segments before indexing, there's no reason I need
> to create a LinkDB, correct?
>
> Joe
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Tuesday, June 14, 2016 06:52
> To: user@nutch.apac
Joseph - LinkRank and LinkDB are not related to eachother. LinkRank scores the
WebGraph, the LinkDB is created with invertlinks.
In any case, consider enabling Hadoop sequence file compression. It greatly
reduces CrawlDB size and increases throughput. The CrawlDB is very suitable for
compressio
Hi - that is a curious case indeed as Nutch adhere's to robots.txt. Can they
provide you with a reason for marking your Nutch as impolite?
Markus
-Original message-
> From:Mattmann, Chris A (3980)
> Sent: Wednesday 25th May 2016 0:26
> To: user@nutch.apache.org
> Subject: Re: Robots.txt
Hello - i don't think so. But in case you are using Solr, you could use
solrmapping.xml on Nutch' side or of course a simple copyField in Solr's schema.
Markus
-Original message-
> From:Jigal van Hemert | alterNET internet BV
> Sent: Friday 20th May 2016 9:34
> To: user
> Subject: hea
Welcome too Karanjeet. Thanks for the good work on HtmlUnit plugin.
Cheers,
Markus
-Original message-
> From:Karanjeet Singh
> Sent: Monday 23rd May 2016 19:52
> To: d...@nutch.apache.org; user@nutch.apache.org
> Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Karanjeet Singh
>
Welcome Thamme Gowda!
Cheers,
Markus
-Original message-
> From:Thamme Gowda
> Sent: Monday 23rd May 2016 0:56
> To: d...@nutch.apache.org; user@nutch.apache.org
> Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Thamme Gowda N.
>
> Hi Sebastian,
> thanks for the invitation an
Hello, Nutch does not have any support for this kind of thing. But it should be
possible to test on screen wideness and such basic things with the new
parse-htmlunit plugin. Link density looks less obvious but font size and
presence of Flash is easier.
Markus
-Original message-
> F
Hello - this is a Tika question and i am not sure this is possible, but it
might just be. Please go to the Tika mailing list and ask them.
Markus
-Original message-
> From:A Laxmi
> Sent: Wednesday 18th May 2016 19:46
> To: user@nutch.apache.org
> Subject: Nutch crawl line breaks
>
Hello Joseph, see inline.
Regards,
Markus
-Original message-
> From:Joseph Naegele
> Sent: Monday 16th May 2016 20:40
> To: user@nutch.apache.org
> Subject: pros/cons of many nodes
>
> Hi folks,
>
>
>
> Would anyone be willing to share a few pros/cons of using many nodes vs. 1
> ve
Hello!
No, search was stripped from Nutch long ago, and for good reason. But indeed,
now you need a crawler, a search engine and a frontend! I think easiest way is
to use Solr, it is not that hard with the Nutch provided schema.xml. You can
then use Solr's Velocity response writer [1] to build
;
> Thanks!
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Monday, May 09, 2016 5:33 AM
> To: user@nutch.apache.org
> Subject: RE: startUp/shutDown methods for plugins
>
> Hello Joseph, you can initialize things in the p
Hello AL - Lewis is about to cut an RC. I expect Nutch 1.12 to be released in
four weeks orso.
Markus
-Original message-
> From:A Laxmi
> Sent: Tuesday 10th May 2016 19:04
> To: user@nutch.apache.org
> Subject: Release date for Nutch 1.12?
>
> Hi,
>
> Any expected release date for Nu
I checked the code. It will extract and parse all documents in the zip file and
concatenate all extracted text.
Markus
-Original message-
> From:Markus Jelsma
> Sent: Monday 9th May 2016 11:37
> To: user@nutch.apache.org
> Subject: RE: Nutch 1.x crawl Zip file URLs
>
> Content of s
Content of size 17027128 was truncated to
this means your http.size or whatever limit is too low. Increase the setting
and try again. By the way, i am not sure how indexing behaviour will be, i
don't think it will handle multiple files just like that.
-Original message-
> From:A Laxm
Hello Joseph, you can initialize things in the plugin's setConf() method, but
there is no close(). Why do you want to clear resources? The plugin's lifetime
is very short for most mappers and reducers, and Hadoop will kill those JVM's
anyway.
Markus
-Original message-
> From:Joseph
Hi - you should use the index command. Solrindex has been removed as separate
command. You can now index to various indexing backends.
Markus
-Original message-
> From:Chaushu, Shani
> Sent: Monday 2nd May 2016 14:47
> To: user@nutch.apache.org
> Subject: crawl with nutch 1.11
>
> H
Hi - this is the output of the Nutch client code buty you need the actual
mapper or reducer logs to get to know what really is going on.
M.
-Original message-
> From:tkg_cangkul
> Sent: Sunday 24th April 2016 18:29
> To: user@nutch.apache.org
> Subject: Indexer Failed on Nutch 1.11 d
201 - 300 of 2009 matches
Mail list logo