RE: nutch 1.12 INJECT REST call not honoring db.injector.overwrite

2016-10-14 Thread Markus Jelsma
REST uses the old method invocation which sets overwrite and update to false, which is wrong. https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L514 Please open a ticket. M. -Original message- > From:Sujan Suppala > Sent: Friday 14th October

RE: Injector and Generator Job Failing

2016-10-14 Thread Markus Jelsma
s and Regards, > Shubham Gupta > > On Friday 14 October 2016 04:11 PM, Markus Jelsma wrote: > > Check the logs, this only tells you that i failed, not why. > > M. > > > > > > > > -Original message- > >> From:shubham.gupta > &g

RE: Injector and Generator Job Failing

2016-10-14 Thread Markus Jelsma
Check the logs, this only tells you that i failed, not why. M. -Original message- > From:shubham.gupta > Sent: Friday 14th October 2016 12:15 > To: user@nutch.apache.org > Subject: Injector and Generator Job Failing > > Hey > > Whenever i run the nutch application, only the injector

RE: nutch 1.12 How can I force a URL to get re-indexed

2016-10-07 Thread Markus Jelsma
se comment if you see any issues with this approach. > > Thanks > Sujan > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Thursday, October 06, 2016 7:32 PM > To: user@nutch.apache.org > Subject: RE: nutch 1.12 How can I forc

RE: 2 Locations and Common Build Practices

2016-10-06 Thread Markus Jelsma
Hi, Just use the latest 1.12 if you have a choice. The archives are usually not very useful. The precompiled versions are identical to compiling the source yourself, with the ant command. Markus -Original message- > From:WebDawg > Sent: Thursday 6th October 2016 15:10 > To: user@

RE: nutch 1.12 How can I force a URL to get re-indexed

2016-10-06 Thread Markus Jelsma
Hi You can use -adddays N in the generator job to fool it, or just use a lower interval. Or, use the freegen tool to immediately crawl a set of URL's. Markus -Original message- > From:Sujan Suppala > Sent: Thursday 6th October 2016 15:56 > To: user@nutch.apache.org > Subject: nutch

RE: Nutch scalability

2016-10-05 Thread Markus Jelsma
Nutch 2.3.1? > Regards, > Vladimir. > > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: October-05-16 2:53 PM > To: user@nutch.apache.org > Subject: RE: Nutch scalability > > Where i wrote YJK, i of course meant CJK instead. > >

RE: 404 removal not working and title mysteriously appearing in content

2016-10-05 Thread Markus Jelsma
Hello - it might be the case that over time, due to additional URL filters, the CrawlDB loses URL's (which can go 404), but are never deleted from the index, and stay there forever. If you really hate 404´ s, I'd just never delete the index, but keep a low fetch interval, and have Nutch delete

RE: 90% of URL rejected by filtering (Nutch 2.3.1)

2016-10-05 Thread Markus Jelsma
Hello - you can try debug logging. Or get the whole list of URL's in a flat file, and pipe it to bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined. URL's with a plus + are passed, URL's with a - minus are filtered. -Original message- > From:shubham.gupta > Sent: Wednesda

RE: Nutch scalability

2016-10-05 Thread Markus Jelsma
Where i wrote YJK, i of course meant CJK instead. -Original message- > From:Markus Jelsma > Sent: Wednesday 5th October 2016 20:34 > To: user@nutch.apache.org > Subject: RE: Nutch scalability > > Hello Vladimir - answers inline. > Markus > > -Original message- > > From:Vlad

RE: Error while attempting to add documents to Solr

2016-10-05 Thread Markus Jelsma
t; If I should upgrade to the latest version of Solr (6.2.1) is it advisable to > upgrade my current version (1.9) of nutch?  If so, should I upgrade to the > latest version of nutch (1.12)?  > > Jackie > > -Original Message- > From: Markus Jelsma [mailto:markus.jel

RE: Nutch and SOLR integration

2016-10-05 Thread Markus Jelsma
Hello - see inline. Markus -Original message- > From:WebDawg > Sent: Wednesday 5th October 2016 15:51 > To: user@nutch.apache.org > Subject: Nutch and SOLR integration > > I am new to Solr and Nutch. > > I was working through the tutorials and managed to get everything > going up to ma

RE: Nutch scalability

2016-10-05 Thread Markus Jelsma
Hello Vladimir - answers inline. Markus -Original message- > From:Vladimir Loubenski > Sent: Wednesday 5th October 2016 20:09 > To: user@nutch.apache.org > Subject: Nutch scalability > > Hi, > I have Nutch 2.3.1 installation with MongoDB. > > I want to understand what scalability opti

RE: crawling a subfolder

2016-10-04 Thread Markus Jelsma
I think, as of 1.12, there is a parameter to disable the robots check, i am not sure. Check nutch-default, it might be there. M. -Original message- > From:Nestor > Sent: Wednesday 5th October 2016 0:05 > To: user@nutch.apache.org > Subject: Re: crawling a subfolder > > OK, Thanks fo

RE: parsing issue - content and title fields combined

2016-10-04 Thread Markus Jelsma
ssues. Cheers Markus -Original message- > From:Comcast > Sent: Tuesday 4th October 2016 20:32 > To: user@nutch.apache.org > Subject: Re: parsing issue - content and title fields combined > > I was not complaining > > Sent from my iPhone > > > On

RE: parsing issue - content and title fields combined

2016-10-04 Thread Markus Jelsma
t; > this is slated for fix in v1.13. > Great. > K > > - Original Message - > > From: "Markus Jelsma" > To: user@nutch.apache.org > Sent: Tuesday, October 4, 2016 12:34:33 PM > Subject: RE: parsing issue - content and title fields combined

RE: why the results have diff number of fields

2016-10-04 Thread Markus Jelsma
the results have diff number of fields > > Maybe because I am trying to just crawl a subfolder mysite.com/subfolder and > I am having problems configuring it to do this and is going and crawling > other pages from the parent directory. > > Thanks! > > > > On

RE: control order of operations

2016-10-04 Thread Markus Jelsma
To my knowledge, there is no such thing, and it would probably never work generic in any way. If you want to prevent section being extracted, Nutch has support for Boilerpipe, an open source extractor. It has major drawbacks, but can work fine in some cases. M. -Original message- >

RE: parsing issue - content and title fields combined

2016-10-04 Thread Markus Jelsma
Hi - this is a known and open issue, but it has a patch: https://issues.apache.org/jira/browse/NUTCH-1749 -Original message- > From:KRIS MUSSHORN > Sent: Tuesday 4th October 2016 16:53 > To: user@nutch.apache.org > Subject: parsing issue - content and title fields combined > > Nutch

RE: control order of operations

2016-10-04 Thread Markus Jelsma
Hello - this is not Solr's maximum for a field at all. But it is Java's maximum for String. Just don't use string when indexing. Markus -Original message- > From:KRIS MUSSHORN > Sent: Friday 30th September 2016 17:54 > To: user@nutch.apache.org > Subject: Re: control order of operatio

RE: why the results have diff number of fields

2016-10-04 Thread Markus Jelsma
Well, probably because you or something indexes different stuff to the Solr index. The first doesn't come from Nutch, the second does. Markus -Original message- > From:Nestor > Sent: Tuesday 4th October 2016 2:07 > To: user@nutch.apache.org > Subject: why the results have diff number

RE: Tika removes tags which I'd prefer to keep.

2016-09-30 Thread Markus Jelsma
java/org/apache/nutch/parse/tika/TikaParser.java#L117 > > > -Ursprüngliche Nachricht- > > Von: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Gesendet: Freitag, 30. September 2016 14:15 > > An: user@nutch.apache.org > > Betreff: RE: Tika removes tags

RE: Tika removes tags which I'd prefer to keep.

2016-09-30 Thread Markus Jelsma
Hello - Tika does some HTML mapping under the hood, but it is configurable. Tell Tika to use the IdentityMapper. I am not sure anymore which param you need, check out TikaParser.java, it is somewhere near the bottom. Markus -Original message- > From:Felix von Zadow > Sent: Friday 3

RE: Error while attempting to add documents to Solr

2016-09-21 Thread Markus Jelsma
> at com.ctc.wstx.sr.BasicStreamReader.handleEOF(BasicStreamReader.java:2134) > at > com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2040) > at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069) > at org.apache.solr.handler.loader.XMLLoad

RE: Segment/CrawlDB in Nutch 1.x, how is it stored?

2016-09-08 Thread Markus Jelsma
Yes, plain Hadoop map or sequence files on local storage or HDFS. M. -Original message- > From:v0id null > Sent: Thursday 8th September 2016 16:03 > To: user@nutch.apache.org > Subject: Segment/CrawlDB in Nutch 1.x, how is it stored? > > I haven't realy been able to find this informa

RE: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

2016-09-06 Thread Markus Jelsma
t; huge size of logs. This leads to the failure of datanode and the job > fails. And, if the logs are deleted periodically then the fetch phase > takes a lot of time and it is uncertain that whether it will complete or > not. > > Shubham Gupta > > On Wednesday 24 August 2016 05:2

RE: indexing metatags with Nutch 1.12

2016-09-06 Thread Markus Jelsma
t I posted > previously > > -----Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Tuesday, September 6, 2016 3:02 PM > To: user@nutch.apache.org > Subject: RE: indexing metatags with Nutch 1.12 > > Well, so we did add https to protocol-http&

RE: indexing metatags with Nutch 1.12

2016-09-06 Thread Markus Jelsma
s with Nutch 1.12 > > Markus, > I'm not sure how to answer your question. > here are 2 xml files for your consideration. > > Kris > > --- > From: "Markus Jelsma" > To: user@nutch.apache.org > Sent: Tuesday, September 6, 2016 2:30:39 PM &g

RE: indexing metatags with Nutch 1.12

2016-09-06 Thread Markus Jelsma
Well, this is certainly not an indexing metatags problem. You need to use protocol-httpclient for https, or configure protocol-http's plugin.xml to support https. That's identical to protocol-httpclient's plugin.xml. On the other hand, when we added support for https to protocol-http, did we fo

RE: Pull All URL List

2016-08-31 Thread Markus Jelsma
Yes, use the LinkDB via the invertlinks command. Markus -Original message- > From:Manish Verma > Sent: Friday 26th August 2016 23:17 > To: user@nutch.apache.org > Subject: Pull All URL List > > Hi, > > Using nutch 1.12 is there any way to get urls referring to given url ? Also > can w

RE: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

2016-08-25 Thread Markus Jelsma
datanode and I have to delete those logs for smooth > functioning of nutch. > > Also, I am unclear as to which parameter should be changed in the > log4j.properties to reduce this size. > > Shubham Gupta > > On 08/24/2016 05:20 PM, Markus Jelsma wrote: > > If

RE: Query on Single Crawl script to Crawl website (Nutch) and Index results (Solr)

2016-08-24 Thread Markus Jelsma
Hello - see inline. Markus -Original message- > From:Ajmal Rahman > Sent: Tuesday 16th August 2016 15:55 > To: user@nutch.apache.org > Subject: Query on Single Crawl script to Crawl website (Nutch) and Index > results (Solr) > > Dear Team, > > I have a query. I'm not sure if this is t

RE: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

2016-08-24 Thread Markus Jelsma
If it is Nutch logging, change its level in conf/log4j.properties. It can also be Hadoop logging. M. -Original message- > From:shubham.gupta > Sent: Tuesday 23rd August 2016 8:15 > To: user@nutch.apache.org > Subject: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

RE: Error while attempting to add documents to Solr

2016-08-12 Thread Markus Jelsma
Hello Jacquelyn, This is very odd: > Unexpected EOF in prolog > at [row,col {unknown-source}]: [1,0] We've fixed this problem a long time ago. It was a problem of non-unicode codepoints in the data sent to Solr. The Solr indexing plugin strips them all, and to my knowledge, there are no other

RE: Indexing Same CrawlDB Result In Different Indexed Doc Count

2016-08-10 Thread Markus Jelsma
b_fetched* and > https url has status *db_duplicate* so it deleted the duplicate. > > Second time it just deleted https version but did not add the http version > which seems wrong, it should behave same way as it did first time. > > Thanks Mark > > > On Mon, Aug 8, 2016

RE: Indexing Same CrawlDB Result In Different Indexed Doc Count

2016-08-08 Thread Markus Jelsma
this is also the same > > count which is shown as stats when indexing job completes (Indexer: 31125 > > indexed (add/update) > > > > What is index-dummy and how to use this ? > > > > > > On Mon, Aug 8, 2016 at 1:16 PM, Markus Jelsma > > wrote: > &g

RE: Indexing Same CrawlDB Result In Different Indexed Doc Count

2016-08-08 Thread Markus Jelsma
Hello - are you sure you are observing docCount and not maxDoc? I don't remember having seen this kind of behaviour in the past years. If it is docCount, then i'd recommend using index-dummy backend twice and diffing their results so you can see which documents are emitted, or not, between index

RE: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 + Yarn

2016-08-05 Thread Markus Jelsma
> > Network Bandwidth dedicated to Nutch: 2 Mbps > > Please help. > > Shubham Gupta > > On 07/29/2016 05:03 PM, Markus Jelsma wrote: > > Hello Shubham, > > > > You can always eliminate the parse step by enabling the fetcher.parse > > parameter. I

RE: Protocol change to https

2016-08-05 Thread Markus Jelsma
l change to https > > Markus so to crawl https and http urls successfully we just need to switch to > a newer version of Nutch I.e. Higher than Nutch 1.10? > > > > On 8/5/16, 12:47 PM, "Markus Jelsma" wrote: > > >Hello - see inline. > >Mar

RE: Protocol change to https

2016-08-05 Thread Markus Jelsma
Hello - see inline. Markus -Original message- > From:Arora, Madhvi > Sent: Friday 5th August 2016 18:03 > To: user@nutch.apache.org > Subject: Protocol change to https > > Hi, > > We are using Nutch 1.10 and Solr 5. We have around 10 different web sites > that are crawled regularly.

RE: functional question... (UNCLASSIFIED)

2016-08-03 Thread Markus Jelsma
Yes, just keep Nutch running all the time with a refetch interval you choose, defaults to 30 days. With -deleteGone switches when indexing you will be fine. M. -Original message- > From:Musshorn, Kris T CTR USARMY RDECOM ARL (US) > > Sent: Wednesday 3rd August 2016 19:11 > To: user

RE: progress (UNCLASSIFIED)

2016-07-29 Thread Markus Jelsma
Hello, can you check the logs? There may be a problem with some libraries as someone recently noticed as well. Markus -Original message- > From:Musshorn, Kris T CTR USARMY RDECOM ARL (US) > > Sent: Wednesday 27th July 2016 15:52 > To: user@nutch.apache.org > Subject: progress (UNCLA

RE: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 +yarn

2016-07-29 Thread Markus Jelsma
Hello Shubham, You can always eliminate the parse step by enabling the fetcher.parse parameter. It used to be bad advice but it is only very rarely a problem, hanging fetchers can still terminate themselves in a proper manner. I am not sure about 2.x but i think you can use this parameter. Max

RE: Indexing Mapper Count

2016-07-29 Thread Markus Jelsma
Hello Manish, Usually Hadoop regulates the mapper count via splits, in only a very few cases do you want control yourself. It certainly does not increase indexing speed because reducers perform the indexing, which you can control, but i don't think you should because Solr or Elastic can easily

RE: Reviewing Solr+Nutch tutorial: which version of Solr?

2016-07-29 Thread Markus Jelsma
Hello Alexandre! Nutch is happy to index to 6.1 as long a client libraries and API's are properly implemented. If, by default, Nutch doesn't work well with 6.x, it should. There have been cases where indexing to the cloud didn't work, but as a back up, Solr still allows indexing to a single nod

RE: help with integration (UNCLASSIFIED)

2016-07-27 Thread Markus Jelsma
Here's a list of companies offering support for Apache Nutch including my own: https://wiki.apache.org/nutch/Support -Original message- > From:Musshorn, Kris T CTR USARMY RDECOM ARL (US) > > Sent: Thursday 21st July 2016 17:56 > To: user@nutch.apache.org > Subject: help with integrat

RE: mapping files created by: nutch dump to the URL from which each file has been dumped.

2016-07-27 Thread Markus Jelsma
t; if you could please send me some links or... > > Bests > Shakiba Davari > > > On Thu, Jul 21, 2016 at 7:10 PM, Markus Jelsma > wrote: > > > Hello shakiba - the best solution, and solving a lot of additional > > problems, is to make an indexing backend plugi

RE: mapping files created by: nutch dump to the URL from which each file has been dumped.

2016-07-21 Thread Markus Jelsma
Hello shakiba - the best solution, and solving a lot of additional problems, is to make an indexing backend plugin, specifically for your indexing service. The coding involved is quite straightforward, except any nuances your indexing backend might have. Dumping files and reprocessing them for

RE: Generate segment of only unfetched urls

2016-07-21 Thread Markus Jelsma
ote: > > > Fantastic, thanks Markus > > > > On Wed, Jul 20, 2016 at 5:30 PM Markus Jelsma > > wrote: > > > >> Hi Harry, > >> > >> The generator has Jexl support, check [1] for fields. Metadata is as-is. > >> > >> It&#x

RE: Generate segment of only unfetched urls

2016-07-20 Thread Markus Jelsma
Hi Harry, The generator has Jexl support, check [1] for fields. Metadata is as-is. It's very simple: # bin/nutch generate -expr "status == db_unfetched" Cheers [1] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java#L524 -Original message- >

RE: Newbie Nutch/Solr Question(s)

2016-07-18 Thread Markus Jelsma
Hi Jamal - don't use managed schema with Solr 6.0 and/or 6.1. Just copy over the schema Nutch provides and you are good to go. Markus -Original message- > From:Jamal, Sarfaraz > Sent: Friday 15th July 2016 15:47 > To: user@nutch.apache.org > Subject: Newbie Nutch/Solr Question(s) >

RE: Nutch with Alluxio?

2016-07-13 Thread Markus Jelsma
Hi Otis - it seems nobody has. Did you have a chance to try it? If you're annoyed with reading lots of data each time, try compression first, it's major benefit for no cost. M. -Original message- > From:Otis Gospodnetić > Sent: Friday 4th March 2016 15:51 > To: user@nutch.apache.or

RE: Nutch db_gone

2016-07-13 Thread Markus Jelsma
Hello Mark - why? Although this is possible to do so, for what reason because it makes no sense. Gone records are not reindexed, they are ignored, or with the correct flags even removed from the index. In any case, in Nutch 1.x the CrawlDB is read (optionally in trunk i believe) and the number

RE: readdb get db_gone count

2016-07-13 Thread Markus Jelsma
Manish - to get a count for db_gone, Nutch readdb needs to check the whole CrawlDB to get stats, collecting additional information is no overhead, just a convenience. You get all the counts, or none, there is no other way and for good reason. M. -Original message- > From:Manish Ver

RE: Indexed URLs not re-indexed

2016-07-13 Thread Markus Jelsma
Jigal! If they are not reindexed, they are probably not being refetched at all, for some awkward reason. Are you using adaptive scheduling? In you case, you don't need it, you force refetch every day. And it might just be your problem as adaptive scheduling has this weird property. There are s

RE: Running into an Issue

2016-07-13 Thread Markus Jelsma
utch.util.LockUtil.createLockFile(LockUtil.java:58) > at org.apache.nutch.crawl.Injector.inject(Injector.java:357) > at org.apache.nutch.crawl.Injector.run(Injector.java:467) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl

RE: Delete db_gone from crawdb

2016-07-12 Thread Markus Jelsma
o turn purge on all time. > Can we specify something in run time to delete these from crawldb(some script > or runtime argument). > > Regards, > MV > > > On Jul 12, 2016, at 1:48 AM, Markus Jelsma > > wrote: > > > > Hi - what do you mean by control?

RE: Running into an Issue

2016-07-12 Thread Markus Jelsma
Hi , there are some Windows API calls in there that i will never understand. Are there some kinds of symlinks you are working with or whatever they are called in Windows? There must be something with Nutch/Hadoop getting access to your disk. Check permissions, disk space and whatever you can thi

RE: Delete db_gone from crawdb

2016-07-12 Thread Markus Jelsma
Hi - what do you mean by control? In any case, you can turn it on once and purge db_gone, then turn if off again, right? Markus -Original message- > From:Manish Verma > Sent: Tuesday 12th July 2016 8:08 > To: user@nutch.apache.org > Subject: Delete db_gone from crawdb > > Hi, > > W

RE: Does Nutch work with JRE8?

2016-07-11 Thread Markus Jelsma
Hello Jamal - yes, it does! M -Original message- > From:Jamal, Sarfaraz > Sent: Monday 11th July 2016 18:50 > To: Nutch help > Subject: Does Nutch work with JRE8? > > Hi Guys, > > I see in the documentation that it is designed to work with jre7 - > > Does it also work for JRE8? >

RE: Nutch 1.11 | Ignoring content header and footer content while parsing HTML

2016-07-08 Thread Markus Jelsma
Hello Megha - upgrade to 1.12 and try again. Markus -Original message- > From:Megha Bhandari > Sent: Friday 8th July 2016 16:28 > To: user@nutch.apache.org > Subject: Nutch 1.11 | Ignoring content header and footer content while > parsing HTML > > Hi > > Read a couple of threads th

RE: Nutch 1.11 | memory leak?

2016-07-07 Thread Markus Jelsma
Hello - what memory is not getting released by what process? Crawls 'slowing down' is usually the case because more and more records are being fetched. I have never seen Nutch actually leaking memory in the JVM heap and since the process' memory is largely dictated by the max heap size (default

RE: Nutch Redirect Skip Indexing Orignal Url

2016-07-05 Thread Markus Jelsma
Hello Manish! 1. Not really except with custom intervention. The protocol-htmlunit plugin won't work either because it follows redirects at all levels, so it would index the same content twice. Unfortunately, it must follow redirects to work because assets as JS and CSS must follow redirects as

RE: Remove Header from content

2016-07-05 Thread Markus Jelsma
> //030: b800 104e 2d2c b800 0f2d b0 // > // Stackmap Table:// > //append_frame(@47,Object[#143])/ > > > What should I do, please help. > regards > > On 04/07/16 16:37, Markus Jelsma wrote: > > Hello - there is no Boilerpipe support fo

RE: Remove Header from content

2016-07-04 Thread Markus Jelsma
Hello - there is no Boilerpipe support for 2.x. Markus -Original message- > From:Nana Pandiawan > Sent: Monday 4th July 2016 6:16 > To: user@nutch.apache.org > Subject: Re: Remove Header from content > > Hi Markus Jelsma, > > If Boilerpipe support for Apache

RE: Regular expressions in regex-urlfilter.txt

2016-07-01 Thread Markus Jelsma
Hello Jose Marcio! You mean there is absolutely no self-repeating pattern anywhere in the URL? If not, you are in trouble! Nutch URL filters don't operate in context of the page the URL is located on, nor does it operate on groups of URL's. The easiest approach is to limit the URL length to 512

RE: Some Java parameters defined inside bin/crawl 1.12

2016-06-29 Thread Markus Jelsma
Hello Jose-Marcio - Hadoop parameters can also be specified in nutch-(default|site).xml. It behave identical to command line -D parameter switchers. Markus -Original message- > From:Jose-Marcio Martins da Cruz > Sent: Tuesday 28th June 2016 17:05 > To: user@nutch.apache.org > Subject:

RE: Does Nutch 1 Honor googleoff tags

2016-06-29 Thread Markus Jelsma
Manish - Nutch has no support for it but you could write a custom ContentHandler for parse-tike that support it. But since this is related to your text extraction question, i'd recommend fixing the Boilerpipe issue. Markus -Original message- > From:Manish Verma > Sent: Wednesday 29th

RE: Remove Header from content

2016-06-29 Thread Markus Jelsma
t; > > > tika.extractor.boilerpipe.algorithm > CanolaExtractor > > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > > > > Am I missing something here ? > > > Regards, >

RE: Remove Header from content

2016-06-29 Thread Markus Jelsma
Manish - you're in luck. Nutch 1.12 was released and has Boilerpipe support. Check: https://issues.apache.org/jira/browse/NUTCH-961 Markus -Original message- > From:Manish Verma > Sent: Tuesday 28th June 2016 23:46 > To: user@nutch.apache.org > Subject: Remove Header from content >

RE: Purging 404 Docs

2016-06-23 Thread Markus Jelsma
Hello Manish - purge404's does not delete stuff from the index. Documents will always be recrawled if there are hyperlinks to them. Markus -Original message- > From:Manish Verma > Sent: Wednesday 22nd June 2016 23:53 > To: user@nutch.apache.org > Subject: Purging 404 Docs > > Hi Tea

RE: Nutch generate slowdown

2016-06-22 Thread Markus Jelsma
Hi James - use the -noFilter and -noNormalize switches and you'll get your first massive performance improvement. M. -Original message- > From:James Mardell > Sent: Wednesday 22nd June 2016 17:18 > To: user@nutch.apache.org > Subject: Nutch generate slowdown > > We currently run a

RE: nutch 1.12 - different options for each crawldb

2016-06-22 Thread Markus Jelsma
You can use custom regex files etc, but no config. I recommend to just have separate Nutch instances and working directories. We also separate all our customers. Markus -Original message- > From:Jose-Marcio Martins da Cruz > Sent: Tuesday 21st June 2016 11:50 > To: user@nutch.apache

RE: Nutch 1.11 | scoring-opic plugin | influence on solr document score

2016-06-22 Thread Markus Jelsma
Yes indeed. Turn if off if you are doing incremental crawls. Turn it on if you are going to perform one single crawl, and need OPIC scoring, for whatever reason i cannot think of. Markus -Original message- > From:Jigal van Hemert | alterNET internet BV > Sent: Wednesday 22nd June 2

RE: Number of crawled links from seed page

2016-06-22 Thread Markus Jelsma
seed page > > Hi Markus, > > Thanks for your answer. > > 2016-06-21 13:59 GMT+02:00 Markus Jelsma : > > > > > Have you set this parameter to 182? Probably not but anyway. > > > > > > db.max.outlinks.per.page > > 100 > > >

RE: Nutch 1.11 | scoring-opic plugin | influence on solr document score

2016-06-22 Thread Markus Jelsma
Hello, With Nutch 1.12 you can write a custom indexing filter that just removed that field from NutchDocument, quite easy. You can also no store and not index that field in Solr, basically ignoring it. Or you can just not query or boost on it. Markus -Original message- > From:Megha

RE: Indexing nutch crawled data in “Bluemix” solr

2016-06-22 Thread Markus Jelsma
nt.java:184) > ... 19 more > 2016-06-21 13:22:00,803 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch

RE: immense term,Correcting analyzer

2016-06-22 Thread Markus Jelsma
Yes, this happens if you use recent Solr's with managed schema, it apparently treats text as string types. There's a ticket to change that to TextField though. Markus -Original message- > From:Sebastian Nagel > Sent: Tuesday 21st June 2016 23:15 > To: user@nutch.apache.org > Subject

RE: Reindex Nutch periodically using cron job

2016-06-21 Thread Markus Jelsma
Hello Abdul, Nutch will, by default not recrawl until some interval is passed. Check, db.fetch.interval.default 2592000 The default number of seconds between re-fetches of a page (30 days). Markus -Original message- > From:Abdul Munim > Sent: Sunday 19th June 2016 21:3

RE: Number of crawled links from seed page

2016-06-21 Thread Markus Jelsma
Hello Jigal, Have you set this parameter to 182? Probably not but anyway. db.max.outlinks.per.page 100 The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, al

RE: Indexing nutch crawled data in “Bluemix” solr

2016-06-21 Thread Markus Jelsma
Hello Shakiba - please check Nutch' logs. The error is reported there. Markus -Original message- > From:shakiba davari > Sent: Thursday 16th June 2016 23:04 > To: user@nutch.apache.org > Subject: Re: Indexing nutch crawled data in “Bluemix” solr > > Thanks so much Lewis. It really h

RE: nutch clean in crawl script throwing error

2016-06-21 Thread Markus Jelsma
Hello Abdul - please check the logs, the real errors are reported there. Markus -Original message- > From:Abdul Munim > Sent: Sunday 19th June 2016 21:29 > To: user@nutch.apache.org > Subject: nutch clean in crawl script throwing error > > Hi folks, > > > I've setup Nutch 1.12 and

RE: [ANNOUNCE] Apache Nutch 1.12 Release

2016-06-21 Thread Markus Jelsma
To those who upgrade, The release announcement is missing some additional upgrade notes.  If you use the db.ignore.internal|external.links parameters, read the points below. Regards, Markus - Fellow committers, Nutch 1.12 contains a breaking change N

RE: improving distributed indexing performance

2016-06-14 Thread Markus Jelsma
puting LinkRank on all segments before indexing, there's no reason I need > to create a LinkDB, correct? > > Joe > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Tuesday, June 14, 2016 06:52 > To: user@nutch.apac

RE: improving distributed indexing performance

2016-06-14 Thread Markus Jelsma
Joseph - LinkRank and LinkDB are not related to eachother. LinkRank scores the WebGraph, the LinkDB is created with invertlinks. In any case, consider enabling Hadoop sequence file compression. It greatly reduces CrawlDB size and increases throughput. The CrawlDB is very suitable for compressio

RE: Robots.txt

2016-05-25 Thread Markus Jelsma
Hi - that is a curious case indeed as Nutch adhere's to robots.txt. Can they provide you with a reason for marking your Nutch as impolite? Markus -Original message- > From:Mattmann, Chris A (3980) > Sent: Wednesday 25th May 2016 0:26 > To: user@nutch.apache.org > Subject: Re: Robots.txt

RE: headings plug-in target field

2016-05-24 Thread Markus Jelsma
Hello - i don't think so. But in case you are using Solr, you could use solrmapping.xml on Nutch' side or of course a simple copyField in Solr's schema. Markus -Original message- > From:Jigal van Hemert | alterNET internet BV > Sent: Friday 20th May 2016 9:34 > To: user > Subject: hea

RE: [ANNOUNCE] New Nutch committer and PMC - Karanjeet Singh

2016-05-24 Thread Markus Jelsma
Welcome too Karanjeet. Thanks for the good work on HtmlUnit plugin. Cheers, Markus -Original message- > From:Karanjeet Singh > Sent: Monday 23rd May 2016 19:52 > To: d...@nutch.apache.org; user@nutch.apache.org > Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Karanjeet Singh >

RE: [ANNOUNCE] New Nutch committer and PMC - Thamme Gowda N.

2016-05-24 Thread Markus Jelsma
Welcome Thamme Gowda! Cheers, Markus -Original message- > From:Thamme Gowda > Sent: Monday 23rd May 2016 0:56 > To: d...@nutch.apache.org; user@nutch.apache.org > Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Thamme Gowda N. > > Hi Sebastian, >  thanks for the invitation an

RE: Scoring mobile-friendliness

2016-05-24 Thread Markus Jelsma
Hello, Nutch does not have any support for this kind of thing. But it should be possible to test on screen wideness and such basic things with the new parse-htmlunit plugin. Link density looks less obvious but font size and presence of Flash is easier. Markus -Original message- > F

RE: Nutch crawl line breaks

2016-05-18 Thread Markus Jelsma
Hello - this is a Tika question and i am not sure this is possible, but it might just be. Please go to the Tika mailing list and ask them. Markus -Original message- > From:A Laxmi > Sent: Wednesday 18th May 2016 19:46 > To: user@nutch.apache.org > Subject: Nutch crawl line breaks >

RE: pros/cons of many nodes

2016-05-17 Thread Markus Jelsma
Hello Joseph, see inline. Regards, Markus -Original message- > From:Joseph Naegele > Sent: Monday 16th May 2016 20:40 > To: user@nutch.apache.org > Subject: pros/cons of many nodes > > Hi folks, > > > > Would anyone be willing to share a few pros/cons of using many nodes vs. 1 > ve

RE: WebSearch response similar to Google

2016-05-11 Thread Markus Jelsma
Hello! No, search was stripped from Nutch long ago, and for good reason. But indeed, now you need a crawler, a search engine and a frontend! I think easiest way is to use Solr, it is not that hard with the Nutch provided schema.xml. You can then use Solr's Velocity response writer [1] to build

RE: startUp/shutDown methods for plugins

2016-05-11 Thread Markus Jelsma
; > Thanks! > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Monday, May 09, 2016 5:33 AM > To: user@nutch.apache.org > Subject: RE: startUp/shutDown methods for plugins > > Hello Joseph, you can initialize things in the p

RE: Release date for Nutch 1.12?

2016-05-11 Thread Markus Jelsma
Hello AL - Lewis is about to cut an RC. I expect Nutch 1.12 to be released in four weeks orso. Markus -Original message- > From:A Laxmi > Sent: Tuesday 10th May 2016 19:04 > To: user@nutch.apache.org > Subject: Release date for Nutch 1.12? > > Hi, > > Any expected release date for Nu

RE: Nutch 1.x crawl Zip file URLs

2016-05-09 Thread Markus Jelsma
I checked the code. It will extract and parse all documents in the zip file and concatenate all extracted text. Markus -Original message- > From:Markus Jelsma > Sent: Monday 9th May 2016 11:37 > To: user@nutch.apache.org > Subject: RE: Nutch 1.x crawl Zip file URLs > > Content of s

RE: Nutch 1.x crawl Zip file URLs

2016-05-09 Thread Markus Jelsma
Content of size 17027128 was truncated to this means your http.size or whatever limit is too low. Increase the setting and try again. By the way, i am not sure how indexing behaviour will be, i don't think it will handle multiple files just like that. -Original message- > From:A Laxm

RE: startUp/shutDown methods for plugins

2016-05-09 Thread Markus Jelsma
Hello Joseph, you can initialize things in the plugin's setConf() method, but there is no close(). Why do you want to clear resources? The plugin's lifetime is very short for most mappers and reducers, and Hadoop will kill those JVM's anyway. Markus -Original message- > From:Joseph

RE: crawl with nutch 1.11

2016-05-03 Thread Markus Jelsma
Hi - you should use the index command. Solrindex has been removed as separate command. You can now index to various indexing backends. Markus -Original message- > From:Chaushu, Shani > Sent: Monday 2nd May 2016 14:47 > To: user@nutch.apache.org > Subject: crawl with nutch 1.11 > > H

RE: Indexer Failed on Nutch 1.11 deploy mode

2016-04-26 Thread Markus Jelsma
Hi - this is the output of the Nutch client code buty you need the actual mapper or reducer logs to get to know what really is going on. M. -Original message- > From:tkg_cangkul > Sent: Sunday 24th April 2016 18:29 > To: user@nutch.apache.org > Subject: Indexer Failed on Nutch 1.11 d

<    1   2   3   4   5   6   7   8   9   10   >