Re: nutch adds %20 in urls instead of spaces

2024-01-09 Thread Markus Jelsma
Hello Steve, Having those spaces normalized/encoded is expected behaviour with urlnormalizer-basic active. I would recommend to keep it this way and have all URLs in Solr properly encoded. Having spaces in Solr IDs is also not recommended as it can lead to unexpected behaviour. If you really

Re: Nutch - Restriction by content type

2023-11-16 Thread Markus Jelsma
Hello, You can skip certain types of documents based on their file extension, using the urlfilter-suffix. It only filters known suffixes. Filtering based on content type is not possible, because to know the content type requires fetching and parsing them. You can skip specific content types when

Re: Re[2]: Siet is not crawling

2023-08-13 Thread Markus Jelsma
t; > > > On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma > >* wrote --- > > Yes, remove the other protocol-* plugins from the configuration. With all > three active it is not always determined which one is going to do the > work. > > Op ma 30 jan. 2023 om 12:50

Re: Nutch Exception

2023-07-24 Thread Markus Jelsma
Hello, Please check the logs for more information. Regards, Markus Op ma 24 jul 2023 om 19:05 schreef Raj Chidara : > Hi > > Nutch 1.19 compiled with ant without any errors and when running > Injector, getting an error that > > > > 19:20:25.055 [main] ERROR org.apache.nutch.crawl.Injector -

Re: Nutch 1.19/Hadoop compatible

2023-03-07 Thread Markus Jelsma
Hello Mike, > Is nutch 1.19 compatible with Hadoop 3.3.4? Yes! Regards, Markus Op di 7 mrt 2023 om 17:37 schreef Mike : > Hello! > > Is nutch 1.19 compatible with Hadoop 3.3.4? > > > Thanks! > > mike >

Re: Capture and index match count on regex

2023-02-26 Thread Markus Jelsma
Hello Joe, > Now I'd like to capture and index the count of forward slash characters '/' It seems you are trying to do that with the subcollection plugin, i don't think that is going to work with it. Instead, i would suggest to write a simple index plugin that does the counting, and adds the

Re: Re[2]: Siet is not crawling

2023-01-30 Thread Markus Jelsma
Are there any additional steps to be > followed for installation of selenium. Please suggest > > > Thanks and Regards > > Raj Chidara > > - Original Message - > From: Markus Jelsma (markus.jel...@openindex.io) > Date: 30-01-2023 16:26 > To: user@nutch.apac

Re: Siet is not crawling

2023-01-30 Thread Markus Jelsma
Hello Raj, I think the same question about the same site was asked here some time ago. Anyway, this site loads its content via Javascript. You will need a protocol plugin that supports it, either protocol-htmlunit, or protocol-selenium, instead of protocol-http or any other. Change the

Re: Nutch/Hadoop Cluster

2023-01-14 Thread Markus Jelsma
Hello Mike, > would it pay off for me to put a hadoop cluster on top of the 3 servers. Yes, for as many reasons as Hadoop exists for. It can be tedious to set up for the first time, and there are many components. But at least you have three servers, which is kind of required by Zookeeper, that

Re: Not able to crawl ich

2022-12-17 Thread Markus Jelsma
Hello Raj, This site loads its content via Javascript, so you need a protocol plugin that supports it. HtmlUnit does not seem to work with this site, but Selenium does. Please change your protocol plugin accordingly in you plugin.includes configuration directive. I tested it with our own parser

Re: CSV indexer file data overwriting

2022-11-25 Thread Markus Jelsma
jects must take > advantage of any little contribution no matter the way. > > Best, > > El vie, 25 nov 2022 a las 7:21, Markus Jelsma ( >) > escribió: > > > Hello Paul, > > > > > I tried to comment on this jira issue, but I don't have access, > > unf

Re: CSV indexer file data overwriting

2022-11-25 Thread Markus Jelsma
Hello Paul, > I tried to comment on this jira issue, but I don't have access, unfortunately I don't know how to do it. Due to too much spam, it is no longer possible to create an account for yourself, but we can do that for you if you wish Regards, Markus Op do 24 nov. 2022 om 22:46 schreef

Re: Few websites not crawling

2022-11-23 Thread Markus Jelsma
Hello, The German site is crawlable, but it does produce awful URLs with some ;jsessionid=<> attached to it. The Chinese site is all Javascript, it requires HtmlUnit or Selenium protocol plugin for it to work at all. No guarantee if it will. Regards, Markus Op wo 23 nov. 2022 om 11:07 schreef

Re: Incomplete TLD List

2022-11-08 Thread Markus Jelsma
Hello Mike, You can try adding the TLD to conf/domain-suffixes.xml and see if it works. Regards, Markus Op di 8 nov. 2022 om 11:16 schreef Mike : > Hi! > Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend > the TLD list? > >

Re: How should the headings plugin be configured?

2022-10-31 Thread Markus Jelsma
kus! > > Thank you for taking care of my problem! > > I removed the metatag.h# fron index.parse.md but ntuch indexchecker do not > show me still the fields. > > Am Mo., 31. Okt. 2022 um 12:56 Uhr schrieb Markus Jelsma < > markus.jel...@openindex.io>: > > > Hello Mike,

Re: How should the headings plugin be configured?

2022-10-31 Thread Markus Jelsma
Hello Mike, Please remove the metatag.* prefix in the index.parse.md config and i think you should be fine. Regards, Markus Op ma 31 okt. 2022 om 12:32 schreef Mike : > Yes, sorry, I also forgot to post this setting: > > >index.parse.md > > > >

Re: How should the headings plugin be configured?

2022-10-31 Thread Markus Jelsma
Hello Mike, I think it should be working just fine with it enabled in protocol.includes. You can check Nutch' parser output by using: $ bin/nutch parsechecker You should see one or more h# output fields present. You can then use the index-metadata plugin to map the parser output fields to the

Re: Nutch/Hadoop: Error (FreeGenerator job did not succeed)

2022-10-14 Thread Markus Jelsma
Hello, You cannot just run Nutch's JAR like that on Hadoop, you need the large .job file instead. If you build Nutch from source, you will get a runtime/deploy directory. Upload its contents to a Hadoop client and run Nutch commands using bin/nutch ... You will then automatically use the large

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-29 Thread Markus Jelsma
gt; >>> I have been able to compile under OpenJDK 11 > >>> Have not done anything further so far > >>> I'm gonna try to get to it this evening > >>> > >>> Greetz > >>> Ralf > >>> > >>> On Wed, Aug 24, 2022 a

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread Markus Jelsma
Hi, Everything seems fine, the crawler seems fine when trying the binary distribution. The source won't work because this computer still cannot compile it. Clearing the local Ivy cache did not do much. This is the known compiler error with the elastic-indexer plugin: compile: [echo] Compiling

Re: [DISCUSS] Release 1.19 ?

2022-08-09 Thread Markus Jelsma
Sounds good! I see we're still at Tika 2.3.0, i'll submit a patch to upgrade to the current 2.4.1. Thanks! Markus Op di 9 aug. 2022 om 09:11 schreef Sebastian Nagel : > Hi all, > > more than 60 issues are done for Nutch 1.19 > > https://issues.apache.org/jira/projects/NUTCH/versions/12349580

Re: Does Nutch work with Hadoop Versions greater than 3.1.3?

2022-06-13 Thread Markus Jelsma
To add to Sebastian, it runs on Hadoop 3.3.x very good as well. Actually, i never had any Hadoop version that could not run Nutch out of the box and without issues. Op ma 13 jun. 2022 om 11:54 schreef Sebastian Nagel : > Hi Michael, > > Nutch (1.18, and trunk/master) should work together with

OkHttp NoClassDefFoundError: okhttp3/Authenticator

2021-07-23 Thread Markus Jelsma
Hello, With a 1.18 checkout i am trying the okhttp plugin. I couldn't get it to work on 1.15 due to another NoClassDefFoundError, and now with 1.18, it still doesn't work and throws another NoClassDefFoundError. java.lang.NoClassDefFoundError: okhttp3/Authenticator at

Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-01 Thread Markus Jelsma
Hello Sebastian, We have always used vanilla Apache Hadoop on our own physical servers that are running on the latest Debian, which also runs on ARM. It will run HDFS and YARN and any other custom job you can think of. It has snappy compression, which is a massive improvement for large data

Re: Crawling same domain URL's

2021-05-11 Thread Markus Jelsma
ll be a bottleneck in this case. I am > looking for options to distribute the same domain URLs across various > mappers. Not sure if that's even possible with Nutch or not. > > Regards > Prateek > > On Tue, May 11, 2021 at 11:58 AM Markus Jelsma > > wrote: > > >

Re: Crawling same domain URL's

2021-05-11 Thread Markus Jelsma
Hello Prateet, If you want to fetch stuff from the same host/domain as fast as you want, increase the number of threads, and the number of threads per queue. Then decrease all the fetch delays. Regards, Markus Op di 11 mei 2021 om 12:48 schreef prateek : > Hi Lewis, > > As mentioned earlier,

Re: Nutch getting rid of older segments

2021-04-07 Thread Markus Jelsma
Hello Abhay, You only need to keep or merge old segments if you 'quickly' need to reindex the data, and are unable to start with a fresh crawl. If you frequently recrawl all urls, e.g. a month, then segments older than a month can safely be removed. You can also do daily an monthly merges, like

Re: Nutch Configure multiple fetch plugins

2021-04-02 Thread Markus Jelsma
Hello Abhay, You can configure a protocol plugin per host using the host-protocol-mapping.txt configuration file. Its usage is: or protocol: Regards, Markus Op vr 2 apr. 2021 om 15:18 schreef Abhay Ratnaparkhi < abhay.ratnapar...@gmail.com>: > Hello, > > I would like to know how to

Re: EXTERNAL: Re: 301 perm redirect pages are still in Solr

2021-03-09 Thread Markus Jelsma
om 08:49 schreef Hany NASR : > Hello Markus, > > I added the property in nutch-site.xml with no luck. > > The documents still exist in Solr; any advice? > > Regards, > Hany > > From: Markus Jelsma > Sent: Monday, March 8, 2021 3:40 PM > To: user@nutch.apache.o

Re: 301 perm redirect pages are still in Solr

2021-03-08 Thread Markus Jelsma
Hello Hany, You need to tell the indexer to delete those record. This will help: indexer.delete true Regards, Markus Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR : > Hi All, > > I'm using Nutch 1.15, and figure out that permeant redirect pages (301) > are still indexed and not

RE: [ANNOUNCE] Apache Nutch 1.17 Release

2020-07-02 Thread Markus Jelsma
Thanks Sebastian! -Original message- > From:Sebastian Nagel > Sent: Thursday 2nd July 2020 16:42 > To: user@nutch.apache.org > Cc: d...@nutch.apache.org; annou...@apache.org > Subject: [ANNOUNCE] Apache Nutch 1.17 Release > > The Apache Nutch team is pleased to announce the release

RE: Resolve by IP

2020-04-14 Thread Markus Jelsma
Hello Marcel, You can use your /etc/hosts file for that purpose, assuming you are on Linux. Regards, Markus -Original message- > From:Marcel Haazen > Sent: Tuesday 14th April 2020 12:12 > To: user@nutch.apache.org > Subject: Resolve by IP > > Hi, > I'm trying to crawl a specific

RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

2019-12-31 Thread Markus Jelsma
Hello Joseph, > Is there more documentation on having Nutch get what Tika sees into what Solr > will see? No, but i believe you would want to checkout the parsechecker and indexchecker tools. These tools display what Tika sees and what will be sent to Solr. Regards, Markus -Original

RE: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Markus Jelsma
nd I can increase my throughput. > > Please let me know. > > Thanks > Sachin > > > > > > > On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma > wrote: > > > Hello Sachin, > > > > Nutch can run on Amazon AWS without trouble, and probably on any Hadoop > >

RE: Nutch not crawling all pages

2019-10-30 Thread Markus Jelsma
score quantile 0.99:2.542474209925616E-4 > > min score: 3.0443254217971116E-5 > > avg score: 7.001118352666182E-4 > > max score: 1.3120110034942627 > > status 2 (db_fetched): 39150 > > status 3 (db_gone): 13 > > status 4 (db_redir_temp):

RE: Best and economical way of setting hadoop cluster for distributed crawling

2019-10-30 Thread Markus Jelsma
Hello Sachin, Nutch can run on Amazon AWS without trouble, and probably on any Hadoop based provider. This is the most expensive option you have. Cheaper would be to rent some servers and install Hadoop yourself, getting it up and running by hand on some servers will take the better part of a

RE: Nutch not crawling all pages

2019-10-30 Thread Markus Jelsma
Hello Dave, First you should check the CrawlDB using readdb -stats. My bet is that your set contains some redirects and gone (404), or transient errors. The number for fetched and notModified added up should be about the same as the number of documents indexed. Regards, Markus

RE: Adding specfic query parameters to nutch url filters

2019-10-21 Thread Markus Jelsma
Hello Sachin, Once a URL gets filtered, by any plugin, it is rejected entirely. If you want specific queries to pass the regex-urlfilter, you must let is pass explicitly above this -[?*!@=] line, e.g. +passThisQuery= Use bin/nutch filterchecker -stdIn for quick testing. Regards, Markus

Unable to index on Hadoop 3.2.0 with 1.16

2019-10-14 Thread Markus Jelsma
Hello, We're upgrading our stuff to 1.16 and got a peculiar problem when we started indexing: 2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IllegalStateException: text width is less than 1, was <-41> at

RE: [ANNOUNCE] Apache Nutch 1.16 Release

2019-10-14 Thread Markus Jelsma
Thanks Sebastian! -Original message- > From:Sebastian Nagel > Sent: Friday 11th October 2019 17:03 > To: user@nutch.apache.org > Cc: d...@nutch.apache.org; annou...@apache.org > Subject: [ANNOUNCE] Apache Nutch 1.16 Release > > Hi folks! > > The Apache Nutch [0] Project Management

RE: Excluding individual pages?

2019-10-10 Thread Markus Jelsma
Hello Dave, If you have just one specific page you do not want Nutch to index, or Solr to show, you can either create a custom IndexingFilter that returns null (rejecting it) for the specified URL, or add an additional filterQuery to Solr, fq=-id:, filtering the specific URL from the results.

RE: [VOTE] Release Apache Nutch 1.16 RC#1

2019-10-03 Thread Markus Jelsma
Hello Sebastian, All tests pass nicely and i can easily run a crawl. +1 Thanks, Markus By the way, what does this mean: 2019-10-03 12:48:49,696 INFO  crawl.Generator - Generator: number of items rejected during selection: 2019-10-03 12:48:49,698 INFO  crawl.Generator - Generator:  1 

RE: Nutch NTLM to IIS 8.5 - issues!

2019-04-25 Thread Markus Jelsma
Hello, It doesn't say much except failure, no reason. You might want to set debugging to TRACE, the authenticator logs on that level. You could also check if there are server side messages. Regards, Markus -Original message- > From:Larry.Santello > Sent: Thursday 25th April 2019

RE: Boilerpipe algorithm is not working as expected

2019-03-20 Thread Markus Jelsma
Hello Hany, For Boilerpipe you can only select which extractor it should use. By default it uses ArticleExtractor, which is the best choice in most cases. However, if content is more spread out into separate blocks, CanolaExtractor could be a better choice. Regards, Markus -Original

RE: Limiting Results From Single Domain

2019-03-20 Thread Markus Jelsma
Hello Alexis, see inline. Regards, Markus -Original message- > From:IZaBEE_Keeper > Sent: Wednesday 20th March 2019 1:28 > To: user@nutch.apache.org > Subject: RE: Limiting Results From Single Domain > > Markus Jelsma-2 wrote > > Hello Alexis, > > >

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread Markus Jelsma
y (HOST) ul. > > Kapelanka 42A, 30-347 Kraków, Poland > > __ > > > > Tie line: 7148 7689 4698 > > External: +48 123 42 0698 > > Mobile: +48 723 680 278 > > E-mail: hany.n...@hsbc.com > >

RE: Increasing the number of reducer in UpdateHostDB

2019-03-18 Thread Markus Jelsma
Hello Suraj, You can safely increase the number of reducers for UpdateHostDB to as high as you like. Regards, Markus -Original message- > From:Suraj Singh > Sent: Monday 18th March 2019 11:41 > To: user@nutch.apache.org > Subject: Increasing the number of reducer in UpdateHostDB > >

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-14 Thread Markus Jelsma
Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no choice, either skip large files, or increase memory. Regards, Markus -Original message- > From:hany.n...@hsbc.com.INVALID > Sent: Thursday 14th March 2019 10:44 > To: user@nutch.apache.org > Subject:

RE: Increasing the number of reducer in Deduplication

2019-02-20 Thread Markus Jelsma
Hello Suraj, That should be no problem. Duplicates are grouped by their signature, this means you can have as many reducers as you would like. Regards, Markus -Original message- > From:Suraj Singh > Sent: Wednesday 20th February 2019 12:56 > To: user@nutch.apache.org > Subject:

RE: Difficulty getting data from Nutch parse data into Solr document

2019-02-13 Thread Markus Jelsma
Hello Tom, To get parse metadata field indexed, you need the indexer-metadata plugin. Use the index.parse.md parameter to define the fields you want to have indexed. Use indexchecker to test. Regards, Markus -Original message- > From:Tom Potter > Sent: Wednesday 13th February

RE: Multiple Reducers for Linkdb

2018-12-18 Thread Markus Jelsma
Hello Suraj, You can safely run the LinkDB merger with as many reducers as you like. Regards, Markus -Original message- > From:Suraj Singh > Sent: Tuesday 18th December 2018 15:39 > To: user@nutch.apache.org > Subject: Multiple Reducers for Linkdb > > Hello, > > Can we run

RE: RE: unexpected Nutch crawl interruption

2018-11-19 Thread Markus Jelsma
Nutch crawl interruption > > I think in the case that you interrupt the fetcher, you'll have the problem > that URLs that where scheduled to be fetched on the interrupted cycle will > never be fetched (because of NUTCH-1842). > > Yossi. > > > -Original Message

RE: RE: unexpected Nutch crawl interruption

2018-11-19 Thread Markus Jelsma
Hello Hany, That depends. If you interrupt the fetcher, the segment being fetched can be thrown away. But if you interrupt updatedb, you can remove the temp directory and must get rid of the lock file. The latter is also true if you interrupt the generator. Regards, Markus -Original

RE: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Markus Jelsma
Hello Nicholas, Your IP might be blocked, or the firewall just drops the connection due to your User-Agent name. We have no problems fetching this host. Regards, Markus -Original message- > From:Nicholas Roberts > Sent: Wednesday 14th November 2018 7:58 > To: user@nutch.apache.org

RE: Block certain parts of HTML code from being indexed

2018-11-14 Thread Markus Jelsma
Hello Hany, Using parse-tika as your HTML parser, you can enable Boilerpipe (see nutch-default). Regards, Markus -Original message- > From:hany.n...@hsbc.com > Sent: Wednesday 14th November 2018 15:53 > To: user@nutch.apache.org > Subject: Block certain parts of HTML code from

RE: Getting Nutch To Crawl Sharepoint Online

2018-10-29 Thread Markus Jelsma
Hello Ashish, You might want to check out Apache ManifoldCF. Regards. Markus http://manifoldcf.apache.org/ -Original message- > From:Ashish Saini > Sent: Monday 29th October 2018 18:56 > To: user@nutch.apache.org > Subject: Getting Nutch To Crawl Sharepoint Online > > We are

RE: Apache Nutch commercial support

2018-10-12 Thread Markus Jelsma
Hello Hany, There are a few, mine included, mentioned on the Nutch support wiki page [1]. Regards, Markus [1] https://wiki.apache.org/nutch/Support -Original message- > From:hany.n...@hsbc.com > Sent: Friday 12th October 2018 9:25 > To: user@nutch.apache.org > Subject: Apache

RE: Regex to block some patterns

2018-10-03 Thread Markus Jelsma
Hi Amarnatha, -^.+(?:modal|exit).*\.html Will work for all exampes given. You can test regexes really well online [1]. If each input has true for lookingAt, Nutch' regexfilter will filter the URL's. Regards, Markus [1] https://www.regexplanet.com/advanced/java/index.html -Original

RE: Nutch 2.x HBase alternatives

2018-10-03 Thread Markus Jelsma
Hi Benjamin, If you do not specifically require Nutch 2.x, i would strongly suggest to go to Nutch 1.x. It doesn't have the added hustle of a DB and DB layer, is much more mature and gets the most commits of the two. Regards, Markus -Original message- > From:Benjamin Vachon >

RE: Nutch Maven support for plugins

2018-08-29 Thread Markus Jelsma
Hello Rustam, You can use urlnormalizer-slash for this task. Regards, Markus -Original message- > From:Rustam > Sent: Wednesday 29th August 2018 10:30 > To: user@nutch.apache.org > Subject: Nutch Maven support for plugins > > It seems Nutch is available in Maven, but without its

RE: [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-01 Thread Markus Jelsma
en.wikipedia.org/wiki/Internet_media_type (queue crawl delay=5000ms) > > That could be because of NUTCH-2623 (to be fixed in 1.16). > > If you have more examples, let me know. Otherwise, let's re-test if NUTCH-2623 > is fixed and the logging is improved. Could you open an issue fo

RE: [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-01 Thread Markus Jelsma
(or in general the status of > a fetch). > It would double the logged lines but would help to understand what the > fetcher is doing, > esp. regarding robots denied and redirects. > > Best, > Sebastian > > > On 08/01/2018 11:59 AM, Markus Jelsma wrote: > >

RE: [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-01 Thread Markus Jelsma
All tests pass, crawler run fine so far, +1 for 1.15! Regards, Markus -Original message- > From:Sebastian Nagel > Sent: Thursday 26th July 2018 17:05 > To: user@nutch.apache.org > Cc: d...@nutch.apache.org > Subject: [VOTE] Release Apache Nutch 1.15 RC#1 > > Hi Folks, > > A first

RE: Issues while crawling pagination

2018-07-28 Thread Markus Jelsma
Hello, Yossi's suggestion is excellent if your case is crawl everything once, and never again. However, if you need to crawl future articles as well, and have to deal with mutations, then let the crawler run continuously without regard for depth. The latter is the usual case, because after

RE: Sitemap URL's concatenated, causing status 14 not found

2018-06-06 Thread Markus Jelsma
assumption that the document is valid and > > conformant. > > > > Yossi. > > > >> -Original Message- > >> From: Markus Jelsma > >> Sent: 25 May 2018 23:45 > >> To: User > >> Subject: Sitemap URL's concaten

RE: Sitemap URL's concatenated, causing status 14 not found

2018-05-29 Thread Markus Jelsma
n from the assumption that the document is valid and conformant. > > Yossi. > > > -Original Message- > > From: Markus Jelsma > > Sent: 25 May 2018 23:45 > > To: User > > Subject: Sitemap URL's concatenated, causing status 14 not found > > &

Sitemap URL's concatenated, causing status 14 not found

2018-05-25 Thread Markus Jelsma
Hello, We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but Nutch things those two sitemap URL's are actually one consisting of both concatenated. Here is https://www.saxion.nl/sitemap.xml http://www.sitemaps.org/schemas/sitemap/0.9;>

RE: Having plugin as a separate project

2018-05-07 Thread Markus Jelsma
Hi, Here are examples using Maven: https://github.com/ATLANTBH/nutch-plugins/tree/master/nutch-plugins Regards, Markus -Original message- > From:Yash Thenuan Thenuan > Sent: Monday 7th May 2018 11:51 > To: user@nutch.apache.org > Subject: Re: Having plugin as

RE: RE: random sampling of crawlDb urls

2018-05-01 Thread Markus Jelsma
ath.random. For example, I can extract > records above a specific score with "score>1.0". But the random thing doesn't > work even though I have tried various thresholds. > > On Tuesday, May 1, 2018, 2:00:48 PM PDT, Markus Jelsma > <markus.jel...@openindex.io> wr

RE: random sampling of crawlDb urls

2018-05-01 Thread Markus Jelsma
Hello Michael, I would think this should work as well. But since you mention .99 works fine, did you try .1 as well to get ~10% output? It seems the expressions itself do work at some level, and since this is a Jexl specific thing, you might want to try the Jexl list as well. I could not find

RE: Nutch fetching times out at 3 hours, not sure why.

2018-04-17 Thread Markus Jelsma
Hello Chip, I have no clue where the three hour limit could come from. Please take a further look in the last few minutes of the logs. The only thing i can think of is that a webserver would block you after some amount of requests/time window, that would be visible in the logs. It is clear

RE: Issues related to Hung threads when crawling more than 15K articles

2018-04-04 Thread Markus Jelsma
That doesn't appear to be the case, fetcher's time bomb nicely logs when it reached its limit, it also usually runs for longer than two seconds which we see here. What can you find in the logs? There must be some error beyond having hung threads. Usually something with a hanging parser or GC

RE: Is there any way to block the hubpages while crawling

2018-03-20 Thread Markus Jelsma
Hello Shiva, Yes, that is possible, but it (ours) is not a fool proof solution. We got our first hub classifier years ago in the form of a simple ParseFilter backed by an SVM. The model was built solely on the HTML of positive and negative examples, with very few features, so it was extremely

RE: Reg: URL Near Duplicate Issues with same content

2018-03-15 Thread Markus Jelsma
About URL Normalizers, you can use: urlnormalizer-host to normalize between www- and non-www hosts, and urlnormalizer-slash to normalize per host trailing or non-trailing slashes. There are no committed tools that automate this, but if your set of sites is limited, it is easy to manage by hand.

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
ot db.max.outlinks.per.page, db.max.anchor.length. Copy paste error... > > > -Original Message- > > From: Markus Jelsma <markus.jel...@openindex.io> > > Sent: 12 March 2018 14:01 > > To: user@nutch.apache.org > > Subject: RE: UrlRegexFilter is getting des

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
scripts/apache-nutch-1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205: maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100); scripts/apache-nutch-1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118: int maxOutlinksPerPage =

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Markus Jelsma
Hello - see inline. Regards, Markus -Original message- > From:Semyon Semyonov > Sent: Monday 12th March 2018 11:47 > To: usernutch.apache.org > Subject: UrlRegexFilter is getting destroyed for unrealistically long links > > Dear all, >

RE: Need Tutorial on Nutch

2018-03-07 Thread Markus Jelsma
lvalen...@gmail.com> > Sent: Wednesday 7th March 2018 21:51 > To: user@nutch.apache.org > Subject: Re: Need Tutorial on Nutch > > How about using nutch with a headless browser like CasperJS? Will this > work? Have any of you tried this? > > On Tue, Mar 6, 2018 at

index-metadata, lowercasing field names?

2018-03-07 Thread Markus Jelsma
Hi, I've got metadata, containing a capital in the field name. But index-metadata lowercases its field names: parseFieldnames.put(metatag.toLowerCase(Locale.ROOT), metatag); This means index-metadata is useless if your metadata fields contain uppercase characters. Was this done for a reason?

RE: Need Tutorial on Nutch

2018-03-06 Thread Markus Jelsma
Hi, Yes you are going to need code, and a lot more than just that, probably including dropping the 'every two hour' requirement. For your case you need either site-specific price extraction, which is easy but a lot of work for 500+ sites. Or you need a more complicated generic algorithm,

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Markus Jelsma
of OpenIndex, and perhaps should > be removed now that the code is part of Nutch, or is there a reason this > normalizer must not be used with UpdateHostDb? > > Yossi. > > > -Original Message- > > From: Markus Jelsma <markus.jel...@openindex.io> >

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Markus Jelsma
Hi, The reason is simple, we (company) needed this information based on hostname, so we made a hostdb. I don't see any downside for supporting a domain mode. Adding support for it through hostdb.url.mode seems like a good idea. Regards, Markus -Original message- > From:Yossi Tamari

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Markus Jelsma
Hi, If you want to stay clear of all 2.x caveats, use Nutch 1.x. If you want the most stable and feature rich version, use 1.x. If you want to limit the number of wheels (Gora as DB abstraction, running and operate a separate DB server), use 1.x. If you do not intend to crawl tens of millions

RE: Search with Accent and without accent Character

2018-02-13 Thread Markus Jelsma
Checked and confirmed, even Dutch digraph IJ is folded properly, as well as the upper case dotless Turkish i and the Spanish example you provided is folded properly. Correction for German (before Nagel corrects me), ö and ü are not normalized by ICU folder according to German rules. Their

RE: Search with Accent and without accent Character

2018-02-13 Thread Markus Jelsma
Hi, My guess is you haven't reindexed after changing filter configuration, which is required for index-time filters. Regarding your fieldType, you can drop the lowercase and ASCII folding filters and just keep the ICU folder, it will work for pretty much any character set. It will normalize

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
<user-digest-h...@nutch.apache.org> wrote: > > > > > From: Markus Jelsma <markus.jel...@openindex.io> > > To: User <user@nutch.apache.org> > > Cc: > > Bcc: > > Date: Wed, 17 Jan 2018 10:51:49 + > > Subject: SitemapProcessor des

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
ought NUTCH-2442 forward. > Time to review the patch of NUTCH-2466! > > On 01/17/2018 01:53 PM, Markus Jelsma wrote: > > Ah thanks! > > > > I knew you'd fixed some of these, now i know my patch of NUTCH-2466 > > silently removes your commit! > >

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
), > only checking for exceptions isn't enough! > > Sebastian > > On 01/17/2018 11:51 AM, Markus Jelsma wrote: > > Hello, > > > > We noticed some abnormalities in our crawl cycle caused by a sudden > > reduction of our CrawlDB's size. The SitemapProcesso

SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Hello, We noticed some abnormalities in our crawl cycle caused by a sudden reduction of our CrawlDB's size. The SitemapProcessor ran, failed (timed out, see below) and left us with a decimated CrawlDB. This is odd because of:     } catch (Exception e) {   if (fs.exists(tempCrawlDb))   

RE: [ANNOUNCE] Apache Nutch 1.14 Release

2017-12-25 Thread Markus Jelsma
Thanks Sebastian! -Original message- > From:Sebastian Nagel > Sent: Monday 25th December 2017 18:38 > To: user@nutch.apache.org; annou...@apache.org > Subject: [ANNOUNCE] Apache Nutch 1.14 Release > > Dear Nutch users, > > the Apache Nutch [0] Project

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Markus Jelsma
nolaExtractor. I don't know > what the differences are, but I bet ArticleExtractor (the default algorithm ) > inserts the Title. > > > > > From: Markus Jelsma <markus.jel...@openindex.io> > To: "user@nutch.apache.org" <us

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Markus Jelsma
>   boilerpipe >   >   Which text extraction algorithm to use. Valid values are: boilerpipe or > none. >   > > > >   tika.extractor.boilerpipe.algorithm >   ArticleExtractor >   >   Which Boilerpipe algorithm to use. Valid values

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

2017-11-15 Thread Markus Jelsma
You could do that, but you would need to fiddle around in TikaParser.java. Using TeeContentHandler you can add both the normal ContentHandler, and the Boilerpipe version. -Original message- > From:Michael Coffey > Sent: Wednesday 15th November 2017 20:34

RE: Removing header,Footer and left menus while crawling

2017-11-14 Thread Markus Jelsma
Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction? The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site

FW: Nutch(plugins) and R

2017-11-07 Thread Markus Jelsma
cc list -Original message- > From:Markus Jelsma > Sent: Wednesday 8th November 2017 0:15 > To: user@nutch.apache.org > Subject: RE: Nutch(plugins) and R > > Hello - there are no responses, and i don't know what R is, but you are > interested in HTML parsing,

RE: Nutch(plugins) and R

2017-11-07 Thread Markus Jelsma
Hello - there are no responses, and i don't know what R is, but you are interested in HTML parsing, specifically topic detection, so here are my thoughts. We have done topic detection in our custom HTML parser, but in Nutch speak we would do it in a ParseFilter implementation. Get the

RE: Incorrect encoding detected

2017-11-02 Thread Markus Jelsma
er sends > Content-Type: text/html; charset=utf-8 > > Sebastian > > On 11/01/2017 07:06 PM, Markus Jelsma wrote: > > Any ideas? > > > > Thanks! > > > > > > > > -Original message- > >> From:Markus Jelsma <marku

RE: sitemap and xml crawl

2017-11-02 Thread Markus Jelsma
Hi - Nutch has a parser for RSS and ATOM on-board: https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/feed/FeedParser.html You must configure it in your plugin.includes to use it. Regards, Markus -Original message- > From:Ankit Goel >

RE: Incorrect encoding detected

2017-11-01 Thread Markus Jelsma
gt; > > For 1.17, the simplest solution, I think, is to allow users to configure > > extending the detection limit via our @Field config methods, that is, via > > tika-config.xml. > > > > To confirm, Nutch will allow users to specify a tika-config file? Will

FW: Incorrect encoding detected

2017-10-31 Thread Markus Jelsma
> > -----Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Tuesday, October 31, 2017 5:47 AM > To: u...@tika.apache.org > Subject: RE: Incorrect encoding detected > > Hello Timothy - what would be your preferred solution? Increase detection

  1   2   3   4   5   6   7   8   9   10   >