Parsing and URL filter plugins that depend on URL pattern.

2017-10-19 Thread Semyon Semyonov
Dear all, I want to adjust Nutch for crawling of only one big text-based website and therefore to set up the develop plugins/set up settings for the best crawling performance. Precisely, there is a website that has 3 category : A,B,C. The urls therefore website/A/itemN, website/B/articleN, web

Ways of limit pages per host. generate.max.count, hostdb, scoring-depth

2017-10-23 Thread Semyon Semyonov
Hi, Im looking for the best way of restriction by amount of pages crawled per host. I have a list of hosts to crawl, lets say M hosts and I would like to limit crawling on each host as MaxPages. The external links are turned off for the crawling processes. My own proposal can be found at 3)   1

Re: RE: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth

2017-10-23 Thread Semyon Semyonov
Thanks for the suggestion. Could you explain how can I use it in the crawling process? Should I call generate with a specific parameter? It is not really clear from the issue. I use Nutch 1.13.   Sent: Monday, October 23, 2017 at 3:57 PM From: "Markus Jelsma" To: "user@nutch.apache.org" Subje

Re: RE: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth

2017-11-03 Thread Semyon Semyonov
I managed to apply the issue, but I had to made small modification of the code(it didn't work for Nutch RestAPI, a patch is attached to the issue.) I used the path with the following settings:  generate.max.count.expr if(fetched > 120) {return new("java.lang.Double", 0);} else {return conf.get

Nutch(plugins) and R

2017-11-03 Thread Semyon Semyonov
Hello, I'm looking for a way to use R in Nutch, particularly HTML parser, but usage in the other parts can be intresting as well. For each parsed document I would like to run a script and provide the results back to the system e.g. topic detection of the document.   NB I'm not looking for a way

Re: RE: Nutch(plugins) and R

2017-11-08 Thread Semyon Semyonov
Thanks for the suggestion, nice to get some insight about topic detection. Probably, Mallet is the most efficient way for the specific algorithm, but the biggest advantage of R is a huge cover of mathematical/data science/machine learning algorithms(it is worth nothing to mention how easy to dev

Not valid URLs in Crawldb through crawlcomplete

2017-11-28 Thread Semyon Semyonov
Hello all, I have launched a crawling process for 100 websites with external links equals to true. After several hours, I run the crawlcomplete command with mode equals host. The crawlcomplete output file contains(apart from the proper host names) the following lines. 1#Are there any place

Re: Not valid URLs in Crawldb through crawlcomplete

2017-11-29 Thread Semyon Semyonov
lly, a URL "http://#Are there any places to eat onsite during the show?" should not make it into the CrawlDb. Best, Sebastian On 11/28/2017 02:17 PM, Semyon Semyonov wrote: > Hello all, > > I have launched a crawling process for 100 websites with external links > equals to t

Re: crawlcomplete

2017-12-14 Thread Semyon Semyonov
The third question can be: 1) Now we have hostdb that stores all statistics per host. You can read/write to the database. Does it make sense to have both for the reporting?   Sent: Monday, December 04, 2017 at 7:47 PM From: "Yossi Tamari" To: user@nutch.apache.org Subject: crawlcomplete Hi, I

Usage previous stage HostDb data for generate(fetched deltas)

2017-12-14 Thread Semyon Semyonov
Dear all, I plan to improve hostdb functionality to have a DB_FETCHED delta for generate stage. Lets say for each website we have condition of generate while number of fetched < 150. The problem is for some websites that condition will (almost)never be finished, because of its structure. Fo

Fw: Usage previous stage HostDb data for generate(fetched deltas)

2017-12-15 Thread Semyon Semyonov
I have created an issue for this functionality: https://issues.apache.org/jira/browse/NUTCH-2481     Sent: Thursday, December 14, 2017 at 2:07 PM From: "Semyon Semyonov" To: "usernutch.apache.org" Subject: Usage previous stage HostDb data for generate(fetched deltas)

Re: RE: Usage previous stage HostDb data for generate(fetched deltas)

2017-12-16 Thread Semyon Semyonov
Hi Yossi, What you say makes sense if you run Nutch in the "whole Internet crawling" mode. In other words, you don't specify the set of hosts you want to crawl, but crawl up to infinity. Our case is different. We crawl the specific hosts per each country(around 20). For each host we set up

Re: Usage previous stage HostDb data for generate(fetched deltas)

2018-01-19 Thread Semyon Semyonov
generate.max.count.expr if(fetched > 70 && FetchedDelta < 5 ) {return new("java.lang.Double", 0);} else {return conf.getDouble("generate.max.count", -1);} The commit should be tested though. So, feel free to test/modify.    Sent: Thursday, December 14, 2017 at 2

Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-20 Thread Semyon Semyonov
Dear All, I'm trying to increase quality of the crawling. A part of my database has DB_FETCHED = 1. Example, http://www.wincs.be/ in seed list. The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374 Nutch considers one of the link(http://wincs.be/lakindustrie.html) as extern

Re: RE: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Semyon Semyonov
the crawling quality > > Hi Semyon, > > Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be > issue? > As far as I can see the protocol (HTTP/HTTPS) does not play any part in the > decision if this is the same domain. > > Yossi. > > &

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-21 Thread Semyon Semyonov
/wiki/InternetDomainNameExplained] On 02/21/2018 10:44 AM, Semyon Semyonov wrote: > Thanks Yossi, Markus, > > I have an issue with the db.ignore.external.links.mode=byDomain solution. > > I crawl specific hosts only therefore I have a finite number of hosts to > crawl. > Let

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-06 Thread Semyon Semyonov
docs-1.14/org/apache/nutch/urlfilter/ignoreexempt/ExemptionUrlFilter.html [2] https://tools.ietf.org/html/rfc1738#section-3.1[https://tools.ietf.org/html/rfc1738#section-3.1] On 02/21/2018 01:52 PM, Semyon Semyonov wrote: > Hi Sabastian, > > If I > - modify the method URLUtil.getDomainNam

Re: Need Tutorial on Nutch

2018-03-06 Thread Semyon Semyonov
Here is an unpleasant truth - there is no up to date tutorial for Nutch. To make it even more interesting, sometimes the tutorial can contradict real behavior of Nutch, because of lately introduced features/bugs. If you find such cases, please try to fix and contribute to the project. Welcome t

UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
Dear all, There is an issue with UrlRegexFilter and parsing. In average, parsing takes about 1 millisecond, but sometimes the websites have the crazy links that destroy the parsing(takes 3+ hours and destroy the next steps of the crawling).  For example, below you can see shortened logged versio

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
/java/org/apache/nutch/parse/ParseOutputFormat.java:118: > int > >>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100); > >>>> > >>>> > >>>> > >>>> > >>>> -Original message-

Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Semyon Semyonov
ink length > in parsing, but if that is not what it's supposed to do, I guess it needs to > be renamed (to match the code), moved to a different section of the > properties file, and perhaps better documented. In that case, you'll need to > use Markus' solution, and basi

Re: RE: Dependency between plugins

2018-03-14 Thread Semyon Semyonov
As a side note, I had to implement my own parser with extra functionality, simple copy/past of the code of HTMLparser did the job. If you want to inherit instead of copy paste it can be a bad idea at all. HTML parser is a concrete non abstract class, therefore the inheritance will not be so sm

Re: RE: Reg: URL Near Duplicate Issues with same content

2018-03-15 Thread Semyon Semyonov
In addition, If you crawl a fixed set of urls with external links = false,  case 1 is solve. For example, if you inject http://www.samacharplus.com/ only 1) will be crawled, 2 will be ignored(because of external links = false). 1)http://www.samacharplus.com/~samachar/index.php/en/worlds/11-in

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-16 Thread Semyon Semyonov
ers accepts/exempts the URL. Please open a issue to change it. Thanks, Sebastian On 03/06/2018 10:28 AM, Semyon Semyonov wrote: > I have proposed a solution for this problem > https://issues.apache.org/jira/browse/NUTCH-2522. > > The other question is how voting mechanism

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-20 Thread Semyon Semyonov
il.getHost(BidirectionalUrlExemptionFilter.tranform(key.toString()));   Sent: Friday, March 16, 2018 at 7:20 PM From: "Semyon Semyonov" To: user@nutch.apache.org Subject: Re: Internal links appear to be external in Parse. Improvement of the crawling quality Hi again, Another issue has appeared

Re: [MASSMAIL][ANNOUNCE] New Nutch committer and PMC -

2018-06-27 Thread Semyon Semyonov
Hi Roannel, Congratulations and good luck!   Semyon.   Sent: Wednesday, June 27, 2018 at 3:42 AM From: "Roannel Fernández Hernández" To: user@nutch.apache.org Subject: Re: [MASSMAIL][ANNOUNCE] New Nutch committer and PMC - Hi Folks Thank you very much for allowing me to be part of this project.

Re: RE: Apache Nutch commercial support

2018-10-12 Thread Semyon Semyonov
I will add that likewise for all open source projects most of the top committers are either work for a company that actively use it in a large scale or/and support it as contractors. But if you plan to do so I do recommend to review the quality of code/issues of this person, because obviously qu

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Semyon Semyonov
Hi Nicholas,   I have the same problem with https://www.graydon.nl/ And it doesnt look like a wordpress website. Semyon   Sent: Wednesday, November 14, 2018 at 7:49 AM From: "Nicholas Roberts" To: user@nutch.apache.org Subject: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHtt

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Semyon Semyonov
sites fail org.apache.commons.httpclient.NoHttpResponseException You can try checking robots.txt for these websites On Wed, 14 Nov 2018, 16:00 Yash Thenuan Thenuan Most probably the problem is these websites allow only some specific > crawlers in their robots.txt file. > > On Wed, 14 Nov 2018, 15:56 Semyon Semyo

Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-14 Thread Semyon Semyonov
Hi everyone, We are testing the quality of our crawl for one of our domain countries against the other public crawling tool( http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs ). All the webpages tested via both crawl script and the parsechecker tool for

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-15 Thread Semyon Semyonov
ine. Very amazing considering the fact that it is THE core part of any parser.   Sent: Wednesday, November 14, 2018 at 3:32 PM From: "Semyon Semyonov" To: user@nutch.apache.org Subject: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript. H

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-15 Thread Semyon Semyonov
th our dedicated staff of experienced professionals and our logistics centre at Amsterdam Schiphol International Airport, we are well positioned to anticipate and react swiftly to the dynamic requirements of our customers. Amphar B.V.   On 11/15/18 1:30 PM, Semyon Semyonov wrote: > Ok, wit

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-15 Thread Semyon Semyonov
for the long term parse-html should be either actively maintained or needs to be skipped. Best, Sebastian On 11/15/18 2:39 PM, Semyon Semyonov wrote: > Hi Sebastian, >   > Thanks for the detailed response. > I will try to migrate to Tika. > > Is there any reasons to keep the default

Re: Block certain parts of HTML code from being indexed

2018-11-16 Thread Semyon Semyonov
Hi Hany,   There is another (dirty) solution, you can modify the content during parsing if you dont need it at all. It is probably not like you should do it, but you can if you really want.   For example, modify/delete Node values  src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMCo

Re: update seed list when nutch is running

2018-11-17 Thread Semyon Semyonov
Hi Srini, What do you mean exactly? You can run Nutch with crawl script, nutch commands or as a server application. For 2 and 3 you have a full control over injection phase. Just inject more urls into the crawldb. Semyon   Sent: Friday, November 16, 2018 at 8:23 PM From: "Srinivasan Ramaswamy

Re: unexpected Nutch crawl interruption

2018-11-19 Thread Semyon Semyonov
Hi Hany,     If you open the script code you will reach that line:   # main loop : rounds of generate - fetch - parse - update for ((a=1; ; a++)) with number of break conditions. For each iteration it calls n-independent map jobs. If it breaks it stops. You should finish the loop either with manua

Re: RE: unexpected Nutch crawl interruption

2018-11-19 Thread Semyon Semyonov
ie line: 7148 7689 4698 External: +48 123 42 0698 Mobile: +48 723 680 278 E-mail: hany.n...@hsbc.com  __  Protect our environment - please only print this if you have to! -Original Message- From: Semyon Semyonov [mailto

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-19 Thread Semyon Semyonov
# Logging Threshold log4j.threshold=ALL I receive [Error] :23:70: Missing attribute name. [Error] :24:68: Missing attribute name. [Error] :25:108: Missing attribute name. etc... Are these errors important? Sent: Thursday, November 15, 2018 at 3:33 PM From: "Semyon Semyonov" To: u

Re: Ignore external links but allow redirections to external websites

2018-11-26 Thread Semyon Semyonov
Hi Patricia, I wish I had a generic solution for this problem, but I managed to fix http://www.abc.com -> http://abc.com[http://abc.com] problem with an extension of url exemption filter for both ways (www.abc.com -> abc.com and abc.com -> www.abc.com).  https://jira.apache.org/jira/browse/NUTC

Re: Ignore external links but allow redirections to external websites

2018-11-26 Thread Semyon Semyonov
There is one more thing. You(we do it) can do it outside of Nutch. You can create a program that validates the seed list urls and save redirects as an input for Nutch.   Sent: Monday, November 26, 2018 at 2:43 PM From: "Semyon Semyonov" To: user@nutch.apache.org Subject: Re: Ignor