from:"Yossi Tamari"

RE: RE: unexpected Nutch crawl interruption

2018-11-19 Thread Yossi Tamari

I think in the case that you interrupt the fetcher, you'll have the problem 
that URLs that where scheduled to be fetched on the interrupted cycle will 
never be fetched (because of NUTCH-1842).

Yossi.

> -Original Message-
> From: Markus Jelsma 
> Sent: 19 November 2018 14:52
> To: user@nutch.apache.org
> Subject: RE: RE: unexpected Nutch crawl interruption
> 
> Hello Hany,
> 
> That depends. If you interrupt the fetcher, the segment being fetched can be
> thrown away. But if you interrupt updatedb, you can remove the temp directory
> and must get rid of the lock file. The latter is also true if you interrupt 
> the
> generator.
> 
> Regards,
> Markus
> 
> 
> 
> -Original message-
> > From:hany.n...@hsbc.com 
> > Sent: Monday 19th November 2018 13:30
> > To: user@nutch.apache.org
> > Subject: RE: RE: unexpected Nutch crawl interruption
> >
> > This means there is nothing called corrupted db by any mean?
> >
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT Corporate
> > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > Kapelanka 42A, 30-347 Kraków, Poland
> >
> _
> _
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> >
> _
> _
> > Protect our environment - please only print this if you have to!
> >
> >
> > -Original Message-
> > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> > Sent: Monday, November 19, 2018 12:59 PM
> > To: user@nutch.apache.org
> > Subject: Re: RE: unexpected Nutch crawl interruption
> >
> > From the most recent updated crawldb.
> >
> >
> > Sent: Monday, November 19, 2018 at 12:35 PM
> > From: hany.n...@hsbc.com
> > To: "user@nutch.apache.org" 
> > Subject: RE: unexpected Nutch crawl interruption Hello Semyon,
> >
> > Does it means that if I re-run crawl command it will continue from where it 
> > has
> been stopped from the previous run?
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT Corporate
> > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > Kapelanka 42A, 30-347 Kraków, Poland
> >
> _
> _
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> >
> _
> _
> > Protect our environment - please only print this if you have to!
> >
> >
> > -Original Message-
> > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> > Sent: Monday, November 19, 2018 12:06 PM
> > To: user@nutch.apache.org
> > Subject: Re: unexpected Nutch crawl interruption
> >
> > Hi Hany,
> >
> > If you open the script code you will reach that line:
> >
> > # main loop : rounds of generate - fetch - parse - update for ((a=1; ; 
> > a++)) with
> number of break conditions.
> >
> > For each iteration it calls n-independent map jobs.
> > If it breaks it stops.
> > You should finish the loop either with manual nutch commands, or start with
> the new call of crawl script using the past iteration crawldb.
> > Semyon.
> >
> >
> >
> > Sent: Monday, November 19, 2018 at 11:41 AM
> > From: hany.n...@hsbc.com
> > To: "user@nutch.apache.org" 
> > Subject: unexpected Nutch crawl interruption Hello,
> >
> > What will happen if bin/crawl command is forced to be stopped by any
> reason? Server restart
> >
> > Kind regards,
> > Hany Shehata
> > Solutions Architect, Marketing and Communications IT Corporate
> > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > Kapelanka 42A, 30-347 Kraków, Poland
> >
> _
> _
> >
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> >
> _
> _
> > Protect our environment - please only print this if you have to!
> >
> >
> >
> > -
> > SAVE PAPER - THINK BEFORE YOU PRINT!
> >
> > This E-mail is confidential.
> >
> > It may also be legally privileged. If you are not the addressee you may not
> copy, forward, disclose or use any part of it. If you have received this 
> message in
> error, please delete it and all copies from your system and notify the sender
> immediately by return E-mail.
> >
> > Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> > The sender does not accept liability for any errors or omissions.
> >
> >
> > ***
> > This message originated from the Internet. Its originator may or may not be
> who they claim to be and the information contained in the message

RE: Block certain parts of HTML code from being indexed

2018-11-14 Thread Yossi Tamari

Hi Hany,

The Tika parser supports Boilerpipe for header and footer removal, but I don't 
know how well it works.
You can test it online at https://boilerpipe-web.appspot.com/


> -Original Message-
> From: hany.n...@hsbc.com 
> Sent: 14 November 2018 16:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from being indexed
> 
> Hello All,
> 
> I am using Nutch 1.15, and wondering if there is a feature for blocking 
> certain
> parts of HTML code from being indexed (header & footer).
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate Functions |
> HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347
> Kraków, Poland
> _
> _
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> _
> _
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in
> error, please delete it and all copies from your system and notify the sender
> immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.

RE: index-replace: variable substitution?

2018-10-12 Thread Yossi Tamari

Hi Ryan,

 

>From looking at the code of index-replace, it uses Java's Matcher.replaceAll 
>
> , so $1 (for example) should work.

 

Yossi. 

 

> -Original Message-

> From: Ryan Suarez 

> Sent: 13 October 2018 01:38

> To: user@nutch.apache.org

> Subject: index-replace: variable substitution?

> 

> Greetings,

> 

> I'm using binaries of nutch v1.15 with solr v7.3.1, and index-replace to copy 
> a

> substring of the 'url' field to a new 'site' field.  Here is the definition 
> in my nutch-

> site.xml:

> 

> 

> index.replace.regexp

> 

>urlmatch=.*www.mydomain.ca.*

>  
> url:site=/.*www.mydomain.ca.*/www/

> 

>urlmatch=.*foo.mydomain.ca.*

>  
> url:site=/.*foo.mydomain.ca.*/foo/

> 

>urlmatch=.*bar.mydomain.ca.*

>  
> url:site=/.*bar.mydomain.ca.*/bar/

> 

> 

> 

> This works as expected.  I am given the following site values for the given 
> url

> values:

> 

> url:   https://www.mydomain.ca/test/path 
> -> site: www

> url:   
> http://foo.mydomain.ca/some/other/path -> site: foo

> url:   
> https://bar.mydomain.ca/another/example -> site: foo

> 

> However, it means I have to have a definition for every host or subdomain I am

> crawling (ie. www, foo, bar).  Can I use variable substitution in 
> index-replace or

> is there another way for me to do this automatically?

> 

> regards,

> Ryan

RE: Nutch 1.15: Solr indexing issue

2018-10-11 Thread Yossi Tamari

I'm using 1.15, but not with Solr. However, the configuration of IndexWriters 
changed in 1.15, you may want to read 
https://wiki.apache.org/nutch/IndexWriters#Solr_indexer_properties.

Yossi.

> -Original Message-
> From: hany.n...@hsbc.com 
> Sent: 11 October 2018 10:20
> To: user@nutch.apache.org
> Subject: Nutch 1.15: Solr indexing issue
> 
> Hi All,
> 
> Anyone is using Nutch 1.15?
> 
> I am trying to index my crawled urls into Solr but it is indexing only for
> http://localhost:8983/solr/nutch. Is it hard coded somewhere in the code?
> 
> When I created a nutch core, my urls are indexed into it and ignored my
> solr.server.url property.
> 
> My crawl command is:
> 
> sudo bin/crawl -i -D solr.server.url=http://localhost:8983/solr/website -s 
> urls
> /home/hany.nasr/apache-nutch-1.15/crawl 1
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate Functions |
> HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347
> Kraków, Poland
> _
> _
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> _
> _
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in
> error, please delete it and all copies from your system and notify the sender
> immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.

RE: IndexWriter interface in 1.15

2018-09-06 Thread Yossi Tamari

Hi Lewis,

First of all I must say that I can't reproduce my claim regarding 
getConf/setConf. I was getting a compilation error for their @Override, but not 
anymore, and it's being called, so I'm not sure what happened.
Open() changing its signature is still a breaking change. I can't roll a new 
release, because I'm not a maintainer. I'm also not sure if it's justified, 
because I don't know how many people implement an IndexWriter.
I would still suggest that it be added as a breaking change in master.

Yossi.

> -Original Message-
> From: lewis john mcgibbney 
> Sent: 06 September 2018 19:00
> To: user@nutch.apache.org
> Subject: Re: IndexWriter interface in 1.15
> 
> Hi Yossi,
> 
> REASON: Upgrade of MapReduce API from legacy to 'new'. This was a breaking
> change for sure and a HUGE patch. We did not however factor in the non-
> braking aspects of the upgrade... so it has not all been plain sailing.
> PROPOSED SOLUTION: I tend to agree with you that this should be addd as a
> breaking change to the current master CHANGES.txt and should be consulted
> when people pull a new release. We cannot add this to the release artifacts
> however. We would need to roll a new release (1.15.1). If you feel that this 
> is
> enough of a reason to roll a new release (which I do not) then please go ahead
> and do so.
> 
> This is a lesson learned and I can honestly say that it was the result of us 
> trying
> to make the upgrade as clean as possible without leaving too much of the
> deprecated MR API still around. Maybe this could have however been phased
> out across several releases...
> 
> Lewis
> 
> On Tue, Sep 4, 2018 at 8:53 AM  wrote:
> 
> >
> > user Digest 4 Sep 2018 15:53:01 - Issue 2929
> >
> > Topics (messages 34147 through 34147)
> >
> > IndexWriter interface in 1.15
> > 34147 by: Yossi Tamari
> >
> > Administrivia:
> >
> > -
> > To post to the list, e-mail: user@nutch.apache.org To unsubscribe,
> > e-mail: user-digest-unsubscr...@nutch.apache.org
> > For additional commands, e-mail: user-digest-h...@nutch.apache.org
> >
> > --
> >
> >
> >
> >
> > -- Forwarded message --
> > From: Yossi Tamari 
> > To: 
> > Cc:
> > Bcc:
> > Date: Tue, 4 Sep 2018 18:52:54 +0300
> > Subject: IndexWriter interface in 1.15 Hi,
> >
> >
> >
> > I missed it at the time, but I just realized (the hard way) that the
> > IndexWriter interface was changed in 1.15 in ways that are not backward
> > compatible.
> >
> > That means that any custom IndexWriter implementation will no longer
> > compile, and probably will not run either.
> >
> > I think this was a mistake (maybe a new interface should have been created,
> > and the old one deprecated and supported for now, or just the old methods
> > deprecated without change, and the new methods provided with a default
> > implementation), but it's too late now.
> >
> > I still think this is something that should be highlighted in the release
> > note for 1.15 (meaning at the top, as "breaking changes").
> >
> > The main changes I encountered:
> >
> > 1.  setConf and getConf were removed from the interface (without
> > deprecation).
> > 2.  open was deprecated (that's fine), and its signature was changed
> > (from JobConf to Configuration), which means it a completely different
> > function technically, and there is no point in the deprecation.
> >
> >
> >
> > Yossi.
> >
> >
> 
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc

IndexWriter interface in 1.15

2018-09-04 Thread Yossi Tamari

Hi,

 

I missed it at the time, but I just realized (the hard way) that the
IndexWriter interface was changed in 1.15 in ways that are not backward
compatible.

That means that any custom IndexWriter implementation will no longer
compile, and probably will not run either.

I think this was a mistake (maybe a new interface should have been created,
and the old one deprecated and supported for now, or just the old methods
deprecated without change, and the new methods provided with a default
implementation), but it's too late now. 

I still think this is something that should be highlighted in the release
note for 1.15 (meaning at the top, as "breaking changes").

The main changes I encountered:

1.  setConf and getConf were removed from the interface (without
deprecation).
2.  open was deprecated (that's fine), and its signature was changed
(from JobConf to Configuration), which means it a completely different
function technically, and there is no point in the deprecation.

 

Yossi.

RE: Issues while crawling pagination

2018-07-28 Thread Yossi Tamari

Hi Shiva,

Having looked at the specific site, I have to amend my recommended max-depth 
from 1 to 2, since I assume you want to fetch the stories themselves, not just 
the hubpages.

If you want to crawl continuously, as Markus suggested, I still think you 
should keep the depth at 2, but define the first hubpage(s) to have a very high 
priority and very short recrawl delay. This is because stories are always added 
on the first page, and then get pushed back. I suspect that if you don't limit 
depth, and especially if you don't limit yourself to the domain, you will find 
yourself crawling the whole internet eventually. If you do limit to the domain, 
that won't be a problem, but unless you give special treatment to the first 
page(s), you will be continuously recrawling hundreds of thousands of static 
pages.

Yossi.

> -Original Message-
> From: Markus Jelsma 
> Sent: 29 July 2018 00:53
> To: user@nutch.apache.org
> Subject: RE: Issues while crawling pagination
> 
> Hello,
> 
> Yossi's suggestion is excellent if your case is crawl everything once, and 
> never
> again. However, if you need to crawl future articles as well, and have to deal
> with mutations, then let the crawler run continuously without regard for 
> depth.
> 
> The latter is the usual case, because after all, if you got this task a few 
> months
> ago you wouldn't need to go to a depth of 497342 right?
> 
> Regards,
> Markus
> 
> 
> 
> 
> -Original message-
> > From:Yossi Tamari 
> > Sent: Saturday 28th July 2018 23:09
> > To: user@nutch.apache.org; shivakarthik...@gmail.com;
> > nu...@lucene.apache.org
> > Subject: RE: Issues while crawling pagination
> >
> > Hi Shiva,
> >
> > My suggestion would be to programmatically generate a seeds file containing
> these 497342 URLs (since you know them in advance), and then use a very low
> max-depth (probably 1), and a high number of iterations, since only a small
> number will be fetched in each iteration, unless you set a very low 
> crawl-delay.
> > (Mathematically, If you fetch 1 URL per second from this domain, fetching
> 497342 URLs will take 138 hours).
> >
> > Yossi.
> >
> > > -Original Message-
> > > From: ShivaKarthik S 
> > > Sent: 28 July 2018 23:20
> > > To: nu...@lucene.apache.org; user@nutch.apache.org
> > > Subject: Reg: Issues while crawling pagination
> > >
> > >  Hi
> > >
> > > Can you help me in figuring out the issue while crawling a hub page
> > > having pagination. Problem what i am facing is what depth to give
> > > and how to handle pagination.
> > > I have a hubpage which has a pagination of more than 4.95L.
> > > e.g. https://www.jagran.com/latest-news-page497342.html  is
> > > the number of pages under the hubpage latest-news>
> > >
> > >
> > > --
> > > Thanks and Regards
> > > Shiva
> >
> >

RE: Issues while crawling pagination

2018-07-28 Thread Yossi Tamari

Hi Shiva,

My suggestion would be to programmatically generate a seeds file containing 
these 497342 URLs (since you know them in advance), and then use a very low 
max-depth (probably 1), and a high number of iterations, since only a small 
number will be fetched in each iteration, unless you set a very low crawl-delay.
(Mathematically, If you fetch 1 URL per second from this domain, fetching 
497342 URLs will take 138 hours).

Yossi.

> -Original Message-
> From: ShivaKarthik S 
> Sent: 28 July 2018 23:20
> To: nu...@lucene.apache.org; user@nutch.apache.org
> Subject: Reg: Issues while crawling pagination
> 
>  Hi
> 
> Can you help me in figuring out the issue while crawling a hub page having
> pagination. Problem what i am facing is what depth to give and how to handle
> pagination.
> I have a hubpage which has a pagination of more than 4.95L.
> e.g. https://www.jagran.com/latest-news-page497342.html  the number of pages under the hubpage latest-news>
> 
> 
> --
> Thanks and Regards
> Shiva

RE: [MASSMAIL]RE: Events out-of-the-box

2018-07-05 Thread Yossi Tamari

Hi Roannel,

I am not using, and was not even aware of Nutch's ability to emit events. I 
just read https://issues.apache.org/jira/browse/NUTCH-2132?attachmentOrder=desc 
where they basically had the same discussion.
If the general capability already exists, it seems like a good idea to add more 
functionality to it. I also agree with Sebastian's comments there: the default 
behaviour should not change, and there shouldn't be a performance cost, 
especially to those not using this feature (which in this case should be easy, 
it's just on step begin/end).

BTW, from an architectural point of view, I see these, and the ones in the 
original ticket, as Logging Events, not Integration Events, and as such I think 
the approach should have been to use structured log messages instead of 
RabbitMQ. I'm not sure how much support log4j has for structured log messages, 
but I know that newer log libraries have good support for sending the same 
message as text to a normal log and as a structured message to another appender.

Yossi.

> -Original Message-
> From: Roannel Fernández Hernández 
> Sent: 05 July 2018 04:05
> To: user@nutch.apache.org
> Subject: Re: [MASSMAIL]RE: Events out-of-the-box
> 
> Hi Yossi
> 
> Thanks for your answer. I've been testing your idea of the appender, but I 
> think
> is too hard get the counters from the context by this via. I truly believe 
> Nutch
> should provide some mainly events out-of-the-box using the included publisher
> component. Do you agree with me?
> 
> Regards
> 
> - Mensaje original -
> > De: "Yossi Tamari" 
> > Para: user@nutch.apache.org
> > Enviados: Viernes, 29 de Junio 2018 2:09:52
> > Asunto: [MASSMAIL]RE: Events out-of-the-box
> >
> > This is not something I actually did, but you should be able to
> > achieve this by adding a log4j appender for RabbitMQ (such as
> > https://github.com/plant42/rabbitmq-log4j-appender), and configuring
> > the relevant loggers and filters to send only the logging events you
> > need to that appender.
> > BTW, if you just want "fetching started/fetching ended" style
> > messages, you can simply add it to the crawl script, no need to touch the 
> > Java
> code.
> >
> > > -Original Message-
> > > From: Roannel Fernández Hernández 
> > > Sent: 29 June 2018 06:24
> > > To: user@nutch.apache.org
> > > Subject: Events out-of-the-box
> > >
> > >
> > >
> > > Hi folks,
> > >
> > >
> > >
> > >
> > > I'm using Nutch 1.14 and I have to send notifications to a RabbitMQ
> > > queue when a every step starts and ends. So, my question is: Do I
> > > have to change the code to achieve this or is there an easier way?
> > > How can I do this?
> > >
> > >
> > >
> > >
> > > If code should be changed I think is a good idea provide
> > > out-of-the-box events for each step. We can even pass the counters
> > > from the context to each event. I don't know, it's just an idea.
> > >
> > >
> > >
> > >
> > > What do you think guys?
> > >
> > >
> > >
> > >
> > > Regards
> > >
> > > UCIENCIA 2018: III Conferencia Científica Internacional de la
> > > Universidad de las Ciencias Informáticas.
> > > Del 24-26 de septiembre, 2018 http://uciencia.uci.cu
> > > http://eventos.uci.cu
> >
> >
> UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad de 
> las
> Ciencias Informáticas.
> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu

RE: Events out-of-the-box

2018-06-28 Thread Yossi Tamari

This is not something I actually did, but you should be able to achieve this by 
adding a log4j appender for RabbitMQ (such as 
https://github.com/plant42/rabbitmq-log4j-appender), and configuring the 
relevant loggers and filters to send only the logging events you need to that 
appender.
BTW, if you just want "fetching started/fetching ended" style messages, you can 
simply add it to the crawl script, no need to touch the Java code.

> -Original Message-
> From: Roannel Fernández Hernández 
> Sent: 29 June 2018 06:24
> To: user@nutch.apache.org
> Subject: Events out-of-the-box
> 
> 
> 
> Hi folks,
> 
> 
> 
> 
> I'm using Nutch 1.14 and I have to send notifications to a RabbitMQ queue when
> a every step starts and ends. So, my question is: Do I have to change the 
> code to
> achieve this or is there an easier way? How can I do this?
> 
> 
> 
> 
> If code should be changed I think is a good idea provide out-of-the-box events
> for each step. We can even pass the counters from the context to each event. I
> don't know, it's just an idea.
> 
> 
> 
> 
> What do you think guys?
> 
> 
> 
> 
> Regards
> 
> UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad de 
> las
> Ciencias Informáticas.
> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu

RE: Sitemap URL's concatenated, causing status 14 not found

2018-05-25 Thread Yossi Tamari

Hi Markus,

I don’t believe this is a valid sitemapindex. Each  should include 
exactly one .
See also https://www.sitemaps.org/protocol.html#index and 
https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
I agree that the this is not the ideal error behaviour, but I guess the code 
was written from the assumption that the document is valid and conformant.

Yossi.

> -Original Message-
> From: Markus Jelsma 
> Sent: 25 May 2018 23:45
> To: User 
> Subject: Sitemap URL's concatenated, causing status 14 not found
> 
> Hello,
> 
> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> Nutch things those two sitemap URL's are actually one consisting of both
> concatenated.
> 
> Here is https://www.saxion.nl/sitemap.xml
> 
> 
>  xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";>
> 
> https://www.saxion.nl/opleidingen-sitemap.xml
> https://www.saxion.nl/content-sitemap.xml
> 
> 
> 
> This seems fine, but Nutch attempts, and obviously fails to load:
> 
> 2018-05-25 16:27:50,515 ERROR [Thread-30]
> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> Status code: 14 for https://www.saxion.nl/opleidingen-
> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> 
> What is going on here? Why does Nutch, or CC's sitemap util behave like this?
> 
> Thanks,
> Markus

RE: Problems starting crawl from sitemaps

2018-05-24 Thread Yossi Tamari

Hi Chris,

In order to inject sitemaps, you should use the "nutch sitemap" command. After 
you inject those sitemaps to the crawl DB, you can proceed as normal with the 
crawl command, without the -s parameter.
The error you are seeing may be because you have http.content.limit defined. 
The default value would cause any document to be truncated after 65536 bytes. 
For sitemaps, you should set it to a much larger number, or -1.

 Yossi.

> -Original Message-
> From: Chris Gray 
> Sent: 23 May 2018 20:39
> To: user@nutch.apache.org
> Subject: Problems starting crawl from sitemaps
> 
> I've been using nutch for a few years to do conventional link-to-link crawls 
> of
> our local websites, but I would like to switch to doing crawls based on
> sitemaps.  So far I've had no luck doing this.
> 
> I'm not sure I've configured this correctly and the documentation I've found 
> has
> left me guessing at many things.  Why aren't the pages in listed in a sitemap
> being fetched and indexed?
> 
> I've installed Nutch 1.14 and Solr 6.6.0.  My urls/seeds.txt file contains 
> only the
> URLs for the 4 sitemaps I'm interested in.  After running:
> 
> bin/crawl -i -D "solr.server.url=http://localhost:8983/solr/nutch"; -s urls 
> crawl 5
> 
> the crawl ends after 3 of 5 iterations and only 3 documents are in the
> index:  3 of the seeds.
> 
> I do get error messages that 3 sitemap files that contain  elements 
> are
> malformed, for example:
> 
> 2018-05-23 08:57:24,564 ERROR tika.TikaParser - Error parsing
> https://uwaterloo.ca/library/sitemap.xml
> Caused by: org.xml.sax.SAXParseException; lineNumber: 420; columnNumber:
> 122; XML document structures must start and end within the same entity.
> 
> But I can't find anything wrong with the sitemaps and other validators say
> they're OK and the location pointed to (line 420, column 122) is in the 
> middle of
> the name of a directory in a URL.
> 
> Is there good documentation or a tutorial on using Nutch with sitemaps?
>

RE: random sampling of crawlDb urls

2018-05-01 Thread Yossi Tamari

Hi Michael,

If you are using 1.14, there is a parameter -sample that allows you to request 
a random sample. See https://issues.apache.org/jira/browse/NUTCH-2463.

Yossi.

> -Original Message-
> From: Michael Coffey 
> Sent: 01 May 2018 23:47
> To: User 
> Subject: random sampling of crawlDb urls
> 
> I want to extract a random sample of URLS from my big crawldb. I think I 
> should
> be able to do this using readdb -dump with a Jexl expression, but I haven't 
> been
> able to get it to work.
> 
> I have tried several variations of the following command.
> $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -
> dump /crawls/pop2/data/crawldb/pruned/current -format crawldb -expr
> "((Math.random())>=0.1)"
> 
> 
> Typically, it produces zero records. I know the expression is getting through 
> to
> the CrawlDbReader (without quotes) because I get this message:
> 18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr:
> ((Math.random())>=0.1)
> 
> Even when I use the expression "((Math.random())>=0.0)" I get zero output
> records.
> 
> If I use the expression "((Math.random())>=.99)" it lets all records pass 
> through
> to the output. I guess it has something to do with the lack of leading zero 
> on the
> numeric constant.
> 
> Does anyone know a good way to extract a random sample of records from a
> crawlDb?

RE: No internet connection in Nutch crawler: Proxy configuration -PAC file

2018-04-23 Thread Yossi Tamari

To add to what Lewis said, PAC files are mostly used by browsers, not so much 
by servers (like Nutch). It is possible your IT department has another proxy 
configuration that you can use in a server.
Keep in mind that a PAC file is just a JavaScript function that translates a 
URL to proxy information, so if the logic is simple and the file is static, it 
may be enough for you to look at the contents of the file, and extract some 
static proxy definition that will work for all URLs.

> -Original Message-
> From: lewis john mcgibbney 
> Sent: 23 April 2018 18:04
> To: user@nutch.apache.org
> Subject: Re: No internet connection in Nutch crawler: Proxy configuration -PAC
> file
> 
> Hi Patricia,
> I've never used a proxy auto-config (PAC) method for proxying anything before.
> The PAC is defined as "...Proxy auto-configuration (PAC): Specify the URL for 
> a
> PAC file with a JavaScript function that determines the appropriate proxy for
> each URL. This method is more suitable for laptop users who need several
> different proxy configurations, or complex corporate setups with many 
> different
> proxies."
> Right now, the public guidance for using Nutch with a proxy goes as far as the
> following tutorial https://wiki.apache.org/nutch/SetupProxyForNutch
> Right now, Nutch does not support the reading of PAC files... I think you 
> would
> need to add this functionality.
> Lewis
> 
> On Sun, Apr 22, 2018 at 10:31 AM,  wrote:
> 
> >
> > From: Patricia Helmich 
> > To: "user@nutch.apache.org" 
> > Cc:
> > Bcc:
> > Date: Fri, 20 Apr 2018 10:31:42 +
> > Subject: No internet connection in Nutch crawler: Proxy configuration
> > -PAC file Hi,
> >
> > I am using Nutch and it used to work fine. Now, some internet
> > configurations changed and I have to use a proxy. In my browser, I
> > specify the proxy by providing a PAC file to the option "Automatic
> > proxy configuration URL". I was searching for a similar option in
> > Nutch in the conf/nutch-default.xml file. I do find some proxy options
> > (http.proxy.host, http.proxy.port, http.proxy.username,
> > http.proxy.password,
> > http.proxy.realm) but none seems to be the one I am searching for.
> >
> > So, my question is: where can I specify the PAC file in the Nutch
> > configurations for the proxy?
> >
> > Thanks for your help,
> >
> > Patricia
> >
> >
> 
> 
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc

RE: Issues related to Hung threads when crawling more than 15K articles

2018-04-04 Thread Yossi Tamari

I believe this is normal behaviour. The fetch timeout which you have defined 
(fetcher.timelimit.mins) has passed, and the fetcher is exiting. In this case 
one of the fetcher threads is still waiting for a response from a specific URL. 
This is not a problem, and any URLs which were not fetched because of the 
timeout will be "generated" again in a future segment.
You do want to try to match the fetcher timeout and the generated segment size, 
but you can never be 100% successful, and that's not a problem.

Yossi.

> -Original Message-
> From: ShivaKarthik S 
> Sent: 04 April 2018 12:32
> To: user@nutch.apache.org
> Cc: Sebastian Nagel 
> Subject: Reg: Issues related to Hung threads when crawling more than 15K
> articles
> 
> Hi,
> 
>I am crawling 25K+ artilces at a time (in single depth), but after 
> crawling (using
> nutch-1.11) certain amount of articles am getting error related to Hung 
> threads
> and the process gets killed. Can some one suggest me a solution to resolve 
> this?
> 
> *Error am getting is as follows*
> 
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5 -activeThreads=10, spinWaiting=9,
> fetchQueues.totalSize=2,
> fetchQueues.getQueueCount=1
> Aborting with 10 hung threads.
> Thread #0 hung while processing
> https://24.kg/sport/29754_kyirgyizstantsyi_vyiigrali_dva_boya_na_litsenzionno
> m_turnire_po_boksu_v_kitae/
> Thread #1 hung while processing null
> Thread #2 hung while processing null
> Thread #3 hung while processing null
> Thread #4 hung while processing null
> Thread #5 hung while processing null
> Thread #6 hung while processing null
> Thread #7 hung while processing null
> Thread #8 hung while processing null
> Thread #9 hung while processing null
> Fetcher: finished at 2018-04-04 14:23:45, elapsed: 00:00:02
> 
> --
> Thanks and Regards
> Shiva

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari

If you look at the code of the HTML parser, you'll see that the parameter is 
passed the variable "root", the same variable that is passed to the methods 
that extract the outlinks, the title, and the text. So it simply can’t be null. 
It may be an issue with what toString is printing for this element (for example 
it may be printing the name of the root element, and it happens to not have a 
name).
Again, I strongly recommend debugging, so you can see the real value there.

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 15 March 2018 10:26
> To: user@nutch.apache.org
> Subject: RE: RE: Dependency between plugins
> 
> Yes  I am using Html parser and yes the document is getting parsed but
> document fragment is printing null.
> 
> On 15 Mar 2018 13:52, "Yossi Tamari"  wrote:
> 
> > Is your parser the HTML parser? I can say from experience that the
> > document is passed.
> > I really recommend debugging in local mode rather than using sysout.
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan 
> > > Sent: 15 March 2018 10:13
> > > To: user@nutch.apache.org
> > > Subject: RE: RE: Dependency between plugins
> > >
> > > I tried printing the contents of document fragment in
> > > parsefilter-regex
> > by writing
> > > System.out.println(doc) but its printing null!! And document is
> > > getting
> > parsed!!
> > >
> > > On 15 Mar 2018 13:15, "Yossi Tamari"  wrote:
> > >
> > > > Parse filters receive a DocumentFragment as their fourth parameter.
> > > >
> > > > > -Original Message-
> > > > > From: Yash Thenuan Thenuan 
> > > > > Sent: 15 March 2018 08:50
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: RE: Dependency between plugins
> > > > >
> > > > > Hi Jorge and Yossi,
> > > > > The reason why I am trying to do it is exactly what yossi said
> > > > > "removing
> > > > nutch
> > > > > overhead", I didn't thought that it would be that complicated,
> > > > > All I am
> > > > trying is to
> > > > > call the existing parsers from my own parser, but I am not able
> > > > > to do it
> > > > correctly,
> > > > > may be chain approach is a better idea to do that but *do parse
> > > > > filter
> > > > receives
> > > > > any DOM object?* as a parameter so by accessing that I can
> > > > > extract the
> > > > data I
> > > > > want??
> > > > >
> > > > >
> > > > > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari
> > > > > 
> > > > > wrote:
> > > > >
> > > > > > There is no built-in mechanism for this. However, are you sure
> > > > > > you really want a parser for each website, rather than a
> > > > > > parse-filter for each website (which will take the results of
> > > > > > the HTML parser and apply some domain specific customizations)?
> > > > > > In both cases you can use a dispatcher approach, which your
> > > > > > custom parser is, or a chain approach (every parser that is
> > > > > > not intended for this domain returns null, or each
> > > > > > parse-filter that is not intended for this domain returns the 
> > > > > > ParseResult
> that it received).
> > > > > > The advantage of the chain approach is that each new website
> > > > > > parser is a first-class, reusable Nutch object. The advantage
> > > > > > of the dispatcher approach is that you don't need to deal with
> > > > > > a lot of the Nutch overhead, but it is more monolithic (You
> > > > > > can end up with one huge plugin that needs to be constantly
> > > > > > modified whenever one of the
> > > > websites is
> > > > > modified).
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Yash Thenuan Thenuan 
> > > > > > > Sent: 14 March 2018 15:28
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Re: RE: Dependency between plugins
> > > > > > >
> > > > > > > Is there a way in nutch by which we can use different parser
> > > >

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari

Is your parser the HTML parser? I can say from experience that the document is 
passed.
I really recommend debugging in local mode rather than using sysout.

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 15 March 2018 10:13
> To: user@nutch.apache.org
> Subject: RE: RE: Dependency between plugins
> 
> I tried printing the contents of document fragment in parsefilter-regex by 
> writing
> System.out.println(doc) but its printing null!! And document is getting 
> parsed!!
> 
> On 15 Mar 2018 13:15, "Yossi Tamari"  wrote:
> 
> > Parse filters receive a DocumentFragment as their fourth parameter.
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan 
> > > Sent: 15 March 2018 08:50
> > > To: user@nutch.apache.org
> > > Subject: Re: RE: Dependency between plugins
> > >
> > > Hi Jorge and Yossi,
> > > The reason why I am trying to do it is exactly what yossi said
> > > "removing
> > nutch
> > > overhead", I didn't thought that it would be that complicated, All I
> > > am
> > trying is to
> > > call the existing parsers from my own parser, but I am not able to
> > > do it
> > correctly,
> > > may be chain approach is a better idea to do that but *do parse
> > > filter
> > receives
> > > any DOM object?* as a parameter so by accessing that I can extract
> > > the
> > data I
> > > want??
> > >
> > >
> > > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari
> > > 
> > > wrote:
> > >
> > > > There is no built-in mechanism for this. However, are you sure you
> > > > really want a parser for each website, rather than a parse-filter
> > > > for each website (which will take the results of the HTML parser
> > > > and apply some domain specific customizations)?
> > > > In both cases you can use a dispatcher approach, which your custom
> > > > parser is, or a chain approach (every parser that is not intended
> > > > for this domain returns null, or each parse-filter that is not
> > > > intended for this domain returns the ParseResult that it received).
> > > > The advantage of the chain approach is that each new website
> > > > parser is a first-class, reusable Nutch object. The advantage of
> > > > the dispatcher approach is that you don't need to deal with a lot
> > > > of the Nutch overhead, but it is more monolithic (You can end up
> > > > with one huge plugin that needs to be constantly modified whenever
> > > > one of the
> > websites is
> > > modified).
> > > >
> > > > > -Original Message-
> > > > > From: Yash Thenuan Thenuan 
> > > > > Sent: 14 March 2018 15:28
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: RE: Dependency between plugins
> > > > >
> > > > > Is there a way in nutch by which we can use different parser for
> > > > different
> > > > > websites?
> > > > > I am trying to do this by writing a custom parser which will
> > > > > call
> > > > different parsers
> > > > > for different websites?
> > > > >
> > > > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> > > 
> > > > > wrote:
> > > > >
> > > > > > As a side note,
> > > > > >
> > > > > > I had to implement my own parser with extra functionality,
> > > > > > simple copy/past of the code of HTMLparser did the job.
> > > > > >
> > > > > > If you want to inherit instead of copy paste it can be a bad
> > > > > > idea at
> > > > all.
> > > > > > HTML parser is a concrete non abstract class, therefore the
> > > > > > inheritance will not be so smooth as in case of contract
> > > > > > implementations(the plugins are contracts, ie interfaces) and
> > > > > > can
> > > > easily break
> > > > > some OOP rules.
> > > > > >
> > > > > >
> > > > > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > > > > From: "Yossi Tamari" 
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: RE: Dependency between plugins One suggestion I can
> > > > > > make is to ensure that the html-p

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari

Parse filters receive a DocumentFragment as their fourth parameter.

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 15 March 2018 08:50
> To: user@nutch.apache.org
> Subject: Re: RE: Dependency between plugins
> 
> Hi Jorge and Yossi,
> The reason why I am trying to do it is exactly what yossi said "removing nutch
> overhead", I didn't thought that it would be that complicated, All I am 
> trying is to
> call the existing parsers from my own parser, but I am not able to do it 
> correctly,
> may be chain approach is a better idea to do that but *do parse filter 
> receives
> any DOM object?* as a parameter so by accessing that I can extract the data I
> want??
> 
> 
> On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari 
> wrote:
> 
> > There is no built-in mechanism for this. However, are you sure you
> > really want a parser for each website, rather than a parse-filter for
> > each website (which will take the results of the HTML parser and apply
> > some domain specific customizations)?
> > In both cases you can use a dispatcher approach, which your custom
> > parser is, or a chain approach (every parser that is not intended for
> > this domain returns null, or each parse-filter that is not intended
> > for this domain returns the ParseResult that it received).
> > The advantage of the chain approach is that each new website parser is
> > a first-class, reusable Nutch object. The advantage of the dispatcher
> > approach is that you don't need to deal with a lot of the Nutch
> > overhead, but it is more monolithic (You can end up with one huge
> > plugin that needs to be constantly modified whenever one of the websites is
> modified).
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan 
> > > Sent: 14 March 2018 15:28
> > > To: user@nutch.apache.org
> > > Subject: Re: RE: Dependency between plugins
> > >
> > > Is there a way in nutch by which we can use different parser for
> > different
> > > websites?
> > > I am trying to do this by writing a custom parser which will call
> > different parsers
> > > for different websites?
> > >
> > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> 
> > > wrote:
> > >
> > > > As a side note,
> > > >
> > > > I had to implement my own parser with extra functionality, simple
> > > > copy/past of the code of HTMLparser did the job.
> > > >
> > > > If you want to inherit instead of copy paste it can be a bad idea
> > > > at
> > all.
> > > > HTML parser is a concrete non abstract class, therefore the
> > > > inheritance will not be so smooth as in case of contract
> > > > implementations(the plugins are contracts, ie interfaces) and can
> > easily break
> > > some OOP rules.
> > > >
> > > >
> > > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > > From: "Yossi Tamari" 
> > > > To: user@nutch.apache.org
> > > > Subject: RE: Dependency between plugins One suggestion I can make
> > > > is to ensure that the html-parse plugin is built before your
> > > > plugin (since you are including the jars that are generated in its 
> > > > build).
> > > >
> > > > > -Original Message-
> > > > > From: Yash Thenuan Thenuan 
> > > > > Sent: 14 March 2018 09:55
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Dependency between plugins
> > > > >
> > > > > Hi,
> > > > > It didn't worked in ant runtime.
> > > > > I included "import org.apache.nutch.parse.html;" in my custom
> > > > > parser
> > > > code.
> > > > > but it is throwing errror while i am doing ant runtime.
> > > > >
> > > > > [javac]
> > > > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > > > >
> custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > > > > error: cannot find symbol
> > > > >
> > > > > [javac] import org.apache.nutch.parse.html;
> > > > >
> > > > > [javac] ^
> > > > >
> > > > > [javac] symbol: class html
> > > > >
> > > > > [javac] location: package org.apache.nutch.parse
> > > > >
> > > > >
> > > > > below are

RE: RE: Dependency between plugins

2018-03-14 Thread Yossi Tamari

There is no built-in mechanism for this. However, are you sure you really want 
a parser for each website, rather than a parse-filter for each website (which 
will take the results of the HTML parser and apply some domain specific 
customizations)?
In both cases you can use a dispatcher approach, which your custom parser is, 
or a chain approach (every parser that is not intended for this domain returns 
null, or each parse-filter that is not intended for this domain returns the 
ParseResult that it received).
The advantage of the chain approach is that each new website parser is a 
first-class, reusable Nutch object. The advantage of the dispatcher approach is 
that you don't need to deal with a lot of the Nutch overhead, but it is more 
monolithic (You can end up with one huge plugin that needs to be constantly 
modified whenever one of the websites is modified). 

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 14 March 2018 15:28
> To: user@nutch.apache.org
> Subject: Re: RE: Dependency between plugins
> 
> Is there a way in nutch by which we can use different parser for different
> websites?
> I am trying to do this by writing a custom parser which will call different 
> parsers
> for different websites?
> 
> On 14 Mar 2018 14:19, "Semyon Semyonov" 
> wrote:
> 
> > As a side note,
> >
> > I had to implement my own parser with extra functionality, simple
> > copy/past of the code of HTMLparser did the job.
> >
> > If you want to inherit instead of copy paste it can be a bad idea at all.
> > HTML parser is a concrete non abstract class, therefore the
> > inheritance will not be so smooth as in case of contract
> > implementations(the plugins are contracts, ie interfaces) and can easily 
> > break
> some OOP rules.
> >
> >
> > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > From: "Yossi Tamari" 
> > To: user@nutch.apache.org
> > Subject: RE: Dependency between plugins One suggestion I can make is
> > to ensure that the html-parse plugin is built before your plugin
> > (since you are including the jars that are generated in its build).
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan 
> > > Sent: 14 March 2018 09:55
> > > To: user@nutch.apache.org
> > > Subject: Re: Dependency between plugins
> > >
> > > Hi,
> > > It didn't worked in ant runtime.
> > > I included "import org.apache.nutch.parse.html;" in my custom parser
> > code.
> > > but it is throwing errror while i am doing ant runtime.
> > >
> > > [javac]
> > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > > custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > > error: cannot find symbol
> > >
> > > [javac] import org.apache.nutch.parse.html;
> > >
> > > [javac] ^
> > >
> > > [javac] symbol: class html
> > >
> > > [javac] location: package org.apache.nutch.parse
> > >
> > >
> > > below are the xml files of my parser
> > >
> > >
> > > My ivy.xml
> > >
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > http://nutch.apache.org"/>
> > >
> > > 
> > >
> > > Apache Nutch
> > >
> > > 
> > >
> > > 
> > >
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > build.xml
> > >
> > > 
> > >
> > > 
> > >
> > >  
> > > 
> > > 
> > >
> > >
> > > 
> > > 
> > >   
> > >
> > >  
> > >   > > target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
> > > 
> > >
> > > 
> > >
> > > plugin.xml
> > >
> > >  > > id="parse-custom"
> > > name="Custom Parse Plug-in"
> > > version="1.0.0"
> > > provider-name="nutch.org">
> > >
> > > 
> > > 
> > > 
> > > 
> > > 
> > >
> > > 
> > > 
> > >> > id="org.apache.nutch.parse.custom"
> > > name="CustomParse"
> > > point="org.apache.nu

RE: Dependency between plugins

2018-03-14 Thread Yossi Tamari

One suggestion I can make is to ensure that the html-parse plugin is built 
before your plugin (since you are including the jars that are generated in its 
build). 

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 14 March 2018 09:55
> To: user@nutch.apache.org
> Subject: Re: Dependency between plugins
> 
> Hi,
> It didn't worked in ant runtime.
> I included  "import org.apache.nutch.parse.html;" in my custom parser code.
> but it is throwing errror while i am doing ant runtime.
> 
> [javac]
> /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> error: cannot find symbol
> 
> [javac] import org.apache.nutch.parse.html;
> 
> [javac]  ^
> 
> [javac]   symbol:   class html
> 
> [javac]   location: package org.apache.nutch.parse
> 
> 
> below are the xml files of my parser
> 
> 
> My ivy.xml
> 
> 
> 
> 
>   
> 
> 
> 
> http://nutch.apache.org"/>
> 
> 
> 
> Apache Nutch
> 
> 
> 
>   
> 
> 
>   
> 
> 
> 
>   
> 
> 
>   
> 
> 
> 
> 
> 
>   
> 
> 
> 
> build.xml
> 
> 
> 
>   
> 
>   
>   
> 
>   
> 
> 
>   
> 
> 
> 
> 
> 
>   
>   
> 
> 
>   
> 
> 
> 
> plugin.xml
> 
> id="parse-custom"
>name="Custom Parse Plug-in"
>version="1.0.0"
>provider-name="nutch.org">
> 
>
>   
>  
>   
>
> 
>
>   
>   
> 
>   name="CustomParse"
>   point="org.apache.nutch.parse.Parser">
> 
>  class="org.apache.nutch.parse.custom.CustomParser">
>  value="text/html|application/xhtml+xml"/>
> 
>   
> 
>
> 
> 
> 
> 
> 
> 
> On Wed, Mar 14, 2018 at 1:02 PM, Yossi Tamari 
> wrote:
> 
> > Hi Yash,
> >
> > I don't know how to do it, I never tried, but if I had to it would be
> > a trial and error thing
> >
> > If you want to increase the chances that someone will answer your
> > question, I suggest you provide as much information as possible:
> > Where did it not work? In "ant runtime", or when running in Hadoop?
> > What was the error message?
> > What is the content of your build.xml, plugin.xml, and ivy.xml?
> > Is parse-html configured in your plugin-includes?
> >
> > If it's a problem during execution, I would suggest looking at or
> > debugging the code of PluginClassLoader.
> >
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan 
> > > Sent: 14 March 2018 08:34
> > > To: user@nutch.apache.org
> > > Subject: Re: Dependency between plugins
> > >
> > > Anybody please help me out regarding this.
> > >
> > > On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan <
> > > rit2014...@iiita.ac.in> wrote:
> > >
> > > > I am trying to import Htmlparser in my custom parser.
> > > > I did it in the same way by which Htmlparser imports lib-nekohtml
> > > > but it didn't worked.
> > > > Can anybody please tell me how to do it?
> > > >
> >
> >

RE: Dependency between plugins

2018-03-14 Thread Yossi Tamari

Hi Yash,

I don't know how to do it, I never tried, but if I had to it would be a trial 
and error thing

If you want to increase the chances that someone will answer your question, I 
suggest you provide as much information as possible:
Where did it not work? In "ant runtime", or when running in Hadoop? What was 
the error message?
What is the content of your build.xml, plugin.xml, and ivy.xml?
Is parse-html configured in your plugin-includes?

If it's a problem during execution, I would suggest looking at or debugging the 
code of PluginClassLoader.

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 14 March 2018 08:34
> To: user@nutch.apache.org
> Subject: Re: Dependency between plugins
> 
> Anybody please help me out regarding this.
> 
> On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan <
> rit2014...@iiita.ac.in> wrote:
> 
> > I am trying to import Htmlparser in my custom parser.
> > I did it in the same way by which Htmlparser imports lib-nekohtml but
> > it didn't worked.
> > Can anybody please tell me how to do it?
> >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari

I think the first one should also be handled by reopening NUTCH-2220, which 
specifically mentions renaming db.max.anchor.length. The problem is that it 
seems like I am not able to reopen a closed/resolved issue. Sorry...

> -Original Message-
> From: Sebastian Nagel 
> Sent: 12 March 2018 17:39
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> > OK, agreed, but it should also be moved to the LinkDB section in nutch-
> default.xml.
> 
> Yes, of course, plus make the description more explicit.
> Could you open a Jira issue for this?
> 
> > It should apply to outlinks received from the parser, not to injected URLs, 
> > for
> example.
> 
> Maybe it's ok not to apply it to seed URLs but what about URLs from sitemaps
> and ev. redirects?
> But agreed, you always could also add a rule to regex-urlfilter.txt if 
> required. But
> it should be made clear that only outlinks are checked for length.
> Could you reopen NUTCH-1106 to address this?
> 
> 
> Thanks!
> 
> 
> On 03/12/2018 03:27 PM, Yossi Tamari wrote:
> >> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> > db.max.anchor.length, I already said that when I wrote
> "db.max.outlinks.per.page" it was a copy/paste error.
> >
> >> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
> > OK, agreed, but it should also be moved to the LinkDB section in nutch-
> default.xml.
> >
> >> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> >> - it should be applied before URL normalizers
> > Agreed, but it seems to me the most natural place to add it is where
> db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It
> should apply to outlinks received from the parser, not to injected URLs, for
> example. The only other place I can think of where this may be needed is after
> redirect.
> > This is pretty much the same as what Semyon suggests, whether we push it
> down into the filterNormalize method or do it before calling it.
> >
> > Yossi.
> >
> >> -Original Message-
> >> From: Sebastian Nagel 
> >> Sent: 12 March 2018 15:57
> >> To: user@nutch.apache.org
> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
> >> long links
> >>
> >> Hi Semyon, Yossi, Markus,
> >>
> >>> what db.max.anchor.length was supposed to do
> >>
> >> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
> >>   anchor text
> >> Can we agree to use the term "anchor" in this meaning?
> >> At least, that's how it is used in the class Outlink and hopefully
> >> throughout Nutch.
> >>
> >>> Personally, I still think the property should be used to limit
> >>> outlink length in parsing,
> >>
> >> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> >>
> >> I was about renaming
> >>   db.max.anchor.length -> linkdb.max.anchor.length This property was
> >> forgotten when making the naming more consistent in
> >>   [NUTCH-2220] - Rename db.* options used only by the linkdb to
> >> linkdb.*
> >>
> >> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> >> - it should be applied before URL normalizers
> >>   (that would be the main advantage over adding a regex filter rule)
> >> - but probably for all tools / places where URLs are filtered
> >>   (ugly because there are many of them)
> >> - one option would be to rethink the pipeline of URL normalizers and 
> >> filters
> >>   as Julien did it for Storm-crawler [1].
> >> - a pragmatic solution to keep the code changes limited:
> >>   do the length check twice at the beginning of
> >>URLNormalizers.normalize(...)
> >>   and
> >>URLFilters.filter(...)
> >>   (it's not guaranteed that normalizers are always called)
> >> - the minimal solution: add a default rule to regex-urlfilter.txt.template
> >>   to limit the length to 512 (or 1024/2048) characters
> >>
> >>
> >> Best,
> >> Sebastian
> >>
> >> [1]
> >> https://github.com/DigitalPebble/storm-
> >> crawler/blob/master/archetype/src/main/resources/archetype-
> >> resources/src/main/resources/urlfilters.json
> >>
> >>
&

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari

> Which property, db.max.outlinks.per.page or db.max.anchor.length?
db.max.anchor.length, I already said that when I wrote 
"db.max.outlinks.per.page" it was a copy/paste error.

> I was about renaming db.max.anchor.length -> linkdb.max.anchor.length
OK, agreed, but it should also be moved to the LinkDB section in 
nutch-default.xml.

> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> - it should be applied before URL normalizers
Agreed, but it seems to me the most natural place to add it is where 
db.max.outlinks.per.page is applied, around line 257 in ParseOutputFormat. It 
should apply to outlinks received from the parser, not to injected URLs, for 
example. The only other place I can think of where this may be needed is after 
redirect.
This is pretty much the same as what Semyon suggests, whether we push it down 
into the filterNormalize method or do it before calling it.

Yossi.

> -Original Message-
> From: Sebastian Nagel 
> Sent: 12 March 2018 15:57
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> Hi Semyon, Yossi, Markus,
> 
> > what db.max.anchor.length was supposed to do
> 
> it's applied to "anchor texts" (https://en.wikipedia.org/wiki/Anchor_text)
>   anchor text
> Can we agree to use the term "anchor" in this meaning?
> At least, that's how it is used in the class Outlink and hopefully throughout
> Nutch.
> 
> > Personally, I still think the property should be used to limit outlink
> > length in parsing,
> 
> Which property, db.max.outlinks.per.page or db.max.anchor.length?
> 
> I was about renaming
>   db.max.anchor.length -> linkdb.max.anchor.length This property was forgotten
> when making the naming more consistent in
>   [NUTCH-2220] - Rename db.* options used only by the linkdb to linkdb.*
> 
> Regarding a property to limit the URL length as discussed in NUTCH-1106:
> - it should be applied before URL normalizers
>   (that would be the main advantage over adding a regex filter rule)
> - but probably for all tools / places where URLs are filtered
>   (ugly because there are many of them)
> - one option would be to rethink the pipeline of URL normalizers and filters
>   as Julien did it for Storm-crawler [1].
> - a pragmatic solution to keep the code changes limited:
>   do the length check twice at the beginning of
>URLNormalizers.normalize(...)
>   and
>URLFilters.filter(...)
>   (it's not guaranteed that normalizers are always called)
> - the minimal solution: add a default rule to regex-urlfilter.txt.template
>   to limit the length to 512 (or 1024/2048) characters
> 
> 
> Best,
> Sebastian
> 
> [1]
> https://github.com/DigitalPebble/storm-
> crawler/blob/master/archetype/src/main/resources/archetype-
> resources/src/main/resources/urlfilters.json
> 
> 
> 
> On 03/12/2018 02:02 PM, Yossi Tamari wrote:
> > The other properties in this section actually affect parsing (e.g.
> db.max.outlinks.per.page). I was under the impression that this is what
> db.max.anchor.length was supposed to do, and actually increased its value.
> Turns out this is one of the many things in Nutch that are not intuitive (or 
> in this
> case, does nothing at all).
> > One of the reasons I thought so is that very long links can be used as an 
> > attack
> on crawlers.
> > Personally, I still think the property should be used to limit outlink 
> > length in
> parsing, but if that is not what it's supposed to do, I guess it needs to be
> renamed (to match the code), moved to a different section of the properties
> file, and perhaps better documented. In that case, you'll need to use Markus'
> solution, and basically everybody should use Markus' first rule...
> >
> >> -Original Message-
> >> From: Semyon Semyonov 
> >> Sent: 12 March 2018 14:51
> >> To: user@nutch.apache.org
> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
> >> long links
> >>
> >> So, which is the conclusion?
> >>
> >> Should it be solved in regex file or through this property?
> >>
> >> Though, how the property of crawldb/linkdb suppose to prevent this
> >> problem in Parse?
> >>
> >> Sent: Monday, March 12, 2018 at 1:42 PM
> >> From: "Edward Capriolo" 
> >> To: "user@nutch.apache.org" 
> >> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically
> >> long links Some regular expressions (those with backtracing) can be
> >> very expensive for lomg strings

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari

The other properties in this section actually affect parsing (e.g. 
db.max.outlinks.per.page). I was under the impression that this is what 
db.max.anchor.length was supposed to do, and actually increased its value. 
Turns out this is one of the many things in Nutch that are not intuitive (or in 
this case, does nothing at all).
One of the reasons I thought so is that very long links can be used as an 
attack on crawlers.
Personally, I still think the property should be used to limit outlink length 
in parsing, but if that is not what it's supposed to do, I guess it needs to be 
renamed (to match the code), moved to a different section of the properties 
file, and perhaps better documented. In that case, you'll need to use Markus' 
solution, and basically everybody should use Markus' first rule...

> -Original Message-
> From: Semyon Semyonov 
> Sent: 12 March 2018 14:51
> To: user@nutch.apache.org
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> So, which is the conclusion?
> 
> Should it be solved in regex file or through this property?
> 
> Though, how the property of crawldb/linkdb suppose to prevent this problem in
> Parse?
> 
> Sent: Monday, March 12, 2018 at 1:42 PM
> From: "Edward Capriolo" 
> To: "user@nutch.apache.org" 
> Subject: Re: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> Some regular expressions (those with backtracing) can be very expensive for
> lomg strings
> 
> https://regular-expressions.mobi/catastrophic.html?wlr=1
> 
> Maybe that is your issue.
> 
> On Monday, March 12, 2018, Sebastian Nagel 
> wrote:
> 
> > Good catch. It should be renamed to be consistent with other
> > properties, right?
> >
> > On 03/12/2018 01:10 PM, Yossi Tamari wrote:
> > > Perhaps, however it starts with db, not linkdb (like the other
> > > linkdb
> > properties), it is in the CrawlDB part of nutch-default.xml, and
> > LinkDB code uses the property name linkdb.max.anchor.length.
> > >
> > >> -Original Message-
> > >> From: Markus Jelsma 
> > >> Sent: 12 March 2018 14:05
> > >> To: user@nutch.apache.org
> > >> Subject: RE: UrlRegexFilter is getting destroyed for
> > >> unrealistically
> > long links
> > >>
> > >> That is for the LinkDB.
> > >>
> > >>
> > >>
> > >> -Original message-
> > >>> From:Yossi Tamari 
> > >>> Sent: Monday 12th March 2018 13:02
> > >>> To: user@nutch.apache.org
> > >>> Subject: RE: UrlRegexFilter is getting destroyed for
> > >>> unrealistically long links
> > >>>
> > >>> Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy
> > >>> paste
> > >> error...
> > >>>
> > >>>> -Original Message-
> > >>>> From: Markus Jelsma 
> > >>>> Sent: 12 March 2018 14:01
> > >>>> To: user@nutch.apache.org
> > >>>> Subject: RE: UrlRegexFilter is getting destroyed for
> > >>>> unrealistically long links
> > >>>>
> > >>>> scripts/apache-nutch-
> > >>>> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> > >>>> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page",
> > >>>> 100);
> > >>>> scripts/apache-nutch-
> > >>>> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:
> > int
> > >>>> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> -Original message-
> > >>>>> From:Yossi Tamari 
> > >>>>> Sent: Monday 12th March 2018 12:56
> > >>>>> To: user@nutch.apache.org
> > >>>>> Subject: RE: UrlRegexFilter is getting destroyed for
> > >>>>> unrealistically long links
> > >>>>>
> > >>>>> Nutch.default contains a property db.max.outlinks.per.page,
> > >>>>> which I think is
> > >>>> supposed to prevent these cases. However, I just searched the
> > >>>> code and couldn't find where it is used. Bug?
> > >>>>>
> > >>>>>> -Original Message-
> > >>>>>&g

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari

Perhaps, however it starts with db, not linkdb (like the other linkdb 
properties), it is in the CrawlDB part of nutch-default.xml, and LinkDB code 
uses the property name linkdb.max.anchor.length.

> -Original Message-
> From: Markus Jelsma 
> Sent: 12 March 2018 14:05
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> That is for the LinkDB.
> 
> 
> 
> -Original message-
> > From:Yossi Tamari 
> > Sent: Monday 12th March 2018 13:02
> > To: user@nutch.apache.org
> > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > long links
> >
> > Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste
> error...
> >
> > > -Original Message-
> > > From: Markus Jelsma 
> > > Sent: 12 March 2018 14:01
> > > To: user@nutch.apache.org
> > > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > > long links
> > >
> > > scripts/apache-nutch-
> > > 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> > > maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> > > scripts/apache-nutch-
> > > 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:int
> > > maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> > >
> > >
> > >
> > >
> > > -Original message-
> > > > From:Yossi Tamari 
> > > > Sent: Monday 12th March 2018 12:56
> > > > To: user@nutch.apache.org
> > > > Subject: RE: UrlRegexFilter is getting destroyed for
> > > > unrealistically long links
> > > >
> > > > Nutch.default contains a property db.max.outlinks.per.page, which
> > > > I think is
> > > supposed to prevent these cases. However, I just searched the code
> > > and couldn't find where it is used. Bug?
> > > >
> > > > > -Original Message-
> > > > > From: Semyon Semyonov 
> > > > > Sent: 12 March 2018 12:47
> > > > > To: usernutch.apache.org 
> > > > > Subject: UrlRegexFilter is getting destroyed for unrealistically
> > > > > long links
> > > > >
> > > > > Dear all,
> > > > >
> > > > > There is an issue with UrlRegexFilter and parsing. In average,
> > > > > parsing takes about 1 millisecond, but sometimes the websites
> > > > > have the crazy links that destroy the parsing(takes 3+ hours and
> > > > > destroy the next
> > > steps of the crawling).
> > > > > For example, below you can see shortened logged version of url
> > > > > with encoded image, the real lenght of the link is 532572 characters.
> > > > >
> > > > > Any idea what should I do with such behavior?  Should I modify
> > > > > the plugin to reject links with lenght > MAX or use more comlex
> > > > > logic/check extra configuration?
> > > > > 2018-03-10 23:39:52,082 INFO [main]
> > > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> > > > > normalization
> > > > > 2018-03-10 23:39:52,178 INFO [main]
> > > > > org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > > > > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url
> > > > > filter for url
> > > > >
> > >
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoNS
> > > > >
> > >
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > > > >
> > >
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> > > > >
> > >
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > > > >
> > >
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > > > > dbnu50253lju... [532572 characters]
> > > > > 2018-03-11 03:56:26,118 INFO [main]
> > > > > org.apache.nutch.parse.ParseOutputFormat:
> > > > > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> > > > > normalization
> > > > >
> > > > > Semyon.
> > > >
> > > >
> >
> >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari

Sorry, not db.max.outlinks.per.page, db.max.anchor.length. Copy paste error...

> -Original Message-
> From: Markus Jelsma 
> Sent: 12 March 2018 14:01
> To: user@nutch.apache.org
> Subject: RE: UrlRegexFilter is getting destroyed for unrealistically long 
> links
> 
> scripts/apache-nutch-
> 1.14/src/java/org/apache/nutch/fetcher/FetcherThread.java:205:
> maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
> scripts/apache-nutch-
> 1.14/src/java/org/apache/nutch/parse/ParseOutputFormat.java:118:int
> maxOutlinksPerPage = job.getInt("db.max.outlinks.per.page", 100);
> 
> 
> 
> 
> -Original message-
> > From:Yossi Tamari 
> > Sent: Monday 12th March 2018 12:56
> > To: user@nutch.apache.org
> > Subject: RE: UrlRegexFilter is getting destroyed for unrealistically
> > long links
> >
> > Nutch.default contains a property db.max.outlinks.per.page, which I think is
> supposed to prevent these cases. However, I just searched the code and 
> couldn't
> find where it is used. Bug?
> >
> > > -Original Message-
> > > From: Semyon Semyonov 
> > > Sent: 12 March 2018 12:47
> > > To: usernutch.apache.org 
> > > Subject: UrlRegexFilter is getting destroyed for unrealistically
> > > long links
> > >
> > > Dear all,
> > >
> > > There is an issue with UrlRegexFilter and parsing. In average,
> > > parsing takes about 1 millisecond, but sometimes the websites have
> > > the crazy links that destroy the parsing(takes 3+ hours and destroy the 
> > > next
> steps of the crawling).
> > > For example, below you can see shortened logged version of url with
> > > encoded image, the real lenght of the link is 532572 characters.
> > >
> > > Any idea what should I do with such behavior?  Should I modify the
> > > plugin to reject links with lenght > MAX or use more comlex
> > > logic/check extra configuration?
> > > 2018-03-10 23:39:52,082 INFO [main]
> > > org.apache.nutch.parse.ParseOutputFormat:
> > > ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and
> > > normalization
> > > 2018-03-10 23:39:52,178 INFO [main]
> > > org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> > > ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter
> > > for url
> > >
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoNS
> > >
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> > >
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> > >
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> > >
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> > > dbnu50253lju... [532572 characters]
> > > 2018-03-11 03:56:26,118 INFO [main]
> > > org.apache.nutch.parse.ParseOutputFormat:
> > > ParseOutputFormat.Write.filterNormalize 4.4. After filteing and
> > > normalization
> > >
> > > Semyon.
> >
> >

RE: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Yossi Tamari

Nutch.default contains a property db.max.outlinks.per.page, which I think is 
supposed to prevent these cases. However, I just searched the code and couldn't 
find where it is used. Bug? 

> -Original Message-
> From: Semyon Semyonov 
> Sent: 12 March 2018 12:47
> To: usernutch.apache.org 
> Subject: UrlRegexFilter is getting destroyed for unrealistically long links
> 
> Dear all,
> 
> There is an issue with UrlRegexFilter and parsing. In average, parsing takes
> about 1 millisecond, but sometimes the websites have the crazy links that
> destroy the parsing(takes 3+ hours and destroy the next steps of the 
> crawling).
> For example, below you can see shortened logged version of url with encoded
> image, the real lenght of the link is 532572 characters.
> 
> Any idea what should I do with such behavior?  Should I modify the plugin to
> reject links with lenght > MAX or use more comlex logic/check extra
> configuration?
> 2018-03-10 23:39:52,082 INFO [main]
> org.apache.nutch.parse.ParseOutputFormat:
> ParseOutputFormat.Write.filterNormalize 4.3. Before filteing and normalization
> 2018-03-10 23:39:52,178 INFO [main]
> org.apache.nutch.urlfilter.api.RegexURLFilterBase:
> ParseOutputFormat.Write.filterNormalize 4.3.1. In regex url filter for url
> :https://www.sintgoedele.be/image/png%3Bbase64%2CiVBORw0KGgoNS
> UhEUgAAAXgAAAF2CAIAAADr9RSBAAAgAElEQVR4nFSaZ7RdZbnvd5LdVpt9vmX2
> Xtdcve6191q79%2Bz0hCSQhBIg0os0pSigiKLYrhw9FmDgUS/HcqzHggU956iIiiBkr7
> X2TkIRPUhACSG0fT/Ec8e9Y7yf5nzGO5/8/%2B9z3jGHG/Pk0890VlpHzm6unp0pd1
> efuqpJ9ud5eV2u905vZaPHFk9cnR1eXm53Vlut0%2BvdqfbbneW2%2B12p9PudNu
> dbnu50253lju... [532572 characters]
> 2018-03-11 03:56:26,118 INFO [main]
> org.apache.nutch.parse.ParseOutputFormat:
> ParseOutputFormat.Write.filterNormalize 4.4. After filteing and normalization
> 
> Semyon.

RE: Regarding Internal Links

2018-03-07 Thread Yossi Tamari

1. Go to https://issues.apache.org/jira/projects/NUTCH
2. Click "Log-In" (upper right corner). Create a user if needed and log in.
3. Click "Create" (in the top banner).
4. Fill in the fields. They are mostly self-explanatory, and those that you 
don't understand can probably be ignored. The important thing is to provide as 
much relevant information as possible - in this case what your Parse Plugin 
does, and the error that happens in the Index phase (these go in the 
Description field). Provide the same log as you provided here, either in the 
description field as well, but using the format options, or as an attachment.
5. Click "Create" at the bottom of the dialog, and you're done!

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 07 March 2018 12:51
> To: user@nutch.apache.org
> Subject: RE: Regarding Internal Links
> 
> Yossi I tried with both the original url and the newer one but it didn't 
> worked!!
> However for now I disabled the scoring opic as suggested by Sebastian and it
> worked for now.
> And I will open a jira issue but I am new to open source world so can you 
> please
> help me regarding this?
> Thanks a lot yossi and Sebastian.
> 
> On 7 Mar 2018 16:11, "Yossi Tamari"  wrote:
> 
> Yas, just to be sure, you are using the original URL (the one that was in the
> ParseResult passed as parameter to the filter) in the ParseResult constructor,
> right?
> 
> > -Original Message-
> > From: Sebastian Nagel 
> > Sent: 07 March 2018 12:36
> > To: user@nutch.apache.org
> > Subject: Re: Regarding Internal Links
> >
> > Hi,
> >
> > that needs to be fixed. It's because there is no CrawlDb entry for the
> partial
> > documents. May also be happen after NUTCH-2456. Could you open a Jira
> issue
> > to address the problem? Thanks!
> >
> > As a quick work-around:
> > - either disable scoring-opic while indexing
> > - or check dbDatum for null in scoring-opic indexerScore(...)
> >
> > Thanks,
> > Sebastian
> >
> > On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote:
> > > Thanks Yossi, I am now able to parse the data successfully but I am
> > > getting Error at the time of indexing.
> > > Below are the hadoop logs for indexing.
> > >
> > > ElasticRestIndexWriter
> > > elastic.rest.host : hostname
> > > elastic.rest.port : port
> > > elastic.rest.index : elastic index command
> > > elastic.rest.max.bulk.docs
> > > : elastic bulk index doc counts. (default 250)
> > > elastic.rest.max.bulk.size : elastic bulk index length. (default
> > > 2500500
> > > ~2.5MB)
> > >
> > >
> > > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduce:
> > > crawldb: crawl/crawldb
> > > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduce:
> > > linkdb: crawl/linkdb
> > > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> > IndexerMapReduces:
> > > adding segment: crawl/segments/20180307130959
> > > 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
> > > deduplication is: off
> > > 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
> > > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> > > 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting
> > > server pool to a list of 1 servers: [http://localhost:9200]
> > > 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
> > > thread/connection supporting pooling connection manager
> > > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using
> > > default GSON instance
> > > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node
> > > Discovery disabled...
> > > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle
> > > connection reaping disabled...
> > > 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
> > > Processing remaining requests [docs = 1, length = 210402, total docs
> > > = 1]
> > > 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
> > > Processing to finalize last execute
> > > 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter -
> > > Previous took in ms 175, including wait 97
> > > 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
> > > job_local1561152089_0001
> > > java.lang.Exception: java.lang.NullPointerException at
> > > org.apache.hadoop.mapred.LocalJobRunner$Job.runTask

RE: Regarding Internal Links

2018-03-07 Thread Yossi Tamari

Yas, just to be sure, you are using the original URL (the one that was in the 
ParseResult passed as parameter to the filter) in the ParseResult constructor, 
right?

> -Original Message-
> From: Sebastian Nagel 
> Sent: 07 March 2018 12:36
> To: user@nutch.apache.org
> Subject: Re: Regarding Internal Links
> 
> Hi,
> 
> that needs to be fixed. It's because there is no CrawlDb entry for the partial
> documents. May also be happen after NUTCH-2456. Could you open a Jira issue
> to address the problem? Thanks!
> 
> As a quick work-around:
> - either disable scoring-opic while indexing
> - or check dbDatum for null in scoring-opic indexerScore(...)
> 
> Thanks,
> Sebastian
> 
> On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote:
> > Thanks Yossi, I am now able to parse the data successfully but I am
> > getting Error at the time of indexing.
> > Below are the hadoop logs for indexing.
> >
> > ElasticRestIndexWriter
> > elastic.rest.host : hostname
> > elastic.rest.port : port
> > elastic.rest.index : elastic index command elastic.rest.max.bulk.docs
> > : elastic bulk index doc counts. (default 250)
> > elastic.rest.max.bulk.size : elastic bulk index length. (default
> > 2500500
> > ~2.5MB)
> >
> >
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: crawl/crawldb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > linkdb: crawl/linkdb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20180307130959
> > 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> > 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting
> > server pool to a list of 1 servers: [http://localhost:9200]
> > 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
> > thread/connection supporting pooling connection manager
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default
> > GSON instance
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node
> > Discovery disabled...
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle
> > connection reaping disabled...
> > 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing remaining requests [docs = 1, length = 210402, total docs =
> > 1]
> > 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing to finalize last execute
> > 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter -
> > Previous took in ms 175, including wait 97
> > 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
> > job_local1561152089_0001
> > java.lang.Exception: java.lang.NullPointerException at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.ja
> > va:462) at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:52
> > 9) Caused by: java.lang.NullPointerException at
> > org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScori
> > ngFilter.java:171)
> > at
> > org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.ja
> > va:120)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :296)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :57) at
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> > at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(Loc
> > alJobRunner.java:319) at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511
> > ) at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> > ava:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> > java:624) at java.lang.Thread.run(Thread.java:748)
> > 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer:
> > java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
> > at org.apache.hadoop.util.ToolR

RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari

Regarding the configuration parameter, your Parse Filter should expose a 
setConf method that receives a conf parameter. Keep that as a member variable 
and pass it where necessary.
Regarding parsestatus, contentmeta and parsemeta, you're going to have to look 
at them yourself (probably in a debugger), but as a baseline, you can probably 
just use the values in the inbound ParseResult (of the whole document).
More specifically, parsestatus is an indication of whether parsing was 
successful. Unless your parsing may fail even when the whole document parsing 
was successful, you don't need to change it. contentmeta is all the information 
that was gathered about this page before parsing, so again, you probably just 
want to keep it, and finally parsemeta is the metadata that was gathered during 
parsing and may be useful for indexing, so passing the metadata from the 
original ParseResult makes sense, or just using the constructor that does not 
require it if you don't care about the metadata.
This should all be easier to understand if you look at what the HTML Parser 
does with each of these fields.

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 06 March 2018 20:17
> To: user@nutch.apache.org
> Subject: RE: Regarding Internal Links
> 
> I am able to get parsetext data structure.
> But having trouble with parseData as it's constructor is asking for 
> parsestatus,
> outlinks, contentmeta and parsemeta.
> Outlinks I can get from outlinkExtractor but what about other parameters?
> And again getoutlinks is asking for configuration and i don't know, from 
> where I
> can get it?
> 
> On 6 Mar 2018 18:32, "Yossi Tamari"  wrote:
> 
> > You should go over each segment, and for each one produce a ParseText
> > and a ParseData. This is basically what the HTML Parser does for the
> > whole document, which is why I suggested you should dive into its code.
> > A ParseText is basically just a String containing the actual content
> > of the segment (after stripping the HTML tags). This is usually the
> > document you want to index.
> > The ParseData structure is a little more complex, but the main things
> > it contains are the title of this segment, and the outlinks from the
> > segment (for further crawling). Take a look at the code of both
> > classes and it should be relatively clear.
> > Finally, you need to build one ParseResult object, with the original
> > URL, and for each of the ParseText/ParseData pairs, call the put
> > method, with the internal URL of the segment as the key.
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan 
> > > Sent: 06 March 2018 14:45
> > > To: user@nutch.apache.org
> > > Subject: RE: Regarding Internal Links
> > >
> > > > I am able to get the content corresponding to each Internal link
> > > > by writing a parse filter plugin. Now  I am  not getting how to
> > > > proceed further. How can I parse them as separate document and
> > > > what should my ParseResult filter return??
> >
> >

RE: Regarding Internal Links

2018-03-06 Thread Yossi Tamari

You should go over each segment, and for each one produce a ParseText and a 
ParseData. This is basically what the HTML Parser does for the whole document, 
which is why I suggested you should dive into its code.
A ParseText is basically just a String containing the actual content of the 
segment (after stripping the HTML tags). This is usually the document you want 
to index.
The ParseData structure is a little more complex, but the main things it 
contains are the title of this segment, and the outlinks from the segment (for 
further crawling). Take a look at the code of both classes and it should be 
relatively clear.
Finally, you need to build one ParseResult object, with the original URL, and 
for each of the ParseText/ParseData pairs, call the put method, with the 
internal URL of the segment as the key.  

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 06 March 2018 14:45
> To: user@nutch.apache.org
> Subject: RE: Regarding Internal Links
> 
> > I am able to get the content corresponding to each Internal link by
> > writing a parse filter plugin. Now  I am  not getting how to proceed
> > further. How can I parse them as separate document and what should
> > my ParseResult filter return??

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Yossi Tamari

Hi Sebastian,

Yes, right now I only care about the statistics (basically using HostDB as an 
improved CrawlCompletionStats). For this reason, and since the number of 
problematic domains I have right now is small, urlnormalizer-host is good 
enough for me.
Aggregating over HostDB per domain as a parameter to ReadHostDb would also 
solve my problem, as you suggest. There is even a comment in the code there 
that suggests someone already had a similar idea.
To be honest, I don't know which solution is best, and I have a useable 
work-around, so I don't feel the need to implement a solution right now, unless 
someone pushes me to 😊

Yossi.

> -Original Message-
> From: Sebastian Nagel 
> Sent: 05 March 2018 16:07
> To: user@nutch.apache.org
> Subject: Re: Why doesn't hostdb support byDomain mode?
> 
> Hi Yossi,
> 
> please don't take it as a vote against your proposal.
> It could be also solved by documenting what's not working with the HostDb
> containing domains.
> 
> Are you only about the statistics or also about using the HostDb for 
> Generator?
> For the former use case, a solution could be also to aggregate the counts by
> domain. Usually, the HostDb is orders of magnitude smaller than the CrawlDb,
> so this should be considerably fast.
> 
> Best,
> Sebastian
> 
> On 03/05/2018 02:03 PM, Yossi Tamari wrote:
> > Thanks, I will submit a patch for this. Since this allows me to solve my 
> > specific
> issue, and since Sebastian raised some questions regarding byDomain, I will 
> not
> proceed with that currently.
> >
> >> -Original Message-
> >> From: Markus Jelsma 
> >> Sent: 05 March 2018 14:41
> >> To: user@nutch.apache.org
> >> Subject: RE: Why doesn't hostdb support byDomain mode?
> >>
> >> Ah, well, that is a good one! I took me a while to figure it out, but
> >> having the check there is an error. We had added the same check in an
> >> earlier different Nutch job where the database itself could remove
> >> itself just by the rules it emitted and host normalized enabled.
> >>
> >> I simply reused the job setup code and forgot to remove that check.
> >> You can safely remove that check in HostDB.
> >>
> >> Regards,
> >> Markus
> >>
> >>
> >> -Original message-
> >>> From:Yossi Tamari 
> >>> Sent: Monday 5th March 2018 11:30
> >>> To: user@nutch.apache.org
> >>> Subject: RE: Why doesn't hostdb support byDomain mode?
> >>>
> >>> Thanks Markus, I will open a ticket and submit a patch.
> >>> One follow up question: UpdateHostDb checks and throws an exception
> >>> if
> >> urlnormalizer-host (which can be used to mitigate the problem I
> >> mentioned) is enabled. Is that also an internal decision of
> >> OpenIndex, and perhaps should be removed now that the code is part of
> >> Nutch, or is there a reason this normalizer must not be used with
> UpdateHostDb?
> >>>
> >>>   Yossi.
> >>>
> >>>> -Original Message-
> >>>> From: Markus Jelsma 
> >>>> Sent: 05 March 2018 12:22
> >>>> To: user@nutch.apache.org
> >>>> Subject: RE: Why doesn't hostdb support byDomain mode?
> >>>>
> >>>> Hi,
> >>>>
> >>>> The reason is simple, we (company) needed this information based on
> >>>> hostname, so we made a hostdb. I don't see any downside for
> >>>> supporting a domain mode. Adding support for it through
> >>>> hostdb.url.mode seems like a good idea.
> >>>>
> >>>> Regards,
> >>>> Markus
> >>>>
> >>>> -Original message-
> >>>>> From:Yossi Tamari 
> >>>>> Sent: Sunday 4th March 2018 12:01
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: Why doesn't hostdb support byDomain mode?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>>
> >>>>>
> >>>>> Is there a reason that hostdb provides per-host data even when the
> >>>>> generate/fetch are working by domain? This generates misleading
> >>>>> statistics for servers that load-balance by redirecting to nodes (e.g.
> >>>> photobucket).
> >>>>>
> >>>>> If this is just an oversight, I can contribute a patch, but I'm
> >>>>> not sure if I should use partition.url.mode, generate.count.mode,
> >>>>> one of the other similar properties, or create one more such
> >>>>> property
> >>>> hostdb.url.mode.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Yossi.
> >>>>>
> >>>>>
> >>>
> >>>
> >

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Yossi Tamari

Thanks, I will submit a patch for this. Since this allows me to solve my 
specific issue, and since Sebastian raised some questions regarding byDomain, I 
will not proceed with that currently.

> -Original Message-
> From: Markus Jelsma 
> Sent: 05 March 2018 14:41
> To: user@nutch.apache.org
> Subject: RE: Why doesn't hostdb support byDomain mode?
> 
> Ah, well, that is a good one! I took me a while to figure it out, but having 
> the
> check there is an error. We had added the same check in an earlier different
> Nutch job where the database itself could remove itself just by the rules it
> emitted and host normalized enabled.
> 
> I simply reused the job setup code and forgot to remove that check. You can
> safely remove that check in HostDB.
> 
> Regards,
> Markus
> 
> 
> -Original message-
> > From:Yossi Tamari 
> > Sent: Monday 5th March 2018 11:30
> > To: user@nutch.apache.org
> > Subject: RE: Why doesn't hostdb support byDomain mode?
> >
> > Thanks Markus, I will open a ticket and submit a patch.
> > One follow up question: UpdateHostDb checks and throws an exception if
> urlnormalizer-host (which can be used to mitigate the problem I mentioned) is
> enabled. Is that also an internal decision of OpenIndex, and perhaps should be
> removed now that the code is part of Nutch, or is there a reason this 
> normalizer
> must not be used with UpdateHostDb?
> >
> > Yossi.
> >
> > > -Original Message-
> > > From: Markus Jelsma 
> > > Sent: 05 March 2018 12:22
> > > To: user@nutch.apache.org
> > > Subject: RE: Why doesn't hostdb support byDomain mode?
> > >
> > > Hi,
> > >
> > > The reason is simple, we (company) needed this information based on
> > > hostname, so we made a hostdb. I don't see any downside for
> > > supporting a domain mode. Adding support for it through
> > > hostdb.url.mode seems like a good idea.
> > >
> > > Regards,
> > > Markus
> > >
> > > -Original message-
> > > > From:Yossi Tamari 
> > > > Sent: Sunday 4th March 2018 12:01
> > > > To: user@nutch.apache.org
> > > > Subject: Why doesn't hostdb support byDomain mode?
> > > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > Is there a reason that hostdb provides per-host data even when the
> > > > generate/fetch are working by domain? This generates misleading
> > > > statistics for servers that load-balance by redirecting to nodes (e.g.
> > > photobucket).
> > > >
> > > > If this is just an oversight, I can contribute a patch, but I'm
> > > > not sure if I should use partition.url.mode, generate.count.mode,
> > > > one of the other similar properties, or create one more such
> > > > property
> > > hostdb.url.mode.
> > > >
> > > >
> > > >
> > > > Yossi.
> > > >
> > > >
> >
> >

RE: Regarding Internal Links

2018-03-05 Thread Yossi Tamari

You will need to write a HTML Parser Filter plugin. It receives the DOM of the 
document as parameter, you will have to scan this and isolate the relevant 
sections, then extract the content of these sections (probably copying code 
from the HTML parser). Your filter returns a ParseResult, which is really a Map 
from the URL (with the anchor in your case), to a Parse object, which the HTML 
parser creates. You can have as many of these as you want, as long as the URLs 
are different.

This is going to require that you dive into the code of the HTML Parser or the 
Tika Parser, there is no way around this.

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 05 March 2018 13:59
> To: user@nutch.apache.org
> Subject: Re: Regarding Internal Links
> 
> Please help me out regarding this.
> It's urgent.
> 
> On 5 Mar 2018 15:41, "Yash Thenuan Thenuan"  wrote:
> 
> > How can I achieve this in nutch 1.x?
> >
> > On 1 Mar 2018 22:30, "Sebastian Nagel" 
> wrote:
> >
> >> Hi,
> >>
> >> Yes, that's possible but only for Nutch 1.x:
> >> a ParseResult [1] may contain multiple ParseData objects each
> >> accessible by a separate URL.
> >> This feature is not available for 2.x [2].
> >>
> >> It's used by the feed parser plugin to add a single entry for every
> >> feed item.  Afaik, that's not supported out of the box for sections
> >> of a page (e.g., split by anchors or h1/h2/h3). You would need to
> >> write a parse-filter plugin to achieve this.
> >>
> >> I've once used it to index parts of a page identified by XPath
> >> expressions.
> >>
> >> Best,
> >> Sebastian
> >>
> >> [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/
> >> nutch/parse/ParseResult.html
> >> [2] https://nutch.apache.org/apidocs/apidocs-2.3.1/org/apache/
> >> nutch/parse/Parse.html
> >>
> >>
> >> On 03/01/2018 08:02 AM, Yash Thenuan Thenuan wrote:
> >> > Hi there,
> >> > For example we have a url
> >> > https://wiki.apache.org/nutch/NutchTutorial#Table_of_Contents
> >> > here #table_of _contents is a internal link.
> >> > I want to separate the contents of the page on the basis of
> >> > internal
> >> links.
> >> > Is this possible in nutch??
> >> > I want to index the contents of each internal link separately.
> >> >
> >>
> >>

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Yossi Tamari

Hi Sebastian,

So do you think this fix should be avoided? I wouldn't want to add something 
that will cause problems for users down the line, but, frankly, I can think of 
examples of domains that intend their robots.txt to apply across servers and 
protocols (crawl-delay), but I can't think of any that mean the opposite, 
standards aside.

Yossi.

> -Original Message-
> From: Sebastian Nagel 
> Sent: 05 March 2018 12:50
> To: user@nutch.apache.org
> Subject: Re: Why doesn't hostdb support byDomain mode?
> 
> Hi Yossi, hi Markus,
> 
> we should keep on the radar that some features will not work properly with a
> domain-level hostdb:
> - DNS checks in UpdateHostDbReducer
> - SitemapProcessor tries to find sitemaps announced in the host's robots.txt
> (there may be more)
> 
> > Adding support for it through hostdb.url.mode seems like a good idea.
> 
> Yes! We already have three of them:
>   generate.count.mode
>   partition.url.mode
>   fetcher.queue.mode
> Better to keep it also as a separate property for the HostDb.
> In fact you may even set them to different values if you know what you do.
> 
> Btw., the fact that robots.txt is per host (and also protocol/port) also 
> affects the
> fetcher in domain mode: the robots.txt may define a custom crawl-delay, with
> multiple hosts per domain there is no guarantee that it is used. Also one 
> large
> delay could be used accidentally for the entire domain.
> 
> Sebastian
> 
> On 03/05/2018 11:21 AM, Markus Jelsma wrote:
> > Hi,
> >
> > The reason is simple, we (company) needed this information based on
> hostname, so we made a hostdb. I don't see any downside for supporting a
> domain mode. Adding support for it through hostdb.url.mode seems like a good
> idea.
> >
> > Regards,
> > Markus
> >
> > -Original message-
> >> From:Yossi Tamari 
> >> Sent: Sunday 4th March 2018 12:01
> >> To: user@nutch.apache.org
> >> Subject: Why doesn't hostdb support byDomain mode?
> >>
> >> Hi,
> >>
> >>
> >>
> >> Is there a reason that hostdb provides per-host data even when the
> >> generate/fetch are working by domain? This generates misleading
> >> statistics for servers that load-balance by redirecting to nodes (e.g.
> photobucket).
> >>
> >> If this is just an oversight, I can contribute a patch, but I'm not
> >> sure if I should use partition.url.mode, generate.count.mode, one of
> >> the other similar properties, or create one more such property
> hostdb.url.mode.
> >>
> >>
> >>
> >> Yossi.
> >>
> >>

RE: Why doesn't hostdb support byDomain mode?

2018-03-05 Thread Yossi Tamari

Thanks Markus, I will open a ticket and submit a patch.
One follow up question: UpdateHostDb checks and throws an exception if 
urlnormalizer-host (which can be used to mitigate the problem I mentioned) is 
enabled. Is that also an internal decision of OpenIndex, and perhaps should be 
removed now that the code is part of Nutch, or is there a reason this 
normalizer must not be used with UpdateHostDb?

Yossi.

> -Original Message-
> From: Markus Jelsma 
> Sent: 05 March 2018 12:22
> To: user@nutch.apache.org
> Subject: RE: Why doesn't hostdb support byDomain mode?
> 
> Hi,
> 
> The reason is simple, we (company) needed this information based on
> hostname, so we made a hostdb. I don't see any downside for supporting a
> domain mode. Adding support for it through hostdb.url.mode seems like a good
> idea.
> 
> Regards,
> Markus
> 
> -Original message-
> > From:Yossi Tamari 
> > Sent: Sunday 4th March 2018 12:01
> > To: user@nutch.apache.org
> > Subject: Why doesn't hostdb support byDomain mode?
> >
> > Hi,
> >
> >
> >
> > Is there a reason that hostdb provides per-host data even when the
> > generate/fetch are working by domain? This generates misleading
> > statistics for servers that load-balance by redirecting to nodes (e.g.
> photobucket).
> >
> > If this is just an oversight, I can contribute a patch, but I'm not
> > sure if I should use partition.url.mode, generate.count.mode, one of
> > the other similar properties, or create one more such property
> hostdb.url.mode.
> >
> >
> >
> > Yossi.
> >
> >

Why doesn't hostdb support byDomain mode?

2018-03-04 Thread Yossi Tamari

Hi,

 

Is there a reason that hostdb provides per-host data even when the
generate/fetch are working by domain? This generates misleading statistics
for servers that load-balance by redirecting to nodes (e.g. photobucket).

If this is just an oversight, I can contribute a patch, but I'm not sure if
I should use partition.url.mode, generate.count.mode, one of the other
similar properties, or create one more such property hostdb.url.mode.

 

Yossi.

RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yossi Tamari

Sorry, I just realized that you're using Nutch 2.x and I'm answering for Nutch 
1.x. I'm afraid I can't help you.

> -Original Message-
> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> Sent: 28 February 2018 14:20
> To: user@nutch.apache.org
> Subject: RE: Regarding Indexing to elasticsearch
> 
> IndexingJob ( | -all |-reindex) [-crawlId ] This is the output of
> nutch index i have already configured the nutch-site.xml.
> 
> On 28 Feb 2018 17:41, "Yossi Tamari"  wrote:
> 
> > I suggest you run "nutch index", take a look at the returned help
> > message, and continue from there.
> > Broadly, first of all you need to configure your elasticsearch
> > environment in nutch-site.xml, and then you need to run nutch index
> > with the location of your CrawlDB and either the segment you want to
> > index or the directory that contains all the segments you want to index.
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> > > Sent: 28 February 2018 14:06
> > > To: user@nutch.apache.org
> > > Subject: RE: Regarding Indexing to elasticsearch
> > >
> > > All I want  is to index my parsed data to elasticsearch.
> > >
> > >
> > > On 28 Feb 2018 17:34, "Yossi Tamari"  wrote:
> > >
> > > Hi Yash,
> > >
> > > The nutch index command does not have a -all flag, so I'm not sure
> > > what
> > you're
> > > trying to achieve here.
> > >
> > > Yossi.
> > >
> > > > -Original Message-
> > > > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> > > > Sent: 28 February 2018 13:55
> > > > To: user@nutch.apache.org
> > > > Subject: Regarding Indexing to elasticsearch
> > > >
> > > > Can somebody please tell me what happens when we hit the bin/nutc
> > > > index
> > > -all
> > > > command.
> > > > Because I can't figure out why the write function inside the
> > > elastic-indexer is not
> > > > getting executed.
> >
> >

RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yossi Tamari

I suggest you run "nutch index", take a look at the returned help message, and 
continue from there. 
Broadly, first of all you need to configure your elasticsearch environment in 
nutch-site.xml, and then you need to run nutch index with the location of your 
CrawlDB and either the segment you want to index or the directory that contains 
all the segments you want to index.

> -Original Message-
> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> Sent: 28 February 2018 14:06
> To: user@nutch.apache.org
> Subject: RE: Regarding Indexing to elasticsearch
> 
> All I want  is to index my parsed data to elasticsearch.
> 
> 
> On 28 Feb 2018 17:34, "Yossi Tamari"  wrote:
> 
> Hi Yash,
> 
> The nutch index command does not have a -all flag, so I'm not sure what you're
> trying to achieve here.
> 
> Yossi.
> 
> > -Original Message-
> > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> > Sent: 28 February 2018 13:55
> > To: user@nutch.apache.org
> > Subject: Regarding Indexing to elasticsearch
> >
> > Can somebody please tell me what happens when we hit the bin/nutc
> > index
> -all
> > command.
> > Because I can't figure out why the write function inside the
> elastic-indexer is not
> > getting executed.

RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yossi Tamari

Hi Yash,

The nutch index command does not have a -all flag, so I'm not sure what you're 
trying to achieve here.

Yossi.

> -Original Message-
> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> Sent: 28 February 2018 13:55
> To: user@nutch.apache.org
> Subject: Regarding Indexing to elasticsearch
> 
> Can somebody please tell me what happens when we hit the bin/nutc index -all
> command.
> Because I can't figure out why the write function inside the elastic-indexer 
> is not
> getting executed.

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Yossi Tamari

I use Nutch 1.X, so I can't really answer your question. However, the point of 
Nutch 2.X is to replace HDFS with other storage options. MR is still required.


> -Original Message-
> From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> Sent: 23 February 2018 23:49
> To: user@nutch.apache.org
> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> So what's the whole point of supporting Cassandra or other databases(via
> Gora) if Hadoop(HDFS & MR)both are essential? What exactly Cassandra would
> be doing ?
> 
> On 23 Feb 2018 22:41, "Yossi Tamari"  wrote:
> 
> > 1 is not true.
> > 2 is true, if we ignore the second part 😊
> > Hadoop is made of two parts: distributed storage (HDFS) and a
> > Map/Reduce framework. Nutch is essentially a collection of Map/Reduce
> > tasks. It relies on Hadoop to distribute these tasks to all
> > participating servers. So if you run in local mode, you can only use
> > one server. If you have a single-node Hadoop, Nutch will be able to
> > fully utilize the server, but it will still be limited to crawling
> > from one machine, which is only sufficient for small/slow crawls.
> >
> > > -Original Message-
> > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > Sent: 23 February 2018 23:16
> > > To: user@nutch.apache.org
> > > Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> > >
> > > Ohh. I'm a bit confused. What of the following is true in the 'deploy'
> > mode:
> > > 1. Data cannot be stored in Cassandra, HBase is the only way.
> > > 2. Data will be stored in Cassandra but you need a (maybe, just a
> > > single node)Hadoop cluster anyway which won't be storing any data
> > > but is there
> > just to
> > > make Nutch happy.
> > >
> > > On 23 Feb 2018 22:08, "Yossi Tamari"  wrote:
> > >
> > > > Hi Kaliyug,
> > > >
> > > > Nutch 2 still requires Hadoop to run, it just allows you to store
> > > > data somewhere other than HDFS.
> > > > The only way to run Nutch without Hadoop is local mode, which is
> > > > only recommended for testing. To do that, run ./runtime/local/bin/crawl.
> > > >
> > > > Yossi.
> > > >
> > > > > -Original Message-
> > > > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > > > Sent: 23 February 2018 20:26
> > > > > To: user@nutch.apache.org
> > > > > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> > > > >
> > > > > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> > > > >
> > > > > I have extracted and built Nutch under the Cygwin's home directory.
> > > > >
> > > > > I believe that the Cassandra server is working:
> > > > >
> > > > > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > > > > JOINING: Finish joining ring
> > > > > INFO  [main] 2018-02-23 16:20:41,820
> > > > > SecondaryIndexManager.java:509
> > > > > - Executing pre-join tasks for: CFS(Keyspace='test',
> > > > > ColumnFamily='test')
> > > > > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 -
> > > > > Node
> > > > > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > > > > 16:20:43,049 NativeTransportService.java:75 - Netty using Java
> > > > > NIO event
> > > > loop
> > > > > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using
> > > > > Netty
> > > > > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > > > > netty-codec=netty-codec-4.0.44.Final.452812a,
> > > > > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > > > > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > > > > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > > > > netty-common=netty-common-4.0.44.Final.452812a,
> > > > > netty-handler=netty-handler-4.0.44.Final.452812a,
> > > > > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > > > > netty-transport=netty-transport-4.0.44.Final.452812a,
> > > > > netty-transport-native-epoll=netty-transport-native-epoll-
> > > > 4.0.44.Final.452812a,
> > > > > netty-transport-rxtx=netty-transpo

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Yossi Tamari

1 is not true.
2 is true, if we ignore the second part 😊
Hadoop is made of two parts: distributed storage (HDFS) and a Map/Reduce 
framework. Nutch is essentially a collection of Map/Reduce tasks. It relies on 
Hadoop to distribute these tasks to all participating servers. So if you run in 
local mode, you can only use one server. If you have a single-node Hadoop, 
Nutch will be able to fully utilize the server, but it will still be limited to 
crawling from one machine, which is only sufficient for small/slow crawls.

> -Original Message-
> From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> Sent: 23 February 2018 23:16
> To: user@nutch.apache.org
> Subject: RE: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> Ohh. I'm a bit confused. What of the following is true in the 'deploy' mode:
> 1. Data cannot be stored in Cassandra, HBase is the only way.
> 2. Data will be stored in Cassandra but you need a (maybe, just a single
> node)Hadoop cluster anyway which won't be storing any data but is there just 
> to
> make Nutch happy.
> 
> On 23 Feb 2018 22:08, "Yossi Tamari"  wrote:
> 
> > Hi Kaliyug,
> >
> > Nutch 2 still requires Hadoop to run, it just allows you to store data
> > somewhere other than HDFS.
> > The only way to run Nutch without Hadoop is local mode, which is only
> > recommended for testing. To do that, run ./runtime/local/bin/crawl.
> >
> > Yossi.
> >
> > > -Original Message-
> > > From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> > > Sent: 23 February 2018 20:26
> > > To: user@nutch.apache.org
> > > Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> > >
> > > Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> > >
> > > I have extracted and built Nutch under the Cygwin's home directory.
> > >
> > > I believe that the Cassandra server is working:
> > >
> > > INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> > > JOINING: Finish joining ring
> > > INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509
> > > - Executing pre-join tasks for: CFS(Keyspace='test',
> > > ColumnFamily='test')
> > > INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
> > > localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> > > 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO
> > > event
> > loop
> > > INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
> > > Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> > > netty-codec=netty-codec-4.0.44.Final.452812a,
> > > netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> > > netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> > > netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> > > netty-common=netty-common-4.0.44.Final.452812a,
> > > netty-handler=netty-handler-4.0.44.Final.452812a,
> > > netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> > > netty-transport=netty-transport-4.0.44.Final.452812a,
> > > netty-transport-native-epoll=netty-transport-native-epoll-
> > 4.0.44.Final.452812a,
> > > netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a,
> > > netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a,
> > > netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
> > > INFO  [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting
> > listening for
> > > CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
> > > INFO  [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not
> > > starting RPC server as requested. Use JMX
> > > (StorageService->startRPCServer()) or nodetool (enablethrift) to
> > > start
> > it
> > >
> > > I did the following check:
> > >
> > > apache-cassandra-3.11.1\bin>nodetool status
> > > Datacenter: datacenter1
> > > 
> > > Status=Up/Down
> > > |/ State=Normal/Leaving/Joining/Moving
> > > --  AddressLoad   Tokens   Owns (effective)  Host ID
> > > Rack
> > > UN  127.0.0.1  273.97 KiB  256  100.0%
> > > dab932f2-d138-4a1a-acd4-f63cbb16d224  rack1
> > >
> > > csql connects
> > >
> > > apache-cassandra-3.11.1\bin>cqlsh
> > >
> > > WARNING: console codepage must be set to cp65001 to support utf-8
> > encoding
> > > on Windows platforms.
> > &

RE: Nutch pointed to Cassandra, yet, asks for Hadoop

2018-02-23 Thread Yossi Tamari

Hi Kaliyug,

Nutch 2 still requires Hadoop to run, it just allows you to store data 
somewhere other than HDFS.
The only way to run Nutch without Hadoop is local mode, which is only 
recommended for testing. To do that, run ./runtime/local/bin/crawl.

Yossi.

> -Original Message-
> From: Kaliyug Antagonist [mailto:kaliyugantagon...@gmail.com]
> Sent: 23 February 2018 20:26
> To: user@nutch.apache.org
> Subject: Nutch pointed to Cassandra, yet, asks for Hadoop
> 
> Windows 10 Nutch 2.3.1 Cassandra 3.11.1
> 
> I have extracted and built Nutch under the Cygwin's home directory.
> 
> I believe that the Cassandra server is working:
> 
> INFO  [main] 2018-02-23 16:20:41,077 StorageService.java:1442 -
> JOINING: Finish joining ring
> INFO  [main] 2018-02-23 16:20:41,820 SecondaryIndexManager.java:509 -
> Executing pre-join tasks for: CFS(Keyspace='test',
> ColumnFamily='test')
> INFO  [main] 2018-02-23 16:20:42,161 StorageService.java:2268 - Node
> localhost/127.0.0.1 state jump to NORMAL INFO  [main] 2018-02-23
> 16:20:43,049 NativeTransportService.java:75 - Netty using Java NIO event loop
> INFO  [main] 2018-02-23 16:20:43,358 Server.java:155 - Using Netty
> Version: [netty-buffer=netty-buffer-4.0.44.Final.452812a,
> netty-codec=netty-codec-4.0.44.Final.452812a,
> netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a,
> netty-codec-http=netty-codec-http-4.0.44.Final.452812a,
> netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a,
> netty-common=netty-common-4.0.44.Final.452812a,
> netty-handler=netty-handler-4.0.44.Final.452812a,
> netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb,
> netty-transport=netty-transport-4.0.44.Final.452812a,
> netty-transport-native-epoll=netty-transport-native-epoll-4.0.44.Final.452812a,
> netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a,
> netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a,
> netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
> INFO  [main] 2018-02-23 16:20:43,359 Server.java:156 - Starting listening for
> CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
> INFO  [main] 2018-02-23 16:20:43,941 CassandraDaemon.java:527 - Not
> starting RPC server as requested. Use JMX
> (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
> 
> I did the following check:
> 
> apache-cassandra-3.11.1\bin>nodetool status
> Datacenter: datacenter1
> 
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID
> Rack
> UN  127.0.0.1  273.97 KiB  256  100.0%
> dab932f2-d138-4a1a-acd4-f63cbb16d224  rack1
> 
> csql connects
> 
> apache-cassandra-3.11.1\bin>cqlsh
> 
> WARNING: console codepage must be set to cp65001 to support utf-8 encoding
> on Windows platforms.
> If you experience encoding problems, change your console codepage with 'chcp
> 65001' before starting cqlsh.
> 
> Connected to Test Cluster at 127.0.0.1:9042.
> [cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4] Use 
> HELP
> for help.
> WARNING: pyreadline dependency missing.  Install to enable tab completion.
> cqlsh> describe keyspaces
> 
> system_schema  system_auth  system  system_distributed  test  system_traces
> 
> I followed the tutorial 'Setting up NUTCH 2.x with CASSANDRA
> ' and added the respective
> entries in the properties and the xml files.
> 
> I go to the Cygwin prompt and attempt to crawl. Instead of using Cassandra, it
> asks for Hadoop(HBase, probably)
> 
> /home/apache-nutch-2.3.1
> $ ./runtime/deploy/bin/crawl urls/ crawl/ 1 No SOLRURL specified. Skipping
> indexing.
> which: no hadoop in () Can't find Hadoop
> executable. Add HADOOP_HOME/bin to the path or run in local mode.
> 
> 
> 
>  signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> email&utm_content=webmail>
> Virus-free.
> www.avg.com
>  signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> email&utm_content=webmail>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

RE: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-02-20 Thread Yossi Tamari

Hi Semyon,

Wouldn't setting db.ignore.external.links.mode=byDomain solve your wincs.be 
issue?
As far as I can see the protocol (HTTP/HTTPS) does not play any part in the 
decision if this is the same domain.

Yossi.

> -Original Message-
> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> Sent: 20 February 2018 20:43
> To: usernutch.apache.org 
> Subject: Internal links appear to be external in Parse. Improvement of the
> crawling quality
> 
> Dear All,
> 
> I'm trying to increase quality of the crawling. A part of my database has
> DB_FETCHED = 1.
> 
> Example, http://www.wincs.be/ in seed list.
> 
> The root of the problem is in Parse/ParseOutputFormat.java line 364 - 374
> 
> Nutch considers one of the link(http://wincs.be/lakindustrie.html) as external
> and therefore reject it.
> 
> 
> If I insert http://wincs.be in seed file, everything works fine.
> 
> Do you think it is a good behavior? I mean, formally it is indeed two 
> different
> domains, but from user perspective it is exactly the same.
> 
> And if it is a default behavior, how can I fix it for my case? The same 
> question for
> similar switch http -> https  etc.
> 
> Thanks.
> 
> Semyon.

RE: Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue

2018-01-26 Thread Yossi Tamari

Hi Rushikesh,

I don't have any experience with this specific plugin, but I have run across 
similar problems, with 2 possible reasons:
1. It is possible that this specific site does not properly declare what 
encoding it is using, and the browser guesses the correct one.
2. You may have run across https://issues.apache.org/jira/browse/NUTCH-1807. I 
solved a similar problem by setting the environment variable LC_ALL to 
en_US.UTF-8 for all Hadoop processes (more specifically, adding `export 
LC_ALL=en_US.UTF-8` in ~hadoop/.bashrc on all Hadoop machines solved the 
problem for me).

Yossi.

> -Original Message-
> From: Rushi [mailto:rushikeshmod...@gmail.com]
> Sent: 25 January 2018 16:32
> To: user@nutch.apache.org; Mark Vega 
> Subject: Bayan Group Extractor plugin for Nutch-Spanish Accent Character Issue
> 
> Hello Everyone,
> I am having an issue while crawling the spanish website,some the accent
> characters are not converting properly.
> Here is an example  InfecciÃ³n (wrong one)should be Infección (correct ).
> 
> Note:This is with  *Bayan Group Extractor plugin.* Is there any change that i
> need to make to convert correctly.
> 
> --
> Regards
> Rushikesh M
> .Net Developer

RE: Usage previous stage HostDb data for generate(fetched deltas)

2017-12-15 Thread Yossi Tamari

Hi Semyon,

Maybe I'm missing the point, but I don't see why you would want to do this.
On one hand, if there is only 1 URL per cycle, why not fetch it? The cost is 
negligible.
On the other hand, imagine this scenario: You find the first link to some host 
from another host, and you crawl it. But it happens to be some "leaf" document 
that has no links (or maybe it has an homepage link only), so your delta 
condition is not satisfied. Later you find another link to this host from 
another host, this time to the homepage, where you can find all the "good" 
links, but you will not crawl it, because your delta condition is still not 
satisfied.
What am I missing?

Yossi.

> -Original Message-
> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> Sent: 14 December 2017 15:08
> To: usernutch.apache.org 
> Subject: Usage previous stage HostDb data for generate(fetched deltas)
> 
> Dear all,
> 
> I plan to improve hostdb functionality to have a DB_FETCHED delta for generate
> stage.
> 
> Lets say for each website we have condition of generate while number of
> fetched < 150.
> The problem is for some websites that condition will (almost)never be 
> finished,
> because of its structure.
> 
> For example
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page
> ...etc.
> 
> I would like to add the delta condition for fetched that describes speed of 
> the
> process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
> Therefore in this case the process should stop on round 5 with total number of
> fetched equals to 92.
> 
> To make it I plan to modify updatehostdb function and add delta variable in
> hostdatum for fetched.
> 
> Do you think it is a good idea to make it in such a way?
> 
> Semyon.

RE: readseg dump and non-ASCII characters

2017-12-14 Thread Yossi Tamari

Hi Michael,

Not directly answering this question, but keep in mind that as mentioned in the 
issue Sebastian referenced, there are many more places in Nutch that have the 
same problem, so setting LC_ALL is probably a good idea in general (until that 
issue is fixed...).
If you're worried about other applications, I believe passing 
`-DLC_ALL=en_US.utf8` as a parameter to all Nutch jobs should also work.

Yossi.


> -Original Message-
> From: Michael Coffey [mailto:mcof...@yahoo.com.INVALID]
> Sent: 14 December 2017 20:30
> To: user@nutch.apache.org
> Subject: Re: readseg dump and non-ASCII characters
> 
> Not sure it's practical to go around to all the hadoop machines and change 
> their
> default encoding settings. Not sure it wouldn't break something else!
> 
> I'm wondering if there's a simple fix I could make to the source code to make
> nutch.segment.SegmentReader use utf-8 as a default when reading the segment
> data.
> 
> 
> 
> In SegmentReader.java, the only obvious file-reading code I see is in this 
> append
> function.
>   private int append(FileSystem fs, Configuration conf, Path src,
>   PrintWriter writer, int currentRecordNumber) throws IOException {
> BufferedReader reader = new BufferedReader(new InputStreamReader(
> fs.open(src)));
> try {
>   String line = reader.readLine();
>   while (line != null) {
> if (line.startsWith("Recno:: ")) {
>   line = "Recno:: " + currentRecordNumber++;
> }
> writer.println(line);
> line = reader.readLine();
>   }
>   return currentRecordNumber;
> } finally {
>   reader.close();
> }
>   }
> 
> 
> SegmentReader has three different lines that create an OutputStreamWriter.
> Two of those explicitly use "UTF-8", but the one that creates a PrintWriter
> implicitly uses default encoding.
> 
> If I insert a "UTF-8" arg into the InputStreamReader and OutputStreamWriter
> constructors, should that work? Is it likely to break something else?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> From: Sebastian Nagel 
> To: user@nutch.apache.org
> Sent: Wednesday, November 15, 2017 5:18 AM
> Subject: Re: readseg dump and non-ASCII characters
> 
> 
> 
> Hi Michael,
> 
> from the arguments I guess you're interested in the raw/binary HTML content,
> right?
> After a closer look I have no simple answer:
> 
> 1. HTML has no fix encoding - it could be everything, pageA may have a 
> different
> encoding than pageB.
> 
> 2. That's different for parsed text: it's a Java String internally
> 
> 3. "readseg dump" converts all data to a Java String using the default 
> platform
> encoding. On Linux having these locales installed you may get different 
> results
> for:
>LC_ALL=en_US.utf8  ./bin/nutch reaseg -dump
>LC_ALL=en_US   ./bin/nutch reaseg -dump
>LC_ALL=ru_RU   ./bin/nutch reaseg -dump
> In doubt, try to set UTF-8 to your platform encoding. Most pages nowadays
> are UTF-8.
> Btw., this behavior isn't ideal, it should be fixed as part NUTCH-1807.
> 
> 4. a more reliable solution would require to detect the HTML encoding (the 
> code
> is available
> in Nutch) and then convert the byte[] content using the right encoding.
> 
> Best,
> Sebastian
> 
> 
> 
> 
> On 11/15/2017 02:20 AM, Michael Coffey wrote:
> > Greetings Nutchlings,
> > I have been using readseg-dump successfully to retrieve content crawled by
> nutch, but I have one significant problem: many non-ASCII characters appear as
> '???' in the dumped text file. This happens fairly frequently in the 
> headlines of
> news sites that I crawl, for things like quotes, apostrophes, and dashes.
> > Am I doing something wrong, or is this a known bug? I use a python utf8
> decoder, so it would be nice if everything were UTF8.
> > Here is the command that I use to dump each segment (using nutch
> 1.12).bin/nutch readseg -dump  segPath destPath -noparse -noparsedata -
> noparsetext -nogenerate
> > It is so close to working perfectly!
> >

RE: purging low-scoring urls

2017-12-04 Thread Yossi Tamari

Forgot to say: a urlfilter can't do that, since its input is just the URL, 
without any metadata such as the score.

> -Original Message-
> From: Yossi Tamari [mailto:yossi.tam...@pipl.com]
> Sent: 04 December 2017 21:01
> To: user@nutch.apache.org; 'Michael Coffey' 
> Subject: RE: purging low-scoring urls
> 
> Hi Michael,
> 
> I think one way you can do it is using `readdb  -dump new_crawldb -
> format crawldb -expr "score>0.03" `.
> You would then need to use hdfs commands to replace the existing
> /current with newcrawl_db.
> Of course, I strongly recommend backing up the current crawldb before
> replacing it...
> 
>   Yossi.
> 
> > -Original Message-
> > From: Michael Coffey [mailto:mcof...@yahoo.com.INVALID]
> > Sent: 04 December 2017 20:38
> > To: User 
> > Subject: purging low-scoring urls
> >
> > Is it possible to purge low-scoring urls from the crawldb? My news crawl has
> > many thousands of zero-scoring urls and also many thousands of urls with
> > scores less than 0.03. These urls will never be fetched because they will 
> > never
> > make it into the generator's topN by score. So, all they do is make the 
> > process
> > slower.
> >
> > It seems like something an urlfilter could do, but I have not found any
> > documentation for any urlfilter that does it.

RE: purging low-scoring urls

2017-12-04 Thread Yossi Tamari

Hi Michael,

I think one way you can do it is using `readdb  -dump new_crawldb 
-format crawldb -expr "score>0.03" `.
You would then need to use hdfs commands to replace the existing 
/current with newcrawl_db.
Of course, I strongly recommend backing up the current crawldb before replacing 
it...

Yossi. 

> -Original Message-
> From: Michael Coffey [mailto:mcof...@yahoo.com.INVALID]
> Sent: 04 December 2017 20:38
> To: User 
> Subject: purging low-scoring urls
> 
> Is it possible to purge low-scoring urls from the crawldb? My news crawl has
> many thousands of zero-scoring urls and also many thousands of urls with
> scores less than 0.03. These urls will never be fetched because they will 
> never
> make it into the generator's topN by score. So, all they do is make the 
> process
> slower.
> 
> It seems like something an urlfilter could do, but I have not found any
> documentation for any urlfilter that does it.

crawlcomplete

2017-12-04 Thread Yossi Tamari

Hi,

 

I'm trying to understand some of the design decisions behind the
crawlcomplete tool. I find the concept itself very useful, but there are a
couple of behaviors that I don't understand:

1.  URLs that resulted in redirect (even permanent) are counted as
unfetched. That means that if I had a crawl with only one URL, and that URL
returned a redirect, which was fetched successfully, I would see 1 FETCHED
and 1 UNFETCHED in crawlcomplete, and there is no inherent way for me to
know that, really, my crawl is 100% complete. My expectation would be for
URLs that resulted in redirection to not be counted (as they have been
replaced by new URLs), or to be counted in a separate group (which can then
be ignored).
2.  URLs that are db_gone are also counted as unfetched. It seems to me
these URLs were "successfully" crawled. It's the reality of the web that
pages disappear over time, and knowing that this happened is useful. These
URLs do not need to be crawled again, so they should not be counted as
unfetched. I can see why counting them as FETCHED would be confusing, so
maybe the names of the groups should be changed (COMPLETE and INCOMPLETE)
or a new group (GONE) added.

 

Are there good reasons for the current behavior? 



   Yossi.

RE: General question on dealing with file types

2017-11-25 Thread Yossi Tamari

Hi Sol,

Note that you do not need to use a regular expression to filter by file suffix, 
the suffix-urlfilter plugin does that.
Obviously, if the URL does not contain the file type, you have to fetch it 
anyway, to get the mime-type. If there is no parser for this fie type, it will 
not be parsed and indexed anyway. If there is a parser and you want to disable 
it, I think you can do it in parse-plugins.xml (remove the * rule, and map only 
the mime-types you do want).

Yossi.

> -Original Message-
> From: Sol Lederman [mailto:sol.leder...@gmail.com]
> Sent: 25 November 2017 18:57
> To: user@nutch.apache.org
> Subject: General question on dealing with file types
> 
> Like most of you I imagine, I want to capture and index file types from a
> particular set of types. I want to index HTML but I may or may not want to 
> index
> cgi-bin or PDFs. It seems that there are two general approaches for selecting
> what to include and exclude and neither seems ideal.
> 
> 1. I can include files I care about based on the URL matching a reg ex. So, I 
> can
> have a list: html, HTML, pdf, PDF, etc. and filter out URLs that don't match 
> the
> pattern.
> 
> 2. I can exclude files I don't want. I can exclude files with reg exes that 
> match
> /cgi-bin/, .ico, .doc, etc and keep everything else.
> 
> The problem with the first approach is that lots of HTML files don't end in 
> .html.
> Often there is no file name. The home page of a site may just be 
> http://foo.bar.
> So, the first approach will miss lots of HTML files.
> 
> The second approach is ok until I forget a file pattern that I really want to
> exclude.
> 
> I'm wondering if using the MIME type in conjunction with the first approach
> would work well. So, accept URLs with MIME type text/html, accept URLs that
> match some URL patterns I want to include and exclude the rest.
> 
> I can, I suppose, use approach #2 and not worry since files that don't have 
> text
> won't produce any searchable text in the index. I'm not too worried about
> having some junk in the index as I'm not crawling a huge number of pages.
> 
> Thoughts? What do folks generally do?
> 
> Thanks.
> 
> Sol

RE: sitemap and xml crawl

2017-11-02 Thread Yossi Tamari

Hi Ankit,

So I guess you want to remove the parser that is configured by default (since 
you don't need to parse HTML at all), add the RSS parser that Markus suggested, 
and then you probably need to add either a custom parser for the second XML 
format, or an indexing filter, or both. This would depend on exactly what you 
are trying to achieve at the end of the crawl.

Yossi.

> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: 02 November 2017 11:29
> To: user@nutch.apache.org
> Subject: RE: sitemap and xml crawl
> 
> Hi - Nutch has a parser for RSS and ATOM on-board:
> https://nutch.apache.org/apidocs/apidocs-
> 1.13/org/apache/nutch/parse/feed/FeedParser.html
> 
> You must configure it in your plugin.includes to use it.
> 
> Regards,
> Markus
> 
> 
> 
> -Original message-
> > From:Ankit Goel 
> > Sent: Thursday 2nd November 2017 10:11
> > To: user@nutch.apache.org
> > Subject: Re: sitemap and xml crawl
> >
> > Hi Yossi,
> > I have 2 kinds of rss links which are domain.com/rss/feed.xml
> <http://domain.com/rss/feed.xml> links. One is the standard rss feed that we
> see, which becomes the starting point for crawling further as we can pull 
> links
> from it.
> >
> >
> > 
> > 
> > 
> > 
> > 
> > article url
> > 
> >  date 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > …
> > 
> >
> > The other one also includes the content within the xml itself, so it 
> > doesn’t need
> further crawling.
> > I have standalone xml parsers in java that I can use directly, but 
> > obviously,
> crawling is an important part, because it documents all the links traversed 
> so far.
> >
> > What would you advice?
> >
> > Regards,
> > Ankit Goel
> >
> > > On 02-Nov-2017, at 2:04 PM, Yossi Tamari  wrote:
> > >
> > > Hi Ankit,
> > >
> > > If you are looking for a Sitemap parser, I would suggest moving to
> > > 1.14 (trunk). I've been using it, and it is probably in better shape than 
> > > 1.13.
> > > If you need to parse your own format, the answer depends on the
> > > details. Do you need to crawl pages in this format where each page
> > > contains links in XML that you need to crawl? Or is this more like
> > > Sitemap where the XML is just the  initial starting point?
> > > In the second case, maybe just write something outside of Nutch that
> > > will parse the XML and produce a seed file?
> > > In the first case, the link you sent is not relevant. You need to
> > > implement a
> > > http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/
> > > Parser.h tml. I haven't done that myself. My suggestion is that you
> > > take a look at the built-in parser at
> > > https://github.com/apache/nutch/blob/master/src/plugin/parse-html/sr
> > > c/java/o rg/apache/nutch/parse/html/HtmlParser.java. Google found
> > > this article on developing a custom parser, which might be a good
> > > starting point:
> > > http://www.treselle.com/blog/apache-nutch-with-custom-parser/.
> > >
> > >   Yossi.
> > >
> > >
> > >> -Original Message-
> > >> From: Ankit Goel [mailto:ankitgoel2...@gmail.com]
> > >> Sent: 02 November 2017 10:24
> > >> To: user@nutch.apache.org
> > >> Subject: Re: sitemap and xml crawl
> > >>
> > >> Hi Yossi,
> > >> So I need to make a custom parser. Where do I start? I found this
> > >> link https://wiki.apache.org/nutch/HowToMakeCustomSearch
> > >> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the
> > >> right place, or should I be looking at creating a plugin page. Any
> > >> advice would
> > > be
> > >> helpful.
> > >>
> > >> Thank you,
> > >> Ankit Goel
> > >>
> > >>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari 
> wrote:
> > >>>
> > >>> Hi Ankit,
> > >>>
> > >>> According to this:
> > >>> https://issues.apache.org/jira/browse/NUTCH-1465,
> > >>> sitemap is a 1.14 feature.
> > >>> I just checked, and the command indeed exists in 1.14. I did not
> > >>> test that it works.
> > >>>
> > >>> In general, Nutch supports crawling anything, but you might need
> > >>> to write your own parser for custom protocols.
> > >>>
> > >>> Yossi.
> > >>>
> > >>>> -Original Message-
> > >>>> From: Ankit Goel [mailto:ankitgoel2...@gmail.com]
> > >>>> Sent: 01 November 2017 18:55
> > >>>> To: user@nutch.apache.org
> > >>>> Subject: sitemap and xml crawl
> > >>>>
> > >>>> Hi,
> > >>>> I need to crawl a xml feed, which includes url, title and content
> > >>>> of the
> > >>> articles on
> > >>>> site.
> > >>>>
> > >>>> The documentation on the site says that bin/nutch sitemap exists,
> > >>>> but on
> > >>> my
> > >>>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
> > >>>> support crawling sitemaps? Or xml links.
> > >>>>
> > >>>> Regards,
> > >>>> Ankit Goel
> > >>>
> > >>>
> > >
> > >
> >
> >

RE: sitemap and xml crawl

2017-11-02 Thread Yossi Tamari

Hi Ankit,

If you are looking for a Sitemap parser, I would suggest moving to 1.14
(trunk). I've been using it, and it is probably in better shape than 1.13.
If you need to parse your own format, the answer depends on the details. Do
you need to crawl pages in this format where each page contains links in XML
that you need to crawl? Or is this more like Sitemap where the XML is just
the  initial starting point? 
In the second case, maybe just write something outside of Nutch that will
parse the XML and produce a seed file?
In the first case, the link you sent is not relevant. You need to implement
a
http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h
tml. I haven't done that myself. My suggestion is that you take a look at
the built-in parser at
https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o
rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on
developing a custom parser, which might be a good starting point:
http://www.treselle.com/blog/apache-nutch-with-custom-parser/.

Yossi.

> -Original Message-
> From: Ankit Goel [mailto:ankitgoel2...@gmail.com]
> Sent: 02 November 2017 10:24
> To: user@nutch.apache.org
> Subject: Re: sitemap and xml crawl
> 
> Hi Yossi,
> So I need to make a custom parser. Where do I start? I found this link
> https://wiki.apache.org/nutch/HowToMakeCustomSearch
> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right
> place, or should I be looking at creating a plugin page. Any advice would
be
> helpful.
> 
> Thank you,
> Ankit Goel
> 
> > On 02-Nov-2017, at 1:14 PM, Yossi Tamari  wrote:
> >
> > Hi Ankit,
> >
> > According to this: https://issues.apache.org/jira/browse/NUTCH-1465,
> > sitemap is a 1.14 feature.
> > I just checked, and the command indeed exists in 1.14. I did not test
> > that it works.
> >
> > In general, Nutch supports crawling anything, but you might need to
> > write your own parser for custom protocols.
> >
> > Yossi.
> >
> >> -Original Message-
> >> From: Ankit Goel [mailto:ankitgoel2...@gmail.com]
> >> Sent: 01 November 2017 18:55
> >> To: user@nutch.apache.org
> >> Subject: sitemap and xml crawl
> >>
> >> Hi,
> >> I need to crawl a xml feed, which includes url, title and content of
> >> the
> > articles on
> >> site.
> >>
> >> The documentation on the site says that bin/nutch sitemap exists, but
> >> on
> > my
> >> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
> >> support crawling sitemaps? Or xml links.
> >>
> >> Regards,
> >> Ankit Goel
> >
> >

RE: sitemap and xml crawl

2017-11-02 Thread Yossi Tamari

Hi Ankit,

According to this: https://issues.apache.org/jira/browse/NUTCH-1465, sitemap
is a 1.14 feature.
I just checked, and the command indeed exists in 1.14. I did not test that
it works.

In general, Nutch supports crawling anything, but you might need to write
your own parser for custom protocols.

Yossi.

> -Original Message-
> From: Ankit Goel [mailto:ankitgoel2...@gmail.com]
> Sent: 01 November 2017 18:55
> To: user@nutch.apache.org
> Subject: sitemap and xml crawl
> 
> Hi,
> I need to crawl a xml feed, which includes url, title and content of the
articles on
> site.
> 
> The documentation on the site says that bin/nutch sitemap exists, but on
my
> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch support
> crawling sitemaps? Or xml links.
> 
> Regards,
> Ankit Goel

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari

Hi Markus,

Can you please explain what do you mean by "our parser", because I'm pretty 
sure the language-identifier plugin is not using Optimaize.

Thanks,
Yossi.

> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: 24 October 2017 15:25
> To: user@nutch.apache.org
> Subject: RE: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hello,
> 
> Not sure what the problem is but , buried  deep in our parser we also use
> Optimaize, previously lang-detect. We load models once, inside a static block,
> and create a new Detector instance for every record we parse. This is very 
> fast.
> 
> Regards,
> Markus
> 
> -Original message-
> > From:Sebastian Nagel 
> > Sent: Tuesday 24th October 2017 14:11
> > To: user@nutch.apache.org
> > Subject: Re: Usage of Tika LanguageIdentifier in language-identifier
> > plugin
> >
> > Hi Yossi,
> >
> > > does not separate the Detector object, which contains the model and
> > > should be reused, from the text writer object, which should be request
> specific.
> >
> > But shouldn't a call of reset() make it ready for re-use (the Detector 
> > object
> including the writer)?
> >
> > But I agree that a reentrant function maybe easier to integrate. Nutch
> > plugins also need to be thread-safe, esp. parsers and parse filters if 
> > running in
> a multi-threaded parsing fetcher.
> > Without a reentrant function and without a 100% stateless detector,
> > the only way is to use a ThreadLocal instance of the detector. At a first 
> > glance,
> the optimaize detecter seems to be stateless.
> >
> > > I chose optimaize mainly because Tika did. Using langid instead
> > > should be very simple, but the fact that the project has not seen a
> > > single commit in the last 4 years, and the usage numbers are also quite 
> > > low,
> gives me pause...
> >
> > Of course, maintenance or community around a project is an important
> > factor. CLD2 is also not really maintained, plus the models are fixed, no 
> > code
> available to retrain them.
> >
> > > what I have done locally
> >
> > In any case, would be great if you would open an issue on Jira and a pull
> request on github.
> > Which way to go may be discussed further.
> >
> > Thanks,
> > Sebastian
> >
> >
> > On 10/24/2017 01:05 PM, Yossi Tamari wrote:
> > > Why not LanguageDetector: The API does not separate the Detector object,
> which contains the model and should be reused, from the text writer object,
> which should be request specific. The same API Object instance contains
> references to both. In code terms, both loadModels() and addText() are non-
> static members of LanguageDetector.
> > >
> > > Developing another language-identifier-optimaize is basically what I have
> done locally, but it seems to me having both in the Nutch repository would 
> just
> be confusing for users. 99% of the code would also be duplicated (the relevant
> code is about 5 lines).
> > >
> > > I chose optimaize mainly because Tika did. Using langid instead should be
> very simple, but the fact that the project has not seen a single commit in 
> the last
> 4 years, and the usage numbers are also quite low, gives me pause...
> > >
> > >
> > >> -Original Message-
> > >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> > >> Sent: 24 October 2017 13:18
> > >> To: user@nutch.apache.org
> > >> Subject: Re: Usage of Tika LanguageIdentifier in
> > >> language-identifier plugin
> > >>
> > >> Hi Yossi,
> > >>
> > >> sorry while fast-reading I've thought it's about the old 
> > >> LanguageIdentifier.
> > >>
> > >>> it is not possible to initialize the detector in setConf and then
> > >>> reuse it
> > >>
> > >> Could explain why? The API/interface should allow to get an
> > >> instance and call
> > >> loadModels() or not?
> > >>
> > >>>>> For my needs, I have modified the plugin to use
> > >>>>> com.optimaize.langdetect.LanguageDetector directly, which is
> > >>>>> what
> > >>
> > >> Of course, that's also possible. Or just add a plugin
> > >> language-identifier- optimaize.
> > >>
> > >> Btw., I recently had a look on various open source language
> > >> identi

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari

Why not LanguageDetector: The API does not separate the Detector object, which 
contains the model and should be reused, from the text writer object, which 
should be request specific. The same API Object instance contains references to 
both. In code terms, both loadModels() and addText() are non-static members of 
LanguageDetector.

Developing another language-identifier-optimaize is basically what I have done 
locally, but it seems to me having both in the Nutch repository would just be 
confusing for users. 99% of the code would also be duplicated (the relevant 
code is about 5 lines).

I chose optimaize mainly because Tika did. Using langid instead should be very 
simple, but the fact that the project has not seen a single commit in the last 
4 years, and the usage numbers are also quite low, gives me pause...


> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: 24 October 2017 13:18
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> sorry while fast-reading I've thought it's about the old LanguageIdentifier.
> 
> > it is not possible to initialize the detector in setConf and then reuse it
> 
> Could explain why? The API/interface should allow to get an instance and call
> loadModels() or not?
> 
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> 
> Of course, that's also possible. Or just add a plugin language-identifier-
> optimaize.
> 
> Btw., I recently had a look on various open source language identifier
> implementations would prefer
> langid (a port from Python/C) because it's faster and has a better precision:
>   https://github.com/carrotsearch/langid-java.git
>   https://github.com/saffsd/langid.c.git
>   https://github.com/saffsd/langid.py.git
> Of course, CLD2 (https://github.com/CLD2Owners/cld2.git) is unbeaten (but it's
> C++).
> 
> Thanks,
> Sebastian
> 
> On 10/24/2017 11:46 AM, Yossi Tamari wrote:
> > Hi Sebastian,
> >
> > Please reread the second paragraph of my email 😊.
> > In short, it is not possible to initialize the detector in setConf and then 
> > reuse it,
> and initializing it per call would be extremely slow.
> >
> > Yossi.
> >
> >
> >> -Original Message-
> >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> >> Sent: 24 October 2017 12:41
> >> To: user@nutch.apache.org
> >> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> >>
> >> Hi Yossi,
> >>
> >> why not port it to use
> >>
> >>
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> >> tector.html
> >>
> >> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> >>
> >> Sebastian
> >>
> >> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> >>> Hi
> >>>
> >>>
> >>>
> >>> The language-identifier plugin uses
> >>> org.apache.tika.language.LanguageIdentifier for extracting the
> >>> language from the document text. There are two issues with that:
> >>>
> >>> 1.LanguageIdentifier is deprecated in Tika.
> >>> 2.It does not support CJK language (and I suspect a lot of other
> >>> languages -
> >>> https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> >>> guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> >>> with them - in my experience Chinese was recognized as Italian.
> >>>
> >>>
> >>>
> >>> Since in Tika LanguageIdentifier was superseded by
> >>> org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> >>> make that change in the plugin as well. However, because the design of
> >>> LanguageDetector is terrible, it makes the implementation not
> >>> reentrant, meaning the full language model would have to be reloaded
> >>> on each call to the detector.
> >>>
> >>>
> >>>
> >>> For my needs, I have modified the plugin to use
> >>> com.optimaize.langdetect.LanguageDetector directly, which is what
> >>> Tika's LanguageDetector uses internally (at least by default). My
> >>> question is whether that is a change that should be made to the official
> plugin.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>>Yossi.
> >>>
> >>>
> >
> >

RE: Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari

Hi Sebastian,

Please reread the second paragraph of my email 😊.
In short, it is not possible to initialize the detector in setConf and then 
reuse it, and initializing it per call would be extremely slow.

Yossi.


> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: 24 October 2017 12:41
> To: user@nutch.apache.org
> Subject: Re: Usage of Tika LanguageIdentifier in language-identifier plugin
> 
> Hi Yossi,
> 
> why not port it to use
> 
> http://tika.apache.org/1.16/api/org/apache/tika/language/detect/LanguageDe
> tector.html
> 
> The upgrade to Tika 1.16 is already in progress (NUTCH-2439).
> 
> Sebastian
> 
> On 10/24/2017 11:26 AM, Yossi Tamari wrote:
> > Hi
> >
> >
> >
> > The language-identifier plugin uses
> > org.apache.tika.language.LanguageIdentifier for extracting the
> > language from the document text. There are two issues with that:
> >
> > 1.  LanguageIdentifier is deprecated in Tika.
> > 2.  It does not support CJK language (and I suspect a lot of other
> > languages -
> > https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Lan
> > guages _and_their_ISO_636_Codes), and it doesn't even fail gracefully
> > with them - in my experience Chinese was recognized as Italian.
> >
> >
> >
> > Since in Tika LanguageIdentifier was superseded by
> > org.apache.tika.language.detect.LanguageDetector, it seems obvious to
> > make that change in the plugin as well. However, because the design of
> > LanguageDetector is terrible, it makes the implementation not
> > reentrant, meaning the full language model would have to be reloaded
> > on each call to the detector.
> >
> >
> >
> > For my needs, I have modified the plugin to use
> > com.optimaize.langdetect.LanguageDetector directly, which is what
> > Tika's LanguageDetector uses internally (at least by default). My
> > question is whether that is a change that should be made to the official 
> > plugin.
> >
> >
> >
> > Thanks,
> >
> >Yossi.
> >
> >

Usage of Tika LanguageIdentifier in language-identifier plugin

2017-10-24 Thread Yossi Tamari

Hi

 

The language-identifier plugin uses
org.apache.tika.language.LanguageIdentifier for extracting the language from
the document text. There are two issues with that:

1.  LanguageIdentifier is deprecated in Tika.
2.  It does not support CJK language (and I suspect a lot of other
languages -
https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages
_and_their_ISO_636_Codes), and it doesn't even fail gracefully with them -
in my experience Chinese was recognized as Italian.

 

Since in Tika LanguageIdentifier was superseded by
org.apache.tika.language.detect.LanguageDetector, it seems obvious to make
that change in the plugin as well. However, because the design of
LanguageDetector is terrible, it makes the implementation not reentrant,
meaning the full language model would have to be reloaded on each call to
the detector.

 

For my needs, I have modified the plugin to use
com.optimaize.langdetect.LanguageDetector directly, which is what Tika's
LanguageDetector uses internally (at least by default). My question is
whether that is a change that should be made to the official plugin. 

 

Thanks,

   Yossi.

Sending an empty http.agent.version

2017-10-23 Thread Yossi Tamari

Hi,

 

http.agent.version defaults in nutch-default.xml to Nutch-1.14-SNAPSHOT
(depending on the version of course).

If I want to override it to not send a version as part of the user-agent,
there is nothing I can do in nutch-site.xml, since putting an empty string
there causes the default to be taken, and putting any value there causes a
slash to be appended to the http.agent.name.

As far as I can see, the only way to override it is to remove the value in
nutch-default.xml, which is probably not the "correct" way, considering it
contains a comment saying "Do not modify this file directly".

 

This was asked previously in
https://www.mail-archive.com/user@nutch.apache.org/msg15341.html, but
without a helpful answer.

 

I would be willing to push a fix where setting the string to "null" would
cause it to be ignored, if the maintainers are on board.

 

   Yossi.

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

2017-09-22 Thread Yossi Tamari

Fork from https://github.com/apache/nutch.

-Original Message-
From: Hiran CHAUDHURI [mailto:hiran.chaudh...@amadeus.com] 
Sent: 22 September 2017 12:27
To: user@nutch.apache.org
Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

>Hi Hiran,
>
>Your code call setURLStreamHandlerFactory, the documentation for which says 
>"This method can be called at most once in a given Java Virtual Machine". 
>Isn't this going to be a problem? 
>https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-

I thought of falling back to the already-installed URLStreamHandlerFactory so 
they can chain up. However there is no method to read the current value, so 
that attempt died.
When debugging the procedure I found out that the URLStreamHandlerFactory was 
null during normal application runs, and it was the same on invoking Nutch. So 
for the time being I do not see a problem here. It could arise if a single 
plugin would set the factory - but then it is wiser to do this on application 
level (nutch) than in any plugin.
To come back to your question: I believe by making use of that feature we would 
reduce the risk for plugin developers to get creative. Therefore I rate it a 
general improvent.

>Additionally, does this URLStreamHandlerFactory successfully load the standard 
>handlers (HTTP, HTTPS...)? I would expect it to fail on these.

Yes, it fails on these and returns null. Which triggers the JVM to just 
continue as if there were no URLStreamHandlerFactory installed. So no harm done 
for the well-known protocols if not overridden by plugins.

>To be able to create a pull request, your repository needs to be a fork of the 
>original repository, which does not seem to be the case here.

I thought to have forked from gitbox.apache.org but then something may be 
broken. Do you have an idea how I could fix this?

Hiran

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

2017-09-22 Thread Yossi Tamari

Hi Hiran,

Your code call setURLStreamHandlerFactory, the documentation for which says 
"This method can be called at most once in a given Java Virtual Machine". Isn't 
this going to be a problem? 
https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-
Additionally, does this URLStreamHandlerFactory successfully load the standard 
handlers (HTTP, HTTPS...)? I would expect it to fail on these.

To be able to create a pull request, your repository needs to be a fork of the 
original repository, which does not seem to be the case here.

Yossi.

-Original Message-
From: Hiran CHAUDHURI [mailto:hiran.chaudh...@amadeus.com] 
Sent: 22 September 2017 11:54
To: user@nutch.apache.org
Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hello all.

This time following up on my own post...

>>> When you look at the protocol-smb hook it comes with this static 
>>> hook, but as it is never executed does not help.
>>
>>Yes, it has to be called.
>
>So when would Nutch call this static hook? In practice this does not happen 
>before the plugin is required, but then it is too late as the 
>MalformedURLException is thrown already.
>And this aproach cannot cover the classpath issue.

It seems Nutch would never call this static hook. That is why I patched the 
PluginRepository class.

>>> - create a tutorial to add some arbitrary protocol (e.g. the 
>>> foo://bar/baz url)
>>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>>
>>> I'd be willing to do the latter but would like to see a less clumsy 
>>> behaviour for plugins.
>>
>>Great! Nutch could not exist without voluntary work. Thanks!
>>
>>Sorry, that integration will not be that easy. The problem was indeed 
>>already known since long and should have been better tested, see also [1] and 
>>[2] - the class org.apache.nutch.protocol.sftp.Handler (a dummy handler) has 
>>been lost, you'll find it in the zip file attached to NUTCH-714.
>>
>>However, encapsulation and lazy instantiation I would not call "clumsy 
>>behavior", it's useful for heavy-weight plugins (e.g., parse-tika which 
>>brings 50 MB dependencies).
>
>Both concepts, encapsulation and lazy instantiation are great. What I call 
>clumsy is that the encapsulation does not work. Look at it from a user 
>perspective of the protocol-smb plugin.
>It comes as a (set of) jars, together with an XML descriptor. This could be 
>nicely wrapped in a zip file and thus is one artifact that can easily be 
>versioned and distributed.
>
>But as soon as I want to install it, I have to
>1 - put the artifact into the plugins directory
>2 - modify Nutch configuration files to allow smb:// urls plus include 
>the plugin to the loaded list
>3 - extract jcifs.jar and place it on the system classpath
>4 - run nutch with the correct system property
>
>While items 1 and 2 can be understood easily and maybe one day come with a 
>nice management interface, items 3 and 4 require knowledge about the internals 
>of the plugin. 
>Where did the encapsulation go? This is where I'd like to improve, and I have 
>an idea how that could be established. Need to test it though.

I have a solution that makes steps 3 and 4 obsolete.

>I would need the first to test modifications to the plugin system.
>Then with the second I would create a smb plugin that would suffer 
>other limitations than the LGPL. ;-)

So here is the solution to the first step - the modified plugin system. It is 
available here, however I am not sure how to create the pull request...
https://github.com/HiranChaudhuri/nutch/commit/dc9cbeb3da7ca021e2cce322482d2eaa1ec15b28

Next will be one example plugin and the mentioned protocol-smb.

Hiran

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

2017-09-20 Thread Yossi Tamari

Hi Hiran,

I recently needed the documents you requested myself, and the two below were 
the most helpful. Keep in mind that like most Nutch documentation, they are not 
totally up to date, so you need to be a bit flexible.
The most important difference for me was getting the source from GitHub rather 
than SVN.

https://wiki.apache.org/nutch/RunNutchInEclipse
https://florianhartl.com/nutch-plugin-tutorial.html



-Original Message-
From: Hiran CHAUDHURI [mailto:hiran.chaudh...@amadeus.com] 
Sent: 20 September 2017 09:50
To: user@nutch.apache.org
Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

>> When you look at the protocol-smb hook it comes with this static 
>> hook, but as it is never executed does not help.
>
>Yes, it has to be called.

So when would Nutch call this static hook? In practice this does not happen 
before the plugin is required, but then it is too late as the 
MalformedURLException is thrown already.
And this aproach cannot cover the classpath issue.

>> - create a tutorial to add some arbitrary protocol (e.g. the  
>> foo://bar/baz url)
>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>
>> I'd be willing to do the latter but would like to see a less clumsy 
>> behaviour for plugins.
>
>Great! Nutch could not exist without voluntary work. Thanks!
>
>Sorry, that integration will not be that easy. The problem was indeed already 
>known since long and should have been better tested, see also [1] and [2] - 
>the class >org.apache.nutch.protocol.sftp.Handler (a dummy handler) has been 
>lost, you'll find it in the zip file attached to NUTCH-714.
>
>However, encapsulation and lazy instantiation I would not call "clumsy 
>behavior", it's useful for heavy-weight plugins (e.g., parse-tika which brings 
>50 MB dependencies).

Both concepts, encapsulation and lazy instantiation are great. What I call 
clumsy is that the encapsulation does not work. Look at it from a user 
perspective of the protocol-smb plugin.
It comes as a (set of) jars, together with an XML descriptor. This could be 
nicely wrapped in a zip file and thus is one artifact that can easily be 
versioned and distributed.

But as soon as I want to install it, I have to
1 - put the artifact into the plugins directory
2 - modify Nutch configuration files to allow smb:// urls plus include the 
plugin to the loaded list
3 - extract jcifs.jar and place it on the system classpath
4 - run nutch with the correct system property

While items 1 and 2 can be understood easily and maybe one day come with a nice 
management interface, items 3 and 4 require knowledge about the internals of 
the plugin. Where did the encapsulation go? This is where I'd like to improve, 
and I have an idea how that could be established. Need to test it though.

>Thanks, looking forward how you get it solved, Sebastian

It seems I may need some support to go further. Maybe as you help me two 
documents could arise:
- Building nutch from source
- Developing a (protocol) plugin

I would need the first to test modifications to the plugin system.
Then with the second I would create a smb plugin that would suffer other 
limitations than the LGPL. ;-)

Hiran

RE: Exchange documents in indexing job

2017-08-23 Thread Yossi Tamari

I don't see a good way to do it in configuration, but it should be very easy to 
override the write method in the two plugins to have it check the mime type and 
decide whether to call super.write or not.
(One terrible way to do it with configuration only would be to configure only 
one of the indexers and use mimetype-filter to filter the matching type, and 
then reconfigure for the other indexer and change mimetype-filter.txt to the 
other mime type and index again...)

-Original Message-
From: Roannel Fernández Hernández [mailto:roan...@uci.cu] 
Sent: 23 August 2017 18:05
To: user@nutch.apache.org
Subject: Exchange documents in indexing job

Hi folks: 

There is some way in Nutch to send some documents to a particular index writer 
according to particular values of fields? 

I explain myself better. I have a document with a field called "mimetype" and I 
want to send to Solr only the documents with value "text/plain" for this field 
and send to RabbitMQ the documents with value "text/html". How can I do that? 

Regards 

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

RE: After Parse extension point

2017-07-27 Thread Yossi Tamari

Hi Zoltan,

I think what you want is a HtmlParseFilter - 
https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/HtmlParseFilter.html.
I recommend you read https://florianhartl.com/nutch-plugin-tutorial.html, and 
take a look at one of the included HtmlParseFilters, e.g. parsefilter-regex.

If you have more specific questions, I may be able to help.

Yossi.

-Original Message-
From: Zoltán Zvara [mailto:zoltan.zv...@gmail.com] 
Sent: 26 July 2017 20:18
To: user@nutch.apache.org
Subject: After Parse extension point

Dear Community,

Looking for the extension point which executes after parse and before update.
Moreover, I would be happy to read further on how extension points are built up 
(in which order). My first impressions of Nutch is that it is highly 
under-documented, or existing documentation is outdated. I would be pleased to 
look into details how the plugin system works, further how extension points are 
controlled and ran by Nutch.

Best,
Zoltán

RE: nutch 1.x tutorial with solr 6.6.0

2017-07-12 Thread Yossi Tamari

Hi Pau,

I think the tutorial is still not fully up-to-date:
If you haven't, you should update the solr.* properties in nutch-site.xml (and 
run `ant runtime` again to update the runtime).
Then the command for the tutorial should be:
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ 
-filter -normalize -deleteGone
The -dir parameter should save you the need to run `index` for each segment. 
I'm not sure if you need the final 3 parameters, depends on your use case.

-Original Message-
From: Pau Paches [mailto:sp.exstream.t...@gmail.com] 
Sent: 12 July 2017 23:48
To: user@nutch.apache.org
Subject: Re: nutch 1.x tutorial with solr 6.6.0

Hi Lewis et al.,
I have followed the new tutorial.
In step Step-by-Step: Indexing into Apache Solr

the command
bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ 
crawl/segments/20131108063838/ -filter -normalize -deleteGone

should be run for each segment directory (there are 3), I guess but for the 
first segment it fails:
Indexer: java.io.IOException: No FileSystem for scheme: http
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at 
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:329)
at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:862)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

thanks,
pau

On 7/12/17, Pau Paches  wrote:
> Hi Lewis,
> Just trying the tutorial again. Doing the third round, it's taking 
> much longer than the other two.
>
> What's this schema for?
> Does the version of Nutch that we run have to have this new schema for 
> compatibility with Solr 6.6.0?
> Or can we use Nutch 1.13?
> thanks,
> pau
>
> On 7/12/17, lewis john mcgibbney  wrote:
>> Hi Folks,
>> I just updated the tutorial below, if you find any discrepancies 
>> please let me know.
>>
>> https://wiki.apache.org/nutch/NutchTutorial
>>
>> Also, I have made available a new schema.xml which is compatible with 
>> Solr
>> 6.6.0 at
>>
>> https://issues.apache.org/jira/browse/NUTCH-2400
>>
>> Please scope it out and let me know what happens.
>> Thank you
>> Lewis
>>
>> On Wed, Jul 12, 2017 at 6:58 AM, 
>> wrote:
>>
>>>
>>> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
>>> Sent: Tuesday, July 11, 2017 2:50 PM
>>> To: user@nutch.apache.org
>>> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>>>
>>> Hi Rashmi,
>>> I have followed your suggestions.
>>> Now I'm seeing a different error.
>>> bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb 
>>> crawl/linkdb crawl/segments The input path at segments is not a 
>>> segment...
>>> skipping
>>> Indexer: starting at 2017-07-11 20:45:56
>>> Indexer: deleting gone documents: false
>>
>>
>> ...

RE: nutch 1.x tutorial with solr 6.6.0

2017-07-11 Thread Yossi Tamari

I struggled with this as well. Eventually I moved to ElasticSearch, which is 
much easier.

What I did manage to find out, is that in newer versions of SOLR you need to 
use ZooKeeper to update the conf file. see https://stackoverflow.com/a/43351358.

-Original Message-
From: Pau Paches [mailto:sp.exstream.t...@gmail.com] 
Sent: 11 July 2017 13:29
To: user@nutch.apache.org
Subject: Re: nutch 1.x tutorial with solr 6.6.0

Hi,
I just crawl a single URL so no whole web crawling.
So I do option 2, fetching, invertlinks successfully. This is just Nutch 1.x 
Then I do Indexing into Apache Solr so go to section Setup Solr for search.
First thing that does not work:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar
No start.jar at the specified location, but no problem you start Solr
6.6.0 with bin/solr start.
Then the tutorial says:
Backup the original Solr example schema.xml:
mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org

But in current Solr, 6.6.0, there is no schema.xml file. In the whole 
distribution. What should I do here?
if I go directly to run the Solr Index command from ${NUTCH_RUNTIME_HOME}:
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/ which may not make sense since I have skipped some 
steps, it crashes:
The input path at segments is not a segment... skipping
Indexer: java.lang.RuntimeException: Missing elastic.cluster and elastic.host. 
At least one of them should be set in nutch-site.xml ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port

Clearly there is some missing configuration in nutch-site.xml, apart from 
setting http.agent.name in nutch-site.xml (mentioned) other fields need to be 
set up. The segments message above is also troubling.

If you follow the steps (if they worked) should we run bin/nutch solrindex 
http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/ 
(this is the last step in Integrate Solr with Nutch) and then

bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ 
crawl/segments/20131108063838/ -filter -normalize -deleteGone (this is one of 
the steps of Using Individual Commands for Whole-Web Crawling, which in fact 
also is the section to read if you are only crawling a URL.

This is what I found by following the tutorial at 
https://wiki.apache.org/nutch/NutchTutorial

On 7/9/17, lewis john mcgibbney  wrote:
> Hi Pau,
>
> On Sat, Jul 8, 2017 at 6:52 AM,  wrote:
>
>> From: Pau Paches 
>> To: user@nutch.apache.org
>> Cc:
>> Bcc:
>> Date: Sat, 8 Jul 2017 15:52:46 +0200
>> Subject: nutch 1.x tutorial with solr 6.6.0 Hi, I have run the Nutch 
>> 1.x Tutorial with Solr 6.6.0.
>> Many things do not work,
>
>
> What does not work? Can you elaborate?
>
>
>> there is a mismatch between the assumed Solr
>> version and the current Solr version.
>>
>
> We support Solr as an indexing backend in the broadest sense possible. We
> do not aim to support the latest and greatest Solr version available. If
> you are interested in upgrading to a particular version, if you could open
> a JIRA issue and provide a pull request it would be excellent.
>
>
>> I have seen some messages about the same problem for Solr 4.x
>> Is this the right path to go or should I move to Nutch 2.x?
>
>
> If you are new to Nutch, I would highly advise that you stick with 1.X
>
>
>> Does it
>> make sense to use Solr 6.6 with Nutch 1.x?
>
>
> Yes... you _may_ have a few configuration options to tweak but there have
> been no backwards incompatibility issues so I see no reason for anything to
> be broken.
>
>
>> If yes, I'm willing to
>> amend the tutorial if someone helps.
>>
>>
> What is broken? Can you elaborate?
>

RE: Nutch 1.13 parsing links but ignoring them?

2017-06-29 Thread Yossi Tamari

I figured it out myself. The problem was with db.max.outlinks.per.page
having a default value of 100.

From: Yossi Tamari [mailto:yossi.tam...@pipl.com] 
Sent: 26 June 2017 19:26
To: user@nutch.apache.org
Subject: Nutch 1.13 parsing links but ignoring them?

I'm seeing many cases where ParserChecker finds outlinks in a document, but
when running crawl on this document they do not appear in the crawl DB at
all (and are not indexed). 

My URL filters are trivial as far as I can tell, and the missing links are
not special in any way that I can see.

For example:

/bin/nutch parsechecker -dumpText "http://corporate.exxonmobil.com/";

finds, among others, the URLs https://energyfactor.exxonmobil.com/ and
http://corporate.exxonmobil.com/en/investors/corporate-governance.

However, when running

bin/crawl  urls_yossi yossi 2

with only http://corporate.exxonmobil.com/ in urls_yossi, and then dumping
yossi/crawldb (using `nutch readdb`), the two above URLs are not found.

When finished, the crawldb contains 786 entries, which is far below topN.

Any idea what could be causing these URLs to be ignored?

Nutch 1.13 parsing links but ignoring them?

2017-06-26 Thread Yossi Tamari

I'm seeing many cases where ParserChecker finds outlinks in a document, but
when running crawl on this document they do not appear in the crawl DB at
all (and are not indexed). 

My URL filters are trivial as far as I can tell, and the missing links are
not special in any way that I can see.

For example:

/bin/nutch parsechecker -dumpText "http://corporate.exxonmobil.com/";

finds, among others, the URLs https://energyfactor.exxonmobil.com/ and
http://corporate.exxonmobil.com/en/investors/corporate-governance.

However, when running

bin/crawl  urls_yossi yossi 2

with only http://corporate.exxonmobil.com/ in urls_yossi, and then dumping
yossi/crawldb (using `nutch readdb`), the two above URLs are not found.

When finished, the crawldb contains 786 entries, which is far below topN.

 

Any idea what could be causing these URLs to be ignored?

IllegalStateException in CleaningJob on ElasticSearch 2.3.3

2017-05-16 Thread Yossi Tamari

Hi,

 

When running 'crawl -i', I get the following exception in the second
iteration, during the CleaningJob:

 

Cleaning up index if possible

/data/apache-nutch-1.13/runtime/deploy/bin/nutch clean crawl-inbar/crawldb

17/05/16 05:40:32 INFO indexer.CleaningJob: CleaningJob: starting at
2017-05-16 05:40:32

17/05/16 05:40:33 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032

17/05/16 05:40:33 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032

17/05/16 05:40:34 INFO mapred.FileInputFormat: Total input paths to process
: 1

17/05/16 05:40:34 INFO mapreduce.JobSubmitter: number of splits:2

17/05/16 05:40:34 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1493910246747_0030

17/05/16 05:40:34 INFO impl.YarnClientImpl: Submitted application
application_1493910246747_0030

17/05/16 05:40:34 INFO mapreduce.Job: The url to track the job:
http://crawler001.pipl.com:8088/proxy/application_1493910246747_0030/

17/05/16 05:40:34 INFO mapreduce.Job: Running job: job_1493910246747_0030

17/05/16 05:40:43 INFO mapreduce.Job: Job job_1493910246747_0030 running in
uber mode : false

17/05/16 05:40:43 INFO mapreduce.Job:  map 0% reduce 0%

17/05/16 05:40:48 INFO mapreduce.Job:  map 50% reduce 0%

17/05/16 05:40:52 INFO mapreduce.Job:  map 100% reduce 0%

17/05/16 05:40:53 INFO mapreduce.Job: Task Id :
attempt_1493910246747_0030_r_00_0, Status : FAILED

Error: java.lang.IllegalStateException: bulk process already closed

at
org.elasticsearch.action.bulk.BulkProcessor.ensureOpen(BulkProcessor.java:27
8)

at
org.elasticsearch.action.bulk.BulkProcessor.flush(BulkProcessor.java:329)

at
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.commit(ElasticIndexW
riter.java:200)

at
org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:127)

at
org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:1
25)

at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)

at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1657)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

 

This happens in all the reduce tasks for this job. In the first iteration
the CleaningJob finished successfully.

Any ideas what may be causing this?

 

Thanks,

   Yossi.

RE: Wrong FS exception in Fetcher

2017-05-03 Thread Yossi Tamari

Hi,

 

Setting the MapReduce framework to YARN solved this issue.

 

Yossi.

 

From: Yossi Tamari [mailto:yossi.tam...@pipl.com] 
Sent: 30 April 2017 17:04
To: user@nutch.apache.org
Subject: Wrong FS exception in Fetcher

 

Hi,

 

I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
distributed mode.

Running the command:

Deploy/bin/crawl urls crawl 2

The Injector and Generator run successfully, but in the Fetcher I get the
following error:

17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
expected:   file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)

at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
6)

at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
System.java:630)

at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
leSystem.java:861)

at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
a:625)

at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
5)

at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)

at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
ormat.java:55)

at
org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)

at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
:141)

at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)

at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1807)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)

at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)

at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1807)

at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)

at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)

at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:234)

at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

 

Error running:

  /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
crawl/segments/20170430084337 -noParsing -threads 50

Failed with exit value 255.

 

 

Any ideas how to fix this?

 

Thanks,

   Yossi.

RE: Wrong FS exception in Fetcher

2017-05-02 Thread Yossi Tamari

Hi,

Issue created: https://issues.apache.org/jira/browse/NUTCH-2383.

Thanks,
Yossi.


-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: 02 May 2017 16:08
To: user@nutch.apache.org
Subject: Re: Wrong FS exception in Fetcher

Hi Yossi,

> that 1.13 requires Hadoop 2.7.2 specifically.

That's not a hard requirement. Usually you have to use the Hadoop version of 
your running Hadoop
cluster. Mostly this causes no problems, but if there are problems it's a good 
strategy to try
this first.

Thanks, for the detailed log. All steps are called the same way. The method
checkOutputSpecs(FileSystem, JobConf) is first called in the Fetcher.
It probably needs debugging to find out why here a local file system for the
output path is assumed.

Please, open an issue on
  https://issues.apache.org/jira/browse/NUTCH

Thanks,
Sebastian

On 05/02/2017 01:21 PM, Yossi Tamari wrote:
> Thanks Sebastian,
> 
> The output with set -x is below. I'm new to Nutch and was not aware that 1.13 
> requires Hadoop 2.7.2 specifically. While I see it now in pom.xml, it may be 
> a good idea to document it in the download page and provide a download link 
> (since the Hadoop releases page contains 2.7.3 but not 2.7.2). I will try to 
> install 2.7.2 and retest tomorrow.
> 
> root@crawler001:/data/apache-nutch-1.13/runtime/deploy/bin# ./crawl urls 
> crawl 2
> Injecting seed URLs
> /data/apache-nutch-1.13/runtime/deploy/bin/nutch inject crawl/crawldb urls
> + cygwin=false
> + case "`uname`" in
> ++ uname
> + THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
> + '[' 3 = 0 ']'
> + COMMAND=inject
> + shift
> ++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
> ++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
> ++ pwd
> + NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
> + '[' '' '!=' '' ']'
> + '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
> + local=true
> + '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
> + local=false
> + for f in '"$NUTCH_HOME"/*nutch*.job'
> + NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
> + false
> + JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
> + JAVA_HEAP_MAX=-Xmx1000m
> + '[' '' '!=' '' ']'
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
> + 
> CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
> + IFS=
> + false
> + false
> + JAVA_LIBRARY_PATH=
> + '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
> + '[' false = true -a X '!=' X ']'
> + unset IFS
> + '[' '' = '' ']'
> + NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
> + '[' '' = '' ']'
> + NUTCH_LOGFILE=hadoop.log
> + false
> + NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
> + NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
> + '[' x '!=' x ']'
> + '[' inject = crawl ']'
> + '[' inject = inject ']'
> + CLASS=org.apache.nutch.crawl.Injector
> + EXEC_CALL=(hadoop jar "$NUTCH_JOB")
> + false
> ++ which hadoop
> ++ wc -l
> + '[' 1 -eq 0 ']'
> + exec hadoop jar 
> /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job 
> org.apache.nutch.crawl.Injector crawl/crawldb urls
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: starting at 2017-05-02 
> 06:00:24
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: crawlDb: crawl/crawldb
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: urlDir: urls
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: Converting injected urls to 
> crawl db entries.
> 17/05/02 06:00:25 INFO Configuration.deprecation: session.id is deprecated. 
> Instead, use dfs.metrics.session-id
> 17/05/02 06:00:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
> processName=JobTracker, sessionId=
> 17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:26 INFO mapreduce.JobSubmitter: number of splits:2
> 17/05/02 06:00:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_local307378419_0001
> 17/05/02 06:00:26 INFO mapreduce.Job: The url

RE: Wrong FS exception in Fetcher

2017-05-02 Thread Yossi Tamari

Thanks Sebastian,

The output with set -x is below. I'm new to Nutch and was not aware that 1.13 
requires Hadoop 2.7.2 specifically. While I see it now in pom.xml, it may be a 
good idea to document it in the download page and provide a download link 
(since the Hadoop releases page contains 2.7.3 but not 2.7.2). I will try to 
install 2.7.2 and retest tomorrow.

root@crawler001:/data/apache-nutch-1.13/runtime/deploy/bin# ./crawl urls crawl 2
Injecting seed URLs
/data/apache-nutch-1.13/runtime/deploy/bin/nutch inject crawl/crawldb urls
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 3 = 0 ']'
+ COMMAND=inject
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+ 
CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' inject = crawl ']'
+ '[' inject = inject ']'
+ CLASS=org.apache.nutch.crawl.Injector
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job 
org.apache.nutch.crawl.Injector crawl/crawldb urls
17/05/02 06:00:24 INFO crawl.Injector: Injector: starting at 2017-05-02 06:00:24
17/05/02 06:00:24 INFO crawl.Injector: Injector: crawlDb: crawl/crawldb
17/05/02 06:00:24 INFO crawl.Injector: Injector: urlDir: urls
17/05/02 06:00:24 INFO crawl.Injector: Injector: Converting injected urls to 
crawl db entries.
17/05/02 06:00:25 INFO Configuration.deprecation: session.id is deprecated. 
Instead, use dfs.metrics.session-id
17/05/02 06:00:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
17/05/02 06:00:26 INFO mapreduce.JobSubmitter: number of splits:2
17/05/02 06:00:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_local307378419_0001
17/05/02 06:00:26 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/
17/05/02 06:00:26 INFO mapreduce.Job: Running job: job_local307378419_0001
17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter is 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: 
attempt_local307378419_0001_m_00_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip 
cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.MapTask: Processing split: 
hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-0/data:0+148
17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:26 INFO plugin.PluginRepository: Plugins: looking in: 
/tmp/hadoop-unjar333276722181778867/classes/plugins
17/05/02 06:0

Wrong FS exception in Fetcher

2017-04-30 Thread Yossi Tamari

Hi,

 

I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
distributed mode.

Running the command:

Deploy/bin/crawl urls crawl 2

The Injector and Generator run successfully, but in the Fetcher I get the
following error:

17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
expected: file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)

at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
6)

at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
System.java:630)

at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
leSystem.java:861)

at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
a:625)

at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
5)

at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)

at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
ormat.java:55)

at
org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)

at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
:141)

at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)

at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1807)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)

at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)

at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1807)

at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)

at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)

at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:234)

at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

 

Error running:

  /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
crawl/segments/20170430084337 -noParsing -threads 50

Failed with exit value 255.

 

 

Any ideas how to fix this?

 

Thanks,

   Yossi.

73 matches

Mail list logo