RE: EXTERNAL: Re: Re: 301 perm redirect pages are still in Solr

2021-03-10 Thread Hany NASR
Hello Markus.

Before running the commands I dumped the crawldb and checked again that 
document status is 5 (db_redir_perm), then I ran both commands with the same 
result, but the 301 document/s still exists in Solr


1.  sudo bin/nutch clean crawl/crawldb/

2.  sudo bin/nutch solrclean crawl/crawldb/


No exchange was configured. The documents will be routed to all index writers.
SolrIndexer: deleting 1000/1000 documents
SolrIndexer: deleting 1000/2000 documents
SolrIndexer: deleting 1000/3000 documents
SolrIndexer: deleting 1000/4000 documents
SolrIndexer: deleting 270/4270 documents

Did I miss anything here?

Regards,
Hany

From: Markus Jelsma 
Sent: Tuesday, March 9, 2021 11:19 AM
To: user@nutch.apache.org
Subject: EXTERNAL: Re: Re: 301 perm redirect pages are still in Solr

Hello Hany,

Sure, check these commands:

 solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED
use the clean command instead
 clean remove HTTP 301 and 404 documents and duplicates from
indexing backends configured via plugins

Regards,
Markus

Op di 9 mrt. 2021 om 08:49 schreef Hany NASR 
mailto:hany.n...@hsbc.com>.invalid>:

> Hello Markus,
>
> I added the property in nutch-site.xml with no luck.
>
> The documents still exist in Solr; any advice?
>
> Regards,
> Hany
>
> From: Markus Jelsma 
> mailto:markus.jel...@openindex.io>>
> Sent: Monday, March 8, 2021 3:40 PM
> To: user@nutch.apache.org<mailto:user@nutch.apache.org>
> Subject: EXTERNAL: Re: 301 perm redirect pages are still in Solr
>
> Hello Hany,
>
> You need to tell the indexer to delete those record. This will help:
>
>   
>  
>indexer.delete
>true
>  
>
> Regards,
> Markus
>
> Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR 
> mailto:hany.n...@hsbc.com> hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>>.invalid>:
>
> > Hi All,
> >
> > I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> > are still indexed and not removed in Solr.
> >
> > When I exported the crawlDB I found the page Status: 5 (db_redir_perm).
> >
> > How can I keep Solr index up to date and make Nutch clean these pages
> > automatically?
> >
> > Regards,
> > Hany
> >
> > -
> > SAVE PAPER - THINK BEFORE YOU PRINT!
> >
> > This E-mail is confidential.
> >
> > It may also be legally privileged. If you are not the addressee you may
> > not copy,
> > forward, disclose or use any part of it. If you have received this
> message
> > in error,
> > please delete it and all copies from your system and notify the sender
> > immediately by
> > return E-mail.
> >
> > Internet communications cannot be guaranteed to be timely secure, error
> or
> > virus-free.
> > The sender does not accept liability for any errors or omissions.
> >
>
> **
> This message originated from the Internet.  Its originator may or
> may not be who they claim to be and the information contained in
> the message and any attachments may or may not be accurate.
> **
>
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>

-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential. 

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: EXTERNAL: Re: 301 perm redirect pages are still in Solr

2021-03-08 Thread Hany NASR
Hello Markus,

I added the property in nutch-site.xml with no luck.

The documents still exist in Solr; any advice?

Regards,
Hany

From: Markus Jelsma 
Sent: Monday, March 8, 2021 3:40 PM
To: user@nutch.apache.org
Subject: EXTERNAL: Re: 301 perm redirect pages are still in Solr

Hello Hany,

You need to tell the indexer to delete those record. This will help:

  
 
   indexer.delete
   true
 

Regards,
Markus

Op ma 8 mrt. 2021 om 15:31 schreef Hany NASR 
mailto:hany.n...@hsbc.com>.invalid>:

> Hi All,
>
> I'm using Nutch 1.15, and figure out that permeant redirect pages (301)
> are still indexed and not removed in Solr.
>
> When I exported the crawlDB I found the page Status: 5 (db_redir_perm).
>
> How can I keep Solr index up to date and make Nutch clean these pages
> automatically?
>
> Regards,
> Hany
>
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
>
> This E-mail is confidential.
>
> It may also be legally privileged. If you are not the addressee you may
> not copy,
> forward, disclose or use any part of it. If you have received this message
> in error,
> please delete it and all copies from your system and notify the sender
> immediately by
> return E-mail.
>
> Internet communications cannot be guaranteed to be timely secure, error or
> virus-free.
> The sender does not accept liability for any errors or omissions.
>

**
This message originated from the Internet.  Its originator may or
may not be who they claim to be and the information contained in
the message and any attachments may or may not be accurate.
**

-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential. 

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


301 perm redirect pages are still in Solr

2021-03-08 Thread Hany NASR
Hi All,

I'm using Nutch 1.15, and figure out that permeant redirect pages (301) are 
still indexed and not removed in Solr.

When I exported the crawlDB I found the page Status: 5 (db_redir_perm).

How can I keep Solr index up to date and make Nutch clean these pages 
automatically?

Regards,
Hany

-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential. 

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: Meta tags are duplicated

2019-04-01 Thread hany . nasr
Yes. It is working fine with 1.15


Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Sadiki Latty [mailto:sla...@uottawa.ca] 
Sent: 29 March 2019 14:11
To: user@nutch.apache.org
Subject: RE: Meta tags are duplicated

It seems to have worked for Hany (using 1.15). I successfully used this patch 
on 1.13 and 1.14.

Sadiki Latty
Web Developer/ Développeur Web
Technologies de l'information / Information Technology Université d'Ottawa | 
University of Ottawa
1 Nicholas (801)
613-562-5800 ext. 7512


-Original Message-
From: IZaBEE_Keeper [mailto:ale...@dvynedesign.com] 
Sent: March 28, 2019 7:52 PM
To: user@nutch.apache.org
Subject: RE: Meta tags are duplicated

Did this resolved the multivalued field issue?

I implemented this a while back but it did not resolve the multivalue field 
issue..

I'd really like to have the keywords as a regular field..  :)



-
Bee Keeper at IZaBEE.com
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html


***
This message originated from the Internet. Its originator
may or may not be who they claim to be and the information
contained in the message and any attachments may or may
not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: Meta tags are duplicated

2019-03-27 Thread hany . nasr
Thank you Sadiki.

The patch is working as expected.

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!

-Original Message-
From: Sadiki Latty [mailto:sla...@uottawa.ca] 
Sent: 26 March 2019 12:05
To: user@nutch.apache.org
Subject: RE: Meta tags are duplicated

Hey,

This is caused by usage of the Tika plugin and MetatagParser. I am currently 
using this patch to resolve the issue

https://issues.apache.org/jira/browse/NUTCH-1559

Cheers,

Sadiki Latty
Web Developer/ Développeur Web
Technologies de l’information / Information Technology Université d'Ottawa | 
University of Ottawa
1 Nicholas (801)
613-562-5800 ext. 7512


-Original Message-
From: hany.n...@hsbc.com.INVALID [mailto:hany.n...@hsbc.com.INVALID] 
Sent: March 26, 2019 4:53 AM
To: user@nutch.apache.org
Subject: Meta tags are duplicated

Hello

I'm using Nutch 1.15 and parsing/indexing meta tags using parse-metatags plugin.

Values are always come duplicated and forced me to change Solr fields to 
multivalue.

Example:  

Moreover, I ran indexchecker and can see the duplication as well.

Any advice how to remove this duplication?

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC 
Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, 
Poland __

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy, forward, disclose or use any part of it. If you have received this 
message in error, please delete it and all copies from your system and notify 
the sender immediately by return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


***
This message originated from the Internet. Its originator
may or may not be who they claim to be and the information
contained in the message and any attachments may or may
not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Meta tags are duplicated

2019-03-26 Thread hany . nasr
Hello

I'm using Nutch 1.15 and parsing/indexing meta tags using parse-metatags plugin.

Values are always come duplicated and forced me to change Solr fields to 
multivalue.

Example:  

Moreover, I ran indexchecker and can see the duplication as well.

Any advice how to remove this duplication?

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Boilerpipe algorithm is not working as expected

2019-03-19 Thread hany . nasr
Hello,

I am using Boilerpipe algorithm in Nutch; however, I noticed the extracted 
content is almost 5% of the page; main page content is removed.

How does Boilerpipe is working and based on which criteria is deciding to 
remove a section or not?

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hi,

JIRA Ticket is created: https://issues.apache.org/jira/browse/NUTCH-2703

I'm able to crawl the website and these huge pdfs with 500MB JVM heap without 
Boilerpipe.

Enabling Boilerpipe forced me to increase the JVM heap to 8500MB.

Hope this bug can be fixed in Nutch 1.16.

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: hany.n...@hsbc.com.INVALID [mailto:hany.n...@hsbc.com.INVALID] 
Sent: 18 March 2019 12:21
To: user@nutch.apache.org
Subject: RE: OutOfMemoryError: GC overhead limit exceeded

Hello Markus,

I am able to parse these pdfs without increasing the heap. If tika extractor is 
none.

I did increase the heap with Boilerpipe enabled and didn't work by giving me 
"failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully 
parse content", then OOM.

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC 
Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, 
Poland __ 

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: 18 March 2019 12:12
To: user@nutch.apache.org
Subject: RE: OutOfMemoryError: GC overhead limit exceeded

Hello Hany,

If you deal with large PDF files, and you get an OOM with this stack trace, it 
is highly unlikely due to Boilerpipe being active. Boilerpipe does not run 
before PDFBox is finished so you should really increase the heap.

Of course, to answer the question, Boilerpipe should not run for non-(X)HTML 
pages anyway, so you can open a ticket. But the resources saved by such a 
change would be minimal at best.

Regards,
Markus
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Monday 18th March 2019 11:49
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul.
> Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID]
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #   Default is 1000.
> #
> #   NUTCH_OPTS  Extra Java runtime options.
> #   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> > I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> > 
> > Is it correct?, should I change anywhere else?
> > 
> > 
> > Kind regards,
> > Hany Shehata
> > Enterprise Engineer
> > Green Six Sigma Certified
> > Solutions Architect, Marketing and Communications IT Corporate 
> > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > Kapelanka 42A, 30-347 Kraków, Poland 
> > __
> > 
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > 

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hello Markus,

I am able to parse these pdfs without increasing the heap. If tika extractor is 
none.

I did increase the heap with Boilerpipe enabled and didn't work by giving me 
"failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully 
parse content", then OOM.

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: 18 March 2019 12:12
To: user@nutch.apache.org
Subject: RE: OutOfMemoryError: GC overhead limit exceeded

Hello Hany,

If you deal with large PDF files, and you get an OOM with this stack trace, it 
is highly unlikely due to Boilerpipe being active. Boilerpipe does not run 
before PDFBox is finished so you should really increase the heap.

Of course, to answer the question, Boilerpipe should not run for non-(X)HTML 
pages anyway, so you can open a ticket. But the resources saved by such a 
change would be minimal at best.

Regards,
Markus
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Monday 18th March 2019 11:49
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID]
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #   Default is 1000.
> #
> #   NUTCH_OPTS  Extra Java runtime options.
> #   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> > I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> > 
> > Is it correct?, should I change anywhere else?
> > 
> > 
> > Kind regards,
> > Hany Shehata
> > Enterprise Engineer
> > Green Six Sigma Certified
> > Solutions Architect, Marketing and Communications IT Corporate 
> > Functions | HSBC Operations, Services and Technology (HOST) ul.
> > Kapelanka 42A, 30-347 Kraków, Poland 
> > __
> > 
> > Tie line: 7148 7689 4698
> > External: +48 123 42 0698
> > Mobile: +48 723 680 278
> > E-mail: hany.n...@hsbc.com
> > __
> > Protect our environment - please only print this if you have to!
> > 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: 14 March 2019 10:59
> > To: user@nutch.apache.org
> > Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> > 
> > Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have 
> > no choice, either skip large files, or increase memory.
> > 
> > Regards,
> > Markus
> > 
> >  
> >  
> > -Original message-
> >> From:hany.n...@hsbc.com.INVALID 
> >> Sent: Thursday 14th March 2019 10:44
> >> To: user@nutch.apache.org
> >> Subject: OutOfMemoryError: GC overhead limit exceeded
> >>
> >> Hello,
> >>
> >> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
> >> trying to parse pdfs that includes 3500 pages.
> >>
> >> I increased the JVM RAM to 1500MB; however, I'm still 

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hi,

Is there any workaround for now to exclude pdfs from the usage of boilerpipe?


Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
Sent: 18 March 2019 12:01
To: user@nutch.apache.org
Subject: Re: OutOfMemoryError: GC overhead limit exceeded

Hi,

good point.

Maybe we should implement a limit on the usage of boilerpipe:
- either by MIME type (only HTML types)
  I doubt that boilerpipe has been implemented for any formats except HTML
- or by document size (or size of the DOM tree)

Please open a Jira issue to implement this.

But you may also ask on the Tika user mailing list about the problem first.

Best,
Sebastian


On 3/18/19 11:49 AM, hany.n...@hsbc.com.INVALID wrote:
> Hi,
> 
> I found the root cause and it is not related to JVM Heap Size.
> 
> The problem of parsing these pdfs happen when I enable the tika extractor to 
> be boilerpipe.
> 
> Boilerpipe article extractor is working perfectly with other pdfs and pages; 
> when I disable it, Tika is able to parse and index these pdfs.
> 
> Any suggestion/help?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID]
> Sent: 14 March 2019 13:06
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError: GC overhead limit exceeded
> 
> Hi,
> 
> if running in local mode, it's better passed via ENV to bin/nutch, cf.
> 
> # Environment Variables
> #
> #   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
> #
> #   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
> #   Default is 1000.
> #
> #   NUTCH_OPTS  Extra Java runtime options.
> #   Multiple options must be separated by white space.
> 
> In distributed mode, please read the Hadoop docs about mapper/reducer memory 
> and Java heap space.
> 
> Best,
> Sebastian
> 
> On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
>> I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
>>
>> Is it correct?, should I change anywhere else?
>>
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.n...@hsbc.com
>> __
>> Protect our environment - please only print this if you have to!
>>
>>
>> -Original Message-
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> Sent: 14 March 2019 10:59
>> To: user@nutch.apache.org
>> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
>>
>> Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have 
>> no choice, either skip large files, or increase memory.
>>
>> Regards,
>> Markus
>>
>>  
>>  
>> -Original message-
>>> From:hany.n...@hsbc.com.INVALID 
>>> Sent: Thursday 14th March 2019 10:44
>>> To: user@nutch.apache.org
>>> Subject: OutOfMemoryError: GC overhead limit exceeded
>>>
>>> Hello,
>>>
>>> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
>>> trying to parse pdfs that includes 3500 pages.
>>>
>>> I increased the JVM RAM to 1500MB; however, I'm still facing the 
>>> same problem
>>>
>>> Please advise
>>>
>>> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing
>>> http://domain/-/media/files/attachments/common/voting_disclosure_201
>>> 4 _ q2.pdf with org.apache.nutch.parse.tika.TikaParser
>>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
>>> overhead limit exceeded
>>> at 
>>> 

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-18 Thread hany . nasr
Hi,

I found the root cause and it is not related to JVM Heap Size.

The problem of parsing these pdfs happen when I enable the tika extractor to be 
boilerpipe.

Boilerpipe article extractor is working perfectly with other pdfs and pages; 
when I disable it, Tika is able to parse and index these pdfs.

Any suggestion/help?

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
Sent: 14 March 2019 13:06
To: user@nutch.apache.org
Subject: Re: OutOfMemoryError: GC overhead limit exceeded

Hi,

if running in local mode, it's better passed via ENV to bin/nutch, cf.

# Environment Variables
#
#   NUTCH_JAVA_HOME The java implementation to use.  Overrides JAVA_HOME.
#
#   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
#   Default is 1000.
#
#   NUTCH_OPTS  Extra Java runtime options.
#   Multiple options must be separated by white space.

In distributed mode, please read the Hadoop docs about mapper/reducer memory 
and Java heap space.

Best,
Sebastian

On 3/14/19 12:16 PM, hany.n...@hsbc.com.INVALID wrote:
> I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.
> 
> Is it correct?, should I change anywhere else?
> 
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: 14 March 2019 10:59
> To: user@nutch.apache.org
> Subject: RE: OutOfMemoryError: GC overhead limit exceeded
> 
> Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no 
> choice, either skip large files, or increase memory.
> 
> Regards,
> Markus
> 
>  
>  
> -Original message-
>> From:hany.n...@hsbc.com.INVALID 
>> Sent: Thursday 14th March 2019 10:44
>> To: user@nutch.apache.org
>> Subject: OutOfMemoryError: GC overhead limit exceeded
>>
>> Hello,
>>
>> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
>> trying to parse pdfs that includes 3500 pages.
>>
>> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
>> problem
>>
>> Please advise
>>
>> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
>> http://domain/-/media/files/attachments/common/voting_disclosure_2014
>> _ q2.pdf with org.apache.nutch.parse.tika.TikaParser
>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
>> overhead limit exceeded
>> at 
>> java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>> at 
>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
>> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
>> at 
>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
>> at 
>> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>> at 
>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>> at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>> at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at 
>> 

RE: OutOfMemoryError: GC overhead limit exceeded

2019-03-14 Thread hany . nasr
I'm changing the mapred.child.java.opts=-Xmx1500m in crawl bash file.

Is it correct?, should I change anywhere else?


Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: 14 March 2019 10:59
To: user@nutch.apache.org
Subject: RE: OutOfMemoryError: GC overhead limit exceeded

Hello - 1500 MB is a lot indeed, but 3500 PDF pages is even more. You have no 
choice, either skip large files, or increase memory.

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com.INVALID 
> Sent: Thursday 14th March 2019 10:44
> To: user@nutch.apache.org
> Subject: OutOfMemoryError: GC overhead limit exceeded
> 
> Hello,
> 
> I'm facing OutOfMemoryError: GC overhead limit exceeded exception while 
> trying to parse pdfs that includes 3500 pages.
> 
> I increased the JVM RAM to 1500MB; however, I'm still facing the same 
> problem
> 
> Please advise
> 
> 2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
> http://domain/-/media/files/attachments/common/voting_disclosure_2014_
> q2.pdf with org.apache.nutch.parse.tika.TikaParser
> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> at 
> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
> at 
> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
> at 
> org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
> at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at 
> org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
> at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79)
> at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you 
> may not copy, 

RE: Nutch and HTTP headers

2019-03-14 Thread hany . nasr
Thank you so much.

I'm able to index the http headers.

I can't imagine my life without this group :)

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
Sent: 13 March 2019 18:41
To: user@nutch.apache.org
Subject: Re: Nutch and HTTP headers

Hi,

> How can I index this value on Solr?

 1. add the field "_response.headers_" to the Solr schema, see
  http://localhost:8983/solr/#/nutch/schema

 2. set the property store.http.headers = true

 3. you can test it sending a single document using the indexchecker:

   % bin/nutch indexchecker \
  
-Dplugin.includes='protocol-okhttp|parse-html|index-metadata|indexer-solr' \
  -Dstore.http.headers=true \
  -Dindex.content.md=_response.headers_ \
  -DdoIndex=true \
 'http://localhost/'
   fetching: http://localhost/
   ...
   Indexing 1/1 documents
   Deleting 0 documents

 4. Solr should contain the document including the header

   "response":{"numFound":1,"start":0,"docs":[
  {
"digest":"3526531ccd6c6a1d2340574a305a18f8",
"id":"http://localhost/;,
"_response.headers_":"HTTP/1.1 200 OK\r\nDate: Wed, 13 Mar 2019 
17:29:49 ..."


> What is the difference between protocol-okhttp and protocol-http?

There are few differences, see NUTCH-2576.

For historic reasons (NUTCH-2213) protocol-http does not always keep the 
original HTTP header while protocol-okhttp does.  I think we can remove this 
restriction, feel free to open a Jira issue for this.

Best,
Sebastian



On 3/13/19 9:21 AM, hany.n...@hsbc.com.INVALID wrote:
> Thank you Sebastian.
> 
> I'm able to get the HTTP headers as you explained below.
> 
> How can I index this value on Solr?
> What is the difference between protocol-okhttp and protocol-http?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID]
> Sent: 11 March 2019 17:06
> To: user@nutch.apache.org
> Subject: Re: Nutch and HTTP headers
> 
> Hi,
> 
>> Can Nutch index custom HTTP headers?
> 
> Nutch stores the HTTP response headers if the property `store.http.headers` 
> is true.  The headers are saved as string concatenated by `\r\n` under the 
> key `_response.headers_` in the content metadata.
> 
> You can send the entire HTTP headers to the indexer using the plugin 
> index-metadata and adding `_response.headers_` to `index.content.md`.  It 
> will add a field `_response.headers_` to the index:
> 
>  % bin/nutch indexchecker \
> -Dplugin.includes='protocol-okhttp|parse-html|index-metadata' \
> -Dstore.http.headers=true \
> -Dindex.content.md=_response.headers_ \
>'http://localhost/'
>  fetching: http://localhost/
>  ...
>  _response.headers_ :HTTP/1.1 200 OK
>  Date: Mon, 11 Mar 2019 16:03:41 GMT
>  Server: Apache/2.4.29 (Ubuntu)
>  Last-Modified: ...
> 
> But there is no standard way to pick single headers and send them to the 
> indexer as arbitrary fields.
> 
> Best,
> Sebastian
> 
> 
> On 3/11/19 4:21 PM, hany.n...@hsbc.com.INVALID wrote:
>> Hello,
>>
>> Can Nutch index custom HTTP headers?
>>
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.n...@hsbc.com
>> __
>> Protect our environment - please only print this if you have to!
>>
>>
>>
>> -
>> SAVE PAPER - THINK BEFORE YOU PRINT!
>>
>> This E-mail is confidential.  
>>
>> It may also be legally privileged. If you are not the 

OutOfMemoryError: GC overhead limit exceeded

2019-03-14 Thread hany . nasr
Hello,

I'm facing OutOfMemoryError: GC overhead limit exceeded exception while trying 
to parse pdfs that includes 3500 pages.

I increased the JVM RAM to 1500MB; however, I'm still facing the same problem

Please advise

2019-03-08 05:31:55,269 WARN  parse.ParseUtil - Error parsing 
http://domain/-/media/files/attachments/common/voting_disclosure_2014_q2.pdf 
with org.apache.nutch.parse.tika.TikaParser
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC 
overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at 
org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
at 
org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:127)
at 
org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:78)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at 
org.apache.pdfbox.text.PDFTextStripper.writePage(PDFTextStripper.java:564)
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392)
at 
org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at 
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:171)
at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:138)
at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:79)
at 
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
at 
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Nutch and HTTP headers

2019-03-11 Thread hany . nasr
Hello,

Can Nutch index custom HTTP headers?

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Fetcher intervals

2019-02-01 Thread hany . nasr
Hello,

We're crawling different websites sizes (small, medium and big) and would like 
to ask you based on your experience.

What are the best values for the fetcher intervals?


-  db.fetch.interval.default

-  db.fetch.interval.max

Bear in mind that:

1.   I am running the crawling each 6 hours to make sure that latest 
published content is crawled

2.   In case of any unexpected website responses like HTTPCode:500, Nutch 
will re-fetch these links after exceeding the interval max number of seconds

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: nutch 1.15 index multiple cores with solr 7.5

2018-12-21 Thread hany . nasr
Same issue here. What did you do with url regex & normalization?; these 
configurations might be changed from site to another.


Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!

-Original Message-
From: Lucas Reyes [mailto:tintanca...@gmail.com] 
Sent: 20 December 2018 22:39
To: user@nutch.apache.org
Subject: nutch 1.15 index multiple cores with solr 7.5

I'm using nutch 1.15 and solr 7.5 with *the need to index multiple cores*.
I have created separate crawldb and linkdb for each core, and then updated 
index-writers.xml with multiple solr writers (each writer_id matching 
corresponding core's name). Also, param name="url" points to each solr core, 
but since there's no place to pass a param indicating the writer id nor the 
solr core, bin/nutch index command indexes an specific crawldb against all 
cores. Of course, I need to only index crawldb1 to core1, and so on.

Any suggestion on resolving this?

Thanks in advance.


***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Nutch fetch job failed

2018-12-11 Thread hany . nasr
Hello,

I crawl my website and I forgot to whitelist my Nutch server IP, so as expected 
Nutch got 403 and didn't fetch the url.

When I whitelisted the server Nutch is not able to re-crawl the url and still 
see it 403.

It happen with me when the website was down and crawlDB has 500 for all links. 
It refuses to re-fetch links

What should I do? My workaround is to delete the crawl folder and re-crawl 
again, but is this the right way? If yes, then it is really not good as I need 
to clean Solr core to make sure that Nutch and Solr are in sync

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: mapred.child.java.opts

2018-12-10 Thread hany . nasr
Thank you. It is really very helpful.

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
Sent: 10 December 2018 10:37
To: user@nutch.apache.org
Subject: Re: mapred.child.java.opts

Hi,

> The 1000MB is static value, will the crawl bash script respect NUTCH_HEAPSIZE?

Yes, in local mode it will respect the value of the environment variable 
NUTCH_HEAPSIZE.
Respectively, the script $NUTCH_HOME/bin/nutch called by bin/crawl will respect 
it.

> How can I set NUTCH_HEAPSIZE?

It's an environment variable. How to set it might depend on the shell you're 
using.
E.g., for the bash shell:
  % export NUTCH_HEAPSIZE=2048
  % bin/crawl ...

Best,
Sebastian


On 12/7/18 4:05 PM, hany.n...@hsbc.com wrote:
> Thank you Sebastian.
> 
> I am using standalone Nutch and using crawl command. Didn't install 
> separate Hadoop cluster
> 
> The 1000MB is static value, will the crawl bash script respect NUTCH_HEAPSIZE?
> How can I set NUTCH_HEAPSIZE?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID]
> Sent: 07 December 2018 15:44
> To: user@nutch.apache.org
> Subject: Re: mapred.child.java.opts
> 
> Hi,
> 
> yes, of course, the comments just one line above even encourages you to do so:
> 
> # note that some of the options listed here could be set in the # 
> corresponding hadoop site xml param file
> 
> For most use cases this value is ok. Only if you're using a parsing fetcher 
> with many threads you may need more Java heap memory. Note that this setting 
> only applies to a (pseudo-)distributed mode (running on Hadoop). In locale 
> mode you can set the Java heap size via the environment variable 
> NUTCH_HEAPSIZE.
> 
> 
>> What will be the impact?
> 
> That depends mostly on your Hadoop cluster setup. Afaik, the properties 
> mapreduce.map.java.opts resp. mapreduce.reduce.java.opts will override 
> mapred.child.java.opts on Hadoop 2.x, so on a recent configured Hadoop 
> cluster there is usually zero impact.
> 
> There is also a Jira issue open to make the heap memory configurable 
> in distributed mode, see
> https://issues.apache.org/jira/browse/NUTCH-2501
> 
> 
> Best,
> Sebastian
> 
> On 12/7/18 3:08 PM, hany.n...@hsbc.com wrote:
>> Hello,
>>
>> While checking the Nutch (1.15) crawl bash file, I noticed at line 
>> 211 that 1000MB is statically set for java - > 
>> mapred.child.java.opts=-Xmx1000m
>>
>> Any idea why?, Can I change it?, What will be the impact?
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.n...@hsbc.com
>> __
>> Protect our environment - please only print this if you have to!
>>
>>


***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.

RE: mapred.child.java.opts

2018-12-07 Thread hany . nasr
Thank you Sebastian.

I am using standalone Nutch and using crawl command. Didn't install separate 
Hadoop cluster

The 1000MB is static value, will the crawl bash script respect NUTCH_HEAPSIZE?
How can I set NUTCH_HEAPSIZE?

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] 
Sent: 07 December 2018 15:44
To: user@nutch.apache.org
Subject: Re: mapred.child.java.opts

Hi,

yes, of course, the comments just one line above even encourages you to do so:

# note that some of the options listed here could be set in the # corresponding 
hadoop site xml param file

For most use cases this value is ok. Only if you're using a parsing fetcher 
with many threads you may need more Java heap memory. Note that this setting 
only applies to a (pseudo-)distributed mode (running on Hadoop). In locale mode 
you can set the Java heap size via the environment variable NUTCH_HEAPSIZE.


> What will be the impact?

That depends mostly on your Hadoop cluster setup. Afaik, the properties 
mapreduce.map.java.opts resp. mapreduce.reduce.java.opts will override 
mapred.child.java.opts on Hadoop 2.x, so on a recent configured Hadoop cluster 
there is usually zero impact.

There is also a Jira issue open to make the heap memory configurable in 
distributed mode, see
https://issues.apache.org/jira/browse/NUTCH-2501


Best,
Sebastian

On 12/7/18 3:08 PM, hany.n...@hsbc.com wrote:
> Hello,
> 
> While checking the Nutch (1.15) crawl bash file, I noticed at line 211 
> that 1000MB is statically set for java - > 
> mapred.child.java.opts=-Xmx1000m
> 
> Any idea why?, Can I change it?, What will be the impact?
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you 
> may not copy, forward, disclose or use any part of it. If you have 
> received this message in error, please delete it and all copies from 
> your system and notify the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 



***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


mapred.child.java.opts

2018-12-07 Thread hany . nasr
Hello,

While checking the Nutch (1.15) crawl bash file, I noticed at line 211 that 
1000MB is statically set for java - > mapred.child.java.opts=-Xmx1000m

Any idea why?, Can I change it?, What will be the impact?
Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: RE: unexpected Nutch crawl interruption

2018-11-19 Thread hany . nasr
This means there is nothing called corrupted db by any mean?


Kind regards, 
Hany Shehata
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] 
Sent: Monday, November 19, 2018 12:59 PM
To: user@nutch.apache.org
Subject: Re: RE: unexpected Nutch crawl interruption

From the most recent updated crawldb.
 

Sent: Monday, November 19, 2018 at 12:35 PM
From: hany.n...@hsbc.com
To: "user@nutch.apache.org" 
Subject: RE: unexpected Nutch crawl interruption Hello Semyon,

Does it means that if I re-run crawl command it will continue from where it has 
been stopped from the previous run?

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC 
Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, 
Poland __ 

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!


-Original Message-
From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
Sent: Monday, November 19, 2018 12:06 PM
To: user@nutch.apache.org
Subject: Re: unexpected Nutch crawl interruption

Hi Hany,  
 
If you open the script code you will reach that line:
 
# main loop : rounds of generate - fetch - parse - update for ((a=1; ; a++)) 
with number of break conditions.

For each iteration it calls n-independent map jobs.
If it breaks it stops.
You should finish the loop either with manual nutch commands, or start with the 
new call of crawl script using the past iteration crawldb.
Semyon.
 
 

Sent: Monday, November 19, 2018 at 11:41 AM
From: hany.n...@hsbc.com
To: "user@nutch.apache.org" 
Subject: unexpected Nutch crawl interruption Hello,

What will happen if bin/crawl command is forced to be stopped by any reason? 
Server restart

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC 
Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, 
Poland __

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.

It may also be legally privileged. If you are not the addressee you may not 
copy, forward, disclose or use any part of it. If you have received this 
message in error, please delete it and all copies from your system and notify 
the sender immediately by return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.





-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.

It may also be legally privileged. If you are not the addressee you may not 
copy, forward, disclose or use any part of it. If you have received this 
message in error, please delete it and all copies from your system and notify 
the sender immediately by return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the 

RE: unexpected Nutch crawl interruption

2018-11-19 Thread hany . nasr
Hello Semyon,

Does it means that if I re-run crawl command it will continue from where it has 
been stopped from the previous run?

Kind regards, 
Hany Shehata
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] 
Sent: Monday, November 19, 2018 12:06 PM
To: user@nutch.apache.org
Subject: Re: unexpected Nutch crawl interruption

Hi Hany,  
 
If you open the script code you will reach that line:
 
# main loop : rounds of generate - fetch - parse - update for ((a=1; ; a++)) 
with number of break conditions.

For each iteration it calls n-independent map jobs.
If it breaks it stops.
You should finish the loop either with manual nutch commands, or start with the 
new call of crawl script using the past iteration crawldb.
Semyon.
 
 

Sent: Monday, November 19, 2018 at 11:41 AM
From: hany.n...@hsbc.com
To: "user@nutch.apache.org" 
Subject: unexpected Nutch crawl interruption Hello,

What will happen if bin/crawl command is forced to be stopped by any reason? 
Server restart

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC 
Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, 
Poland __

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.

It may also be legally privileged. If you are not the addressee you may not 
copy, forward, disclose or use any part of it. If you have received this 
message in error, please delete it and all copies from your system and notify 
the sender immediately by return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


unexpected Nutch crawl interruption

2018-11-19 Thread hany . nasr
Hello,

What will happen if bin/crawl command is forced to be stopped by any reason? 
Server restart

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: Block certain parts of HTML code from being indexed

2018-11-16 Thread hany . nasr
Anyone was facing this requirement before?

Kind regards, 
Hany Shehata
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Hany NASR 
Sent: Thursday, November 15, 2018 4:18 PM
To: user@nutch.apache.org
Subject: RE: Block certain parts of HTML code from being indexed

Hello Markus,

What if I want to remove specific component or page section?

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC 
Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, 
Poland __ 

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Wednesday, November 14, 2018 4:11 PM
To: user@nutch.apache.org
Subject: RE: Block certain parts of HTML code from being indexed

Hello Hany,

Using parse-tika as your HTML parser, you can enable Boilerpipe (see 
nutch-default).

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com 
> Sent: Wednesday 14th November 2018 15:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from being indexed
> 
> Hello All,
> 
> I am using Nutch 1.15, and wondering if there is a feature for blocking 
> certain parts of HTML code from being indexed (header & footer).
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul.
> Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com<mailto:hany.n...@hsbc.com>
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you 
> may not copy, forward, disclose or use any part of it. If you have 
> received this message in error, please delete it and all copies from 
> your system and notify the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: Block certain parts of HTML code from being indexed

2018-11-15 Thread hany . nasr
Hello Markus,

What if I want to remove specific component or page section?

Kind regards, 
Hany Shehata
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, November 14, 2018 4:11 PM
To: user@nutch.apache.org
Subject: RE: Block certain parts of HTML code from being indexed

Hello Hany,

Using parse-tika as your HTML parser, you can enable Boilerpipe (see 
nutch-default).

Regards,
Markus

 
 
-Original message-
> From:hany.n...@hsbc.com 
> Sent: Wednesday 14th November 2018 15:53
> To: user@nutch.apache.org
> Subject: Block certain parts of HTML code from being indexed
> 
> Hello All,
> 
> I am using Nutch 1.15, and wondering if there is a feature for blocking 
> certain parts of HTML code from being indexed (header & footer).
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you 
> may not copy, forward, disclose or use any part of it. If you have 
> received this message in error, please delete it and all copies from 
> your system and notify the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Block certain parts of HTML code from being indexed

2018-11-14 Thread hany . nasr
Hello All,

I am using Nutch 1.15, and wondering if there is a feature for blocking certain 
parts of HTML code from being indexed (header & footer).

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Apache Nutch commercial support

2018-10-12 Thread hany . nasr
Hello,

You know big companies,  always looking for commercial things :(

Do you know if there is any commercial support for Apache Nutch and Solr? - or 
external providers?

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


RE: Nutch 1.15: Solr indexing issue

2018-10-11 Thread hany . nasr
Thank you so much.

They changed it dramatically. 
It is not accepting solr.server.url anymore and even old solr mapping xml file. 
Everything now under IndexWriter.xml

Kind regards, 
Hany Shehata
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.n...@hsbc.com 
__ 
Protect our environment - please only print this if you have to!


-Original Message-
From: Yossi Tamari [mailto:yossi.tam...@pipl.com] 
Sent: Thursday, October 11, 2018 9:33 AM
To: user@nutch.apache.org
Subject: RE: Nutch 1.15: Solr indexing issue

I'm using 1.15, but not with Solr. However, the configuration of IndexWriters 
changed in 1.15, you may want to read 
https://wiki.apache.org/nutch/IndexWriters#Solr_indexer_properties.

Yossi.

> -Original Message-
> From: hany.n...@hsbc.com 
> Sent: 11 October 2018 10:20
> To: user@nutch.apache.org
> Subject: Nutch 1.15: Solr indexing issue
> 
> Hi All,
> 
> Anyone is using Nutch 1.15?
> 
> I am trying to index my crawled urls into Solr but it is indexing only 
> for http://localhost:8983/solr/nutch. Is it hard coded somewhere in the code?
> 
> When I created a nutch core, my urls are indexed into it and ignored 
> my solr.server.url property.
> 
> My crawl command is:
> 
> sudo bin/crawl -i -D 
> solr.server.url=http://localhost:8983/solr/website -s urls 
> /home/hany.nasr/apache-nutch-1.15/crawl 1
> 
> Kind regards,
> Hany Shehata
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> _
> _
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> _
> _
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.
> 
> It may also be legally privileged. If you are not the addressee you 
> may not copy, forward, disclose or use any part of it. If you have 
> received this message in error, please delete it and all copies from 
> your system and notify the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, 
> error or virus-free.
> The sender does not accept liability for any errors or omissions.



***
This message originated from the Internet. Its originator may or may not be who 
they claim to be and the information contained in the message and any 
attachments may or may not be accurate.


 


-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Nutch 1.15: Solr indexing issue

2018-10-11 Thread hany . nasr
Hi All,

Anyone is using Nutch 1.15?

I am trying to index my crawled urls into Solr but it is indexing only for 
http://localhost:8983/solr/nutch. Is it hard coded somewhere in the code?

When I created a nutch core, my urls are indexed into it and ignored my 
solr.server.url property.

My crawl command is:

sudo bin/crawl -i -D solr.server.url=http://localhost:8983/solr/website -s urls 
/home/hany.nasr/apache-nutch-1.15/crawl 1

Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.