Nutch 1.12 with custom metadata

2017-03-28 Thread Chaushu, Shani
Hi, I'm trying to run crawl with nutch 1.12, and the seed file contains urls in this form (like the Example in the code comments) http://www.nutch.org/ \t key=value when I try to crawl, the log has error with invalid url http://www.nutch.org/%20\t%20key=value - the tab and key value custom metat

optimize configuration

2016-05-26 Thread Chaushu, Shani
Hi, I'm running nutch 1.9 on Hadoop & yarn, 3 nodes. Is there anywhere guide with optimize configuration so the nutch will run the most efficient way? Those are my current nutch-site: http.redirect.max 5 The maximum number of redirects the fetcher will follow when trying to fetc

RE: [MASSMAIL]crawl with nutch 1.11

2016-05-04 Thread Chaushu, Shani
Great thanks! -Original Message- From: Jorge Luis Betancourt González [mailto:jlbetanco...@uci.cu] Sent: Tuesday, May 03, 2016 17:53 To: user@nutch.apache.org Subject: Re: [MASSMAIL]crawl with nutch 1.11 Actually, executing bin/crawl shows this: -i|--index Indexes crawl results int

crawl with nutch 1.11

2016-05-02 Thread Chaushu, Shani
Hi, I want to upgrade from nutch 1.9 to nutch 1.11 I saw that in bin/crawl script there is no step of solrindex Do I need to run command for solr index separately after all the crawl is complete? There is another way to run the whole process in one command? Thanks, Shani --

don't crawl links in header

2016-03-22 Thread Chaushu, Shani
Hi, Sometimes in the header of pages that are tag that link to pages that are source code that doesn't interesting for example http:///somexmlsettingsdata?type=xml This link is not suffix xml so I can't filter it out but I want that the nutch will get only links from body and not from t

nutch 1.10 vs. 1.9

2015-09-06 Thread Chaushu, Shani
Hi, I crawled in nutch 1.9 on hadoop for some monthes. I want to upgrade to nutch 1.10 I tried to run the same commands (except for the solrurl because I saw it was removed from the crawl file) And got this error: ERROR crawl.Injector: Injector: java.lang.IllegalArgumentException: Wrong FS: hd

crawl limited amount

2015-08-02 Thread Chaushu, Shani
Hi, I crawl ~80k urls in seed, and notice that after depth 2, the number of each segment is limited in~200k. this is not possible that all depths (segments) are in the same depth. I changed the sizeFetchlist to be : `expr $numSlaves \* 1000` But I saw there is a comment : "250K per task?" Wha

Nutch doesn't crawl all seed

2015-07-18 Thread Chaushu, Shani
Hi I use nutch 1.9 I have ~80k links that I want to crawl When I crawl all together it crawled only ~30k When I crawled not all together but few times parts of the seed, it crawled lot more. The db.max.outlinks.per.page set to -1 There is any parameter that maybe restrict the number of pages the n

Parent URL

2015-07-02 Thread Chaushu, Shani
Hi, I'm using Nutch 1.9 with Solr 4.10 There is any way so see in solr for each page the parent/root page they came from? Thanks, Shani - Intel Electronics Ltd. This e-mail and any attachments may contain confidential material

RE: Nutch 2.X vs. 1.X

2015-05-31 Thread Chaushu, Shani
Thanks ! -Original Message- From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Sunday, May 31, 2015 21:56 To: user@nutch.apache.org Subject: Re: Nutch 2.X vs. 1.X Hi Chaushu, On Sun, May 31, 2015 at 12:30 AM, wrote: > > I'm using Nutch 1.9 with Solr 4.10 > I wanted to

Nutch 2.X vs. 1.X

2015-05-31 Thread Chaushu, Shani
Hi, I'm using Nutch 1.9 with Solr 4.10 I wanted to ask what are the advantages of Nutch 2 vs. Nutch 1 and if I use Solr, there is a reason why should I use Nutch 2. (I understand that the different is that Nutch 2 use NoSQL - but if I use Solr, I can access the data from there..) Thanks a lot, S

crawling page main domain

2015-05-09 Thread Chaushu, Shani
Hi, I crawler lists of domains and index it into the Solr I wanted to know if there is any way to see the source domain of every url in Solr. For example if a page in depth 1 is www.x.com, it will have lots of urls related to it, www.x.com/aaa, www.x.com/b

crawl into the same folder twtice

2015-04-29 Thread Chaushu, Shani
Hi, I wanted to ask what happen if I crawl into the same folder twice different seed. I saw that in the second time even if I give different seed it crawls the previous seed as well. If I want to crawl the same pages few times, should I do it into the same folder for merging duplications? Or it

nutch-selenium on nutch 1.9

2015-03-17 Thread Chaushu, Shani
Hi, I'm trying to run the Nutch -selenium plugin for nutch 1.9 for parsing pages that require to enable java script. The ant runtime fails with several errors such as error: cannot find symbol error: package WebPage does not exist error: method does not override or implement a method from a supert

RE: Nutch doesn't crawl relative pages

2015-02-04 Thread Chaushu, Shani
ww.example.com/$ #+. On Wed, Feb 4, 2015 at 10:10 PM, Chaushu, Shani wrote: > Hi, > I'm using Nucth 1.9. > I'm trying to crawl a page, and it doesn't crawl more than the first page. > The page links are relative links such as href="contact.html" > Maybe there

Nutch doesn't crawl relative pages

2015-02-04 Thread Chaushu, Shani
Hi, I'm using Nucth 1.9. I'm trying to crawl a page, and it doesn't crawl more than the first page. The page links are relative links such as href="contact.html" Maybe there is a parameter that I'm missing that preventing to crawl relatives pages? I changed the parameter db.ignore.internal.lin

RE: Nutch running time

2015-01-03 Thread Chaushu, Shani
Shani, What is your Nutch version and which Hadoop version are you using , I was able to get this running using Nutch 1.7 on Hadoop Yarn, for which I needed to make minor tweaks in the code. On Fri, Jan 2, 2015 at 12:37 PM, Chaushu, Shani wrote: > I'm running nutch distributed, on 3 nodes..

RE: URL filer

2015-01-03 Thread Chaushu, Shani
yes -Original Message- From: Shadi Saleh [mailto:propat...@gmail.com] Sent: Saturday, January 03, 2015 00:23 To: user Subject: URL filer Hello, I edited the file regex-urlfilter.txt while nutch was crawling, should I execute "ant runtime" again and call the crawler everytime I edit t

RE: Nutch running time

2015-01-02 Thread Chaushu, Shani
as a Map Reduce job/application on Hadoop , there is a lot of info on the Wiki to make it run in distributed mode , but if you can live with the psuedo-distributed /local mode for the 20K pages that you need to fecth , it would save you lot of work. On Thu, Jan 1, 2015 at 8:32 AM, Chaushu,

RE: Nutch running time

2015-01-01 Thread Chaushu, Shani
seems kind of slower for 20k links, how many map and reduce tasks ,have you configured for each one of the pahses in a Nutch crawl. On Jan 1, 2015 6:00 AM, "Chaushu, Shani" wrote: > > > Hi all, > I wanted to know how long nutch should run. > I change the configurations,

Nutch running time

2015-01-01 Thread Chaushu, Shani
Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: db.ignore.external.link

Nutch running time

2015-01-01 Thread Chaushu, Shani
Hi all, I wanted to know how long nutch should run. I change the configurations, and ran distributed - one master node and 3 slaves, and it for 20k links for about a day (depth 15). Is it normal? Or it should take less? This is my configurations: db.ignore.external.li

RE: crawls but fails at solr indexing

2014-12-29 Thread Chaushu, Shani
.SolrIndexerJob.indexSolr(SolrIndexerJob.java:61) at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85) On 29 Decemb

RE: crawls but fails at solr indexing

2014-12-29 Thread Chaushu, Shani
present I think the only change is I changed solr's schema.xml "uniquekey" tag to 'url' instead of 'id'. Do you mean the schema.xml in nutch or solr? Can you tell me definitively which schema.xml to change and what changes to make? On 29 December 2014 at 1

RE: crawls but fails at solr indexing

2014-12-29 Thread Chaushu, Shani
Did you change the schema.xml? -Original Message- From: threegara...@gmail.com [mailto:threegara...@gmail.com] On Behalf Of Kevin Porter Sent: Monday, December 29, 2014 14:10 To: user@nutch.apache.org Subject: crawls but fails at solr indexing Hi, I'm new to nutch/solr (although I under

RE: Nutch stopped after 5 segments

2014-12-28 Thread Chaushu, Shani
But it works perfect when I ran it on small number of links, I run the command nutch/crawl that should handle the hold process.. I thought it may be related to configuration that I can't find.. When I ran on 2K links it stopped after 7 iteration, on 10K it stopped after 5 iteration, and on 1 pag

Nutch stopped after 5 segments

2014-12-28 Thread Chaushu, Shani
Hi all, I ran Nutch on distributed Hadoop (1 master node and 3 workers). It ran over ~10k links, 8 hours, numberOfRounds=30 and it stopped after the 5th segment, and I know there is more than depth 5 because when I ran on one of the urls it created much more segments. The last segment folder con

remove backslash \n

2014-12-23 Thread Chaushu, Shani
Hi, Is it possible for nutch to keep the backslash \n from the site text? For now the parsing remove all the tags and backslash, I could find any information about it Thanks. - Intel Electronics Ltd. This e-mail and any attach

Nutch 1.9 on CDH

2014-12-18 Thread Chaushu, Shani
Hi, I want to install and run Nutch 1.9 on CDH5. Is there any guide about installing Nutch on cloudera? I saw only guides about installing Nutch on Hadoop but CDH is existing Hadoop clustered and I don't know how to change Nutch configuration to run on CDH. Thanks, Shani -