Crawldb

2016-06-13 Thread BlackIce
I would like to "groom" the crawldb My guess is that it should be an easy thing just to built upon the function that removes the 404 status and duplicates. But where do I find these? Thank you

Problem integrating nutch 1.11 and solr 5.5.1 or 6.0.1

2016-06-13 Thread Jose-Marcio Martins da Cruz
Hello, Sorry if this is a dumb question. I can't find the good info. I've installed at some computer, nutch 1.11 and Solr 6.0.1 As I'm new to nutch/solr, I've followed the old tutorial at https://wiki.apache.org/nutch/NutchTutorial Both nutch and solr works fine, alone. Trying to integrate

Re: Problem integrating nutch 1.11 and solr 5.5.1 or 6.0.1

2016-06-13 Thread BlackIce
Solr 6 uses a diferentes directory structure now, follow solr tutorials on how to create a core, it will tell you where it creates the cores directory, inside that directory should be a directory called /conf thats were the shema goes. Its also a good idea to read as muchas as posible on solr, nu

Re: Problem integrating nutch 1.11 and solr 5.5.1 or 6.0.1

2016-06-13 Thread Jose-Marcio Martins da Cruz
Hi, Thanks Blackice. Can you suggest me one or two good books on solr and/or nutch which aren't not outdated ? Thanks On 06/13/2016 03:16 PM, BlackIce wrote: Solr 6 uses a diferentes directory structure now, follow solr tutorials on how to create a core, it will tell you where it creates

Re: Problem integrating nutch 1.11 and solr 5.5.1 or 6.0.1

2016-06-13 Thread BlackIce
The only book ive read on the topic was written by some guys from this group, it was 3 years ago and dont remember the name. Most i have learned was from trial and error, this group, groups related to other technologies used like solr hbase zookeeper hadoop, etc.. Google is your friend rule of

Re: Problem integrating nutch 1.11 and solr 5.5.1 or 6.0.1

2016-06-13 Thread BlackIce
Also ive learned a lot from the writings of Erik Hatcher from lucidworks . Probablemente the number 1 authority in the world in everything related to solr El 13/6/2016 15:53, "Jose-Marcio Martins da Cruz" < jose-marcio.mart...@mines-paristech.fr> escribió: > > Hi, > > Thanks Blackice. > > Can you

Re: Webpage in HBase alternative name

2016-06-13 Thread Joseph Obernberger
I see that the gora-hbase-mapping.xml has the table name in it, and the nutch-site/nutch-default xml files have storage.schema.webpage. I've tried changing both, but nutch still uses 'webpage' as the table within HBase. In StorageUtils.java I see: String s

improving distributed indexing performance

2016-06-13 Thread Joseph Naegele
Hi folks, I'm in the process of indexing a large number of docs using Nutch 1.11 and the indexer-elastic plugin. I've observed slow indexing performance and narrowed it down to the map phase and first part of the reduce phase taking 80% of the total runtime per segment. Here are some statistics:

Re: improving distributed indexing performance

2016-06-13 Thread Sebastian Nagel
Hi Joseph, you're right the mapper does not do much, all potentially heavy computations in the index or scoring filters are run in the reduce step. > https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284 There are 5 billion records passed through the map step: Map input records=5115

RE: improving distributed indexing performance

2016-06-13 Thread Joseph Naegele
Sebastian, Thanks! That explains a lot. We're computing LinkRank and I don't specify a LinkDB to the indexer. Our CrawlDB is very large however, so yes I'm very interested in NUTCH-2184. I'm planning to finish helping with https://github.com/apache/nutch/pull/95. 1. A related question: Is it p

Nutch 2.3.1 with MongoDB not generating any URLs

2016-06-13 Thread Jean Vence
I have installed and successfully web crawled thousands of pages using Nutch 2.3.1 with MongoDB. But suddently, Nutch 2.3.1 Generator not generating any URLs. Seed list URL are accepted (InjectorJob: total number of urls injected after normalization and filtering: 3) and ./bin/nutch parsechecker

Newbie Question, hadoop error?

2016-06-13 Thread Jamal, Sarfaraz
Hi Guys, I am attempting to run nutch using cygwin, and I am having the following problem: Ps. I added Hadoop-core to the lib folder already - I appreciate any insight or comment you guys may have - $ bin/crawl -i urls/ TestCrawl/ 2 Injecting seed URLs /cygdrive/c/apache-nutch-1.11/bin/nutch i

Re: improving distributed indexing performance

2016-06-13 Thread Sebastian Nagel
> 1. A related question: Is it possible to index without a crawldb but still > use LinkRank scores? ElasticSearch and Solr nowadays support field-level updates. Given that the link rank calculation is computationally expensive and is not run anew after a segment was fetched, it may be better to