I would like to "groom" the crawldb My guess is that it should be an
easy thing just to built upon the function that removes the 404 status and
duplicates. But where do I find these?
Thank you
Hello,
Sorry if this is a dumb question. I can't find the good info.
I've installed at some computer, nutch 1.11 and Solr 6.0.1
As I'm new to nutch/solr, I've followed the old tutorial at
https://wiki.apache.org/nutch/NutchTutorial
Both nutch and solr works fine, alone. Trying to integrate
Solr 6 uses a diferentes directory structure now, follow solr tutorials on
how to create a core, it will tell you where it creates the cores
directory, inside that directory should be a directory called /conf thats
were the shema goes. Its also a good idea to read as muchas as posible on
solr, nu
Hi,
Thanks Blackice.
Can you suggest me one or two good books on solr and/or nutch which aren't not
outdated ?
Thanks
On 06/13/2016 03:16 PM, BlackIce wrote:
Solr 6 uses a diferentes directory structure now, follow solr tutorials on
how to create a core, it will tell you where it creates
The only book ive read on the topic was written by some guys from this
group, it was 3 years ago and dont remember the name. Most i have learned
was from trial and error, this group, groups related to other technologies
used like solr hbase zookeeper hadoop, etc.. Google is your friend
rule of
Also ive learned a lot from the writings of Erik Hatcher from lucidworks .
Probablemente the number 1 authority in the world in everything related to
solr
El 13/6/2016 15:53, "Jose-Marcio Martins da Cruz" <
jose-marcio.mart...@mines-paristech.fr> escribió:
>
> Hi,
>
> Thanks Blackice.
>
> Can you
I see that the gora-hbase-mapping.xml has the table name in it, and the
nutch-site/nutch-default xml files have storage.schema.webpage. I've tried
changing both, but nutch still uses 'webpage' as the table within HBase.
In StorageUtils.java I see:
String s
Hi folks,
I'm in the process of indexing a large number of docs using Nutch 1.11 and
the indexer-elastic plugin. I've observed slow indexing performance and
narrowed it down to the map phase and first part of the reduce phase taking
80% of the total runtime per segment. Here are some statistics:
Hi Joseph,
you're right the mapper does not do much, all potentially
heavy computations in the index or scoring filters are run
in the reduce step.
> https://gist.github.com/naegelejd/249120387a3d6e4e96bef2ac2edcb284
There are 5 billion records passed through the map step:
Map input records=5115
Sebastian,
Thanks! That explains a lot. We're computing LinkRank and I don't specify a
LinkDB to the indexer. Our CrawlDB is very large however, so yes I'm very
interested in NUTCH-2184. I'm planning to finish helping with
https://github.com/apache/nutch/pull/95.
1. A related question: Is it p
I have installed and successfully web crawled thousands of pages using
Nutch 2.3.1 with MongoDB.
But suddently, Nutch 2.3.1 Generator not generating any URLs. Seed
list URL are accepted (InjectorJob: total number of urls injected
after normalization and filtering: 3) and
./bin/nutch parsechecker
Hi Guys,
I am attempting to run nutch using cygwin, and I am having the following
problem:
Ps. I added Hadoop-core to the lib folder already -
I appreciate any insight or comment you guys may have -
$ bin/crawl -i urls/ TestCrawl/ 2
Injecting seed URLs
/cygdrive/c/apache-nutch-1.11/bin/nutch i
> 1. A related question: Is it possible to index without a crawldb but still
> use LinkRank scores?
ElasticSearch and Solr nowadays support field-level updates. Given that the
link rank
calculation is computationally expensive and is not run anew after a segment
was fetched,
it may be better to
13 matches
Mail list logo