RE: nutch 1.x tutorial with solr 6.6.0

Yossi Tamari Tue, 11 Jul 2017 03:58:53 -0700

I struggled with this as well. Eventually I moved to ElasticSearch, which is 
much easier.

What I did manage to find out, is that in newer versions of SOLR you need to 
use ZooKeeper to update the conf file. see https://stackoverflow.com/a/43351358.

-----Original Message-----
From: Pau Paches [mailto:sp.exstream.t...@gmail.com] 
Sent: 11 July 2017 13:29
To: user@nutch.apache.org
Subject: Re: nutch 1.x tutorial with solr 6.6.0

Hi,
I just crawl a single URL so no whole web crawling.
So I do option 2, fetching, invertlinks successfully. This is just Nutch 1.x 
Then I do Indexing into Apache Solr so go to section Setup Solr for search.
First thing that does not work:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar
No start.jar at the specified location, but no problem you start Solr
6.6.0 with bin/solr start.
Then the tutorial says:
Backup the original Solr example schema.xml:
mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org

But in current Solr, 6.6.0, there is no schema.xml file. In the whole 
distribution. What should I do here?
if I go directly to run the Solr Index command from ${NUTCH_RUNTIME_HOME}:
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/ which may not make sense since I have skipped some 
steps, it crashes:
The input path at segments is not a segment... skipping
Indexer: java.lang.RuntimeException: Missing elastic.cluster and elastic.host. 
At least one of them should be set in nutch-site.xml ElasticIndexWriter
        elastic.cluster : elastic prefix cluster
        elastic.host : hostname
        elastic.port : port

Clearly there is some missing configuration in nutch-site.xml, apart from 
setting http.agent.name in nutch-site.xml (mentioned) other fields need to be 
set up. The segments message above is also troubling.

If you follow the steps (if they worked) should we run bin/nutch solrindex 
http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/ 
(this is the last step in Integrate Solr with Nutch) and then

bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ 
crawl/segments/20131108063838/ -filter -normalize -deleteGone (this is one of 
the steps of Using Individual Commands for Whole-Web Crawling, which in fact 
also is the section to read if you are only crawling a URL.

This is what I found by following the tutorial at 
https://wiki.apache.org/nutch/NutchTutorial

On 7/9/17, lewis john mcgibbney <lewi...@apache.org> wrote:
> Hi Pau,
>
> On Sat, Jul 8, 2017 at 6:52 AM, <user-digest-h...@nutch.apache.org> wrote:
>
>> From: Pau Paches <sp.exstream.t...@gmail.com>
>> To: user@nutch.apache.org
>> Cc:
>> Bcc:
>> Date: Sat, 8 Jul 2017 15:52:46 +0200
>> Subject: nutch 1.x tutorial with solr 6.6.0 Hi, I have run the Nutch 
>> 1.x Tutorial with Solr 6.6.0.
>> Many things do not work,
>
>
> What does not work? Can you elaborate?
>
>
>> there is a mismatch between the assumed Solr
>> version and the current Solr version.
>>
>
> We support Solr as an indexing backend in the broadest sense possible. We
> do not aim to support the latest and greatest Solr version available. If
> you are interested in upgrading to a particular version, if you could open
> a JIRA issue and provide a pull request it would be excellent.
>
>
>> I have seen some messages about the same problem for Solr 4.x
>> Is this the right path to go or should I move to Nutch 2.x?
>
>
> If you are new to Nutch, I would highly advise that you stick with 1.X
>
>
>> Does it
>> make sense to use Solr 6.6 with Nutch 1.x?
>
>
> Yes... you _may_ have a few configuration options to tweak but there have
> been no backwards incompatibility issues so I see no reason for anything to
> be broken.
>
>
>> If yes, I'm willing to
>> amend the tutorial if someone helps.
>>
>>
> What is broken? Can you elaborate?
>

RE: nutch 1.x tutorial with solr 6.6.0

Reply via email to