Re: Indexing nutch crawled data in “Bluemix” solr

shakiba davari Thu, 16 Jun 2016 14:05:05 -0700

Thanks so much Lewis. It really helped me. at least now I know that there
is a way to make it work.
I did used the command as you said:


bin/nutch index -D solr.server.url="
https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections
-D solr.auth=true -D solr.auth.username="USERNAME" -D
solr.auth.password="PASS" Crawl/crawldb -linkdb Crawl/linkdb
Crawl/segments/2016*

and now the result is:

Indexing 153 documents
Indexing 153 documents
Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)


I guess it has something to do with the solr.server.url address, maybe the
end of it. I changed it in different ways
e.g "
https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/example_collection/update";
(since it is used for feeding JSON files the the bluemix solr )
but no chance to now.

any Idea what's happening now??


Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari *


On Tue, Jun 14, 2016 at 4:58 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi shakiba,
>
> On Sat, Jun 11, 2016 at 1:48 PM, <user-digest-h...@nutch.apache.org>
> wrote:
>
> > From: shakiba davari <davari.shak...@gmail.com>
> > To: user@nutch.apache.org
> > Cc:
> > Date: Thu, 9 Jun 2016 13:11:43 -0400
> > Subject: Indexing nutch crawled data in “Bluemix” solr
> > 1down votefavorite
> > <
> >
> http://stackoverflow.com/questions/37731716/indexing-nutch-crawled-data-in-bluemix-solr#
> > >
> >
> > I'm trying to index the nutch crawled data by Bluemix solr and I cannot
> > find anyway to do it. My main question is: Is there anybody that can help
> > me to do so? what should I do to send the result of my nutch crawled data
> > to my Blumix Solr.
> >
> >  For the crawling I used nutch 1.11 and here is a part of what I did to
> now
> > and the problems I faced: I thought there may be two possible solutions:
> >
> >    1. By nutch command:
> >
> > “NUTCH_PATH/bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/
> > -Dsolr.server.url="OURSOLRURL"”
> >
> > I can index the nutch crawled data by OURSOLR. However, I found some
> > problem with that.
> >
> > a-Though it sounds really odd, it could not accept the URL. I could
> handle
> > it by using the URL’s Encode instead.
> >
> > b-Since I have to connect to a specific Username and password, nutch
> could
> > not connect to my solr. Considering this:
> >
> >  Active IndexWriters :
> >  SolrIndexWriter
> >     solr.server.type : Type of SolrServer to communicate with (default
> > 'http' however options include 'cloud', 'lb' and 'concurrent')
> >     solr.server.url : URL of the Solr instance (mandatory)
> >     solr.zookeeper.url : URL of the Zookeeper URL (mandatory if
> > 'cloud' value for solr.server.type)
> >     solr.loadbalance.urls : Comma-separated string of Solr server
> > strings to be used (madatory if 'lb' value for solr.server.type)
> >     solr.mapping.file : name of the mapping file for fields (default
> > solrindex-mapping.xml)
> >     solr.commit.size : buffer size when sending to Solr (default 1000)
> >     solr.auth : use authentication (default false)
> >     solr.auth.username : username for authentication
> >     solr.auth.password : password for authentication
> >
> > in the command line output,I tried to manage this problem by using
> > authentication parameters of the command "solr.auth=true
> > solr.auth.username="SOLR-UserName" solr.auth.password="Pass" to it.
> >
> > So up to now I’ve got to a point to use this command:
> >
> > ”bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*
> > solr.server.url="https%3A%2F%2Fgateway.watsonplatform.net
> >
> %2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections"
> > solr.auth=true solr.auth.username="USERNAME" solr.auth.password="PASS"“.
> >
> > But for some reason that I couldn’t realize yet, the command considers
> the
> > authentication parameters as crawled data directory and does not work.
> So I
> > guess it is not the right way to "Active IndexWriters" can anyone tell me
> > then how can I??
> >
>
> Please enter the command line parameters IN FRONT of the Tool arguments
> e.g. bin/nutch index -D solr.server.url="https%3A%2F%
> 2Fgateway.watsonplatform.net
>
> %2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections"
> -D solr.auth=true -D solr.auth.username="USERNAME" -D
> solr.auth.password="PASS" crawl/crawldb -linkdb crawl/linkdb
> crawl/segments/
>
>
> >
> >    1. By curl command:
> >
> > “curl -X POST -H "Content-Type: application/json" -u
> > "BLUEMIXSOLR-USERNAME":"BLUEMIXSOLR-PASS" "
> >
> >
> https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTERS-ID/solr/example_collection/update
> > "
> > --data-binary @{/path_to_file}/FILE.json”
> >
> > I thought maybe I can feed json files created by this command:
> >
> > bin/nutch commoncrawldump -outputDir finalcrawlResult/ -segment
> > crawl/segments -gzip -extension json -SimpleDateFormat -epochFilename
> > -jsonArray -reverseKey but there are some problems here.
> >
> > a. this command provides so many files in complicated Paths which will
> take
> > so much time to manually post all of them.I guess for big cawlings it may
> > be even impossible. Is there any way to POST all the files in a directory
> > and its subdirectories at once by just one command??
> >
>
> Unfortunately, right now AFAIK you cannot prevent the tool from creating
> the directory hell. You might be better off using the FileDumper tool
> instead
> ./bin/nutch dump
>
>
> >
> > b. there is a weird name "ÙÙ÷y œ" at the start of json files created by
> > commoncrawldump.
> >
>
> The data is encoded as CBOR. This is why the Bytes exist.
>
>
> >
> > c. I removed the name weird name and tried to POST just one of these
> files
> > but here is the result:
> >
> >
> >
> {"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown
> > command 'url' at [9]","code":400}}
> >
> >
> No it just means that you are not using the index tool correctly and that
> possibly your input data is not in the correct format.
> Hope this help.s
> Lewis
>

Re: Indexing nutch crawled data in “Bluemix” solr

Reply via email to