Hi shakiba,

On Sat, Jun 11, 2016 at 1:48 PM, <user-digest-h...@nutch.apache.org> wrote:

> From: shakiba davari <davari.shak...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Date: Thu, 9 Jun 2016 13:11:43 -0400
> Subject: Indexing nutch crawled data in “Bluemix” solr
> 1down votefavorite
> <
> http://stackoverflow.com/questions/37731716/indexing-nutch-crawled-data-in-bluemix-solr#
> >
>
> I'm trying to index the nutch crawled data by Bluemix solr and I cannot
> find anyway to do it. My main question is: Is there anybody that can help
> me to do so? what should I do to send the result of my nutch crawled data
> to my Blumix Solr.
>
>  For the crawling I used nutch 1.11 and here is a part of what I did to now
> and the problems I faced: I thought there may be two possible solutions:
>
>    1. By nutch command:
>
> “NUTCH_PATH/bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/
> -Dsolr.server.url="OURSOLRURL"”
>
> I can index the nutch crawled data by OURSOLR. However, I found some
> problem with that.
>
> a-Though it sounds really odd, it could not accept the URL. I could handle
> it by using the URL’s Encode instead.
>
> b-Since I have to connect to a specific Username and password, nutch could
> not connect to my solr. Considering this:
>
>  Active IndexWriters :
>  SolrIndexWriter
>     solr.server.type : Type of SolrServer to communicate with (default
> 'http' however options include 'cloud', 'lb' and 'concurrent')
>     solr.server.url : URL of the Solr instance (mandatory)
>     solr.zookeeper.url : URL of the Zookeeper URL (mandatory if
> 'cloud' value for solr.server.type)
>     solr.loadbalance.urls : Comma-separated string of Solr server
> strings to be used (madatory if 'lb' value for solr.server.type)
>     solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
>     solr.commit.size : buffer size when sending to Solr (default 1000)
>     solr.auth : use authentication (default false)
>     solr.auth.username : username for authentication
>     solr.auth.password : password for authentication
>
> in the command line output,I tried to manage this problem by using
> authentication parameters of the command "solr.auth=true
> solr.auth.username="SOLR-UserName" solr.auth.password="Pass" to it.
>
> So up to now I’ve got to a point to use this command:
>
> ”bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*
> solr.server.url="https%3A%2F%2Fgateway.watsonplatform.net
> %2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections"
> solr.auth=true solr.auth.username="USERNAME" solr.auth.password="PASS"“.
>
> But for some reason that I couldn’t realize yet, the command considers the
> authentication parameters as crawled data directory and does not work. So I
> guess it is not the right way to "Active IndexWriters" can anyone tell me
> then how can I??
>

Please enter the command line parameters IN FRONT of the Tool arguments
e.g. bin/nutch index -D solr.server.url="https%3A%2F%
2Fgateway.watsonplatform.net
%2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections"
-D solr.auth=true -D solr.auth.username="USERNAME" -D
solr.auth.password="PASS" crawl/crawldb -linkdb crawl/linkdb crawl/segments/


>
>    1. By curl command:
>
> “curl -X POST -H "Content-Type: application/json" -u
> "BLUEMIXSOLR-USERNAME":"BLUEMIXSOLR-PASS" "
>
> https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTERS-ID/solr/example_collection/update
> "
> --data-binary @{/path_to_file}/FILE.json”
>
> I thought maybe I can feed json files created by this command:
>
> bin/nutch commoncrawldump -outputDir finalcrawlResult/ -segment
> crawl/segments -gzip -extension json -SimpleDateFormat -epochFilename
> -jsonArray -reverseKey but there are some problems here.
>
> a. this command provides so many files in complicated Paths which will take
> so much time to manually post all of them.I guess for big cawlings it may
> be even impossible. Is there any way to POST all the files in a directory
> and its subdirectories at once by just one command??
>

Unfortunately, right now AFAIK you cannot prevent the tool from creating
the directory hell. You might be better off using the FileDumper tool
instead
./bin/nutch dump


>
> b. there is a weird name "ÙÙ÷y œ" at the start of json files created by
> commoncrawldump.
>

The data is encoded as CBOR. This is why the Bytes exist.


>
> c. I removed the name weird name and tried to POST just one of these files
> but here is the result:
>
>
>  
> {"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown
> command 'url' at [9]","code":400}}
>
>
No it just means that you are not using the index tool correctly and that
possibly your input data is not in the correct format.
Hope this help.s
Lewis

Reply via email to