Hi shakiba, On Sat, Jun 11, 2016 at 1:48 PM, <user-digest-h...@nutch.apache.org> wrote:
> From: shakiba davari <davari.shak...@gmail.com> > To: user@nutch.apache.org > Cc: > Date: Thu, 9 Jun 2016 13:11:43 -0400 > Subject: Indexing nutch crawled data in “Bluemix” solr > 1down votefavorite > < > http://stackoverflow.com/questions/37731716/indexing-nutch-crawled-data-in-bluemix-solr# > > > > I'm trying to index the nutch crawled data by Bluemix solr and I cannot > find anyway to do it. My main question is: Is there anybody that can help > me to do so? what should I do to send the result of my nutch crawled data > to my Blumix Solr. > > For the crawling I used nutch 1.11 and here is a part of what I did to now > and the problems I faced: I thought there may be two possible solutions: > > 1. By nutch command: > > “NUTCH_PATH/bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/ > -Dsolr.server.url="OURSOLRURL"” > > I can index the nutch crawled data by OURSOLR. However, I found some > problem with that. > > a-Though it sounds really odd, it could not accept the URL. I could handle > it by using the URL’s Encode instead. > > b-Since I have to connect to a specific Username and password, nutch could > not connect to my solr. Considering this: > > Active IndexWriters : > SolrIndexWriter > solr.server.type : Type of SolrServer to communicate with (default > 'http' however options include 'cloud', 'lb' and 'concurrent') > solr.server.url : URL of the Solr instance (mandatory) > solr.zookeeper.url : URL of the Zookeeper URL (mandatory if > 'cloud' value for solr.server.type) > solr.loadbalance.urls : Comma-separated string of Solr server > strings to be used (madatory if 'lb' value for solr.server.type) > solr.mapping.file : name of the mapping file for fields (default > solrindex-mapping.xml) > solr.commit.size : buffer size when sending to Solr (default 1000) > solr.auth : use authentication (default false) > solr.auth.username : username for authentication > solr.auth.password : password for authentication > > in the command line output,I tried to manage this problem by using > authentication parameters of the command "solr.auth=true > solr.auth.username="SOLR-UserName" solr.auth.password="Pass" to it. > > So up to now I’ve got to a point to use this command: > > ”bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016* > solr.server.url="https%3A%2F%2Fgateway.watsonplatform.net > %2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections" > solr.auth=true solr.auth.username="USERNAME" solr.auth.password="PASS"“. > > But for some reason that I couldn’t realize yet, the command considers the > authentication parameters as crawled data directory and does not work. So I > guess it is not the right way to "Active IndexWriters" can anyone tell me > then how can I?? > Please enter the command line parameters IN FRONT of the Tool arguments e.g. bin/nutch index -D solr.server.url="https%3A%2F% 2Fgateway.watsonplatform.net %2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections" -D solr.auth=true -D solr.auth.username="USERNAME" -D solr.auth.password="PASS" crawl/crawldb -linkdb crawl/linkdb crawl/segments/ > > 1. By curl command: > > “curl -X POST -H "Content-Type: application/json" -u > "BLUEMIXSOLR-USERNAME":"BLUEMIXSOLR-PASS" " > > https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTERS-ID/solr/example_collection/update > " > --data-binary @{/path_to_file}/FILE.json” > > I thought maybe I can feed json files created by this command: > > bin/nutch commoncrawldump -outputDir finalcrawlResult/ -segment > crawl/segments -gzip -extension json -SimpleDateFormat -epochFilename > -jsonArray -reverseKey but there are some problems here. > > a. this command provides so many files in complicated Paths which will take > so much time to manually post all of them.I guess for big cawlings it may > be even impossible. Is there any way to POST all the files in a directory > and its subdirectories at once by just one command?? > Unfortunately, right now AFAIK you cannot prevent the tool from creating the directory hell. You might be better off using the FileDumper tool instead ./bin/nutch dump > > b. there is a weird name "ÙÙ÷y œ" at the start of json files created by > commoncrawldump. > The data is encoded as CBOR. This is why the Bytes exist. > > c. I removed the name weird name and tried to POST just one of these files > but here is the result: > > > > {"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown > command 'url' at [9]","code":400}} > > No it just means that you are not using the index tool correctly and that possibly your input data is not in the correct format. Hope this help.s Lewis