Re: Getting started with Solr
OK, got it, works now. Maybe you can advise on something more general? I'm trying to use Solr to analyze html data retrieved with Nutch. I want to crawl a list of webpages built according to a certain template, and analyze certain fields in their HTML (identified by a span class and consisting of a number,) then output results as csv to generate a list with the website's domain and sum of the numbers in all the specified fields. How should I set up the flow? Should I configure Nutch to only pull the relevant fields from each page, then use Solr to add the integers in those fields and output to a csv? Or should I use Nutch to pull in everything from the relevant page and then use Solr to strip out the relevant fields and process them as above? Can I do the processing strictly in Solr, using the stuff found here https://cwiki.apache.org/confluence/display/solr/Indexing+and+Basic+Data+Operations, or should I use PHP through Solarium or something along those lines? Your advice would be appreciated-I don't want to reinvent the bicycle. Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype On Sun, Mar 1, 2015 at 9:17 AM, Baruch Kogan bar...@sellerpanda.com wrote: Thanks for bearing with me. I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this: *Welcome to the SolrCloud example!* *This interactive session will help you launch a SolrCloud cluster on your local workstation.* *To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2] * *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.* *Please enter the port for node1 [8983] * *8983* *Please enter the port for node2 [7574] * *7574* *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1 into /home/ubuntu/crawler/solr/example/cloud/node2* *Starting up SolrCloud node1 on port 8983 using command:* *solr start -cloud -s example/cloud/node1/solr -p 8983 * I then go to http://localhost:8983/solr/admin/cores and get the following: *This XML file does not appear to have any style information associated with it. The document tree is shown below.* *responselst name=responseHeaderint name=status0/intint name=QTime2/int/lstlst name=initFailures/lst name=statuslst name=testCollection_shard1_replica1str name=nametestCollection_shard1_replica1/strstr name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1//strstr name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data//strstr name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate name=startTime2015-03-01T06:59:12.296Z/datelong name=uptime46380/longlst name=indexint name=numDocs0/intint name=maxDoc0/intint name=deletedDocs0/intlong name=indexHeapUsageBytes0/longlong name=version1/longint name=segmentCount0/intbool name=currenttrue/boolbool name=hasDeletionsfalse/boolstr name=directoryorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long name=sizeInBytes71/longstr name=size71 bytes/str/lst/lstlst name=testCollection_shard1_replica2str name=nametestCollection_shard1_replica2/strstr name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2//strstr name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data//strstr name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate name=startTime2015-03-01T06:59:12.751Z/datelong name=uptime45926/longlst name=indexint name=numDocs0/intint name=maxDoc0/intint name=deletedDocs0/intlong name=indexHeapUsageBytes0/longlong name=version1/longint name=segmentCount0/intbool name=currenttrue/boolbool name=hasDeletionsfalse/boolstr name=directoryorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long name=sizeInBytes71/longstr name=size71 bytes/str/lst/lstlst name=testCollection_shard2_replica1str name=nametestCollection_shard2_replica1/strstr name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1//strstr name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data//strstr name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate name=startTime2015-03-01T06:59:12.596Z/datelong name=uptime46081/longlst name=indexint name=numDocs0/intint name=maxDoc0/intint name=deletedDocs0/intlong name=indexHeapUsageBytes0/longlong name=version1
Integrating Solr with Nutch
Hi, guys, I'm working through the tutorial here http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch. I've run a crawl on a list of webpages. Now I'm trying to index them into Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns queries. I've edited the Nutch schema as per instructions. Now I hit a wall: - Save the file and restart Solr under ${APACHE_SOLR_HOME}/example: java -jar start.jar\ On my install (the latest Solr,) there is no such file, but there is a solr.sh file in the /bin which I can start. So I pasted it into solr/example/ and ran it from there. Solr cranks over. Now I need to: - run the Solr Index command from ${NUTCH_RUNTIME_HOME}: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/ and I get this: *ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex http://127.0.0.1:8983/solr/ http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* *Indexer: starting at 2015-03-01 19:51:09* *Indexer: deleting gone documents: false* *Indexer: URL filtering: false* *Indexer: URL normalizing: false* *Active IndexWriters :* *SOLRIndexWriter* * solr.server.url : URL of the SOLR instance (mandatory)* * solr.commit.size : buffer size when sending to SOLR (default 1000)* * solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)* * solr.auth : use authentication (default false)* * solr.auth.username : use authentication (default false)* * solr.auth : username for authentication* * solr.auth.password : password for authentication* *Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/crawldb/current* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/linkdb/current* * at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)* * at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)* * at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)* * at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)* * at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)* * at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)* * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)* * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)* * at java.security.AccessController.doPrivileged(Native Method)* * at javax.security.auth.Subject.doAs(Subject.java:415)* * at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)* * at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)* * at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)* * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)* * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)* * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)* * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)* * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)* What am I doing wrong? Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype
Re: Getting started with Solr
lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long name=sizeInBytes71/longstr name=size71 bytes/str/lst/lst/lst/response* I do not seem to have a gettingstarted collection. Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype On Fri, Feb 27, 2015 at 12:00 AM, Erik Hatcher erik.hatc...@gmail.com wrote: I’m sorry, I’m not following exactly. Somehow you no longer have a gettingstarted collection, but it is not clear how that happened. Could you post the exact script steps you used that got you this error? What collections/cores does the Solr admin show you have?What are the results of http://localhost:8983/solr/admin/cores http://localhost:8983/solr/admin/cores ? — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Feb 26, 2015, at 9:58 AM, Baruch Kogan bar...@sellerpanda.com wrote: Oh, I see. I used the start -e cloud command, then ran through a setup with one core and default options for the rest, then tried to post the json example again, and got another error: buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted example/exampledocs/*.json /usr/lib/jvm/java-7-oracle/bin/java -classpath /home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/books.json SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/gettingstarted/update... Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file books.json (application/json) to [base] SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/gettingstarted/update SimplePostTool: WARNING: Response: html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 404 Not Found/title /head bodyh2HTTP ERROR 404/h2 pProblem accessing /solr/gettingstarted/update. Reason: preNot Found/pre/phr /ismallPowered by Jetty:///small/ibr/ Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype On Thu, Feb 26, 2015 at 4:07 PM, Erik Hatcher erik.hatc...@gmail.com wrote: How did you start Solr? If you started with `bin/solr start -e cloud` you’ll have a gettingstarted collection created automatically, otherwise you’ll need to create it yourself with `bin/solr create -c gettingstarted` — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Feb 26, 2015, at 4:53 AM, Baruch Kogan bar...@sellerpanda.com wrote: Hi, I've just installed Solr (will be controlling with Solarium and using to search Nutch queries.) I'm working through the starting tutorials described here: https://cwiki.apache.org/confluence/display/solr/Running+Solr When I try to run $ bin/post -c gettingstarted example/exampledocs/*.json, I get a bunch of errors having to do with there not being a gettingstarted folder in /solr/. Is this normal? Should I create one? Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype
Re: Getting started with Solr
Oh, I see. I used the start -e cloud command, then ran through a setup with one core and default options for the rest, then tried to post the json example again, and got another error: buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted example/exampledocs/*.json /usr/lib/jvm/java-7-oracle/bin/java -classpath /home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/books.json SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/gettingstarted/update... Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file books.json (application/json) to [base] SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/gettingstarted/update SimplePostTool: WARNING: Response: html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 404 Not Found/title /head bodyh2HTTP ERROR 404/h2 pProblem accessing /solr/gettingstarted/update. Reason: preNot Found/pre/phr /ismallPowered by Jetty:///small/ibr/ Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype On Thu, Feb 26, 2015 at 4:07 PM, Erik Hatcher erik.hatc...@gmail.com wrote: How did you start Solr? If you started with `bin/solr start -e cloud` you’ll have a gettingstarted collection created automatically, otherwise you’ll need to create it yourself with `bin/solr create -c gettingstarted` — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Feb 26, 2015, at 4:53 AM, Baruch Kogan bar...@sellerpanda.com wrote: Hi, I've just installed Solr (will be controlling with Solarium and using to search Nutch queries.) I'm working through the starting tutorials described here: https://cwiki.apache.org/confluence/display/solr/Running+Solr When I try to run $ bin/post -c gettingstarted example/exampledocs/*.json, I get a bunch of errors having to do with there not being a gettingstarted folder in /solr/. Is this normal? Should I create one? Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype
Getting started with Solr
Hi, I've just installed Solr (will be controlling with Solarium and using to search Nutch queries.) I'm working through the starting tutorials described here: https://cwiki.apache.org/confluence/display/solr/Running+Solr When I try to run $ bin/post -c gettingstarted example/exampledocs/*.json, I get a bunch of errors having to do with there not being a gettingstarted folder in /solr/. Is this normal? Should I create one? Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype