Re: Getting started with Solr

2015-03-01 Thread Baruch Kogan
OK, got it, works now.

Maybe you can advise on something more general?

I'm trying to use Solr to analyze html data retrieved with Nutch. I want to
crawl a list of webpages built according to a certain template, and analyze
certain fields in their HTML (identified by a span class and consisting of
a number,) then output results as csv to generate a list with the website's
domain and sum of the numbers in all the specified fields.

How should I set up the flow? Should I configure Nutch to only pull the
relevant fields from each page, then use Solr to add the integers in those
fields and output to a csv? Or should I use Nutch to pull in everything
from the relevant page and then use Solr to strip out the relevant fields
and process them as above? Can I do the processing strictly in Solr, using
the stuff found here
https://cwiki.apache.org/confluence/display/solr/Indexing+and+Basic+Data+Operations,
or should I use PHP through Solarium or something along those lines?

Your advice would be appreciated-I don't want to reinvent the bicycle.

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda http://sellerpanda.com
+972(58)441-3829
baruch.kogan at Skype

On Sun, Mar 1, 2015 at 9:17 AM, Baruch Kogan bar...@sellerpanda.com wrote:

 Thanks for bearing with me.

 I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this:

 *Welcome to the SolrCloud example!*


 *This interactive session will help you launch a SolrCloud cluster on your
 local workstation.*

 *To begin, how many Solr nodes would you like to run in your local
 cluster? (specify 1-4 nodes) [2] *
 *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.*

 *Please enter the port for node1 [8983] *
 *8983*
 *Please enter the port for node2 [7574] *
 *7574*
 *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1
 into /home/ubuntu/crawler/solr/example/cloud/node2*

 *Starting up SolrCloud node1 on port 8983 using command:*

 *solr start -cloud -s example/cloud/node1/solr -p 8983   *

 I then go to http://localhost:8983/solr/admin/cores and get the following:


 *This XML file does not appear to have any style information associated
 with it. The document tree is shown below.*

 *responselst name=responseHeaderint name=status0/intint
 name=QTime2/int/lstlst name=initFailures/lst name=statuslst
 name=testCollection_shard1_replica1str
 name=nametestCollection_shard1_replica1/strstr
 name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1//strstr
 name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data//strstr
 name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate
 name=startTime2015-03-01T06:59:12.296Z/datelong
 name=uptime46380/longlst name=indexint name=numDocs0/intint
 name=maxDoc0/intint name=deletedDocs0/intlong
 name=indexHeapUsageBytes0/longlong name=version1/longint
 name=segmentCount0/intbool name=currenttrue/boolbool
 name=hasDeletionsfalse/boolstr
 name=directoryorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
 maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long
 name=sizeInBytes71/longstr name=size71 bytes/str/lst/lstlst
 name=testCollection_shard1_replica2str
 name=nametestCollection_shard1_replica2/strstr
 name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2//strstr
 name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data//strstr
 name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate
 name=startTime2015-03-01T06:59:12.751Z/datelong
 name=uptime45926/longlst name=indexint name=numDocs0/intint
 name=maxDoc0/intint name=deletedDocs0/intlong
 name=indexHeapUsageBytes0/longlong name=version1/longint
 name=segmentCount0/intbool name=currenttrue/boolbool
 name=hasDeletionsfalse/boolstr
 name=directoryorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
 maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long
 name=sizeInBytes71/longstr name=size71 bytes/str/lst/lstlst
 name=testCollection_shard2_replica1str
 name=nametestCollection_shard2_replica1/strstr
 name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1//strstr
 name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data//strstr
 name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate
 name=startTime2015-03-01T06:59:12.596Z/datelong
 name=uptime46081/longlst name=indexint name=numDocs0/intint
 name=maxDoc0/intint name=deletedDocs0/intlong
 name=indexHeapUsageBytes0/longlong name=version1

Integrating Solr with Nutch

2015-03-01 Thread Baruch Kogan
Hi, guys,

I'm working through the tutorial here
http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch.
I've run a crawl on a list of webpages. Now I'm trying to index them into
Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns
queries. I've edited the Nutch schema as per instructions. Now I hit a wall:

   -

   Save the file and restart Solr under ${APACHE_SOLR_HOME}/example:

   java -jar start.jar\


On my install (the latest Solr,) there is no such file, but there is a
solr.sh file in the /bin which I can start. So I pasted it into
solr/example/ and ran it from there. Solr cranks over. Now I need to:


   -

   run the Solr Index command from ${NUTCH_RUNTIME_HOME}:

   bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb crawl/segments/


and I get this:

*ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex
http://127.0.0.1:8983/solr/ http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb crawl/segments/*
*Indexer: starting at 2015-03-01 19:51:09*
*Indexer: deleting gone documents: false*
*Indexer: URL filtering: false*
*Indexer: URL normalizing: false*
*Active IndexWriters :*
*SOLRIndexWriter*
* solr.server.url : URL of the SOLR instance (mandatory)*
* solr.commit.size : buffer size when sending to SOLR (default 1000)*
* solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)*
* solr.auth : use authentication (default false)*
* solr.auth.username : use authentication (default false)*
* solr.auth : username for authentication*
* solr.auth.password : password for authentication*


*Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/crawldb/current*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/linkdb/current*
* at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)*
* at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)*
* at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)*
* at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)*
* at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)*
* at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)*
* at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)*
* at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)*
* at java.security.AccessController.doPrivileged(Native Method)*
* at javax.security.auth.Subject.doAs(Subject.java:415)*
* at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)*
* at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)*
* at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)*
* at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)*
* at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
* at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
* at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
* at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*

What am I doing wrong?

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda http://sellerpanda.com
+972(58)441-3829
baruch.kogan at Skype


Re: Getting started with Solr

2015-02-28 Thread Baruch Kogan
lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long
name=sizeInBytes71/longstr name=size71
bytes/str/lst/lst/lst/response*

I do not seem to have a gettingstarted collection.

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda http://sellerpanda.com
+972(58)441-3829
baruch.kogan at Skype

On Fri, Feb 27, 2015 at 12:00 AM, Erik Hatcher erik.hatc...@gmail.com
wrote:

 I’m sorry, I’m not following exactly.

 Somehow you no longer have a gettingstarted collection, but it is not
 clear how that happened.

 Could you post the exact script steps you used that got you this error?

 What collections/cores does the Solr admin show you have?What are the
 results of http://localhost:8983/solr/admin/cores 
 http://localhost:8983/solr/admin/cores ?

 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com http://www.lucidworks.com/




  On Feb 26, 2015, at 9:58 AM, Baruch Kogan bar...@sellerpanda.com
 wrote:
 
  Oh, I see. I used the start -e cloud command, then ran through a setup
 with
  one core and default options for the rest, then tried to post the json
  example again, and got another error:
  buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted
  example/exampledocs/*.json
  /usr/lib/jvm/java-7-oracle/bin/java -classpath
  /home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes
  -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool
  example/exampledocs/books.json
  SimplePostTool version 5.0.0
  Posting files to [base] url
  http://localhost:8983/solr/gettingstarted/update...
  Entering auto mode. File endings considered are
 
 xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
  POSTing file books.json (application/json) to [base]
  SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
  http://localhost:8983/solr/gettingstarted/update
  SimplePostTool: WARNING: Response: html
  head
  meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
  titleError 404 Not Found/title
  /head
  bodyh2HTTP ERROR 404/h2
  pProblem accessing /solr/gettingstarted/update. Reason:
  preNot Found/pre/phr /ismallPowered by
  Jetty:///small/ibr/
 
  Sincerely,
 
  Baruch Kogan
  Marketing Manager
  Seller Panda http://sellerpanda.com
  +972(58)441-3829
  baruch.kogan at Skype
 
  On Thu, Feb 26, 2015 at 4:07 PM, Erik Hatcher erik.hatc...@gmail.com
  wrote:
 
  How did you start Solr?   If you started with `bin/solr start -e cloud`
  you’ll have a gettingstarted collection created automatically, otherwise
  you’ll need to create it yourself with `bin/solr create -c
 gettingstarted`
 
 
  —
  Erik Hatcher, Senior Solutions Architect
  http://www.lucidworks.com http://www.lucidworks.com/
 
 
 
 
  On Feb 26, 2015, at 4:53 AM, Baruch Kogan bar...@sellerpanda.com
  wrote:
 
  Hi, I've just installed Solr (will be controlling with Solarium and
 using
  to search Nutch queries.)  I'm working through the starting tutorials
  described here:
  https://cwiki.apache.org/confluence/display/solr/Running+Solr
 
  When I try to run $ bin/post -c gettingstarted
  example/exampledocs/*.json,
  I get a bunch of errors having to do
  with there not being a gettingstarted folder in /solr/. Is this normal?
  Should I create one?
 
  Sincerely,
 
  Baruch Kogan
  Marketing Manager
  Seller Panda http://sellerpanda.com
  +972(58)441-3829
  baruch.kogan at Skype
 
 




Re: Getting started with Solr

2015-02-26 Thread Baruch Kogan
Oh, I see. I used the start -e cloud command, then ran through a setup with
one core and default options for the rest, then tried to post the json
example again, and got another error:
buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted
example/exampledocs/*.json
/usr/lib/jvm/java-7-oracle/bin/java -classpath
/home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes
-Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool
example/exampledocs/books.json
SimplePostTool version 5.0.0
Posting files to [base] url
http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.json (application/json) to [base]
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/gettingstarted/update
SimplePostTool: WARNING: Response: html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 404 Not Found/title
/head
bodyh2HTTP ERROR 404/h2
pProblem accessing /solr/gettingstarted/update. Reason:
preNot Found/pre/phr /ismallPowered by
Jetty:///small/ibr/

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda http://sellerpanda.com
+972(58)441-3829
baruch.kogan at Skype

On Thu, Feb 26, 2015 at 4:07 PM, Erik Hatcher erik.hatc...@gmail.com
wrote:

 How did you start Solr?   If you started with `bin/solr start -e cloud`
 you’ll have a gettingstarted collection created automatically, otherwise
 you’ll need to create it yourself with `bin/solr create -c gettingstarted`


 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com http://www.lucidworks.com/




  On Feb 26, 2015, at 4:53 AM, Baruch Kogan bar...@sellerpanda.com
 wrote:
 
  Hi, I've just installed Solr (will be controlling with Solarium and using
  to search Nutch queries.)  I'm working through the starting tutorials
  described here:
  https://cwiki.apache.org/confluence/display/solr/Running+Solr
 
  When I try to run $ bin/post -c gettingstarted
 example/exampledocs/*.json,
  I get a bunch of errors having to do
  with there not being a gettingstarted folder in /solr/. Is this normal?
  Should I create one?
 
  Sincerely,
 
  Baruch Kogan
  Marketing Manager
  Seller Panda http://sellerpanda.com
  +972(58)441-3829
  baruch.kogan at Skype




Getting started with Solr

2015-02-26 Thread Baruch Kogan
Hi, I've just installed Solr (will be controlling with Solarium and using
to search Nutch queries.)  I'm working through the starting tutorials
described here:
https://cwiki.apache.org/confluence/display/solr/Running+Solr

When I try to run $ bin/post -c gettingstarted example/exampledocs/*.json,
I get a bunch of errors having to do
with there not being a gettingstarted folder in /solr/. Is this normal?
Should I create one?

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda http://sellerpanda.com
+972(58)441-3829
baruch.kogan at Skype