Title: [Fwd: RE: Weekly Report]
I guess I have to read up on segments. I don't know what they are yet.

Looking at the Mail Archive of this List ( http://www.mail-archive.com/[email protected]/msg01607.html ), I found:

./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex

That command will dump all the HTML sourcecode in one file. That's a good start. However the desired result in my case is to create a new field in the schema containing the source code, like:

<htmlsource></htmlsource>

Is that even possible?

Best,
Stephan

Am 09.05.2012 18:05, schrieb Lewis John Mcgibbney:
Which segments are you trying to generate from? Do you maybe need to include them individually? or use a wildcard?

bin/nutch generate crawldb crawldb/segments/*
bin/nutch generate crawldb crawldb/segments/segmentNo

?

On Wed, May 9, 2012 at 3:33 PM, Stephan Kristyn <[email protected]> wrote:
Ok now at the heading "Step-by-Step: Fetching" I get

-bash-4.1$ bin/nutch generate crawldb crawldb/segments
Generator: starting at 2012-05-09 14:32:44
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/kristyns/apache-nutch-1.4-bin/runtime/local/crawldb/current
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:538)
        at org.apache.nutch.crawl.Generator.run(Generator.java:704)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Generator.main(Generator.java:660)

Strange...

Am 09.05.2012 16:04, schrieb Stephan Kristyn:
Hi, it seems like I forgot to fetch the crawled URLs, as mentioned in the tutorial:

http://wiki.apache.org/nutch/NutchTutorial

I'll let you know if and how that worked out for me.

Am 09.05.2012 14:28, schrieb Stephan Kristyn:
This is the query that the SOLR interface generates when I enter "test" and hit the serach button:
http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on

Maybe this is a question better suited for the Solr ML?

From: Lewis John Mcgibbney [mailto:[email protected]]
Sent: Mittwoch, 9. Mai 2012 13:34
To: [email protected]
Subject: Re: HTTP ERROR 400

are you attempting to index to Solr or is this simply when you start you solr server?
On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn <[email protected]<mailto:[email protected]>> wrote:
I copied over the schema and everything else in conf from nutch.

$cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/




Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney:

Which schema are you using with your SOlr server?



On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[email protected]><mailto:[email protected]> wrote:

Also.. entering



java -jar post.jar *.xml on RHEL6 I get a



INFO: [] webapp=/solr path=/update params={} status=400 QTime=42

SimplePostTool: FATAL: Solr returned an error #400 ERROR:

[doc=GB18030TEST] unknown field 'name'



Thanks,

Stephan





Am 09.05.2012 12:11, schrieb Stephan Kristyn:

Hi,



after installing Nutch and Solr I get a





    HTTP ERROR 400



Problem accessing /solr/select/. Reason:



    undefined field text



------------------------------------------------------------------------

/Powered by Jetty://







/Any ideas how to fix this?



Thanks,

Stephan

--

stephan
kristyn
partner operations manager

"The Internet? Is that thing still around?" - Homer Simpson

[email protected]<mailto:[email protected]>
direct +49 (0)89 231 97 207<tel:%2B49%20%280%2989%20231%2097%20207>    mobile +49 (0) 162 28899 02<tel:%2B49%20%280%29%20162%2028899%2002>

yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany
phone (408) 349 3300<tel:%28408%29%20349%203300>    fax (408) 349 3301<tel:%28408%29%20349%203301>

[cid:[email protected]]





--
Lewis

--

 

stephan
kristyn
partner operations manager
 
"The Internet? Is that thing still around?" - Homer Simpson
 
[email protected]
direct +49 (0)89 231 97 207    mobile +49 (0) 162 28899 02
 
yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany
phone (408) 349 3300    fax (408) 349 3301
 
http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif
 

 


--

 

stephan
kristyn
partner operations manager
 
"The Internet? Is that thing still around?" - Homer Simpson
 
[email protected]
direct +49 (0)89 231 97 207    mobile +49 (0) 162 28899 02
 
yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany
phone (408) 349 3300    fax (408) 349 3301
 
http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif
 

 




--
Lewis


--

 

stephan
kristyn
partner operations manager
 
"The Internet? Is that thing still around?" - Homer Simpson
 
[email protected]
direct +49 (0)89 231 97 207    mobile +49 (0) 162 28899 02
 
yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany
phone (408) 349 3300    fax (408) 349 3301
 
http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif
 

 

Reply via email to