Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Tony Mullins
Hi Tejas,

I am following this example https://github.com/veggen/nutch-element-selector.
And now I have tried this example without any changes to my  fresh source
of Nutch 2.2.

Attached is my patch ( change set) on fresh Nutch 2.2 source.
Kindly review it and please let me know if I am missing something.

Thanks,
Tonny


On Thu, Jun 13, 2013 at 11:19 AM, Tejas Patil wrote:

> Weird. I would like to have a quick peek into your changes. Maybe you are
> doing something wrong which is hard to predict and figure out by asking
> bunch of questions to you over email. Can you attach a patch file of your
> changes ? Please remove the fluff from it and only keep the bare essential
> things in the patch. Also, if you are working for some company, make sure
> that you attaching some code here should not be against your organisational
> policy.
>
> Thanks,
> Tejas
>
> On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins  >wrote:
>
> > I have done this all. Created my plugin's ivy.xml , plugin.xml ,
> build,xml
> > . Added the entry in nutch-site.xml and src>plugin>build.xml.
> > But I am still getting "PluginRuntimeException:
> > java.lang.ClassNotFoundException"
> >
> >
> > Is there any other configuration that I am missing or its Nutch 2.2
> issues
> > ?
> >
> > Thanks,
> > Tony.
> >
> >
> > On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil  > >wrote:
> >
> > > Here is the relevant wiki page:
> > > http://wiki.apache.org/nutch/WritingPluginExample
> > >
> > > Although its old, I think that it will help.
> > >
> > >
> > > On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel <
> > > wastl.na...@googlemail.com
> > > > wrote:
> > >
> > > > Hi Tony,
> > > >
> > > > you have to "register" your plugin in
> > > >  src/plugin/build.xml
> > > >
> > > > Does your
> > > >  src/plugin/myplugin/plugin.xml
> > > > properly propagate jar file,
> > > > extension point and implementing class?
> > > >
> > > > And, finally, you have to add your plugin
> > > > to the property plugin.includes in nutch-site.xml
> > > >
> > > > Cheers,
> > > > Sebastian
> > > >
> > > > On 06/12/2013 07:48 PM, Tony Mullins wrote:
> > > > > Hi,
> > > > >
> > > > > I am trying simple ParseFilter plugin in Nutch 2.2. And I can build
> > it
> > > > and
> > > > > also the src>plugin>build.xml successfully. But its .jar file is
> not
> > > > being
> > > > > created in my runtime>local>plugins>myplugin directory.
> > > > >
> > > > > And on running
> > > > > "bin/nutch parsechecker http://www.google.nl";
> > > > >  I get this error " java.lang.RuntimeException:
> > > > > org.apache.nutch.plugin.PluginRuntimeException:
> > > > > java.lang.ClassNotFoundException:
> > > > > com.xyz.nutch.selector.HtmlElementSelectorFilter"
> > > > >
> > > > > If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar
> > > with
> > > > > test & classes directory created there. If I copy .jar  from here
> and
> > > > paste
> > > > > it to my runtime>local>plugins>myplugin directory with plugin.xml
> > file
> > > > then
> > > > > too I get the same exception of class not found.
> > > > >
> > > > > I have not made any changes in src>plugin>build-plugin.xml.
> > > > >
> > > > > Could you please guide me that what is I am doing wrong here ?
> > > > >
> > > > > Thanks,
> > > > > Tony
> > > > >
> > > >
> > > >
> > >
> >
>
Index: conf/gora.properties
===
--- conf/gora.properties(revision 1492208)
+++ conf/gora.properties(working copy)
@@ -20,10 +20,10 @@
 # Default SqlStore properties #
 ###
 
-gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
-gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
-gora.sqlstore.jdbc.user=sa
-gora.sqlstore.jdbc.password=
+# gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
+# gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
+# gora.sqlstore.jdbc.user=sa
+# gora.sqlstore.jdbc.password=
 
 
 # Default AvroStore properties #
@@ -60,7 +60,8 @@
 # CassandraStore properties #
 #
 
-# gora.cassandrastore.servers=localhost:9160
+ gora.cassandrastore.servers=localhost:9160
+ gora.datastore.default=org.apache.gora.cassandra.store.CassandraStore
 
 ###
 # MemStore properties #
Index: conf/nutch-default.xml
===
--- conf/nutch-default.xml  (revision 1492208)
+++ conf/nutch-default.xml  (working copy)
@@ -60,7 +60,7 @@
 
 
   http.agent.name
-  
+  MyIYCrawler
   HTTP 'User-Agent' request header. MUST NOT be empty - 
   please set this to a single word uniquely related to your organization.
 
@@ -79,7 +79,7 @@
 
 
   http.robots.agents
-  *
+  MyIYCrawler
   The agent strings we'll look for in robots.txt files,
   comma-separated, in decreasing order of precedence. You should
   put the value of http.agent.name as the first agent name, and keep the
@@ -823,7 +823,7 @@
 
 
   plugin.folders
-  plug

tstamp and date field -- future dates???

2013-06-12 Thread James Sullivan
Is anybody else having an issue with future dates showing up in tstamp and
date fields in Solr with the more recent 2.x builds?

This is a known issue
NUTCH-1475 but
it seemed to have been fixed for awhile without extra patching but has
resurfaced some time in the last six months or so.  Not a big deal as the
work around is simple but I would be interested if there were changes in
the fetchtime/tstamp area in that time frame. I am running a non stock
Nutch (using adaptiveFetchschedule instead of default, etc.) so the problem
may well just be with my particular configuration


Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Tejas Patil
Weird. I would like to have a quick peek into your changes. Maybe you are
doing something wrong which is hard to predict and figure out by asking
bunch of questions to you over email. Can you attach a patch file of your
changes ? Please remove the fluff from it and only keep the bare essential
things in the patch. Also, if you are working for some company, make sure
that you attaching some code here should not be against your organisational
policy.

Thanks,
Tejas

On Wed, Jun 12, 2013 at 11:03 PM, Tony Mullins wrote:

> I have done this all. Created my plugin's ivy.xml , plugin.xml , build,xml
> . Added the entry in nutch-site.xml and src>plugin>build.xml.
> But I am still getting "PluginRuntimeException:
> java.lang.ClassNotFoundException"
>
>
> Is there any other configuration that I am missing or its Nutch 2.2 issues
> ?
>
> Thanks,
> Tony.
>
>
> On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil  >wrote:
>
> > Here is the relevant wiki page:
> > http://wiki.apache.org/nutch/WritingPluginExample
> >
> > Although its old, I think that it will help.
> >
> >
> > On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel <
> > wastl.na...@googlemail.com
> > > wrote:
> >
> > > Hi Tony,
> > >
> > > you have to "register" your plugin in
> > >  src/plugin/build.xml
> > >
> > > Does your
> > >  src/plugin/myplugin/plugin.xml
> > > properly propagate jar file,
> > > extension point and implementing class?
> > >
> > > And, finally, you have to add your plugin
> > > to the property plugin.includes in nutch-site.xml
> > >
> > > Cheers,
> > > Sebastian
> > >
> > > On 06/12/2013 07:48 PM, Tony Mullins wrote:
> > > > Hi,
> > > >
> > > > I am trying simple ParseFilter plugin in Nutch 2.2. And I can build
> it
> > > and
> > > > also the src>plugin>build.xml successfully. But its .jar file is not
> > > being
> > > > created in my runtime>local>plugins>myplugin directory.
> > > >
> > > > And on running
> > > > "bin/nutch parsechecker http://www.google.nl";
> > > >  I get this error " java.lang.RuntimeException:
> > > > org.apache.nutch.plugin.PluginRuntimeException:
> > > > java.lang.ClassNotFoundException:
> > > > com.xyz.nutch.selector.HtmlElementSelectorFilter"
> > > >
> > > > If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar
> > with
> > > > test & classes directory created there. If I copy .jar  from here and
> > > paste
> > > > it to my runtime>local>plugins>myplugin directory with plugin.xml
> file
> > > then
> > > > too I get the same exception of class not found.
> > > >
> > > > I have not made any changes in src>plugin>build-plugin.xml.
> > > >
> > > > Could you please guide me that what is I am doing wrong here ?
> > > >
> > > > Thanks,
> > > > Tony
> > > >
> > >
> > >
> >
>


Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Tony Mullins
I have done this all. Created my plugin's ivy.xml , plugin.xml , build,xml
. Added the entry in nutch-site.xml and src>plugin>build.xml.
But I am still getting "PluginRuntimeException:
java.lang.ClassNotFoundException"


Is there any other configuration that I am missing or its Nutch 2.2 issues ?

Thanks,
Tony.


On Thu, Jun 13, 2013 at 1:09 AM, Tejas Patil wrote:

> Here is the relevant wiki page:
> http://wiki.apache.org/nutch/WritingPluginExample
>
> Although its old, I think that it will help.
>
>
> On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel <
> wastl.na...@googlemail.com
> > wrote:
>
> > Hi Tony,
> >
> > you have to "register" your plugin in
> >  src/plugin/build.xml
> >
> > Does your
> >  src/plugin/myplugin/plugin.xml
> > properly propagate jar file,
> > extension point and implementing class?
> >
> > And, finally, you have to add your plugin
> > to the property plugin.includes in nutch-site.xml
> >
> > Cheers,
> > Sebastian
> >
> > On 06/12/2013 07:48 PM, Tony Mullins wrote:
> > > Hi,
> > >
> > > I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it
> > and
> > > also the src>plugin>build.xml successfully. But its .jar file is not
> > being
> > > created in my runtime>local>plugins>myplugin directory.
> > >
> > > And on running
> > > "bin/nutch parsechecker http://www.google.nl";
> > >  I get this error " java.lang.RuntimeException:
> > > org.apache.nutch.plugin.PluginRuntimeException:
> > > java.lang.ClassNotFoundException:
> > > com.xyz.nutch.selector.HtmlElementSelectorFilter"
> > >
> > > If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar
> with
> > > test & classes directory created there. If I copy .jar  from here and
> > paste
> > > it to my runtime>local>plugins>myplugin directory with plugin.xml file
> > then
> > > too I get the same exception of class not found.
> > >
> > > I have not made any changes in src>plugin>build-plugin.xml.
> > >
> > > Could you please guide me that what is I am doing wrong here ?
> > >
> > > Thanks,
> > > Tony
> > >
> >
> >
>


Re: Installing Nutch1.6 on Windows7

2013-06-12 Thread Lewis John Mcgibbney
Hi Andrea,
Please describe the problem here. There is an absence of any detail about
what is wrong here.
Thanks

On Wednesday, June 12, 2013, Andrea Lanzoni  wrote:
> Hi everyone, I am a newcomer to Nutch and Solr and, after studying
literature available on web, I tried to install them on _Windows 7_.
>
> I have not been able to match the few instructions on the wikiapache site
nor I could find a guide updated to Nutch 1.6 but only for older versions
> I tried by following old versions guides on the web but never succeeded
in the installation, often because of differences from what the guide read
and what I saw on screen.
>
> I followed the steps  by installing:
> - Tomcat
> - Java jdk 7
> - Cygwin, Nutch 1.6 and Solr 4
>
> Everything went apparently smooth and I copied Nutch and Solr in:
> C:\cygwin\home\apache-nutch-1.6-bin
> and
> C:\cygwin\home\solr-4.2.0\solr-4.2.0
>
> Whilst the two folders: jdk1.7.0_21 and jre7, are within the Java folder
in Programs directory
>
> I apologize for my dumbness but I couldn't find how to manage it. If
somebody has a clear and detailed step by step pattern to follow for
installing Nutch 1.6 and Solr 4 I would be very grateful.
> Thanks in advance.
> Andrea Lanzoni
>

-- 
*Lewis*


Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Tejas Patil
Here is the relevant wiki page:
http://wiki.apache.org/nutch/WritingPluginExample

Although its old, I think that it will help.


On Wed, Jun 12, 2013 at 1:01 PM, Sebastian Nagel  wrote:

> Hi Tony,
>
> you have to "register" your plugin in
>  src/plugin/build.xml
>
> Does your
>  src/plugin/myplugin/plugin.xml
> properly propagate jar file,
> extension point and implementing class?
>
> And, finally, you have to add your plugin
> to the property plugin.includes in nutch-site.xml
>
> Cheers,
> Sebastian
>
> On 06/12/2013 07:48 PM, Tony Mullins wrote:
> > Hi,
> >
> > I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it
> and
> > also the src>plugin>build.xml successfully. But its .jar file is not
> being
> > created in my runtime>local>plugins>myplugin directory.
> >
> > And on running
> > "bin/nutch parsechecker http://www.google.nl";
> >  I get this error " java.lang.RuntimeException:
> > org.apache.nutch.plugin.PluginRuntimeException:
> > java.lang.ClassNotFoundException:
> > com.xyz.nutch.selector.HtmlElementSelectorFilter"
> >
> > If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar with
> > test & classes directory created there. If I copy .jar  from here and
> paste
> > it to my runtime>local>plugins>myplugin directory with plugin.xml file
> then
> > too I get the same exception of class not found.
> >
> > I have not made any changes in src>plugin>build-plugin.xml.
> >
> > Could you please guide me that what is I am doing wrong here ?
> >
> > Thanks,
> > Tony
> >
>
>


Re: IndexWriter Plugin Workflow

2013-06-12 Thread AC Nutch
Ah that makes a lot of sense! I will go ahead and open a Jira issue. Thanks
for the reply!

Alex


On Wed, Jun 12, 2013 at 3:50 PM, Sebastian Nagel  wrote:

> Hi,
>
> > I'm writing a custom IndexWriter and I had some questions on the
> execution
> > workflow.
> Have a look at NUTCH-1527 and NUTCH-1541.
>
> >
> > I notice that when I run my index writer plugin the following happens:
> >
> > - the describe String is printed
> > - the .open method is called once
> > - the .write method is called for every NutchDocument
> > - the .close method is called
> > - the .open method is called
> with argument "name" = "commit"
> > - the .commit method is called
> > - the .close method is called again
> >
> > This in most cases seems fine, however I'm not totally clear on what the
> > .update or the .delete methods would be used. What is the "expected" use
> > for these?
> Intuitively, update resp. delete documents which are already in the index
> Delete is used, e.g., to be sure that 404 documents are definitely removed
> from a Solr index.
> Update is actually not used. It may be useful for index end-points which
> support field-level updates to update only some fields (e.g. score/boost
> and anchor texts which depend on many documents and are permanently
> changing).
>
> But you are definitively right. The interface o.a.n.indexer.IndexWriter
> should provide good documentation for all required methods. Feel free
> to open a jira.
>
> > As a possibly related question, is it possible to change the workflow of
> > the plugin (without editing Nutch source beyond the plugin)?
>
> Hardly. You have some control what is done by the command-line options
> -noCommit
> and -deleteGone. See o.a.n.indexer.IndexingJob.run(), also shown by
>  % bin/nutch index
>
> Bye,
> Sebastian
>


Re: Suffix URLFilter not working

2013-06-12 Thread Sebastian Nagel
Hi Peter,

please do not hijack threads.

Seed URLs must be fully specified including protocol, e.g.:
 http://nutch.apache.org/
but not
 apache.org

Sebastian

On 06/12/2013 05:08 PM, Peter Gaines wrote:
> I have installed version 2.2 of nutch on a CentIOS machine and am using the 
> following command:
> 
> ./bin/crawl urls testcrawl "solrfolder" 2
> 
> I have attempted to use the default filter configuration and also explicitly 
> specified urlfilter-regex
> in the nutch-default.xml (without modifying the default regex filters).
> 
> However if fails each time and I can see the exception below in the 
> hadoop.log.
> 
> As you can see it looks like it has not picked up anything from the seed.txt 
> in the urls folder
> (as MalformedURLException error usually prints the url).
> This file has 1 entry with the protocol specified e.g. http://www.google.com
> 
> Can anyone shed any light on this?
> 
> Regards,
> Peter.
> 
> 2013-06-12 17:00:47,857 INFO  crawl.InjectorJob - InjectorJob: starting at 
> 2013-06-12 17:00:47
> 2013-06-12 17:00:47,858 INFO  crawl.InjectorJob - InjectorJob: Injecting 
> urlDir: urls
> 2013-06-12 17:00:48,140 INFO  crawl.InjectorJob - InjectorJob: Using class
> org.apache.gora.memory.store.MemStore as the Gora storage class.
> 2013-06-12 17:00:48,158 WARN  util.NativeCodeLoader - Unable to load 
> native-hadoop library for your
> platform... using builtin-java classes where applicable
> 2013-06-12 17:00:48,206 WARN  snappy.LoadSnappy - Snappy native library not 
> loaded
> 2013-06-12 17:00:48,344 INFO  mapreduce.GoraRecordWriter - 
> gora.buffer.write.limit = 1
> 2013-06-12 17:00:48,403 INFO  regex.RegexURLNormalizer - can't find rules for 
> scope 'inject', using
> default
> 2013-06-12 17:00:48,407 WARN  mapred.FileOutputCommitter - Output path is 
> null in cleanup
> 2013-06-12 17:00:48,407 WARN  mapred.LocalJobRunner - job_local_0001
> java.net.MalformedURLException: no protocol:
>at java.net.URL.(URL.java:585)
>at java.net.URL.(URL.java:482)
>at java.net.URL.(URL.java:431)
>at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
>at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162)
>at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)
>at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: 
> java.lang.RuntimeException: job
> failed: name=[testcrawl]inject urls, jobid=job_local_0001
>at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
>at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
>at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
>at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
> 
> 



Re: PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Sebastian Nagel
Hi Tony,

you have to "register" your plugin in
 src/plugin/build.xml

Does your
 src/plugin/myplugin/plugin.xml
properly propagate jar file,
extension point and implementing class?

And, finally, you have to add your plugin
to the property plugin.includes in nutch-site.xml

Cheers,
Sebastian

On 06/12/2013 07:48 PM, Tony Mullins wrote:
> Hi,
> 
> I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it and
> also the src>plugin>build.xml successfully. But its .jar file is not being
> created in my runtime>local>plugins>myplugin directory.
> 
> And on running
> "bin/nutch parsechecker http://www.google.nl";
>  I get this error " java.lang.RuntimeException:
> org.apache.nutch.plugin.PluginRuntimeException:
> java.lang.ClassNotFoundException:
> com.xyz.nutch.selector.HtmlElementSelectorFilter"
> 
> If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar with
> test & classes directory created there. If I copy .jar  from here and paste
> it to my runtime>local>plugins>myplugin directory with plugin.xml file then
> too I get the same exception of class not found.
> 
> I have not made any changes in src>plugin>build-plugin.xml.
> 
> Could you please guide me that what is I am doing wrong here ?
> 
> Thanks,
> Tony
> 



Re: IndexWriter Plugin Workflow

2013-06-12 Thread Sebastian Nagel
Hi,

> I'm writing a custom IndexWriter and I had some questions on the execution
> workflow.
Have a look at NUTCH-1527 and NUTCH-1541.

> 
> I notice that when I run my index writer plugin the following happens:
> 
> - the describe String is printed
> - the .open method is called once
> - the .write method is called for every NutchDocument
> - the .close method is called
> - the .open method is called
with argument "name" = "commit"
> - the .commit method is called
> - the .close method is called again
> 
> This in most cases seems fine, however I'm not totally clear on what the
> .update or the .delete methods would be used. What is the "expected" use
> for these?
Intuitively, update resp. delete documents which are already in the index
Delete is used, e.g., to be sure that 404 documents are definitely removed
from a Solr index.
Update is actually not used. It may be useful for index end-points which
support field-level updates to update only some fields (e.g. score/boost
and anchor texts which depend on many documents and are permanently changing).

But you are definitively right. The interface o.a.n.indexer.IndexWriter
should provide good documentation for all required methods. Feel free
to open a jira.

> As a possibly related question, is it possible to change the workflow of
> the plugin (without editing Nutch source beyond the plugin)?

Hardly. You have some control what is done by the command-line options -noCommit
and -deleteGone. See o.a.n.indexer.IndexingJob.run(), also shown by
 % bin/nutch index

Bye,
Sebastian


PluginRuntimeException ClassNotFound for ParseFilter plugin in Nutch 2.2 ?

2013-06-12 Thread Tony Mullins
Hi,

I am trying simple ParseFilter plugin in Nutch 2.2. And I can build it and
also the src>plugin>build.xml successfully. But its .jar file is not being
created in my runtime>local>plugins>myplugin directory.

And on running
"bin/nutch parsechecker http://www.google.nl";
 I get this error " java.lang.RuntimeException:
org.apache.nutch.plugin.PluginRuntimeException:
java.lang.ClassNotFoundException:
com.xyz.nutch.selector.HtmlElementSelectorFilter"

If I go to MyNutch2.2Source/build/myplugin , I can see plugin's jar with
test & classes directory created there. If I copy .jar  from here and paste
it to my runtime>local>plugins>myplugin directory with plugin.xml file then
too I get the same exception of class not found.

I have not made any changes in src>plugin>build-plugin.xml.

Could you please guide me that what is I am doing wrong here ?

Thanks,
Tony


IndexWriter Plugin Workflow

2013-06-12 Thread AC Nutch
Hi,

I'm writing a custom IndexWriter and I had some questions on the execution
workflow.

I notice that when I run my index writer plugin the following happens:

- the describe String is printed
- the .open method is called once
- the .write method is called for very NutchDocument
- the .close method is called
- the .open method is called
- the .commit method is called
- the .close method is called again

This in most cases seems fine, however I'm not totally clear on what the
.update or the .delete methods would be used. What is the "expected" use
for these?

As a possibly related question, is it possible to change the workflow of
the plugin (without editing Nutch source beyond the plugin)?

Thanks,

Alex


Re: Suffix URLFilter not working

2013-06-12 Thread Peter Gaines
I have installed version 2.2 of nutch on a CentIOS machine and am using the 
following command:


./bin/crawl urls testcrawl "solrfolder" 2

I have attempted to use the default filter configuration and also explicitly 
specified urlfilter-regex

in the nutch-default.xml (without modifying the default regex filters).

However if fails each time and I can see the exception below in the 
hadoop.log.


As you can see it looks like it has not picked up anything from the seed.txt 
in the urls folder

(as MalformedURLException error usually prints the url).
This file has 1 entry with the protocol specified e.g. http://www.google.com

Can anyone shed any light on this?

Regards,
Peter.

2013-06-12 17:00:47,857 INFO  crawl.InjectorJob - InjectorJob: starting at 
2013-06-12 17:00:47
2013-06-12 17:00:47,858 INFO  crawl.InjectorJob - InjectorJob: Injecting 
urlDir: urls
2013-06-12 17:00:48,140 INFO  crawl.InjectorJob - InjectorJob: Using class 
org.apache.gora.memory.store.MemStore as the Gora storage class.
2013-06-12 17:00:48,158 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2013-06-12 17:00:48,206 WARN  snappy.LoadSnappy - Snappy native library not 
loaded
2013-06-12 17:00:48,344 INFO  mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 1
2013-06-12 17:00:48,403 INFO  regex.RegexURLNormalizer - can't find rules 
for scope 'inject', using default
2013-06-12 17:00:48,407 WARN  mapred.FileOutputCommitter - Output path is 
null in cleanup

2013-06-12 17:00:48,407 WARN  mapred.LocalJobRunner - job_local_0001
java.net.MalformedURLException: no protocol:
   at java.net.URL.(URL.java:585)
   at java.net.URL.(URL.java:482)
   at java.net.URL.(URL.java:431)
   at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
   at 
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162)
   at 
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)

   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: 
java.lang.RuntimeException: job failed: name=[testcrawl]inject urls, 
jobid=job_local_0001
   at 
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)

   at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
   at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
   at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)




Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Turns out it was because I had a copy of the default file sitting in the
directory I was calling nutch from.

Once I removed that it correctly found my copy in the conf directory.


On Wed, Jun 12, 2013 at 9:29 AM, Bai Shen  wrote:

> Doh!  I really should just read the code of things before posting.
>
> I ran the URLFilterChecker and passed it in a url that the SuffixFilter
> should flag and it still passed it.  However, if I change the url to end in
> a format that is in the default config file, it rejects the url.
>
> So it looks like the problem is that it's not loading the altered config
> file from my conf directory.  Not sure why since the regex filter correctly
> finds it's config file.
>
>
> On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma  > wrote:
>
>> We happily use that filter just as it is shipped with Nutch. Just
>> enabling it in plugin.includes works for us. To ease testing you can use
>> the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters.
>>
>>
>> -Original message-
>> > From:Bai Shen 
>> > Sent: Wed 12-Jun-2013 14:32
>> > To: user@nutch.apache.org
>> > Subject: Suffix URLFilter not working
>> >
>> > I'm dealing with a lot of file types that I don't want to index.  I was
>> > originally using the regex filter to exclude them but it was getting
>> out of
>> > hand.
>> >
>> > I changed my plugin includes from
>> >
>> > urlfilter-regex
>> >
>> > to
>> >
>> > urlfilter-(regex|suffix)
>> >
>> > I've tried using both the default urlfilter-suffix.txt file via adding
>> the
>> > extensions I don't want and making my own file that starts with + and
>> > includes the extensions I do want.
>> >
>> > Neither of these approaches seem to work.  I continue to get urls added
>> to
>> > the database which continue extensions I don't want.  Even adding a
>> > urlfilter.order section to my nutch-site.xml doesn't work.
>> >
>> > I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
>> > suggestions for what else to look at?
>> >
>> > Thanks.
>> >
>>
>
>


Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Doh!  I really should just read the code of things before posting.

I ran the URLFilterChecker and passed it in a url that the SuffixFilter
should flag and it still passed it.  However, if I change the url to end in
a format that is in the default config file, it rejects the url.

So it looks like the problem is that it's not loading the altered config
file from my conf directory.  Not sure why since the regex filter correctly
finds it's config file.


On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma
wrote:

> We happily use that filter just as it is shipped with Nutch. Just enabling
> it in plugin.includes works for us. To ease testing you can use the
> bin/nutch org.apache.nutch.net.URLFilterChecker to test filters.
>
>
> -Original message-
> > From:Bai Shen 
> > Sent: Wed 12-Jun-2013 14:32
> > To: user@nutch.apache.org
> > Subject: Suffix URLFilter not working
> >
> > I'm dealing with a lot of file types that I don't want to index.  I was
> > originally using the regex filter to exclude them but it was getting out
> of
> > hand.
> >
> > I changed my plugin includes from
> >
> > urlfilter-regex
> >
> > to
> >
> > urlfilter-(regex|suffix)
> >
> > I've tried using both the default urlfilter-suffix.txt file via adding
> the
> > extensions I don't want and making my own file that starts with + and
> > includes the extensions I do want.
> >
> > Neither of these approaches seem to work.  I continue to get urls added
> to
> > the database which continue extensions I don't want.  Even adding a
> > urlfilter.order section to my nutch-site.xml doesn't work.
> >
> > I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
> > suggestions for what else to look at?
> >
> > Thanks.
> >
>


Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
I figured as much, which is why I'm not sure why it's not working for me.

I ran bin/nutch org.apache.nutch.net.URLFilterChecker
http://myserver/myurland it's been thirty minutes with no results.

Is there something I should run before running that?

Thanks.


On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma
wrote:

> We happily use that filter just as it is shipped with Nutch. Just enabling
> it in plugin.includes works for us. To ease testing you can use the
> bin/nutch org.apache.nutch.net.URLFilterChecker to test filters.
>
>
> -Original message-
> > From:Bai Shen 
> > Sent: Wed 12-Jun-2013 14:32
> > To: user@nutch.apache.org
> > Subject: Suffix URLFilter not working
> >
> > I'm dealing with a lot of file types that I don't want to index.  I was
> > originally using the regex filter to exclude them but it was getting out
> of
> > hand.
> >
> > I changed my plugin includes from
> >
> > urlfilter-regex
> >
> > to
> >
> > urlfilter-(regex|suffix)
> >
> > I've tried using both the default urlfilter-suffix.txt file via adding
> the
> > extensions I don't want and making my own file that starts with + and
> > includes the extensions I do want.
> >
> > Neither of these approaches seem to work.  I continue to get urls added
> to
> > the database which continue extensions I don't want.  Even adding a
> > urlfilter.order section to my nutch-site.xml doesn't work.
> >
> > I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
> > suggestions for what else to look at?
> >
> > Thanks.
> >
>


Re: HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Tony Mullins
Thank guyz for quick response.
If you could point me to any working example of ParseFilter and/or
IndexFilter would be great.

Regards,
Tony


On Wed, Jun 12, 2013 at 5:46 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> They are called ParseFilters in 2.x :
> http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html
> as they are not limited to processing HTML documents since Tika generates
> SAX events for other mimetypes
>
> J.
>
>
> On 12 June 2013 13:37, Tony Mullins  wrote:
>
> > Hi ,
> >
> > If I go to http://wiki.apache.org/nutch/AboutPlugins  ,here  it shows me
> > HTMLParseFilter is extension point for adding custom metadata to HTML and
> > its 'Filter' method's signature is 'public ParseResult filter(Content
> > content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment
> > doc)'  but its in api 1.4 doc.
> >
> > I am on Nutch 2.2 and there is no class by name of HTMLParseFilter in
>  v2.2
> > api doc
> > http://nutch.apache.org/apidocs-2.2/allclasses-noframe.html.
> >
> > So please tell me which class to use in v2.2 api for adding my custom
> rule
> > to extract some data from HTML page (is it ParseFilter ?) and add it to
> > HMTL metadata so later then I could add it to my Solr using indexfilter
> > plugin.
> >
> >
> > Thanks,
> > Tony.
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>


Re: HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Julien Nioche
They are called ParseFilters in 2.x :
http://nutch.apache.org/apidocs-2.2/org/apache/nutch/parse/ParseFilter.html
as they are not limited to processing HTML documents since Tika generates
SAX events for other mimetypes

J.


On 12 June 2013 13:37, Tony Mullins  wrote:

> Hi ,
>
> If I go to http://wiki.apache.org/nutch/AboutPlugins  ,here  it shows me
> HTMLParseFilter is extension point for adding custom metadata to HTML and
> its 'Filter' method's signature is 'public ParseResult filter(Content
> content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment
> doc)'  but its in api 1.4 doc.
>
> I am on Nutch 2.2 and there is no class by name of HTMLParseFilter in  v2.2
> api doc
> http://nutch.apache.org/apidocs-2.2/allclasses-noframe.html.
>
> So please tell me which class to use in v2.2 api for adding my custom rule
> to extract some data from HTML page (is it ParseFilter ?) and add it to
> HMTL metadata so later then I could add it to my Solr using indexfilter
> plugin.
>
>
> Thanks,
> Tony.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


RE: HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Markus Jelsma
I think for Nutch 2x it was HTMLParseFilter was renamed to ParseFilter. This is 
not true for 1.x, see NUTCH-1482.

 https://issues.apache.org/jira/browse/NUTCH-1482

 
 
-Original message-
> From:Tony Mullins 
> Sent: Wed 12-Jun-2013 14:37
> To: user@nutch.apache.org
> Subject: HTMLParseFilter equivalent in Nutch 2.2 ???
> 
> Hi ,
> 
> If I go to http://wiki.apache.org/nutch/AboutPlugins  ,here  it shows me
> HTMLParseFilter is extension point for adding custom metadata to HTML and
> its 'Filter' method's signature is 'public ParseResult filter(Content
> content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment
> doc)'  but its in api 1.4 doc.
> 
> I am on Nutch 2.2 and there is no class by name of HTMLParseFilter in  v2.2
> api doc
> http://nutch.apache.org/apidocs-2.2/allclasses-noframe.html.
> 
> So please tell me which class to use in v2.2 api for adding my custom rule
> to extract some data from HTML page (is it ParseFilter ?) and add it to
> HMTL metadata so later then I could add it to my Solr using indexfilter
> plugin.
> 
> 
> Thanks,
> Tony.
> 


Installing Nutch1.6 on Windows7

2013-06-12 Thread Andrea Lanzoni
Hi everyone, I am a newcomer to Nutch and Solr and, after studying 
literature available on web, I tried to install them on _Windows 7_.


I have not been able to match the few instructions on the wikiapache 
site nor I could find a guide updated to Nutch 1.6 but only for older 
versions
I tried by following old versions guides on the web but never succeeded 
in the installation, often because of differences from what the guide 
read and what I saw on screen.


I followed the steps  by installing:
- Tomcat
- Java jdk 7
- Cygwin, Nutch 1.6 and Solr 4

Everything went apparently smooth and I copied Nutch and Solr in:
C:\cygwin\home\apache-nutch-1.6-bin
and
C:\cygwin\home\solr-4.2.0\solr-4.2.0

Whilst the two folders: jdk1.7.0_21 and jre7, are within the Java folder 
in Programs directory


I apologize for my dumbness but I couldn't find how to manage it. If 
somebody has a clear and detailed step by step pattern to follow for 
installing Nutch 1.6 and Solr 4 I would be very grateful.

Thanks in advance.
Andrea Lanzoni


HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Tony Mullins
Hi ,

If I go to http://wiki.apache.org/nutch/AboutPlugins  ,here  it shows me
HTMLParseFilter is extension point for adding custom metadata to HTML and
its 'Filter' method's signature is 'public ParseResult filter(Content
content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment
doc)'  but its in api 1.4 doc.

I am on Nutch 2.2 and there is no class by name of HTMLParseFilter in  v2.2
api doc
http://nutch.apache.org/apidocs-2.2/allclasses-noframe.html.

So please tell me which class to use in v2.2 api for adding my custom rule
to extract some data from HTML page (is it ParseFilter ?) and add it to
HMTL metadata so later then I could add it to my Solr using indexfilter
plugin.


Thanks,
Tony.


RE: Suffix URLFilter not working

2013-06-12 Thread Markus Jelsma
We happily use that filter just as it is shipped with Nutch. Just enabling it 
in plugin.includes works for us. To ease testing you can use the bin/nutch 
org.apache.nutch.net.URLFilterChecker to test filters.
 
 
-Original message-
> From:Bai Shen 
> Sent: Wed 12-Jun-2013 14:32
> To: user@nutch.apache.org
> Subject: Suffix URLFilter not working
> 
> I'm dealing with a lot of file types that I don't want to index.  I was
> originally using the regex filter to exclude them but it was getting out of
> hand.
> 
> I changed my plugin includes from
> 
> urlfilter-regex
> 
> to
> 
> urlfilter-(regex|suffix)
> 
> I've tried using both the default urlfilter-suffix.txt file via adding the
> extensions I don't want and making my own file that starts with + and
> includes the extensions I do want.
> 
> Neither of these approaches seem to work.  I continue to get urls added to
> the database which continue extensions I don't want.  Even adding a
> urlfilter.order section to my nutch-site.xml doesn't work.
> 
> I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
> suggestions for what else to look at?
> 
> Thanks.
> 


Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Sorry.  I forgot to mention that I'm running a 2.x release taken from a few
weeks ago.


On Wed, Jun 12, 2013 at 8:31 AM, Bai Shen  wrote:

> I'm dealing with a lot of file types that I don't want to index.  I was
> originally using the regex filter to exclude them but it was getting out of
> hand.
>
> I changed my plugin includes from
>
> urlfilter-regex
>
> to
>
> urlfilter-(regex|suffix)
>
> I've tried using both the default urlfilter-suffix.txt file via adding the
> extensions I don't want and making my own file that starts with + and
> includes the extensions I do want.
>
> Neither of these approaches seem to work.  I continue to get urls added to
> the database which continue extensions I don't want.  Even adding a
> urlfilter.order section to my nutch-site.xml doesn't work.
>
> I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
> suggestions for what else to look at?
>
> Thanks.
>


Suffix URLFilter not working

2013-06-12 Thread Bai Shen
I'm dealing with a lot of file types that I don't want to index.  I was
originally using the regex filter to exclude them but it was getting out of
hand.

I changed my plugin includes from

urlfilter-regex

to

urlfilter-(regex|suffix)

I've tried using both the default urlfilter-suffix.txt file via adding the
extensions I don't want and making my own file that starts with + and
includes the extensions I do want.

Neither of these approaches seem to work.  I continue to get urls added to
the database which continue extensions I don't want.  Even adding a
urlfilter.order section to my nutch-site.xml doesn't work.

I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
suggestions for what else to look at?

Thanks.


Re: Running Nutch standalone (without Solr)

2013-06-12 Thread H. Coskun Gunduz

Hi Peter,

Yes, it's possible.

You'll need a data store (my personal recommendation is HBase).

Regarding on the Nutch version you use, you can follow these tutorials:

Nutch 1.x: http://wiki.apache.org/nutch/NutchTutorial
Nutch 2: http://wiki.apache.org/nutch/Nutch2Tutorial

Happy crawling.

coskun...


On 06/12/2013 02:56 PM, Peter Gaines wrote:

Hi There,

Is it possible to run Nutch as a standalone crawler without integration with 
Solr?
I need to do this in order to do a performance comparison of it’s raw crawling 
functionality.

It seems like it may be possible using the bin/nutch crawl command but this is 
now deprecated.
Is there any way to do this using bin/crawl without commenting out chunks of it?

Regards,
Peter




RE: Running Nutch standalone (without Solr)

2013-06-12 Thread Markus Jelsma
Hi,

Sure, you don't need to index the data and can use the individual commands or 
the new bin/crawl script.

Cheers

 
 
-Original message-
> From:Peter Gaines 
> Sent: Wed 12-Jun-2013 13:57
> To: user@nutch.apache.org
> Subject: Running Nutch standalone (without Solr)
> 
> Hi There,
> 
> Is it possible to run Nutch as a standalone crawler without integration with 
> Solr?
> I need to do this in order to do a performance comparison of it’s raw 
> crawling functionality.
> 
> It seems like it may be possible using the bin/nutch crawl command but this 
> is now deprecated.
> Is there any way to do this using bin/crawl without commenting out chunks of 
> it?
> 
> Regards,
> Peter
> 
> 


Running Nutch standalone (without Solr)

2013-06-12 Thread Peter Gaines
Hi There,

Is it possible to run Nutch as a standalone crawler without integration with 
Solr?
I need to do this in order to do a performance comparison of it’s raw crawling 
functionality.

It seems like it may be possible using the bin/nutch crawl command but this is 
now deprecated.
Is there any way to do this using bin/crawl without commenting out chunks of it?

Regards,
Peter



Re: Data Extraction from 100+ different sites...

2013-06-12 Thread Julien Nioche
What I usually do in cases like these is to propagate an identifier from
the seeds and use that in the HTMLParsers to determine whether they should
process a page. See url-meta plugin for the config to propagate a metadatum
from the seeds. This way you don't need to act based on URL patterns but
need to have one HTMLParser per key. Since you have loads of different
extraction rules, you might be better off following the other suggestions
in this thread.

Julien


On 11 June 2013 18:45, Tony Mullins  wrote:

> I have to crawl the sub-links as well of these sites. And have to identify
> the pattern of these sub-links' html layout and extract my required data.
> One example could be a Movie Review site , now every page of this site
> would have (ideally) same HTML layout which describes a particular movie
> and I have to extract the info for that page.
>
>
> And for this requirement I am relying on Nutch + HtmlParse plugin!!!
>
>
>
> On Tue, Jun 11, 2013 at 10:34 PM, AC Nutch  wrote:
>
> > I'm a bit confused on where the requirement to *crawl* these sites comes
> > into it? From what you're saying it looks like you're just talking about
> > parsing the contents of a list of sites that you're trying to extract
> data
> > from. In which case there's not much of a use case for Nutch... or am I
> > confused?
> >
> >
> > On Tue, Jun 11, 2013 at 1:26 PM, Tony Mullins  > >wrote:
> >
> > > Yes all the web pages will have different HTML structure/layout and I
> > would
> > > have to identify/define a XPath expression for each one of them.
> > >
> > > But I am trying to come up with generic output format for these XPath
> > > expressions so whatever the XPath expression is I want to have
> result
> > > in lets say Field A , Field B , Field C . In some cases some of these
> > > fields could be blank as well. So I could map them to my Solr schema
> > > properly.
> > >
> > > In this regard I was hopping to get some help or guideline from your
> past
> > > experiences ...
> > >
> > > Thanks,
> > > Tony
> > >
> > >
> > > On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch  wrote:
> > >
> > > > Hi Tony,
> > > >
> > > > So if I understand correctly, you have 100+ web pages, each with a
> > > totally
> > > > different format that you're trying to extract separate/unrelated
> > pieces
> > > of
> > > > information from. If there's no connection between any of the web
> pages
> > > or
> > > > any of the pieces of information that you're trying to extract then
> > it's
> > > > pretty much unavoidable to have to provide separate identifiers and
> > cases
> > > > for finding each one. Markus' suggestion I believe is to just have a
> > > > "dictionary" file with URL as the key and XPath expression for the
> info
> > > > that you want as the value. No matter what crawling/parsing platform
> > > you're
> > > > using a solution of that sort is pretty much unavoidable with the
> > > > assumptions given.
> > > >
> > > > That being said, is there any common form that the data you're trying
> > to
> > > > extract from these pages follows? Is there a regex that could match
> it
> > or
> > > > anything else that might identify it in a common way?
> > > >
> > > > Alex
> > > >
> > > >
> > > > On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins <
> > tonymullins...@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi Markus,
> > > > >
> > > > > I couldn't understand how can I avoid switch cases in your
> suggested
> > > > > idea
> > > > >
> > > > > I would have one plugin which will implement HtmlParseFilter and I
> > > would
> > > > > have to check the current URL by getting content.getUrl() and this
> > all
> > > > will
> > > > > be happening in same class so I would have to add swicth cases... I
> > may
> > > > > could add xpath expression for each site in separate files but to
> get
> > > > XPath
> > > > > expression I would have to decide which file I have to read and for
> > > that
> > > > I
> > > > > would have to add my this code logic in swith case
> > > > >
> > > > > Please correct me if I am getting this all wrong !!!
> > > > >
> > > > > And I think this is common requirement for web crawling solutions
> to
> > > get
> > > > > custom data from page... then are not there any such Nutch plugins
> > > > > available on web ?
> > > > >
> > > > > Thanks,
> > > > > Tony.
> > > > >
> > > > >
> > > > > On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Yes, you should write a plugin that has a parse filter and
> indexing
> > > > > > filter. To ease maintenance you would want to have a file per
> > > > host/domain
> > > > > > containing XPath expressions, far easier that switch statements
> > that
> > > > need
> > > > > > to be recompiled. The indexing filter would then index the field
> > > values
> > > > > > extracted by your parse filter.
> > > > > >
> > > > > > Cheers,
> > > > > > Markus
> > > > > >
> > > > > > -Original message-
> > > > > > > From:Tony Mullins 
> > > > > >