Fetch command returns immediately

2010-12-05 Thread Alexis
Hi,

The fetch command returns immediately without downloading any urls. At
least according to my experience. Can somebody else try to fetch some
urls to make sure, see if I am in the wrong or not?

I use the following process to run the command:
$ export NUTCH_ROOT=./nutch
$ svn co http://svn.apache.org/repos/asf/nutch/trunk/ $NUTCH_ROOT
$ ant
$ export NUTCH_HOME=$NUTCH_ROOT/runtime/local

Then a little bit of configuration: http.agent.name and
http.robots.agents properties in $NUTCH_HOME/conf/nutch-default.xml,
as well as Gora in $NUTCH_HOME/conf/gora.properties.

Finally:
$ $NUTCH_HOME/bin/nutch inject seeds
InjectorJob: starting
InjectorJob: urlDir: seeds
InjectorJob: finished
$ $NUTCH_HOME/bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1291539079-2006862361
$ $NUTCH_HOME/bin/nutch fetch 1291539079-2006862361
FetcherJob: starting
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob: batchId: 1291539079-2006862361
FetcherJob: done
$

Nothing gets fetched.

This is the relatively immediate fix:

Index: src/java/org/apache/nutch/fetcher/FetcherJob.java
===
--- src/java/org/apache/nutch/fetcher/FetcherJob.java   (revision 1042291)
+++ src/java/org/apache/nutch/fetcher/FetcherJob.java   (working copy)
@@ -174,6 +174,7 @@
 } else {
   currentJob.setNumReduceTasks(numTasks);
 }
+currentJob.waitForCompletion(true);
 ToolUtil.recordJobStatus(null, currentJob, results);
 return results;
   }


Alexis


Re: Does Nutch 2.0 in good enough shape to test?

2010-12-18 Thread Alexis
> I've spent some time working on this as well. I've just put together a
>> blog entry addressing the issues I ran into. See
>> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html
>>
>
> This is a great howto for Nutch 2.0. Feel free to link to it from the Wiki,
> this could be useful to others.

A link has been added on the Nutch wiki frontpage in Nutch 2.0 section. Thanks!
I added in the blog a small paragraph that shows how to run a Nutch
unit test from Eclipse.

> I don't remember seeing any of the issues you mentioned in the Nutch JIRA.
> If you think something is a bug, why not reporting it? The same applies to
> the fixes you suggested for GORA.

I've created a new issue in the Jira Gora section:
https://issues.apache.org/jira/browse/GORA-20


>
>>
>> In a nutchsell, I changed three pieces in Gora and Nutch code:
>> - flush the datastore regularly in the Hadoop RecordWriter (in
>> GoraOutputFormat)
>> - wait for Hadoop job completion in the Fetcher job
>> - ensure that the content length limit is not being exceeded in
>> protocol-http plugin (only for MySQL datastore)
>>
>
> the content length limit issue can also be fixed by modifying the gora
> schema for the MySQL backend. It would make sense to allow larger values by
> default. Could you please open a JIRA for this?

I commented on https://issues.apache.org/jira/browse/NUTCH-899 which
is the same problem. I tried to come up with a JUnit test but it is
still rather imperfect (I want to use
org.apache.nutch.util.CrawTestUtil.getServer for it). The whole patch
is here:
https://issues.apache.org/jira/secure/attachment/12466548/httpContentLimit.patch

Alexis


Re: Does Nutch 2.0 in good enough shape to test?

2011-01-01 Thread Alexis
Hi,

First of, thanks for your feedback. I get to know which sections need
more information and update the tutorial accordingly.

> Im trying to run the main method in org.apache.nutch.crawl.Crawler. Figured
> it would work pretty much the same as org.apache.nutch.crawl.Crawl in Nutch
> 1.2
I tested the crawl command from bin/nutch script, which runs
underlying org.apache.nutch.crawl.Crawler class.


> Does that work for you? Could you try and parse a few HTML files with
> parse-html?
See http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#crawl
for all the details of the test. It worked for me after I patched a
few stuff. They are described throughout the blog entry or in this new
JIRA-950 issue which, among others, reopens JIRA-899.

Hope this helps.

Alexis.


Re: Welcome Alexis Detreglode as a Nutch Committer

2011-02-15 Thread Alexis
Dear Nutch users & developers,


Thank you for the warm welcome. I guess I'm now part of the family. I
hope it will grow exponentially with the new version.


My Nutch story started in 2007 but only lasted for a few months. I
resumed it recently in November 2010 through an exchange of comments
on Julien's blog, about whether or not using Nuch 1.2 of Nutch 2
(trunk) for my personal purpose. The new design he suggested has
shifted radically from a full-fledged solution for search application,
to a minimalistic project that does not do indexing, neither storing,
neither parsing, but just crawling. Delegating all the subsidiary
tasks to more specialized projects should allow the Nutch community to
focus on it's core activity: Downloading pages from the web
automatically the fastest way and preparing the data for analysis,
still respecting the web standards regarding robots.


I take advantage of this announcement to urge all new and more
familiar users to migrate their crawls to this 2.0 version, even
though it is still in a very alpha version. It works, provided you
apply a few patches here and there. Help will be very much
appreciated, especially in helping kickstart with Gora, an embryonic
project for Data Access in Map/Reduce. IMHO, what's high-priority on
the road map would be:
- Setup an Ivy configuration to build the first Gora release.
Currently Nutch build fails because of the missing Gora dependency in
the Maven repository.
- Port http-protocol plugin that fetches content from the web to
HttpComponents' httpcore-nio in order to leverage Non blocking I/O.
- Design and improve Gora & Nutch unit tests.


Don't hesitate to share your own impressions on the new design, the
road map, the potential improvements. If you wish to participate
please refer to Nutch 2.0 section in the wiki. There are many ways to
contribute: send a message on the mailing-list, create an issue on
JIRA while attaching your patch to it or not, update the wiki...


Give it a shot!

Alexis
http://techvineyard.blogspot.com


On Tue, Feb 15, 2011 at 6:00 PM, Markus Jelsma
 wrote:
> Great!
>
> On Tuesday 15 February 2011 17:49:40 Mattmann, Chris A (388J) wrote:
>> Hi Folks,
>>
>> A while back I nominated Alexis Detreglode for Nutch committership and PMC
>> membership. The VOTE tallies in Nutch PMC-ville have occurred and I'm happy
>> to announce that Alexis is now an Nutch committer!
>>
>> Alexis, feel free to say a little bit about yourself, and, welcome aboard!
>>
>> Cheers,
>> Chris
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattm...@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>


Re: Nutch 2 and Cassandra

2011-08-01 Thread Alexis
Hi, libthrift is a dependency of cassandra-thrift, as listed here:
http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1

During Nutch build, you have to manually tweak the Ivy configuration
depending on your choice of the Gora store, in this case Cassandra.
Basically you need to add all the dependencies listed there:
http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup

Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies
and then let's rebuild Nutch (see attached patch):







$ ant clean
$ ant

In your case libthrift should now be downloaded by Ivy and then
bundled into the nutch-2.0-dev.job file. I'm not sure how
apache-cassandra and hector got included in your classpath...

Somehow we need to resolve as well:



I don't think the following 2 jars are in the default maven repository
so they won't be downloaded, that's why they were commented in the
Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)


Since hector jar is not found in my case I get:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
org.apache.gora.util.GoraException:
java.lang.reflect.InvocationTargetException
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
at 
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
... 12 more
Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
at 
org.apache.gora.cassandra.store.CassandraStore.(CassandraStore.java:60)
... 18 more
Caused by: java.lang.ClassNotFoundException:
me.prettyprint.hector.api.Serializer
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 19 more




On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson  wrote:
> Hi All,
>
>
>
> I am kind of at my wit’s end here, so I am hoping someone here can help.  I
> am trying to use Nutch2 and Cassandra and I have been successful using the
> runtime/local build. I am using the Cloudera CDH3 on CentOs 5 and I do not
> want to contaminate by hadoop install by dropping in a bunch of Nutch jars,
> etc. So I am trying to use the nutch-2-dev.job jar. When I try to use the
> nutch2-dev.job jar, I get the error below.  I have double and triple checked
> the classpath and the included jars and the only jar that contains
> FieldValueMetaData is the libthrift-0.6.1.jar which has the method that is
> claimed to be missing. Any ideas?
>
>
>
> Thanks,
>
> Tom
>
>
>
>
>
>
>
>
>
> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
>
> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs -Dhadoop.log.file=hadoop.log
> -Dhadoop.home.dir=/usr/lib/hadoop-0.

Re: Nutch 2 and Cassandra

2011-08-01 Thread Alexis
d.JobClient: Map output records=3
11/08/01 15:17:52 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished



This is what was added to ivy/ivy.xml:

+   
+   
+   
+   
+   
+   
+   
+   



On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson  wrote:
> I did something similar to below to add the Cassandra dependencies. Note that 
> I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the 
> hector jars to your nutch job jar and see what you get? I think I am one step 
> ahead of you. BTW, I just added this line to get the hector dependency:
>
>         conf="*->default"/>
>
> -Original Message-
> From: Alexis [mailto:alexis.detregl...@gmail.com]
> Sent: Monday, August 01, 2011 2:28 PM
> To: dev@nutch.apache.org
> Subject: Re: Nutch 2 and Cassandra
>
> Hi, libthrift is a dependency of cassandra-thrift, as listed here:
> http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1
>
> During Nutch build, you have to manually tweak the Ivy configuration 
> depending on your choice of the Gora store, in this case Cassandra.
> Basically you need to add all the dependencies listed there:
> http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup
>
> Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and 
> then let's rebuild Nutch (see attached patch):
>         rev="0.2-incubating" conf="*->compile"/>
>         rev="0.8.1"/>
>         conf="*->*,!javadoc,!sources"/>
>         name="high-scale-lib" rev="1.1.2" conf="*->*,!javadoc,!sources"/>
>         rev="1.0" conf="*->*,!javadoc,!sources"/>
>         conf="*->*,!javadoc,!sources"/>
>
> $ ant clean
> $ ant
>
> In your case libthrift should now be downloaded by Ivy and then bundled into 
> the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got 
> included in your classpath...
>
> Somehow we need to resolve as well:
>         rev="0.8.1"/>
>        
>
> I don't think the following 2 jars are in the default maven repository so 
> they won't be downloaded, that's why they were commented in the Gora 
> Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)
>
>
> Since hector jar is not found in my case I get:
> ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject 
> ~/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
> 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
> /home/alex/java/workspace/Nutch/seeds
> 11/08/01 14:18:42 INFO security.Groups: Group mapping 
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=30
> 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
> processName=JobTracker, sessionId=
> 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
> org.apache.gora.util.GoraException:
> java.lang.reflect.InvocationTargetException
>        at 
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
>        at 
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
>        at 
> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
>        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
> Caused by: java.lang.reflect.InvocationTargetException
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>        at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at 
> org.apache.gora.util.Refl

Re: InvocationTargetException with Nutch 2.0 Gora 0.2 and Cassandra 0.8.4

2011-08-30 Thread Alexis
Hi Tom,

I'm having the same issue.
The two missing jars in the nutch-2.0-dev.job, cassandra-all-0.8.0.jar
and hector-core-0.8.0-1.jar, have been manually uploaded for the Gora
build to work into gora-cassandra/lib-ext SVN directory, because for
some reason I did not get them downloaded through Maven...



On Tue, Aug 30, 2011 at 3:30 AM, lewis john mcgibbney
 wrote:
> Hi Tom,
>
> Well this is strange...
>
> No versions of hector in Nutch 2.0/runtime/deploy/nutch-2.0-dev.job or
> /local/lib however Gora 0.2 uses it a dependency as per
> /gora-cassandra/lib/hector-core0.8.0-1.jar
>
> I'm going to take some time later and try various debug combinations within
> eclipse to get to the bottom of this one.
>
> On Mon, Aug 29, 2011 at 10:00 PM, Tom Davidson 
> wrote:
>>
>> I had similar classpath issues. Are there any versions of Hector in your
>> classpath (in your Hadoop lib folder?) that are not the same as in your
>> nutch deployment jar?
>>
>>
>>
>> From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
>> Sent: Monday, August 29, 2011 1:57 PM
>> To: dev@nutch.apache.org
>> Subject: InvocationTargetException with Nutch 2.0 Gora 0.2 and Cassandra
>> 0.8.4
>>
>>
>>
>> Hi,
>>
>> I believe the following error can be attributed to the java compiler
>> finding (or not finding) more than one version of
>> me.prettyprint.hector.api.Serializer. Has anyone experienced this whilst
>> getting the above (or similar) setup configured and running?
>>
>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch inject crawldb urls
>> InjectorJob: starting
>> InjectorJob: urlDir: crawldb
>> InjectorJob: org.apache.gora.util.GoraException:
>> java.lang.reflect.InvocationTargetException
>>     at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
>>     at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
>>     at
>> org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
>>     at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
>>     at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
>>     at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
>> Caused by: java.lang.reflect.InvocationTargetException
>>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>>     at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>     at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>>     at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>     at
>> org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
>>     at
>> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:103)
>>     ... 7 more
>> Caused by: java.lang.NoClassDefFoundError:
>> me/prettyprint/hector/api/Serializer
>>     at
>> org.apache.gora.cassandra.store.CassandraStore.(CassandraStore.java:60)
>>     ... 13 more
>> Caused by: java.lang.ClassNotFoundException:
>> me.prettyprint.hector.api.Serializer
>>     at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>>     at java.security.AccessController.doPrivileged(Native Method)
>>     at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>>     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>     at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
>>     ... 14 more
>> --
>> Lewis
>
>
> --
> Lewis
>
>


Re: [VOTE] Move 2.0 out of trunk

2011-09-19 Thread Alexis
My vote is thumbs down: -1

I am only involved in Nutch 2.0 and that would be put the back burner...

Please read these articles if you struggle with using Nutch 2.0, and give
feedback so that we can improve the doc/code/architecture.

Nutch 2.0 (trunk)
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html

Gora
http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html

I'm glad to hear that there at least 2 people in the community that do
business in their field and proudly use a Nutch-based crawler together with
Cassandra to store the data through Gora. That would not have been possible
with Nutch 1.x version.

Maybe this has been widely discussed already. IMOO, crawl segments are
hard-to-maintain and easily lost. If you want to do that HDFS is what you
are looking for. Even Yahoo has given up and is now using Microsoft updated
crawl information in order to implement search. They use HBase which is, by
the way, Nutch 2.0 compatible.

Take at look:
http://developer.yahoo.com/events/hadoopsummit2011/agenda.html#22 (sorry I
don't think any video of the summit is available yet, not sure why)

Alexis


On Mon, Sep 19, 2011 at 1:05 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

Here is my vote :
>
>  +1 : Shelve 2.0 and move 1.4 to trunk
>
> Julien
>
>
> On 18 September 2011 10:21, Julien Nioche 
> wrote:
>
>> Hi,
>>
>> Following the discussions [1] on the dev-list about the future of Nutch
>> 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a
>> separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The
>> arguments for / against can be found in the thread I mentioned.
>>
>> The vote is open for the next 72 hours.
>>
>> [ ] +1 : Shelve 2.0 and move 1.4 to trunk
>> [] 0 : No opinion
>> [] -1 : Bad idea.  Please give justification.
>>
>> Thanks
>>
>> Julien
>>
>> [1]
>> http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html<http://mail-archives.apache.org/mod_mbox/nutch-dev/201109.mbox/%3cca+-fm0tj2kvuco0wwkxbj6hsamxx5819ujv7lco2vo2kd2z...@mail.gmail.com%3E>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>


Re: Choosing an efficient family configuration for GORA HBase

2011-10-01 Thread Alexis
Dear Ferdy,

This mapping is user defined. It specifies where Avro fields required
by Nutch jobs are stored in HBase.

You can tweak the schema according to this kind of considerations by
editing the config file.

So content is populated by the Fetcher job (writes) that downloads the
web page. It is parsed by the Parser job (reads) that extracts the
links and the metadata.

For example, these are the fields that might need to be grouped in the
same column family (but they are not) because they are all required
for the parse step:
>From 
>http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java?view=markup

58static {
59  FIELDS.add(WebPage.Field.STATUS);
60  FIELDS.add(WebPage.Field.CONTENT);
61  FIELDS.add(WebPage.Field.CONTENT_TYPE);
62  FIELDS.add(WebPage.Field.SIGNATURE);
63  FIELDS.add(WebPage.Field.MARKERS);
64  FIELDS.add(WebPage.Field.PARSE_STATUS);
65  FIELDS.add(WebPage.Field.OUTLINKS);
66  FIELDS.add(WebPage.Field.METADATA);
67}


It looks tricky. I've heard that on the contrary people usually don't
use more that 3 column famillies to avoid slowing down the scans as
you mentioned. Not sure though. If you manage to optimize the config
with big improvements in the processing times don't hesitate to edit
the wiki page...



On Fri, Sep 30, 2011 at 5:57 AM, Ferdy Galema  wrote:
> Hi,
>
> About the example GORA HBase mapping at:
> http://wiki.apache.org/nutch/GORA_HBase
>
> Are there any current developments on improving the configuration for the
> column mappings? For example, at first glance it seems that it would be more
> efficient to put the fairly big column 'content' in a completely separate
> family. This way, doing scans over the smaller columns that do not need the
> 'content' column run much faster because the scan will completely skip
> 'content' on the regionserver level. (All columns in each family are stored
> in the same file per region.)
>
> Any thoughts on this?
>
> Ferdy.
>


[jira] Commented: (NUTCH-873) Ivy configuration settings don't include Gora

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
 ] 

Alexis commented on NUTCH-873:
--

It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory.

Just hack this setting to build nutch trunk right now, waiting for Gora to be 
properly transitioned to Apache?

[nutch]$ svn diff ivy/ivysettings.xml
Index: ivy/ivysettings.xml
===
--- ivy/ivysettings.xml (revision 1031723)
+++ ivy/ivysettings.xml (working copy)
@@ -83,7 +83,7 @@
 rather than look for them online.
 -->
 
-
+
 
 
   

> Ivy configuration settings don't include Gora
> -
>
> Key: NUTCH-873
> URL: https://issues.apache.org/jira/browse/NUTCH-873
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: Nutch trunk (formerly Nutchbase)
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 2.0
>
>
> The Nutch 2.0 trunk now requires Gora, and even though it's not available in 
> any repository, we should still configure Ivy to depend on it so that the 
> build will work provided you follow the Gora instructions here:
> http://github.com/enis/gora
> I've fixed it locally and will commit an update shortly that takes care of 
> it. In order to compile Nutch trunk now (before we get Gora into a repo), 
> here are the steps (copied from http://github.com/enis/gora):
> {noformat}
> $ git clone git://github.com/enis/gora.git
> $ cd gora 
> $ ant
> {noformat}
> This will install Gora into your local Ivy repo. Then from there on out, just 
> update your Ivy resolver (or alternatively just the Nutch build post this 
> issue being resolved) and you're good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-873) Ivy configuration settings don't include Gora

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
 ] 

Alexis edited comment on NUTCH-873 at 11/5/10 3:48 PM:
---

It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.

  was (Author: alexis779):
It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory.

Just hack this setting to build nutch trunk right now, waiting for Gora to be 
properly transitioned to Apache?

[nutch]$ svn diff ivy/ivysettings.xml
Index: ivy/ivysettings.xml
===
--- ivy/ivysettings.xml (revision 1031723)
+++ ivy/ivysettings.xml (working copy)
@@ -83,7 +83,7 @@
 rather than look for them online.
 -->
 
-
+
 
 
   
  
> Ivy configuration settings don't include Gora
> -
>
> Key: NUTCH-873
> URL: https://issues.apache.org/jira/browse/NUTCH-873
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: Nutch trunk (formerly Nutchbase)
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 2.0
>
>
> The Nutch 2.0 trunk now requires Gora, and even though it's not available in 
> any repository, we should still configure Ivy to depend on it so that the 
> build will work provided you follow the Gora instructions here:
> http://github.com/enis/gora
> I've fixed it locally and will commit an update shortly that takes care of 
> it. In order to compile Nutch trunk now (before we get Gora into a repo), 
> here are the steps (copied from http://github.com/enis/gora):
> {noformat}
> $ git clone git://github.com/enis/gora.git
> $ cd gora 
> $ ant
> {noformat}
> This will install Gora into your local Ivy repo. Then from there on out, just 
> update your Ivy resolver (or alternatively just the Nutch build post this 
> issue being resolved) and you're good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-873) Ivy configuration settings don't include Gora

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
 ] 

Alexis edited comment on NUTCH-873 at 11/5/10 3:51 PM:
---

It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.

I guess you need to move away from Github and go to Apache:
$ svn co http://svn.apache.org/repos/asf/incubator/gora/trunk gora
$ cd gora
$ ant

  was (Author: alexis779):
It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.
  
> Ivy configuration settings don't include Gora
> -
>
> Key: NUTCH-873
> URL: https://issues.apache.org/jira/browse/NUTCH-873
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: Nutch trunk (formerly Nutchbase)
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 2.0
>
>
> The Nutch 2.0 trunk now requires Gora, and even though it's not available in 
> any repository, we should still configure Ivy to depend on it so that the 
> build will work provided you follow the Gora instructions here:
> http://github.com/enis/gora
> I've fixed it locally and will commit an update shortly that takes care of 
> it. In order to compile Nutch trunk now (before we get Gora into a repo), 
> here are the steps (copied from http://github.com/enis/gora):
> {noformat}
> $ git clone git://github.com/enis/gora.git
> $ cd gora 
> $ ant
> {noformat}
> This will install Gora into your local Ivy repo. Then from there on out, just 
> update your Ivy resolver (or alternatively just the Nutch build post this 
> issue being resolved) and you're good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-873) Ivy configuration settings don't include Gora

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928788#action_12928788
 ] 

Alexis edited comment on NUTCH-873 at 11/5/10 3:52 PM:
---

It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.

I guess you need to move away from Github and go to Apache:
{noformat} 
$ svn co http://svn.apache.org/repos/asf/incubator/gora/trunk gora
$ cd gora
$ ant
{noformat} 

  was (Author: alexis779):
It did not work as seamless for me. The gora build created a 
~/.ivy2/local/org.gora directory, not with org.apache.gora namespace.

I guess you need to move away from Github and go to Apache:
$ svn co http://svn.apache.org/repos/asf/incubator/gora/trunk gora
$ cd gora
$ ant
  
> Ivy configuration settings don't include Gora
> -
>
> Key: NUTCH-873
> URL: https://issues.apache.org/jira/browse/NUTCH-873
> Project: Nutch
>  Issue Type: Bug
>  Components: build
> Environment: Nutch trunk (formerly Nutchbase)
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Blocker
> Fix For: 2.0
>
>
> The Nutch 2.0 trunk now requires Gora, and even though it's not available in 
> any repository, we should still configure Ivy to depend on it so that the 
> build will work provided you follow the Gora instructions here:
> http://github.com/enis/gora
> I've fixed it locally and will commit an update shortly that takes care of 
> it. In order to compile Nutch trunk now (before we get Gora into a repo), 
> here are the steps (copied from http://github.com/enis/gora):
> {noformat}
> $ git clone git://github.com/enis/gora.git
> $ cd gora 
> $ ant
> {noformat}
> This will install Gora into your local Ivy repo. Then from there on out, just 
> update your Ivy resolver (or alternatively just the Nutch build post this 
> issue being resolved) and you're good.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-880) REST API for Nutch

2010-11-05 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928896#action_12928896
 ] 

Alexis commented on NUTCH-880:
--

This revision introduced a bug in the nutch inject command. It now throws a 
NullPointerException.

Please take a look at:
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/InjectorJob.java?annotate=1028235&pathrev=1028235

Make sure the first element in the array is not null:

{noformat}
Index: src/java/org/apache/nutch/crawl/InjectorJob.java
===
--- src/java/org/apache/nutch/crawl/InjectorJob.java(revision 1031881)
+++ src/java/org/apache/nutch/crawl/InjectorJob.java(working copy)
@@ -242,6 +242,7 @@
 job.setReducerClass(Reducer.class);
 job.setNumReduceTasks(0);
 job.waitForCompletion(true);
+jobs[0] = job;

 job = new NutchJob(getConf(), "inject-p2 " + args[0]);
 StorageUtils.initMapperJob(job, FIELDS, String.class,
{noformat}


> REST API for Nutch
> --
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-899) java.sql.BatchUpdateException: Data truncation: Data too long for column 'content' at row 1

2010-12-10 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970336#action_12970336
 ] 

Alexis commented on NUTCH-899:
--

I ran into the exact same issue, with MySQL. The blob column type can only 
store a string which length L is less than 2^16 = 65536 (not equal to) 
See http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html

I believe you just need to decrement http.content.limit from 65536 to 65535 in 
conf/nutch-default.xml...


> java.sql.BatchUpdateException: Data truncation: Data too long for column 
> 'content' at row 1
> ---
>
> Key: NUTCH-899
> URL: https://issues.apache.org/jira/browse/NUTCH-899
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.0
> Environment: ubuntu 10.04
> JVM : 1.6.0_20
> nutch 2.0 (trunk)
> Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed 
>Reporter: Faruk Berksöz
>Priority: Minor
>
> wenn i try to fetch a web page (e.g. 
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql storage 
> definition,
> I am seeing the following error in my hadoop logs. ,  (no error with hbase ) ;
> java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too 
> long for column 'content' at row 1
> at org.gora.sql.store.SqlStore.flush(SqlStore.java:316)
> at org.gora.sql.store.SqlStore.close(SqlStore.java:163)
> at org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> The type of the column 'content' is BLOB.
> It may be important for the next developments of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-899) java.sql.BatchUpdateException: Data truncation: Data too long for column 'content' at row 1

2010-12-18 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-899:
-

Attachment: httpContentLimit.patch

We stick with  the default gora schema for the MySQL backend, which says 
"bytes" in the Avro definition, that is translated into "blob" in MySQL. From 
src/gora/webpage.avsc;
{"name": "WebPage",
 "type": "record",
 "namespace": "org.apache.nutch.storage",
 "fields": [
{"name": "content", "type": "bytes"},
   ]
}


There is potential bug in protocol-http. The http.content.limit value might be 
exceeded a little bit, hence the error saying that the value is too big for the 
MySQL blob column type, even tough we explicitly force http.content.limit to 
the 65535 max size.

I tried to come up with a unit test for this, which is rather imperfect. Please 
see it in the attached patch. It changes http.content.limit from 65536 to 65535 
when fetching a url which body content is big enough. The first test should see 
the error, the second should not.

Ideally we want to generate the content with a local server for the unit test 
instead of using a random internet url. That remains to be implemented in the 
test.

> java.sql.BatchUpdateException: Data truncation: Data too long for column 
> 'content' at row 1
> ---
>
> Key: NUTCH-899
> URL: https://issues.apache.org/jira/browse/NUTCH-899
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.0
> Environment: ubuntu 10.04
> JVM : 1.6.0_20
> nutch 2.0 (trunk)
> Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed 
>Reporter: Faruk Berksöz
>Priority: Minor
> Attachments: httpContentLimit.patch
>
>
> wenn i try to fetch a web page (e.g. 
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql storage 
> definition,
> I am seeing the following error in my hadoop logs. ,  (no error with hbase ) ;
> java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too 
> long for column 'content' at row 1
> at org.gora.sql.store.SqlStore.flush(SqlStore.java:316)
> at org.gora.sql.store.SqlStore.close(SqlStore.java:163)
> at org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> The type of the column 'content' is BLOB.
> It may be important for the next developments of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-01 Thread Alexis (JIRA)
Content-Length limit, URL filter and few minor issues
-

 Key: NUTCH-950
 URL: https://issues.apache.org/jira/browse/NUTCH-950
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
Reporter: Alexis


1. crawl command (nutch1.patch)

The class was renamed to Crawler but the references to it were not updated.


2. URL filter (nutch2.patch)

This avoids a NPE on bogus urls which host do not have a suffix.


3. Content-Length limit (nutch3.patch)

This is related to NUTCH-899.
The patch avoids the entire flush operation on the Gora datastore to crash 
because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
and protocol-httpclient plugins were problematic.


4. Ivy configuration (nutch4.patch)
- Change xercesImpl and restlet versions. These 2 version changes are required. 
The first one currently makes a JUnit test crash, the second one is missing in 
default Maven repository.

- Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL connector. 
These jars are necesary to run Gora with HBase or MySQL datastores. (more a 
suggestion that a requirement here)

- Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-01 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-950:
-

Attachment: nutch4.patch

> Content-Length limit, URL filter and few minor issues
> -
>
> Key: NUTCH-950
> URL: https://issues.apache.org/jira/browse/NUTCH-950
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
>    Reporter: Alexis
> Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash 
> because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
> and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are 
> required. The first one currently makes a JUnit test crash, the second one is 
> missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL 
> connector. These jars are necesary to run Gora with HBase or MySQL 
> datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-01 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-950:
-

Attachment: nutch3.patch
nutch2.patch
nutch1.patch

> Content-Length limit, URL filter and few minor issues
> -
>
> Key: NUTCH-950
> URL: https://issues.apache.org/jira/browse/NUTCH-950
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
>    Reporter: Alexis
> Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash 
> because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
> and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are 
> required. The first one currently makes a JUnit test crash, the second one is 
> missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL 
> connector. These jars are necesary to run Gora with HBase or MySQL 
> datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-955) Ivy configuration

2011-01-10 Thread Alexis (JIRA)
Ivy configuration
-

 Key: NUTCH-955
 URL: https://issues.apache.org/jira/browse/NUTCH-955
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.0
Reporter: Alexis


As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
help setup the Gora backend more easily.
If the user does not want to stick with default HSQL database, other 
alternatives exist, such as MySQL and HBase.

org.restlet and xercesImpl versions should be changed as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-955) Ivy configuration

2011-01-10 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-955:
-

Attachment: ivy.patch

In the patch, the required dependencies for MySQL and HBase are included in the 
Ivy config, but commented out. It's up to the user to use his own backend to 
store the data.

Following 3 points are minor issues but the fixes allow to play more nicely 
under Eclipse:

- The call to "nutch.root" property set in build.xml for ant should be replaced 
in src/plugin/protocol-sftp/ivy.xml by the built-in "basedir" ivy property.
- The 2.0.1 version of restlet dependency does not exist in the maven 
repository, so you want to manually change it to 2.0.0.
- The xerces (XML parser) implementation needs to be upgraded from 2.6.2 to 
2.9.1, otherwise you'll see exceptions while running a JUnit test.

> Ivy configuration
> -
>
> Key: NUTCH-955
> URL: https://issues.apache.org/jira/browse/NUTCH-955
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: ivy.patch
>
>
> As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
> help setup the Gora backend more easily.
> If the user does not want to stick with default HSQL database, other 
> alternatives exist, such as MySQL and HBase.
> org.restlet and xercesImpl versions should be changed as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-955) Ivy configuration

2011-01-10 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979525#action_12979525
 ] 

Alexis edited comment on NUTCH-955 at 1/10/11 5:27 AM:
---

In the patch, the required dependencies for MySQL and HBase are included in the 
Ivy config, but commented out as suggested in Julien's comment. It's up to the 
user to use his own backend to store the data.

Following 3 points are minor issues but the fixes allow to play more nicely 
under Eclipse:

- The call to "nutch.root" property set in build.xml for ant should be replaced 
in src/plugin/protocol-sftp/ivy.xml by the built-in "basedir" ivy property.
- The 2.0.1 version of restlet dependency does not exist in the maven 
repository, so you want to manually change it to 2.0.0.
- The xerces (XML parser) implementation needs to be upgraded from 2.6.2 to 
2.9.1, otherwise you'll see exceptions while running a JUnit test.

  was (Author: alexis779):
In the patch, the required dependencies for MySQL and HBase are included in 
the Ivy config, but commented out. It's up to the user to use his own backend 
to store the data.

Following 3 points are minor issues but the fixes allow to play more nicely 
under Eclipse:

- The call to "nutch.root" property set in build.xml for ant should be replaced 
in src/plugin/protocol-sftp/ivy.xml by the built-in "basedir" ivy property.
- The 2.0.1 version of restlet dependency does not exist in the maven 
repository, so you want to manually change it to 2.0.0.
- The xerces (XML parser) implementation needs to be upgraded from 2.6.2 to 
2.9.1, otherwise you'll see exceptions while running a JUnit test.
  
> Ivy configuration
> -
>
> Key: NUTCH-955
> URL: https://issues.apache.org/jira/browse/NUTCH-955
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: ivy.patch
>
>
> As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
> help setup the Gora backend more easily.
> If the user does not want to stick with default HSQL database, other 
> alternatives exist, such as MySQL and HBase.
> org.restlet and xercesImpl versions should be changed as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-10 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis resolved NUTCH-950.
--

   Resolution: Fixed
Fix Version/s: 2.0

Sorry I missed the Ivy configuration file in the plugin directory.

See NUTCH-955 for the new Ivy issue.

> Content-Length limit, URL filter and few minor issues
> -
>
> Key: NUTCH-950
> URL: https://issues.apache.org/jira/browse/NUTCH-950
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
>    Reporter: Alexis
> Fix For: 2.0
>
> Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch
>
>
> 1. crawl command (nutch1.patch)
> The class was renamed to Crawler but the references to it were not updated.
> 2. URL filter (nutch2.patch)
> This avoids a NPE on bogus urls which host do not have a suffix.
> 3. Content-Length limit (nutch3.patch)
> This is related to NUTCH-899.
> The patch avoids the entire flush operation on the Gora datastore to crash 
> because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
> and protocol-httpclient plugins were problematic.
> 4. Ivy configuration (nutch4.patch)
> - Change xercesImpl and restlet versions. These 2 version changes are 
> required. The first one currently makes a JUnit test crash, the second one is 
> missing in default Maven repository.
> - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL 
> connector. These jars are necesary to run Gora with HBase or MySQL 
> datastores. (more a suggestion that a requirement here)
> - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-956) soldindex issues

2011-01-13 Thread Alexis (JIRA)
soldindex issues


 Key: NUTCH-956
 URL: https://issues.apache.org/jira/browse/NUTCH-956
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.0
Reporter: Alexis


I ran into a few caveats with solrindex command trying to index documents.
Please refer to 
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
describes my tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-956) soldindex issues

2011-01-13 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-956:
-

Attachment: solr.patch

Here are the changes:

- Avoid multiple values for id field. (NUTCH-819)
- Allow multiple values for tag field. Add tld (Top Level Domain) field.
- Get the content-type from WebPage object's member. Otherwise, you will see 
NullPointerExceptions.
- Compare strings with equalsTo. That's pretty random, but it avoids having 
some suprises.

> soldindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: solr.patch
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-956) solrindex issues

2011-01-13 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-956:
-

Summary: solrindex issues  (was: soldindex issues)

> solrindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Alexis
> Attachments: solr.patch
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-955) Ivy configuration

2011-01-18 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983125#action_12983125
 ] 

Alexis commented on NUTCH-955:
--

Sorry please disregard the nutch.root first bullet in the previous comment and 
in the patch. This would break the build: basedir variable holds the plugin's 
base directory ("Nutch2.0/src/plugin/protocol-sftp"). I get an error in the 
build saying ivy/ivy-configurations.xml is not found with this patch.

I need to figure out how to load this nutch.root variable in the Ivy plugin in 
Eclipse.

> Ivy configuration
> -
>
> Key: NUTCH-955
> URL: https://issues.apache.org/jira/browse/NUTCH-955
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>    Affects Versions: 2.0
>Reporter: Alexis
> Attachments: ivy.patch
>
>
> As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
> help setup the Gora backend more easily.
> If the user does not want to stick with default HSQL database, other 
> alternatives exist, such as MySQL and HBase.
> org.restlet and xercesImpl versions should be changed as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-965) Parsing takes up 100% CPU

2011-02-08 Thread Alexis (JIRA)
Parsing takes up 100% CPU
-

 Key: NUTCH-965
 URL: https://issues.apache.org/jira/browse/NUTCH-965
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Alexis


The issue you're likely to run into when parsing truncated FLV files is 
described here:
http://www.mail-archive.com/user@nutch.apache.org/msg01880.html

The parser library gets stuck in infinite loop as it encounters corrupted data 
due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-965) Parsing takes up 100% CPU

2011-02-08 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-965:
-

Attachment: parserJob.patch

In the parser mapper, compare Content-Length header to the size of the content 
buffer to see if they match.

If this HTTP header is available and in the case that the file was truncated, 
skip the parsing step to avoid that the parser gets stuck in infinite loop 
taking up all the CPU resources.


Before, in the logs, we would see:

{noformat}2011-02-07 14:03:34,693 WARN  parse.ParseUtil - TIMEOUT parsing 
http://downtownjoes.com/botb1.flv with 
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:03:34,693 WARN  parse.ParseUtil - Unable to successfully parse 
content http://downtownjoes.com/botb1.flv of type video/x-flv
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - TIMEOUT parsing 
http://downtownjoes.com/dtj.flv with 
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - Unable to successfully parse 
content http://downtownjoes.com/dtj.flv of type video/x-flv
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - TIMEOUT parsing 
http://downtownjoes.com/botb2.flv with 
org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - Unable to successfully parse 
content http://downtownjoes.com/botb2.flv of type video/x-flv
{noformat} 

After:

{noformat}2011-02-08 09:06:54,482 INFO  parse.ParserJob - 
http://downtownjoes.com/botb1.flv skipped. Content of size 4527822 was 
truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/dtj.flv 
skipped. Content of size 2692082 was truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - 
http://downtownjoes.com/botb2.flv skipped. Content of size 35496213 was 
truncated to 61058
{noformat} 




> Parsing takes up 100% CPU
> -
>
> Key: NUTCH-965
> URL: https://issues.apache.org/jira/browse/NUTCH-965
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>    Reporter: Alexis
> Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is 
> described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted 
> data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-965) Skip parsing for truncated documents

2011-02-10 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-965:
-

Summary: Skip parsing for truncated documents  (was: Parsing takes up 100% 
CPU)

> Skip parsing for truncated documents
> 
>
> Key: NUTCH-965
> URL: https://issues.apache.org/jira/browse/NUTCH-965
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>    Reporter: Alexis
> Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is 
> described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted 
> data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-956) solrindex issues

2011-07-12 Thread Alexis (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064148#comment-13064148
 ] 

Alexis commented on NUTCH-956:
--

I do get the NPE when indexing this url

http://www.truveo.com/ (Content-Type header is "Content-Type: text/html; 
charset=utf-8")

without the 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
 patch.

{code}
java.lang.NullPointerException
at 
org.apache.nutch.indexer.more.MoreIndexingFilter.addType(MoreIndexingFilter.java:204)
at 
org.apache.nutch.indexer.more.MoreIndexingFilter.filter(MoreIndexingFilter.java:78)
at 
org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:107)
at 
org.apache.nutch.indexer.IndexerReducer.reduce(IndexerReducer.java:73)
at org.apache.nutch.indexer.IndexerReducer.reduce(IndexerReducer.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
{code}


See attached patch "solr.patch2".
If you have time can you please go ahead an run the entire tests suite as well:

1 InjectorJob
2 GeneratorJob
3 FetcherJob
4 ParserJob
5 DbUpdaterJob
6 SolrIndexerJob
(Finally chech the index with 
http://localhost:8983/solr/select/?q=video&indent=on in the browser)

at least on this seed url:
- http://www.truveo.com/


Regarding the String comparison in Java, I believe people usually call 
String.equals instead of using the boolean comparator (==).

> solrindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Alexis
> Fix For: 1.4, 2.0
>
> Attachments: solr.patch
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-956) solrindex issues

2011-07-12 Thread Alexis (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis updated NUTCH-956:
-

Attachment: solr.patch2

- NPE related to content-type field
- tld field in Solr schema
- string comparison in Java

> solrindex issues
> 
>
> Key: NUTCH-956
> URL: https://issues.apache.org/jira/browse/NUTCH-956
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Alexis
> Fix For: 1.4, 2.0
>
> Attachments: solr.patch, solr.patch2
>
>
> I ran into a few caveats with solrindex command trying to index documents.
> Please refer to 
> http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#solrindex that 
> describes my tests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira