Hi Tom,
I have been using Nutch 1.x for the last 9 months or so and it works well
for large scale crawls up to around a billion pages. However, the inherent
lack of random access in HDFS really starts to become a burden on our hadoop
cluster when going through the whole
build for the stable branches).
The real issue behind all this is what we should do with Nutch 2.0. What
follows is only my opinion and I would love to hear what others have to say
on this subject.
Since we (actually mostly Dogacan) wrote 2.0 and delegated the storage to
Gora, the latter
to adoption for dev's. This being said, Gora is a
fundamental component for Nutch 2.0 and once you get to grips with the
config and the flexibility which it offers you are then presented with an
excellent setup for Nutch 2.0. I understand people's concerns and why they
would wish to hardwire
Hi,
Just for information purposes, I committed our DOAP which can now be found
under trunk svn. I have been informed by site-dev@ that the system they use
oes not support more than one doap file, however I thought it best to keep
it in svn for the time being. If at some point in the future Nutch
file, however I thought it best to
keep it in svn for the time being. If at some point in the future Nutch 2.0
becomes the de facto Nutch release then no-one will need to recreate one.
Thanks
--
*Lewis*
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com
the project. I would really like to push to get this going as per [1] as I
have been trying to get various documentation updated over the last while.
This would be a reasonable milestone which would carve the way for a fully
documented Nutch 2.0 (and branch 1.4) ;0)
Would it be possible for me
and it
doesn't bother anybody that it fails all the time (and that there
isn't a nightly build for the stable branches).
The real issue behind all this is what we should do with Nutch 2.0. What
follows is only my opinion and I would love to hear what others have to say
on this subject.
Since we (actually
to it but it does not seem to be used much and
there is virtually nothing happening on it in terms of development. More
worryingly, the people who initially contributed to it are not very active
on the project (such is life, new jobs, different projects, etc...)
anymore·. As for Nutch 2.0, it hasn't made any
from svn and after compiling checked all jar
files in runtime/deploy/nutch-2.0-dev.job and /runtime/local/lib.
All jar files in both libraries are identical and versions are consistent
therefore I propose we close this issue as fixed. Perhaps someone committed a
change and didn't realise
. ant report only throws alot of
{code}
[ivy:resolve] unknown resolver maven2
{code}
messages.
different versions of the same library in nutch-2.0-dev.job and local\lib
directory
Key: NUTCH
...@gmail.com]
Sent: Tuesday, August 09, 2011 8:31 AM
To: dev@nutch.apache.org
Cc: gora-...@incubator.apache.org
Subject: Re: Future of Nutch 2.0 [Was: Unresolved dependencies
org.apache.gora#gora-hbase;0.1: not found in Nutch trunk]
Julien,
On Tue, Aug 9, 2011 at 10:10 AM, Julien Nioche
lists.digitalpeb
the last while.
This would be a reasonable milestone which would carve the way for a fully
documented Nutch 2.0 (and branch 1.4) ;0)
Would it be possible for me to invoke a small conversation on this topic to
gather thoughts as it seems this issue has been forgotten about again.
Thank you
[1] https
Hi Lewis,
Currently the slightly (in places) dated roadmap can be found here [1], I
was wondering if we could give this an overhaul/update as it would give a
more robust overview of where trunk is going. Most of the points you make
are still in development, however some have been achieved and
to release this year moving
forward it is essential that this is seen to.
N.B. I moved to old Nutch 2.0 road map to the legacy and archive section of
the wiki in an attempt to disambiguate data and future intentions.
Thanks
[1] http://wiki.apache.org/nutch/Nutch2Roadmap
--
*Lewis*
Could someone give me step-by-step instructions on how to build Nutch
2.0 from the trunk and run it? I tried to follow this
(http://techvineyard.blogspot.com/2010/12/build-nutch-20.html), but
failed to do so as described here
(http://lucene.472066.n3.nabble.com/TestFetcher-hangs-td3091057.html).
I've spent some time working on this as well. I've just put together a
blog entry addressing the issues I ran into. See
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html
This is a great howto for Nutch 2.0. Feel free to link to it from the Wiki,
this could be useful to others
(switching to devs)
On 12/17/10 10:18 AM, Alexis wrote:
Hi,
I've spent some time working on this as well. I've just put together a
blog entry addressing the issues I ran into. See
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html
In a nutchsell, I changed three pieces in Gora and
: 1.0
TOTAL urls: 2894
status 0 (null):2894
avg score: 1.0
-Original Message-
From: Andrzej Bialecki [mailto:a...@getopt.org]
Sent: Thursday, December 16, 2010 11:36 PM
To: u...@nutch.apache.org
Subject: Re: Does Nutch 2.0 in good enough shape to test?
On 12/17/10
Hi guys,
I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
http://wiki.apache.org/nutch/GORA_HBase
Feel free to amend and improve as you see fit.
Please bear in mind that Nutch 2.0 is at a very early stage and is far from
being bug-proof, see in particular [1].
HTH
a
issue to track this down.
Cheers,
Enis
On Wed, Sep 8, 2010 at 1:53 PM, Julien Nioche lists.digitalpeb...@gmail.com
wrote:
Hi guys,
I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
http://wiki.apache.org/nutch/GORA_HBase
Feel free to amend and improve as you see
or nothing?
environments : ubuntu 10.04
JVM : 1.6.0_20
nutch 2.0 (trunk)
Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed
Best regards,
Faruk Berksöz
.
Should I file this in nutch-jira or hithub/gora or nothing?
environments : ubuntu 10.04
JVM : 1.6.0_20
nutch 2.0 (trunk)
Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed
Yes, please create a JIRA issue. Thanks!
--
Best regards,
Andrzej Bialecki
:408)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
The type of the column 'content' is BLOB.
It may be important for the next developments of Gora.
Should I file this in nutch-jira or hithub/gora or nothing?
environments : ubuntu 10.04
JVM : 1.6.0_20
nutch
Hey All,
I have setup the latest version nutch from trunk and am running into a few
issues with hbase and injecting urls. when I run the command
runtime/local/bin/nutch inject runtime/local/seed/
I get
InjectorJob: java.lang.RuntimeException: Could not create datastore
at
Hi David,
I haven't used the Hbase backend with GORA for quite some time but from what
I can remember you'll need the following things :
* conf/hbase-site.xml = this should correspond to your local configuration
* conf/gora-hbase-mapping.xml = see below
* conf/gora.properties = don't think there
instead of maintaining ours. WDYT?
Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
--
Key: NUTCH-874
URL: https://issues.apache.org/jira/browse/NUTCH-874
Port Webgraph to Nutch 2.0
--
Key: NUTCH-875
URL: https://issues.apache.org/jira/browse/NUTCH-875
Project: Nutch
Issue Type: New Feature
Components: linkdb
Affects Versions: 2.1
Reporter
that, I think we're good!
Cheers,
Chris
Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
--
Key: NUTCH-874
URL: https://issues.apache.org/jira/browse/NUTCH-874
Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
--
Key: NUTCH-874
URL: https://issues.apache.org/jira/browse/NUTCH-874
Project: Nutch
Issue
the dependencies managed by
Ivy. This will create a file build/org.apache.nutch-Nutch-test.html with all
the details
different versions of the same library in nutch-2.0-dev.job and local\lib
directory
different versions of the same library in nutch-2.0-dev.job and local\lib
directory
Key: NUTCH-849
URL: https://issues.apache.org/jira/browse/NUTCH-849
Project
segments to the webtable. The drawbacks being that there would be a dual
storage GORA / HDFS and we'd need to keep the legacy Nutch Writable
objects.
The fetcher code is already ported in nutchbase not to use the plain files.
I doubt there would be many users who want to jump to Nutch 2.0
Nutch 2.0 webapp
Key: NUTCH-841
URL: https://issues.apache.org/jira/browse/NUTCH-841
Project: Nutch
Issue Type: Improvement
Components: web gui
Environment: Nutch 2.0
Reporter: Chris
are left with an Apache
Nutchbase branch that needs to incrementally be merged into the Nutch 2.0
trunk, which I agree with Andrzej, and Julien, is the most important part.
So, either way works fine with me, so long as we are left with an Apache
Nutchbase branch that can be merged incrementally
Hi
Can you please tell me from where can I download nutch 2.0 .?
--
Raghavendra Keshava Neelekani
On 2010-06-29 11:17, Raghavendra Neelekani wrote:
Hi
Can you please tell me from where can I download nutch 2.0 .?
Nutch 2.0 is in the planning and early development phase, so it can't be
downloaded yet. We hope to produce a working Nutch 2.0 some time in Q4 2010.
--
Best regards,
Andrzej
Hi,
On Tue, Jun 29, 2010 at 11:49, Julien Nioche
lists.digitalpeb...@gmail.comwrote:
Thanks Chris,
I already shared my thoughts on this yesterday, but I still fail to see the
advantage of keeping the details of the recent github nutchbase commits
(some of them being just upgrades to the
wholesale, either way, we are left with an Apache
Nutchbase branch that needs to incrementally be merged into the Nutch 2.0
trunk, which I agree with Andrzej, and Julien, is the most important part.
So, either way works fine with me, so long as we are left with an Apache
Nutchbase branch that can be merged
branch or we blow away the Apache Nutchbase branch and then import the
Github Nutchbase branch wholesale, either way, we are left with an Apache
Nutchbase branch that needs to incrementally be merged into the Nutch 2.0
trunk, which I agree with Andrzej, and Julien, is the most important part
On 2010-06-28 07:49, Sami Siren wrote:
One aspect that has not been discussed yet is the legal aspect.
According to
http://incubator.apache.org/ip-clearance/index.html there is a formal
process for integrating externally development efforts that have
happened outside of Apache. Should we be
On 06/28/2010 10:10 AM, Andrzej Bialecki wrote:
On 2010-06-28 07:49, Sami Siren wrote:
One aspect that has not been discussed yet is the legal aspect.
According to http://incubator.apache.org/ip-clearance/index.html
there is a formal process for integrating externally development
efforts that
issues there
(h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is
removed. All unit tests should pass regression where it makes sense.
(i) Nutch documentation is brought up to date on wiki and checked into
SVN
(j) We roll a 2.0 release
+1
I'd be happy to do
it up to snuff.
(e) roll the version # in nutch trunk to 2.0-dev
(f) all issues in JIRA should be updated to reflect 2.0-dev fixes where
it makes sense
(g) a 2.1 version is added to mark anything that we don't want in 2.0
and we file post 2.0 issues there
(h) Nutch 2.0 trunk
is added to mark anything that we don't want in 2.0
and we file post 2.0 issues there
(h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is
removed. All unit tests should pass regression where it makes sense.
(i) Nutch documentation is brought up to date on wiki and checked
Okey dokey guys, (c), (e) and (g) are done.
Julien, Doğacan, your turn on (a) and (d) and then we can all work on (e) and
(f)...
Cheers,
Chris
On 6/28/10 12:55 PM, Doğacan Güney doga...@gmail.com wrote:
On Mon, Jun 28, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
On 2010-06-28
101 - 145 of 145 matches
Mail list logo