Re: [VOTE] Apache Nutch 1.1 Release Candidate #3

2010-05-11 Thread Julien Nioche
Hi Chris, -1 : I have just reported https://issues.apache.org/jira/browse/NUTCH-818and will commit the change to SVN shortly There is also an issue with the schema.xml for SOLR which does not play well with solrindex-mapping.xml. Will report that shortly Julien -- DigitalPebble Ltd http://www.d

[Travel Assistance] - Applications Open for ApacheCon NA 2010

2010-05-17 Thread Julien Nioche
A message from the ApacheCon organizers, sorry for cross-posting. The Travel Assistance Committee is now taking in applications for those wanting to attend ApacheCon North America (NA) 2010, which is taking place between the 1st and 5th November in Atlanta.

Re: Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2010-05-18 Thread Julien Nioche
Hi Michela, I tried* *the following command on a* *dummy file* * > > *bin/nutch plugin protocol-file org.apache.nutch.protocol.file.File > file:/tmp/A.M._%28album%29_8a09.html * > and got the expected results : *Content-Type: text/html > Content-Length: 47067 > Last-Modified: Tue, 18 May 2010

Re: [VOTE] Apache Nutch 1.1 Release Candidate #3

2010-06-02 Thread Julien Nioche
Hi guys, Shall we push a new RC with the latest changes or deliver it directly as 1.1? J. -- DigitalPebble Ltd http://www.digitalpebble.com On 9 May 2010 01:16, Mattmann, Chris A (388J) wrote: > Hi Folks, > > I have posted an updated candidate for the Apache Nutch 1.1 release. The > source c

Re: What are the ParseStatus major codes?

2010-06-14 Thread Julien Nioche
Alex, This issue has been fixed in https://issues.apache.org/jira/browse/NUTCH-818and should be part of the latest RC ( http://people.apache.org/~mattmann/apache-nutch-1.1 ) HTH Julien -- DigitalPebble Ltd Open Source Solutions for Te

Re: [VOTE] Apache Nutch 1.1 Release Candidate #4

2010-06-14 Thread Julien Nioche
+1 from me thought I had already done it - sorry J. On 14 June 2010 16:30, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hey Nutch PMC’ers: > > *nudge* > > We currently have 2 PMC binding +1's on this VOTE: > > Chris Mattmann > Doğacan Güney > > Would be great to wrap up th

Re: svnpubsub for the Tika web site

2010-06-24 Thread Julien Nioche
What about doing the same for Nutch? Any reason not to? J. On 21 June 2010 15:46, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > +1million > > Been wishing for this for a while! :) > > Cheers, > Chris > > > > On 6/21/10 3:02 AM, "Jukka Zitting" wrote: > > Hi, > > The PDFBox

Re: svnpubsub for the Tika web site

2010-06-24 Thread Julien Nioche
art. > Done - see https://issues.apache.org/jira/browse/NUTCH-834 and https://issues.apache.org/jira/browse/INFRA-2822 I have not moved the stuff from nutch/trunk to nutch/site though, not clear whether this was a prerequisite or not Thanks J. > > > > On 6/24/10 4:17 AM, "Julien Nioch

Re: [jira] Created: (NUTCH-834) Separate the Nutch web site from trunk

2010-06-26 Thread Julien Nioche
Hi Alex, First thanks for all the work you are putting into improving the documentation. > I've spent a little while today trying to improve the Forrest Nutch > website a little. Is that still a useful task or shall I hold off > until later? > still useful to do it now > > Are you changing ho

Re: Nutch 2.0

2010-06-27 Thread Julien Nioche
Hi, a) is GORA ASL licensed? > it is. see http://github.com/enis/gora/blob/master/LICENSE.txt > b) what's the maintenance plan for GORA? Will it continue to live in > Github? Will you guys propose it into the Apache Incubator as an ASF > project? > I confirm that the plan is definitely to mo

Re: Nutch 2.0

2010-06-27 Thread Julien Nioche
Hi guys, >>> (a) svn copy NutchBase from GitHub to the nutchbase branch in > >>> http://svn.apache.org/repos/asf/nutch/branches/nutchbase bringing the > ASF > >>> branch up to date. > >> > >> this seems like an unnecessary step. There has been an enormous amount > of > >> changes between the

Re: Nutch 2.0

2010-06-28 Thread Julien Nioche
Hi, (a) deleting svn:nutchbase > (b) svn:importing Git Nutchbase. > (c) branch current 1.2-trunk as 1.2-branch > (d) iteratively apply patches from new svn:nutchbase to trunk to bring > it up to snuff. > (e) roll the version # in nutch trunk to 2.0-dev > (f) all issues in

Re: Nutch 2.0

2010-06-28 Thread Julien Nioche
nd the 1.1 changes are already there so I suppose the info about the contributors is not vital. Or am I missing something? > > Makes sense? > > On Mon, Jun 28, 2010 at 16:45, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > >> Hi, >> >> (a)

Re: Nutch 2.0

2010-06-29 Thread Julien Nioche
Thanks Chris, I already shared my thoughts on this yesterday, but I still fail to see the advantage of keeping the details of the recent github nutchbase commits (some of them being just upgrades to the recent changes in 1.1) in svn nutchbase knowing that the point is actually to do incremental ch

Re: Nutch 2.0

2010-06-29 Thread Julien Nioche
2010/6/29 Doğacan Güney > Hi, > > On Tue, Jun 29, 2010 at 11:49, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > >> Thanks Chris, >> >> I already shared my thoughts on this yesterday, but I still fail to see >> the advantage of keeping the d

Update svn nutchbase - Nutch 2.0

2010-06-29 Thread Julien Nioche
Dogacan has produced a patch for svn nutchbase that brings it to the level of github. See https://issues.apache.org/jira/browse/NUTCH-650 The patch has been marked as 'licensed for inclusion in ASF work' and works fine. Any objections to this patch being committed? Thanks Dogacan for producing it

Re: Update svn nutchbase - Nutch 2.0

2010-06-30 Thread Julien Nioche
h IVY, deletion of old plugins, etc... Thanks J. On 29 June 2010 21:27, Dennis Kubes wrote: > +1 on this > > > On 06/29/2010 08:57 AM, Julien Nioche wrote: > > Dogacan has produced a patch for svn nutchbase that brings it to the level > of github. See https://issues.apache.org/j

Re: [Nutchbase] WebPage class is a generated code?

2010-07-02 Thread Julien Nioche
> > (This question is mostly to Dogacan & Enis, but I encourage anyone familiar > with the code to join the threads with [Nutchbase] - the sooner the better > ;) ). > > I'm looking at src/gora/webpage.avsc and WebPage.java & friends... > presumably the java code was autogenerated from avsc using Go

Nutch 2.0 : Design issue

2010-07-02 Thread Julien Nioche
Hi guys, You've probably seen that there has been some progress on 2.0 lately. We've updated the nutchbase svn branch with the latest developments done on Dogacan's Github i.e. using GORA as a storage layer. One of the main issues [1] I raised after using nutchbase was that : NutchBase currently

Re: Nutch 2.0 : Design issue

2010-07-02 Thread Julien Nioche
On 2 July 2010 12:22, Andrzej Bialecki wrote: > On 2010-07-02 12:42, Julien Nioche wrote: > >> Hi guys, >> >> You've probably seen that there has been some progress on 2.0 lately. >> We've >> updated the nutchbase svn branch with the latest devel

Re: Classifying pages on Nutch: plugins?

2010-07-06 Thread Julien Nioche
Hi Cesar, This can definitely be done using a custom parse plugin and an indexing plugin. We did something like this sometime ago to classify adult pages using our text classification API ( http://code.google.com/p/textclassification/) which is based on SVM. Out of interest, what categories are y

Re: Parse-tika ignores too much data...

2010-07-07 Thread Julien Nioche
Ken, See https://issues.apache.org/jira/browse/TIKA-457 for a description of one of the cases found by Andrzej. There seems to be something very wrong with the way is handled, we also saw cases were it was twice in the output. J. On 7 July 2010 17:41, Ken Krugler wrote: > Hi Andrzej, > > I've

Re: Parse-tika ignores too much data...

2010-07-07 Thread Julien Nioche
Hi Ken, Thank you for your comments and analysis. We should probably modify the HTMLHandler so that it does not discard a frameset because of the bodylevel being equal to 0. I suggested earlier on the Tika list having a mechanism for specifying a custom handler via the Context, that would give us

Re: Classifying pages on Nutch: plugins?

2010-07-08 Thread Julien Nioche
Daniel, Your message is not relevant for this mailing list. If you have questions about the TC API use http://groups.google.com/group/digitalpebble instead. Thanks On 8 July 2010 01:56, dgimenes wrote: > > Julien, > > I'm in Luan's project too. > > I'd like to know if you have examples of the

Re: Nutch with classification

2010-07-08 Thread Julien Nioche
Hi, > >- Using a modified version of DmozParser to initialize crawlDB with the >URLs ANDD the top classification. >- Change the crawler to fetch the pages with and include two fields on >the webDB: > - One field for the classification (get from DMOZ file or classified >

Re: Build failed in Hudson: Nutch-trunk #1202

2010-07-09 Thread Julien Nioche
> BUILD SUCCESSFUL > Total time: 24 minutes 31 seconds > Publishing Javadoc > Archiving artifacts > ERROR: No artifacts found that match the file pattern > "trunk/build/*.tar.gz". Configuration error? > ERROR: 'trunk/build/*.tar.gz' doesn't match anything: 'trunk' exists but > not 'trunk/build/*.ta

Re: Merging in nutchbase

2010-07-10 Thread Julien Nioche
I agree with Andrzej that the SQL backend has to be checked and tested on nutchbase before we can start porting it to the trunk. Moreover I have raised an important design issue on the list recently (table per fetchround) which needs some changes to Gora first and must be discussed, implemented and

Re: Merging in nutchbase

2010-07-10 Thread Julien Nioche
Hi Doğacan, Thanks for the update. > While I agree with the "table per fetch" issue, I would like to postpone it > until after the merge. This issue is tricky for a couple of reasons. For > example, AFAIK, cassandra's latest released version > does not support live schema updates so you can not

Re: Merging in nutchbase

2010-07-12 Thread Julien Nioche
Hi guys, We'll probably find minor improvements / bugfixes for 1.2 as we port things from NutchBase to trunk so I'd suggest we wait a bit before releasing it. J. On 12 July 2010 14:52, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Hi Alex, > > I was thinking of making a 1.

Re: I LOVE Ivy!

2010-07-14 Thread Julien Nioche
Well, the credits should go mostly to Enis Soztutar who did the Ivy work in Nutchbase. It's quite neat, isn't it? On 14 July 2010 19:44, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > OK guys, I just had to throw _major_ kudos to Julien and anyone else > involved in the Ivy in

Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

2010-07-20 Thread Julien Nioche
I have made some changes to the nutchbase branch indeed. Mostly porting missing plugins to the new API but also retrofitting recent modifications from the trunk to nutchbase in order to facilitate the later merge from nutchbase to trunk later. I also removed some old Nutch objects which were not a

Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

2010-07-20 Thread Julien Nioche
> > > Now that you mention upgrade solutions from 1.x to 2.0 I suggest that we > open > > a JIRA to discuss this. IMHO we probably don't want to keep the 'old' > code in > > src/java when we merge but could have the code for the conversion > utilities > > and the Nutch 1.x jars in a the contrib/ di

Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java

2010-07-20 Thread Julien Nioche
Thanks for your comments Chris > > > However we still need to address the issue raise by Dogacan i.e shall we > > provide tools to convert from 1.x structures to 2.0 and if so how shall > we > > organise it. Again - some things have been removed fom NutchBase for the > sake > > of clarity but sinc

Re: Nutchbase merge strategy

2010-07-23 Thread Julien Nioche
> > Before doing so, >> let's: >> >> 1. tag current trunk as >> http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 (EOL'ed won't >> be >> worked on, but nice to save). This way someone doesn't have to remember >> the >> Nutchbase rev # before the Nutchbase branch lands in the trunk. >> >> T

Re: Nutchbase merge strategy

2010-07-23 Thread Julien Nioche
On 23 July 2010 10:20, Julien Nioche wrote: > > >> Before doing so, >>> let's: >>> >>> 1. tag current trunk as >>> http://svn.apache.org/repos/asf/nutch/branches/branch-1.3 (EOL'ed won't >>> be >>> worked on, but n

Re: Build failed in Hudson: Nutch-trunk #1213

2010-07-27 Thread Julien Nioche
does anyone have any idea on how to configure the build on Hudson? Shall I open a JIRA on Infra? Jul nightly: > > BUILD SUCCESSFUL > Total time: 9 minutes 16 seconds > Publishing Javadoc > Archiving artifacts > ERROR: No artifacts found that match the file pattern > "trunk/build/*.tar.gz". Confi

Re: Build failed in Hudson: Nutch-trunk #1213

2010-07-27 Thread Julien Nioche
isting PMC member) and request Hudson Zones karma from @infra. > I’d be happy to be this guy since I do the RM’ing a lot, but it might be > nice to have someone else do it in case I get hit by a bus :) > > Cheers, > Chris > > > > On 7/26/10 10:24 PM, "Julien Nio

Re: [VOTE] Apache Nutch 1.2 Release Candidate #1

2010-08-09 Thread Julien Nioche
I reopened https://issues.apache.org/jira/browse/NUTCH-870. It would be good to fix it before releasing 1.2 On 9 August 2010 14:44, Andrzej Bialecki wrote: > On 2010-08-08 03:04, Mattmann, Chris A (388J) wrote: > >> Hi Folks, >> >> I have posted a release candidate for the Apache Nutch 1.2 relea

Re: [VOTE] Apache Nutch 1.2 Release Candidate #1

2010-08-09 Thread Julien Nioche
e to just create a new issue in JIRA and then > link your issue to the issue that you wanted to reopen. It’s just as easy > and doesn’t cause the out of sync problem. > OK, makes sense > > Cheers, > Chris > > > > On 8/9/10 7:45 AM, "Julien Nioche" wrote:

Re: Documentation options Nutch 2.0

2010-08-12 Thread Julien Nioche
Hi Alex, Andrzej has opened a JIRA yesterday to discuss the various options for the documentation. Could you please paste your comments there as well? Thanks Julien On 12 August 2010 10:41, Alex McLintock wrote: > Hi Folks, > > I've been wondering about what techniques we can use to provide h

Re: When a crawl goes bad...

2010-08-16 Thread Julien Nioche
It's probably more an issue with DNS resolution than robots.txt. Even if you respect the robots.txt instructions you can still have N host or even domain names pointing to a single server. This can be avoided in Nutch by setting 'partition.url.mode' and 'fetcher.queue.mode' to 'byIP'. On 16 Augus

Re: Nutch 2.0 Help

2010-09-02 Thread Julien Nioche
Hi David, I haven't used the Hbase backend with GORA for quite some time but from what I can remember you'll need the following things : * conf/hbase-site.xml => this should correspond to your local configuration * conf/gora-hbase-mapping.xml => see below * conf/gora.properties => don't think the

Re: nutch 2.0 (trunk)

2010-09-07 Thread Julien Nioche
Hi Faruk, You can either set a lower value for the parameter http.content.limit or modify the mapping and set which should work for mysql. See the discussion on http://github.com/enis/gora/issues/closed#issue/48 HTH Julien -- * *Open Source Solutions for Text Engineering http://digitalpeb

Re: Nutch 2.0 Help

2010-09-08 Thread Julien Nioche
Hi guys, I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on http://wiki.apache.org/nutch/GORA_HBase Feel free to amend and improve as you see fit. Please bear in mind that Nutch 2.0 is at a very early stage and is far from being bug-proof, see in particular [1]. HTH Ju

Re: Does Nutch 2.0 in good enough shape to test?

2010-12-17 Thread Julien Nioche
Alexis, I've spent some time working on this as well. I've just put together a > blog entry addressing the issues I ran into. See > http://techvineyard.blogspot.com/2010/12/build-nutch-20.html > This is a great howto for Nutch 2.0. Feel free to link to it from the Wiki, this could be useful to ot

Re: Release planning

2011-01-04 Thread Julien Nioche
+1 from me. I've committed today a bunch of patches which were in 1.2 but not in 1.3 (just one last one to do) but haven't compared with 2.0 Having a release based on 1.3 would be great as it would be a nice transition towards 2.0 (delegate indexing/search, dependency management with Ivy, separati

Backport to 1.3 (was: Release planning)

2011-01-05 Thread Julien Nioche
Any thoughts on this? Julien On 4 January 2011 21:44, Julien Nioche wrote: > +1 from me. I've committed today a bunch of patches which were in 1.2 but > not in 1.3 (just one last one to do) but haven't compared with 2.0 > > Having a release based on 1.3 would be great as i

Re: parsing a simple text node

2011-02-08 Thread Julien Nioche
Hi Jun, Which version of Nutch are you using and which parser? parse-html or parse-tika? julien On 8 February 2011 08:16, Jun Yang wrote: > Hi there, > > i am working on a plugin to fetch some structured information (e.g., > product price) in web pages, and I had some problem parsing the follo

Re: Nutch Parser annoyingly faulty

2011-03-04 Thread Julien Nioche
Hi Jurgen, > Since I wrote this email - which I thought got ignored by the > Nutch developers - Thanks for reporting the problem Jurgen. and sorry that you felt you were being ignored. The few active developers Nutch has contribute during their spare time, the reason why you did not get any com

Re: Build failed in Jenkins: Nutch-trunk #1433

2011-03-22 Thread Julien Nioche
On 22 March 2011 04:15, Kirby Bohling wrote: > Is there some reason this is allowed to continue to build if nobody is > going to actually get it to build successfully? I am assuming this > has something to do with the Ivy resolution of the Gora library that > isn't publicly available. > Correct

http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

2011-03-27 Thread Julien Nioche
Gabriele, I think it is a good idea to have a script like this however your proposal could be improved. It currently works only on a single machine and uses commands such as mv, ls etc... which won't work on a pseudo or fully distributed cluster. You should use the 'hadoop fs' commands instead. I

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

2011-03-27 Thread Julien Nioche
> I think it is a good idea to have a script like this however your proposal >> could be improved. It currently works only on a single machine and uses >> commands such as mv, ls etc... which won't work on a pseudo or fully >> distributed cluster. You should use the 'hadoop fs' commands instead. >

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

2011-03-28 Thread Julien Nioche
Hi Gabriele >> you don't need to have 2 *and *3. The hadoop commands will work on the >> local fs in a completely transparent way, it all depends on the way hadoop >> is configured. It isolates the way data are stored (local or distrib) from >> the client code i.e Nutch. By adding a separate scri

Re: All solr* commands fail in 1.3

2011-04-08 Thread Julien Nioche
See http://www.slf4j.org/faq.html#IllegalAccessError This error is caused by the static initilizer of the LoggerFactory class > attempting to directly access the SINGLETON field of > org.slf4j.impl.StaticLoggerBinder. While this was allowed in SLF4J 1.5.5 > and earlier, in 1.5.6 and later the SING

Re: GORA dependency and build failures

2011-04-08 Thread Julien Nioche
Yep. 0.1 has been released and the artifacts should be available soon On Friday, 8 April 2011, Otis Gospodnetic wrote: > Hi, > > Just curious - is the plan to wait for the GORA 0.1 release to get published > somewhere (not familiar with Ivy, so I'm not sure where things need to get > published),

Re: Nutch' pom.xml

2011-04-12 Thread Julien Nioche
Someone suggested that we used an ant task to generate the pom from the Ivy files. This would be far a cleaner option then having to keep this bl***d pom.xml file in sync all the time On 12 April 2011 15:11, Markus Jelsma wrote: > Hi guys, > > I found out that pom.xml lists older dependency ver

Re: Nutch' pom.xml

2011-04-12 Thread Julien Nioche
/use/makepom.html) and remove the pom.xml from SVN? Is there anything in that pom.xml that wouldn't be generated by makepom? J. On 12 April 2011 15:24, Julien Nioche wrote: > Someone suggested that we used an ant task to generate the pom from the Ivy > files. This would be far a clea

Re: chinese token overlap bug in org.apache.nutch.summary.basic.BasicSummarizer.getSummary

2011-04-13 Thread Julien Nioche
Hi, Nutch has moved away from handling the indexing and search itself and now delegates that to SOLR as of versions 1.3 and 2.0 (both forthcoming). The issue you described won't be fixed as this part of the code has been removed. Users are encouraged to start using 1.3 and use SOLR for the indexin

Re: Nutch 1.3 release

2011-04-14 Thread Julien Nioche
There has been a large number of substantial changes with 1.3 (search delegated to SOLR, separation between local and distributed runtimes, ) and we'll need to reflect this in the documentation the site and the wiki. The good news is that a lot of this will be relevant for 2.0 as well. BTW tha

Re: Nutch 1.3 release

2011-04-14 Thread Julien Nioche
ring http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 14 April 2011 08:55, Julien Nioche wrote: > There has been a large number of substantial changes with 1.3 (search > delegated to SOLR, separation between local and distributed runtimes, ) > and we'll need to reflect this in the do

Re: [VOTE] Apache Nutch 1.3 Release Candidate #1

2011-04-24 Thread Julien Nioche
Hi Chris, Thanks for the RC. I think we should fix the 2 issues below. https://issues.apache.org/jira/browse/NUTCH-985 : bug with lastModifiedDate https://issues.apache.org/jira/browse/NUTCH-983 : port SOLRJ to 3.1 I expect many users would use the latest version of SOLR so we might as well upd

Re: Precopy http.agent properties to nutch-site

2011-04-26 Thread Julien Nioche
Hi Markus Any param overridden by the users should be in nutch-site.xml, not just http.agent, so why make an exception for it? Moreover that will not necessarily prevent people from using nutch-default.xml Maybe we could set nutch-default to readonly? Could be changed by the user but this might n

Re: SolrDedup doesn't commit

2011-04-27 Thread Julien Nioche
Hi Markus We might as well do it properly and commit in the same way as index and clean do. Thanks for all your excellent work BTW Julien On 27 April 2011 15:16, Markus Jelsma wrote: > Hi, > > Title says it all. The job doesn't send a commit while index and clean do. > The > question is wheth

Re: 1.3 RC2?

2011-04-30 Thread Julien Nioche
Hi Chris, I don't think we have finished with the dates and update of SOLR to 3.1 yet. I'll also try to do NUTCH-888in the next couple of days. Thanks Julien On 30 April 2011 05:20, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote:

Re: svn commit: r1099483 - in /nutch/branches/branch-1.3: ./ conf/ src/plugin/ src/plugin/parse-rss/ src/plugin/parse-tika/ src/plugin/parse-tika/sample/ src/plugin/parse-tika/src/test/org/apache/nutc

2011-05-04 Thread Julien Nioche
;-) On 4 May 2011 16:26, Mattmann, Chris A (388J) wrote: > Awww, sniffbye parse-rss! > > On May 4, 2011, at 11:20 AM, > wrote: > > > Author: jnioche > > Date: Wed May 4 15:20:00 2011 > > New Revision: 1099483 > > > > URL: http://svn.apache.org/viewvc?rev=1099483&view=rev > > Log: > > NUTC

Re: Update schema to get solrdedup working again

2011-05-05 Thread Julien Nioche
Hi Markus, Sorry for the late reply. Definitely +1 to change to Date in the schema, it is the right thing to do and it's also the right time to do it Thanks Julien On 28 April 2011 12:43, Markus Jelsma wrote: > Hi devs, > > The Solr schema must be updated as well to get dedup to work in 1.3.

Re: Usefulness of cache field

2011-05-08 Thread Julien Nioche
Would need to check in the code but I think that this field is used for storing the value of the meta tags cache-control. Since we don't do caching anymore since delegating to SOLR, this is not really useful but could be again the future. Let's leave it as is for now and document what the field cor

Re: Return value of jobs

2011-05-09 Thread Julien Nioche
Hi Markus, > Currently the various Nutch jobs return 0 or -1 resp. indicating success or > failure. It would be convenient to have certain jobs return the number of > processed items instead of zero to make it a lot easier for shell scripts > to > fetch useful statistics. > > What would be an arg

Re: found a nutch bug

2011-05-09 Thread Julien Nioche
Hi Could you please open a JIRA with a description of the problem and attach a patch generated against the branch-1.3 with 'svn diff'? Thanks 2011/5/9 ldk_5370 > hi, > > I found a bug about calss org.apache.nutch.protocol.http.HttpResponse, > HttpResponse can not got all html content for som

Re: Update schema to get solrdedup working again

2011-05-11 Thread Julien Nioche
ybe create a new issue for 1.4 and rely on Date objects everywhere then > format it properly in the SOLRWriter. We could of course to the latter now, > but since I have no time to do it in the short time and don't want to twist > your arm I'll let you decide > > > >

Collecting Nutch use cases for talk @BerlinBuzzwords

2011-05-16 Thread Julien Nioche
Hi, The title says it all. I'm searching for interesting use cases for my Nutch talk at Berlin. Do you use Nutch in an interesting way or on a particularly large scale? If you think your use case could be a good illustration of what Nutch does, please get in touch and I'll happily include it in my

Re: 1.3 RC2?

2011-05-21 Thread Julien Nioche
Hey Guys, > > WDYT? Ready for RC2 on 1.3? Got some free time tonight and in the releasing > mood :-) > > Cheers, > Chris > > On Apr 30, 2011, at 9:41 AM, Julien Nioche wrote: > > > Hi Chris, > > > > I don't think we have finished with the dates and update

Re: 1.3 RC2?

2011-05-24 Thread Julien Nioche
.3? Got some free time tonight and in the > > > releasing mood :-) > > > > > > Cheers, > > > Chris > > > > > > On Apr 30, 2011, at 9:41 AM, Julien Nioche wrote: > > > > Hi Chris, > > > > > > > > I don't thi

Re: Nutch bug - assumption of HDFS in CrawlDb.java even if using other file systems like S3

2011-05-25 Thread Julien Nioche
Viksit, Please check if this has already been reported on the JIRA and if not open a new issue (for 2.0) Thanks Julien On 25 May 2011 19:02, Viksit Gaur wrote: > [Cross posting since this might be more relevant here.] > > -- > > Hi all, > > Trying to run nutch on Elastic Mapreduce, I ran into

Re: [RESULT] [VOTE] Apache Nutch 1.3 Release Candidate #3

2011-06-08 Thread Julien Nioche
gt; > +1 Nutch PMC > > Chris Mattmann > Markus Jelsma > Julien Nioche > Lewis John McGibbney > > I'll go ahead and push the release to the mirrors and release the Maven > repo to Central and then

Re: 'Other Resources' section of wiki

2011-06-09 Thread Julien Nioche
Hi Lewis, Thanks for volunteering on this. I'd create a subpage ('archives') and put all the how-tos and tutorials for the versions pre-1.3 linked from there. We've also : Presentation / Academic Articles / link to Video from Doug + various other things that could also be linked to from a subpage

new branch 1.4 and possible features

2011-06-10 Thread Julien Nioche
Guys, I added a new label 1.4 on the JIRA. Shall we create a new branch 1.4 on SVN from the existing 1.3? I agree that it is a pain to have to maintain 1.x AND trunk in parallel but my feeling is that 2.0 needs more work before being completely reliable and in the meantime we might want to add new

Re: Please remove me from the mailing list

2011-06-12 Thread Julien Nioche
http://nutch.apache.org/mailing_lists.html -> dev-unsubscr...@nutch.apache.org On 12 June 2011 14:33, Tolga Soyata wrote: > Please remove me from the mailing list -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

Re: Bug-fix for Nutch 1.3 with solrdedup

2011-06-13 Thread Julien Nioche
Hi, Please open a new issue on https://issues.apache.org/jira/browse/NUTCH Thanks Julien On 13 June 2011 04:20, Yavinty wrote: > Hello, > > I have a bug-fix for Nutch 1.3 (solrdedup throwing > NullPointerException), where do I submit it? > > Thanks. > -- * *Open Source Solutions for Text

Re: new branch 1.4 and possible features

2011-06-13 Thread Julien Nioche
Guys, I've created a new branch for 1.4 on * https://svn.apache.org/repos/asf/nutch/branches/branch-1.4 * Thanks Jul On 10 June 2011 12:11, Markus Jelsma wrote: > > > Guys, > > > > I added a new label 1.4 on the JIRA. Shall we create a new branch 1.4 on > > SVN from the existing 1.3? I agree

Re: new branch 1.4 and possible features

2011-06-13 Thread Julien Nioche
Hi, [...] > > Yes indeed. I see that Gora is still in incubation and I have not been > using trunk for sometime as it has been broken due to Gora dependencies? I > think this suggestion is the only sensible way to continue. As I have not > been using trunk, what is the current situation with thi

[ANNOUNCEMENT] Lewis John Mc Gibbney is a Nutch committer and PMC member

2011-06-29 Thread Julien Nioche
Hi, A while back the NUTCH PMC nominated Lewis John Mc Gibbney for Nutch committership and PMC membership. The VOTE tallies in Nutch PMC-ville have occurred and I'm happy to announce that Lewis is now a Nutch committer! Lewis, feel free to say a little bit about yourself, and, welcome aboard! Ju

Re: Create separate issues for 2.0?

2011-06-30 Thread Julien Nioche
> I'd be happy to roll 1.4 whenever we're ready. > There are quite a few things that we've discussed for 1.4 (e.g. make indexing backend plugable, delegate code to crawler-commons) so it is a bit premature to talk about releasing 1.4 just now. End of 2011 would be a good deadline, with a release r

Re: Nutch 2.0 roadmap

2011-07-04 Thread Julien Nioche
Hi Lewis, Currently the slightly (in places) dated roadmap can be found here [1], I > was wondering if we could give this an overhaul/update as it would give a > more robust overview of where trunk is going. Most of the points you make > are still in development, however some have been achieved a

Re: Rebuilding site

2011-07-07 Thread Julien Nioche
Hi Lewis, > As I am back home I propose to rebuild the site to link the current > tutorial link to the new 1.3 tutorial on the wiki. I would also like to > formally make my first committ by adding my name to the list of committers > before I progress with other bits and pieces. > Good idea! See

Re: [Nutch Wiki] Update of "NutchTutorial" by JulienNioche

2011-07-12 Thread Julien Nioche
http://nutch.apache.org/mailing_lists.html > Hey, > > please delete my E-Mail address from your mailing list or whatever. I > receive more than 50 mails every day. > > Bye > > -- > Marcel Schubert > Auszubildener > TU ClausthalE-Mail: schub...@rz.tu-clausthal.de > Rechenzentr

Re: Real-time Solr integration

2011-07-12 Thread Julien Nioche
Hi Matthew, This is usually achieved by writing a script containing the individual Nutch commands (as opposed to calling 'nutch crawl') and index at the end of a generate-fetch-parse-update-linkdb sequence. You don't need any plugins for that HTH Julien On 12 July 2011 13:35, Matthew Painter w

Re: Real-time Solr integration

2011-07-14 Thread Julien Nioche
this issue (however I have not used his > scripts > > > extensively). They might be of interest for a look. Try the link below > > > > > > > http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script > > > > > > On Tue, Jul 12, 2011 at 2:15 PM

Re: Normalize and filter hyperlinks during parse

2011-07-14 Thread Julien Nioche
Are you sure we don't we already filter and normalize at the end of the parse? (not in front of code - sorry can't check) On 14 July 2011 16:37, Markus Jelsma wrote: > Hi, > > If we filter and normalize hyperlinks in the parse job, we wouldn't have to > filter and normalize during all other jobs

Re: HTTPS support

2011-07-14 Thread Julien Nioche
http://www.google.co.uk/search?q=nutch+mailing+list -> 1st result On 14 July 2011 16:50, Zanzico Gioele wrote: > how can i be deleted from this mailing list pls ? > > tks > ciao > gioele > > Gioele Zanzico > Senior Web Analyst > Vitec Group Imaging & Staging Division > Direct Line: +39 0424

Re: Normalize and filter hyperlinks during parse

2011-07-15 Thread Julien Nioche
dified filter again to allow .nl pages. I > updated the db and it worked. Now i have two urls. > not clear. Was there only one outlink in that seed? Did the filtering work or not? > > More thoughts? :) > > On Thursday 14 July 2011 18:31:07 Julien Nioche wrote: > > Are you

Re: Real-time Solr integration

2011-07-15 Thread Julien Nioche
> > > On Thursday 14 July 2011 15:03:34 Julien Nioche wrote: > > Have been thinking about this again. We could make so that the indexer > does > > not necessarily require a linkDB : some people are not particularly > > interested in getting the anchors. At the mome

Re: Real-time Solr integration

2011-07-15 Thread Julien Nioche
Will take care of this one later : https://issues.apache.org/jira/browse/NUTCH-1054 On 15 July 2011 12:46, Markus Jelsma wrote: > > > On Friday 15 July 2011 11:07:36 Julien Nioche wrote: > > > On Thursday 14 July 2011 15:03:34 Julien Nioche wrote: > > > > Have been

Re: adding details to mvn.template?

2011-07-17 Thread Julien Nioche
Please excuse (and correct) my ignorance, but I need to clear this one up so > I understand correctly. The purpose the mvn.template file serves is so we > can specify exactly who can commit a Nutch maven pom. The pom in turn > specifies the build dirs e.g. source dir as well as test dir. Then final

Re: Automaton improvements

2011-07-25 Thread Julien Nioche
Hi Kirby, Thanks for sharing this. It is definitely relevant for Nutch and I am sure that there would be quite a few people interested in giving it a try. Let's hope that this patch gets into the original library or that the Lucene people ship it in a separate jar, in the meantime your patch would

Re: .BAT file for running nutch in Windows (no cygwin)

2011-07-25 Thread Julien Nioche
Hi Radim yes please open a JIRA with a description of what you've done + attach the script Thanks Julien 2011/7/23 Radim Kolar > I ported shell start-up script to standard windows .BAT file (tested in > Windows XP). > > Where can i upload it? I need help with testing nutch under native window

Re: Automaton improvements

2011-07-25 Thread Julien Nioche
uld just make Nutch require Lucene as a dependency -- this > would provide more stable updates. > > Dawid > > > On Mon, Jul 25, 2011 at 10:35 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > >> Hi Kirby, >> >> Thanks for sharing this. It is

Re: (NUTCH-1071) Crawldb update to total counts per status

2011-07-29 Thread Julien Nioche
431595 > status 6 (db_notmodified): 118696 > > Thanks > > > Crawldb update to total counts per status > > - > > > > Key: NUTCH-1071 > > URL: https://issues.apache.org/jira/browse/NUTCH-1071 > >

Re: (NUTCH-1071) Crawldb update to total counts per status

2011-07-29 Thread Julien Nioche
Markus, Have just committed a change to CrawlDBReducer (rev 1152254) see line 155 -> reporter.getCounter("CrawlDB status", CrawlDatum.getStatusName(* old*.getStatus())).increment(1); was using the wrong object :-( Would you mind giving it a try? Thanks Julien

Re: Possible use of your bot as a hacking tool

2011-07-30 Thread Julien Nioche
Ardath, Nutch is an open source project not a service or commercial entity and as such we don't run crawls and can't be held responsible for the way people use it. Judging by the content of your logs, this specific crawl is being carried out by a university in Korea. You should try and get in tou

  1   2   3   4   5   6   7   8   9   10   >