nutchpy

2015-01-09 Thread Lewis John Mcgibbney
Hi Folks, Just wanted to make folk aware of some work Continuum Analytics have been doing on bringing Nutch to the Python community. https://github.com/ContinuumIO/nutchpy Comtinuum are the folks behind most of the scientific Python stuff you've ever used. If you've used Python before, then

Re: nutchpy

2015-01-09 Thread Mattmann, Chris A (3980)
Yep it's awesome work funded by the DARPA memex project and our team. Cc'ing Andy Terrel for awareness thanks Lewis! Sent from my iPhone On Jan 9, 2015, at 6:04 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, Just wanted to make folk aware of some work Continuum

Re: nutchpy

2015-01-09 Thread Shadi Saleh
Thanks, I got this error while installing [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building seqreader-app 1.0-SNAPSHOT [INFO] [INFO]

Fwd: nutchpy

2015-01-09 Thread Mattmann, Chris A (3980)
Sent from my iPhone Begin forwarded message: From: Shadi Saleh propat...@gmail.commailto:propat...@gmail.com Date: January 9, 2015 at 6:28:11 PM PST To: user user@nutch.apache.orgmailto:user@nutch.apache.org Subject: Re: nutchpy Reply-To: user@nutch.apache.orgmailto:user@nutch.apache.org

Re: nutchpy

2015-01-09 Thread Shadi Saleh
Dear all, I added to file : nutchpy/seqreader-app/pom.xml the following: plugin groupIdorg.apache.maven.plugins/groupId artifactIdmaven-compiler-plugin/artifactId configuration compilerVersion1.5/compilerVersion source1.5/source

[VOTE] Release Apache Nutch 2.3

2015-01-09 Thread Lewis John Mcgibbney
Hi user@ dev@, This thread is a VOTE for releasing Apache Nutch 2.3. Quite incredibly we addressed 143 issues as per the release report http://s.apache.org/nutch_2.3 The release candidate comprises the following components. * A staging repository [0] containing various Maven artifacts * A

RE: Problem with time out on QueueFeeder

2015-01-09 Thread Markus Jelsma
Do you have enough memory? 50 thtreads and PDF's and and older Tika version will get you in trouble. That PDFBox version eats memory! Try upgrading to the latest PDFBox, you can drop jars manually and reference them in Tika's plugin.xml. M -Original message- From:Paul Rogers

Re: Problem with time out on QueueFeeder

2015-01-09 Thread Paul Rogers
Thanks Markus I will try that and see if it fixes things. The server has 24GB of memory but only about 1GB free without the nutch process running!! Are the PDFBox files in Tika 1.6 (PDFBox 1.8.6) likely to have fixed this or should I go for 1.8.8 on the PDFBox site? Thanks again P On 9

Problem with time out on QueueFeeder

2015-01-09 Thread Paul Rogers
Hi Guys I am using nutch 1.8 to fetch pdf documents from an http server. The jobs have been running OK until recently when I started getting the following error: -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2500 fetching

Re: Problem with time out on QueueFeeder

2015-01-09 Thread Paul Rogers
Hi Markus Rebooting the server frees up 23GB of memory. Have installed PDFBox 1.8.8 and am running fetch again. Will update you on results. Thanks P On 9 January 2015 at 14:11, Paul Rogers paul.roge...@gmail.com wrote: Thanks Markus I will try that and see if it fixes things. The