Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Julien Nioche
What a fantastic addition to the Nutch team! Congrats to Tim On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel wrote: > Dear all, > > It is my pleasure to announce that Tim Allison has joined us > as a committer and member of the Nutch PMC. > > You may already know Tim as a maintainer of and contrib

Re: Nutch Plugins Source Control

2017-04-07 Thread Julien Nioche
Hi Ben On 7 April 2017 at 15:10, Ben Vachon wrote: > Hi Isroudi, > > I am not working with an install of Nutch, I'm just working with the jar I > got via maven, and it doesn't have any of the plugins. > > I could build the plugins into the project myself, but to do that I would > need to downloa

Re: [VOTE] Release Apache Nutch 1.13 RC#1

2017-03-29 Thread Julien Nioche
Hi Lewis +1 compiled from source and ran a small crawl in local mode. All good! Thanks Julien On 29 March 2017 at 05:20, lewis john mcgibbney wrote: > Hi Folks, > > A first candidate for the Nutch 1.13 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/1.13/ > > The r

Re: Need help installing scoring-depth plugin

2017-01-31 Thread Julien Nioche
You don't need to install scoring-depth. It's just that the term 'depth' in the old crawl class has been replaced by 'rounds', which is more accurate. The equivalent of the command you used to call should be *bin/crawl phfaws crawl **1 * The value for topN needs setting in the crawl scrip, see si

Re: Setting different depths for different urls in seed.txt

2017-01-18 Thread Julien Nioche
Yes, use the scoring-depth plugin and set _maxdepth_=X in the seeds file HTH Julien On 18 January 2017 at 10:40, Manav Bagai wrote: > Is it possible to set different depths for different urls in seed.txt. For > example. there are two url 'A' and 'B' in seed.txt, is it possible that > crawler

Re: General question about subdomains

2017-01-11 Thread Julien Nioche
Hi Joe, Do these subdomains point to the same IP address? Did they blacklist your server i.e. can you connect to these domains from the crawl server using a different tool like curl? Not a silver bullet but a way of preventing this is to group by IP or domain (fetcher.queue.mode and partition.url

Re: Nutch 1.x on hadoop

2016-11-03 Thread Julien Nioche
erminal) as it goes along. How can I watch what > it's doing when it runs under hadoop? I have clicked around a little bit in > the hadoop monitoring web app, but haven't found it yet. > > > From: Julien Nioche > To: "user@nutch.apache.org" ; Michael Coffey

Re: Nutch 1.x on hadoop

2016-11-02 Thread Julien Nioche
Michael, See http://digitalpebble.blogspot.co.uk/2015/09/index-web-with-aws-cloudsearch.html for a relatively recent step-by-step tutorial for Nutch 1.x Julien On 2 November 2016 at 16:10, Michael Coffey wrote: > I'm having trouble trying to get Nutch 1.12 to run on hadoop 2.7.3. > I get a c

Re: Trouble fetch PDFs to pass to Tika (I think)

2016-10-17 Thread Julien Nioche
Hi Tom You haven't modified the value for the config below by any chance? http.robots.403.allow true Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites

Re: Nutch 2.x for large-scale crawls

2016-06-20 Thread Julien Nioche
Hi Joseph, I meant to update the benchmarks for a while but haven't found the time to do so. I will probably add StormCrawler to the mix next time. One thing that helped with the performance when I was running very large crawls with Nutch 1.x was to generate multiple segments in one go, fetch and

Re: [VOTE] Release Apache Nutch 1.12

2016-06-15 Thread Julien Nioche
+1 Thanks Lewis and team! On 15 June 2016 at 06:14, lewis john mcgibbney wrote: > Hi Folks, > > A first candidate for the Nutch 1.12 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/1.12/ > > The release candidate is a zip and tar archive of the sources tag available >

Re: Nutch WARC export problems

2016-04-27 Thread Julien Nioche
18 April 2016 at 23:25, Davíð Steinn Geirsson wrote: > Hi Julien, > > Julien Nioche wrote: > > Hi David > > > > the resulting file contains no matching request records, or even a > > > warcinfo record for that matter. > > > > > >

Re: Nutch WARC export problems

2016-04-14 Thread Julien Nioche
Hi David the resulting file contains no matching request records, or even a > warcinfo record for that matter. It wouldn't be too difficult to add at least the request records to WARCExporter - please open a JIRA + contributions are welcome as always. I'm willing to move to nutch v2.x if it m

Re: Configuration of very specific requirements

2016-04-06 Thread Julien Nioche
Hi Jigal, You can do this by activating the scoring-depth plugin and setting scoring.depth.max to 1 in nutch-site.xml For the scheduling simply set db.fetch.interval.default 86400 in nutch-site.xml Filtering URLs from being indexed based on the content could be done by writing a custom Inde

Re: [MASSMAIL]Extract Contact Information - Custom Parser

2016-02-12 Thread Julien Nioche
essor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++ > > > > > > -Original Message- > From: Julien Nioche > Reply-To: "user@nutch.apac

Re: [MASSMAIL]Extract Contact Information - Custom Parser

2016-02-10 Thread Julien Nioche
See SO => http://stackoverflow.com/questions/35299744/nutch-parser-plugin-collect-contact-information There seems to be more and more people sending the questions to both the ML and SO. Am wondering whether we should set up a redirect so that any question asked there lands automatically on the use

Re: Tools to import WARC file into Nutch segments?

2015-12-16 Thread Julien Nioche
Hi Tien Short answer : not yet. BTW the WARCExporter is more scalable than the CCDataDumper. As mentioned in [NUTCH-2102 ] we could add an importer in the package org/apache/nutch/tools/warc. Julien On 16 December 2015 at 07:22, Nguyen Manh Tien

Re: [MASSMAIL]Crawling focused only over seed file

2015-11-27 Thread Julien Nioche
db.ignore.external.links is for filtering the outlinks and keeping the ones from the same host (and now domain https://issues.apache.org/jira/browse/NUTCH-2069). The one you probably want is db.update.additions.allowed true If true, updatedb will add newly discovered URLs, if false only a

Re: Manipulate queues

2015-11-26 Thread Julien Nioche
You could also pass it a high score during the injection without having to write a custom filter and rely on metadata see http://wiki.apache.org/nutch/bin/nutch%20inject e.g. http://www.thissimplycannotwait.com *nutch.score=1* the trouble with this is that is that it might have an impact on

Re: Crawling subdomains, but not external links

2015-11-18 Thread Julien Nioche
Hi Gaspar Have a look at https://issues.apache.org/jira/browse/NUTCH-2069, this should allow you to restrict the crawl to the domain and not just the hostname. Hasn't been committed yet as Seb had suggested some improvements. HTH Julien On 18 November 2015 at 21:10, Gaspar Pizarro wrote: > Hi

Re: Populating outlinks with CrawlDatum Metadata

2015-11-04 Thread Julien Nioche
Hi Lewis Can't you achieve this already with the url-meta plugin and urlmeta.tags config? Julien On 3 November 2015 at 22:09, Lewis John Mcgibbney wrote: > Hi Folks, > The above has been discussed a few times with the following thread [0] > being probably most helpful. > Is anyone else wo

Re: Webcast : Apache Nutch on EMR

2015-09-26 Thread Julien Nioche
Hi Lewis > > > Whats your thoughts about making this part of the scrolling banner on the > homepage? > a bit OTT I think. I need to dig up my Wiki credentials and add the video + blog entry on the documentation page. > I think it is great. > Thanks mate Julien -- *Open Source Solutions fo

Webcast : Apache Nutch on EMR

2015-09-23 Thread Julien Nioche
Hi again, I have uploaded at webcast explaining how to run Nutch on AWS Elastic Map Reduce https://www.youtube.com/watch?v=v9zjcTjjjyU Please excuse the sound quality, hesitations and stuttering. I hope you find it useful nonetheless. Julien -- *Open Source Solutions for Text Engineering* h

Tutorial : Index the web with AWS CloudSearch

2015-09-23 Thread Julien Nioche
Hi everyone, Just to let you know that we've just published a new tutorial on how to use Nutch (and StormCrawler) to crawl and index documents into AWS CloudSearch. This is related to the recent addition of NUTCH-1517 in the trunk codebase. The t

Fwd: Job Opening at Common Crawl - Crawl Engineer / Data Scientist

2015-09-18 Thread Julien Nioche
Nutch people, Just in case you missed the announcement below. As you probably know CC use Nutch for their crawls, this is a fantastic opportunity to put your Nutch skills to great use! Julien -- Forwarded message -- From: Sara Crouse Date: 17 September 2015 at 22:51 Subject: Job

Re: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-10 Thread Julien Nioche
Congratulations Asitang and welcome! Julien On 9 September 2015 at 23:01, Sebastian Nagel wrote: > Dear all, > > on behalf of the Nutch PMC it is my pleasure to announce > that Asitang Mishra has joined the Nutch team as committer > and PMC member. Asitang, please feel free to introduce > yours

Re: Issue when fetching with multiple threads

2015-09-03 Thread Julien Nioche
Hi Alex You can use the segment reader to check the binary content and data extracted from the parse (`./nutch readseg ...`). This should at least give you some insights into where things might have gone wrong. HTH Julien On 3 September 2015 at 16:13, Alex Wang wrote: > Hi, > > We are using N

Re: Parent URL

2015-07-02 Thread Julien Nioche
Hi Shani Tracking the seed URL which led to a given page is easy : you can add a custom metadata to the seeds being the seed URL itself e.g. *http://www.guardian.co.uk seed=http://www.guardian.co.uk * then specify 'seed' as a value for the co

crawler-commons 0.6 released

2015-06-11 Thread Julien Nioche
[Apologies for cross posting]crawler-commons 0.6 is released We are glad to announce the 0.6 release of Crawler Commons. See the CHANGES.txt file included with the release for a full list of details. We suggest

Re: [MASSMAIL]Re: about boost field extremely high

2015-05-20 Thread Julien Nioche
and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > > > > > > - Mensaje original - > De: "Julien Nioche" > Para

Re: about boost field extremely high

2015-05-20 Thread Julien Nioche
Hi Eyeris The boost value is simply the output of what the ScoringFilters give for a document. Are you using OPIC? Julien On 20 May 2015 at 19:32, Eyeris RodrIguez Rueda wrote: > Hi all. > Im using nutch 1.9 in local mode and solr 4.10 with half million of > documents. > An adaptive fetch sche

Re: Using Elasticsearch, Getting LUCENE_36 errors

2015-05-06 Thread Julien Nioche
Hi Scott EMR instance come with Lucene jars which might conflict with the ones used by Nutch. One (brutal) option is to simply remove the ones preinstalled on the slave nodes but it should also be possible to configure Hadoop so that it uses user jars prior to the system ones. Julien On 5 May 20

Re: [ANNOUNCE] New Nutch committer and PMC - Mo Omer

2015-03-23 Thread Julien Nioche
Welcome Mo! On 22 March 2015 at 19:31, Markus Jelsma wrote: > Welcome Mohammad! > > -Original message- > From: Mohammed Omer > Sent: Sunday 22nd March 2015 18:55 > To: user@nutch.apache.org > Cc: d...@nutch.apache.org > Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Mo Omer > > He

Re: Scheduling multiple possibly parallel nutch crawls based on different configurations?

2015-03-16 Thread Julien Nioche
Hi guys, Running different Nutch crawls on the same cluster is of course doable but generally not very optimal. Assuming that you have one 'logical' crawl per hostname for instance you'd end up with N instances of Fetcher all running at the same time but using only a single thread and using one Ma

Re: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-19 Thread Julien Nioche
Congratulations and welcome Jorge! Great to have you with us Julien On 19 February 2015 at 17:20, Sebastian Nagel wrote: > Dear all, > > on behalf of the Nutch PMC it is my pleasure to announce that > Jorge Luis Betancourt Gonzalez has been voted in as committer > and member of the Nutch PMC. J

Re: Nutch with amazon cloudsearch

2015-01-12 Thread Julien Nioche
Hi See https://issues.apache.org/jira/browse/NUTCH-1517. I used it for one of my clients but they found it quite expensive and went for a hosted SOLR service instead. Performance was not an issue. HTH Julien On 12 January 2015 at 13:52, Adil Ishaque Abbasi wrote: > Hello all, > > Has anyone

Re: nutch on amazon emr

2015-01-01 Thread Julien Nioche
Hi Adil Why don't you simply SSH to the master node, install Nutch there and run the crawl script in runtime/deploy? You can then monitor your crawl in the usual way using the MapReduce UI. HTH Julien On 1 January 2015 at 17:03, Adil Ishaque Abbasi wrote: > I tried to run it through custom ja

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-31 Thread Julien Nioche
e need to > consider any data loss(URLs) in this scenario ? > no, why? J. > > > > > > On Thu, Oct 30, 2014 at 6:22 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > Hi Meraj > > > > You can control the # of URLs per segment

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-30 Thread Julien Nioche
gt; > Is is possible to set the an upper limit on the max number of URLs per > fetch map task, along with the collective topN for the whole Fetch phase ? > > Thanks, > Meraj. > > On Sat, Oct 18, 2014 at 2:28 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: >

Re: Generate multiple segments in Generate phase and have multiple Fetch map tasks in parallel.

2014-10-30 Thread Julien Nioche
Thanks for sharing this Meraj. It's already proving useful to other users. On 25 September 2014 17:04, Meraj A. Khan wrote: > Just wanted to update and let everyone know that this issue with single map > task for fetch was occurring because Generator.java had logic around MRV1 > property *mapred

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-17 Thread Julien Nioche
129) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at >

Re: Reduce phase in Fetcher taking excessive time to finish.

2014-10-16 Thread Julien Nioche
Hi Meraj You could call jstack on the Java process a couple of times to see what it is busy doing, that will be a simple of way of checking that this is indeed the source of the problem. See https://issues.apache.org/jira/browse/NUTCH-1314 for a possible solution J. On 16 October 2014 06:08, Mer

Re: Nutch vs Lucidworks Fusion

2014-10-06 Thread Julien Nioche
Thanks for the explanations Andrzej and Grant! Great to hear that you are using stuff from crawler-commons. Julien On 6 October 2014 14:47, Andrzej Białecki wrote: > > On 03 Oct 2014, at 12:44, Julien Nioche > wrote: > > > Attaching Andrzej to this thread. As most of you kn

Re: Nutch vs Lucidworks Fusion

2014-10-03 Thread Julien Nioche
Attaching Andrzej to this thread. As most of you know Andrzej was the Nutch PMC chair prior to me and a huge contributor to Nutch over the years. He also works for Lucid. Andrzej : would you mind telling us a bit about LW's crawler and why you went for Aperture? Am I right in thinking that this has

Re: Running multiple fetch map tasks on a Hadoop Cluster.

2014-09-19 Thread Julien Nioche
The fetching operates segment by segment and won't fetch more than one at the same time. You can get the generation step to build multiple segments in one go but you'd need to modify the script so that the fetching step is called as many times as you have segments + you'd probably need to add some

Re: Plugin loading and NUTCH-609

2014-09-15 Thread Julien Nioche
Hi Edoardo, See my comments below On 12 September 2014 11:11, Edoardo Causarano wrote: > Hi all, > > I'm completely lost, can anyone help me out here? > > I have this job.jar which contains all Nutch code, dependencies and > plugins. I don't understand how I keep getting this error: > > 2014-09

Re: Filtering bad urls in 1.7

2014-09-11 Thread Julien Nioche
Hi Myriam, You'll need to write a custom URLFilter for that, see https://wiki.apache.org/nutch/PluginCentral for pages related to plugins and how to write them. Julien On 10 September 2014 20:04, myriam abramson wrote: > Hello! > > Sorry for a newbie question. How do I filter bad urls using th

Nutch FAQ

2014-09-01 Thread Julien Nioche
Hi guys, Our FAQ page [http://wiki.apache.org/nutch/FAQ] needs a bit of an update. Some of the items on it are now irrelevant (search and analysing) or too philosophical to be really useful, the layout and formatting is awful and it simply does not serve its purpose which is summarise the most fre

Re: [RELEASE] Apache Nutch 1.9

2014-09-01 Thread Julien Nioche
t; Instrument Software and Science Data Systems Section (398) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 168-519, Mailstop: 168-527 > > Email: chris.a.mattm...@nasa.gov > > WWW: http://sunset.usc.edu/~mattmann/ > > ++ > > Adjunct Asso

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
elow? > > bin/hadoop bin/crawl > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > As the name runtime/deploy suggest - it is used exactly for that purpose > > ;-) Just make sure HADOOP_HOME/bin is added to th

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
http://sched.co/1pbE15n) were we'll cover things like these On 29 August 2014 14:30, S.L wrote: > Thanks, can this be used on a hadoop cluster? > > Sent from my HTC > > - Reply message ----- > From: "Julien Nioche" > To: "user@nutch.apache.org"

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
ch > does not fetch all the urls no matter what depth or topN i give. > > I am submitting the Nutch job jar which seems to be using the Crawl.java > class, how do I use the Crawl script on a Hadoop cluster, are there any > pointers you can share? > > Thanks. > On Aug 29, 2014

Re: Nutch Confusion

2014-08-29 Thread Julien Nioche
Hi Iqbal, Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1 > version. > > We would be indexing our crawled data in ElasticSearch 1.x version. > > I know the 2.2.1 version provides OTB support for Elastic 0.x version but > to use 2.x I need to change the code (ElasticWriter.ja

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering wi

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Julien Nioche
Hi Lewis, A few comments below. I use Nutch 2.x as it enables me to do analytics over the data I am > crawling. This is my justification for trying to maintain an further the > development on that branch over the last while. > Just out of interest, what sort of analytics do you do and why is it

Re: [RELEASE] Apache Nutch 1.9

2014-08-26 Thread Julien Nioche
Hi Mo, Sorry for the late reply. 2.x hasn't made much progress lately and 2.3 has still not been released as there are open issues with it (see JIRA). The trunk and 2.x branch live quite separate lives although there are improvements added to trunk that are not in 2.x. Most active contributors (me

Re: bin/crawl : incorrect handling of nutch errors?

2014-08-23 Thread Julien Nioche
Hi Mathieu, It is a bug indeed. As Feng suggested, please open an issue on https://issues.apache.org/jira/browse/NUTCH and attach a patch if you can. Thanks Julien On 20 August 2014 02:59, feng lu wrote: > yes, I think this is a bug for bin/craw

Re: Use nutch as a distributed monitoring solution, any idea?

2014-08-18 Thread Julien Nioche
Hi Howard (and Sebb), You could do it with Nutch but due to the batch nature of MapReduce it is not a natural fit e.g. no guarantee that the previous batch operation will be finished in time for the next one. There could be ways around this but the whole thing would get rather convoluted and diffi

Re: How to recrawl changing the seed.txt list

2014-08-13 Thread Julien Nioche
Hi, Yes, that should be fine. The only thing I would do differently would be : 2. Change the list in regex-urlfilter.txt (add +^http://www.rlp.de/ i.e. > for every url) allow any URLs instead of specifying all the hostnames one by one but set the following property to true in nutch-site.xml :

Re: [VOTE] Apache Nutch 1.9 Release Candidate #1

2014-08-13 Thread Julien Nioche
Hi, +1 to release. Compilation and tests run fine. Signatures look good. Thanks Lewis! Julien On 13 August 2014 06:32, Lewis John Mcgibbney wrote: > VOTE'ing will be open for 'at-least' 72 hours to allow people enough time > to cast their VOTE's. > Thanks > Lewis > > > On Tue, Aug 12, 2014 a

Re: java.lang.NullPointerException at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown Source)

2014-08-13 Thread Julien Nioche
Hi Steve, I tried with Nutch 1.9 RC1 and am not getting this exception. => ./nutch parsechecker -D http.agent.name=tralala http://www.my-ebenefits.com/PenguinRandomHouse/ Probably something that we fixed since 1.5.1 which is rather outdated. Why don't you give 1.9 a try instead? Julien On 12

Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

2014-07-31 Thread Julien Nioche
Hi, Just to add to what Seb said below : *> (from https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium )> C) Not have to wait another 2 years for Nutch to patch in either the Ajax crawler> hashbang workaround

Re: New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

2014-07-31 Thread Julien Nioche
Hi Mo, Great to hear about the plugin and the tutorial you are planning to write. Why don't you add a link to your plugin from https://wiki.apache.org/nutch/PluginCentral? IMHO plugins don't necessarily need to live in the Nutch codebase and can happily be maintained at an external location e.g.

Re: Segment already parsed!

2014-07-22 Thread Julien Nioche
This also answers your other question about memory exceptions while fetching : if you are parsing at the same time then you'll need more memory. On 22 July 2014 14:40, Adam Estrada wrote: > Sebastian, > > Thanks so much for the quick response. You were right. I read > somewhere that changing

Re: Nutch returns empty result set for some websites

2014-07-21 Thread Julien Nioche
>From your log : 2014-07-19 10:41:58,279 ERROR fetcher.FetcherJob - Unexpected error for https://www.google.com/finance org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https Replace protocol-http with protocol-httpclient in your nutch-site.xml or use the code from the http

Re: Nutch Regular Expression Testing

2014-07-21 Thread Julien Nioche
Hi The + character needs escaping, use - \+ in the filter (see http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) There is a tool for testing the URLFilters in Nutch already, just do ./nutch org.apache.nutch.net.URLFilterChecker -allCombined from runtime/local/bin HTH Juli

Re: Unable to fetch content

2014-07-17 Thread Julien Nioche
gets tried within the same fetch step (i.e same round). HTH Julien > > Thanks, > Vijay > > On Jul 17, 2014, at 4:42 PM, Julien Nioche > wrote: > > > Hi, > > > > The crawl command is deprecated, use the crawl script instead and give > it a > >

Re: Ignoring errors in crawl

2014-07-17 Thread Julien Nioche
Hi Adam, Your problem is the OutOfMemoryError, not the read timeouts. Having timeouts won't crash the Fetcher. How much memory do you give Nutch? J. On 17 July 2014 18:40, Adam Estrada wrote: > Julien and Markus, > > The logs report that a couple of threads hung while processing certain > U

Re: Nutch 1.8 and Zero Boost

2014-07-17 Thread Julien Nioche
Hi Michael, Maybe look in the crawldb for such documents to see if they have something in common? I can't think of a particular reason why this would happen, it's definitely worth investigating. Thanks Julien On 17 July 2014 18:15, Michael Carlson wrote: > I’m using Nutch 1.8 to crawl my sit

Re: Unable to fetch content

2014-07-17 Thread Julien Nioche
Hi, The crawl command is deprecated, use the crawl script instead and give it a number of rounds > 1 so that it has a chance to fetch the redirection J. On 17 July 2014 21:10, Vijay Chakilam wrote: > Hi, > > I am trying to crawl the page at: " > http://0-search.proquest.com.alpha2.latrobe.edu

Re: Ignoring errors in crawl

2014-07-17 Thread Julien Nioche
Is it just slower or do these URLs properly crash Nutch? Can you tell us more about the crashes you are getting, e.g. logs etc..? On 17 July 2014 15:06, Adam Estrada wrote: > All, > > I am coming across a few pages that are not responsive at all which is > causing Nutch to #failwhale before fin

Re: [VOTE] Remove pom.xml from source

2014-07-16 Thread Julien Nioche
ld be more urgent? > > Thanks > > Simon > > > On Tue, Jul 15, 2014 at 6:36 PM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > Hi, > > > > One of the frequent issues on the mailing list / JIRA is that users can > be > > led to th

Re: [DISCUSS] [VOTE] Remove pom.xml from source

2014-07-15 Thread Julien Nioche
nerates the dependencies, and not e.g., the developer > list, etc. So, we need the pom.xml as the template that has that stuff, until someone cooks up a XSL combining solution with that original template > and then what ant deploy spits out, no? > > Cheers, > Chris > > > &

Re: Nutch Integration with hbase 94.x and hadoop 2.2

2014-07-15 Thread Julien Nioche
> > @julien > i just started with latest version, > [big sigh] it used to be called NutchGora which was probably a better name for it. People (reasonnably) expect a 2.x version to be better than the 1.x one and the de-facto version to go for. 2.x is not as stable as 1.x, it lacks some of the 1.x

Re: Nutch Integration with hbase 94.x and hadoop 2.2

2014-07-15 Thread Julien Nioche
Hi On 15 July 2014 11:31, yeshwanth kumar wrote: > hi , > > i am using hbase 0.94.10 on top of hadoop 2.2. > > now i need to crawl the websites and store the results in hbase. > i saw that nutch doesn't have integration with gora 0.4 and higher versions > of hbase. > Use the 2.x branch instead

[VOTE] Remove pom.xml from source

2014-07-15 Thread Julien Nioche
Hi, One of the frequent issues on the mailing list / JIRA is that users can be led to think that Nutch is built with Maven as they can see what looks like a perfectly valid pom.xml at the root of the project. It becomes clearer when reading the WIKI or FAQ that ANT should be used instead but it is

Re: Nutch-New outlinks removes old valid outlinks

2014-07-12 Thread Julien Nioche
Hi Looks like yet another bug with Nutch 2.x. Could you open a JIRA and tag the issue for 2.3? In the meantime I'd advise you to use Nutch 1.x which is more reliable, has more features and is also an awful lot faster. Julien On 11 July 2014 10:01, mesenthil1 < senthilkumar.arumu...@viacomcontra

Re: Prevent parsing of office documents and PDFs

2014-07-11 Thread Julien Nioche
gt; to parse them. The conversion to indexable text takes place somewhere else, > not need for Nutch to sweat on it. > > Harald. > > > > On 11.07.2014 15:27, Julien Nioche wrote: > >> You don't need to modify parse-plugins.xml, just remove parse-tika >> from plugin.i

Re: Prevent parsing of office documents and PDFs

2014-07-11 Thread Julien Nioche
You don't need to modify parse-plugins.xml, just remove parse-tika from plugin.includes. Your problem here is that you have an open office document in the segment and no parser to deal with it. why don't you add a regular expression to URL filters to remove all URLs ending in .pdf, .docx, .doc ? T

Re: Nutch local: large crawls, extremely slow, small solr index

2014-07-10 Thread Julien Nioche
Hi Craig See comments below, will also comment on your other mail separately : On 9 July 2014 20:58, Craig Leinoff wrote: > Hello, > > I have a handful of questions about Nutch, and it's unclear whether it's > considered "impolite" to combine them all into one. As a result, I'm going > to start

Re: Nutch local: large crawls, extremely slow, small solr index

2014-07-10 Thread Julien Nioche
Hi again Craig, There is a deduplicator in Nutch but it won't prevent you from crawling these URLs infinitely. One option would be to change the URLFilters / Normalisers so that they deal with the repetition of two elements in the path. How do you run your crawl BTW? Do you use the crawl script?

Re: Nutch 1.7: No content fetched

2014-07-09 Thread Julien Nioche
The clue is in : Metadata: _ngt_: 1404918941993_pst_: robots_denied(18), lastModified=0 The server you are hitting prevents robots, see http://79657.70194.14886.graphicspotting.com/robots.txt The parsechecker does not check for robots.txt whereas the normal crawl operations do. Julien On 9 J

Re: Duplicate HTML Metadata When Parsed with Tika

2014-07-09 Thread Julien Nioche
rsing with Tika I get back > duplicate metadata. Do you have any other thoughts? > > Best, > Jonathan > > > On Wed, Jul 9, 2014 at 4:11 AM, Julien Nioche < > lists.digitalpeb...@gmail.com > > wrote: > > > Hi Jonathan > > > > You shouldn't need t

Re: Duplicate HTML Metadata When Parsed with Tika

2014-07-09 Thread Julien Nioche
Hi Jonathan You shouldn't need to modify parse-plugins.xml to parse HTML docs with Tika : just remove parse-html from plugin.includes from nutch-site.xml. Could you please try that instead and see if that fixes your problem? Thanks Julien On 8 July 2014 19:41, Jonathan Cooper-Ellis wrote: >

Re: Nearing a 1.9 release?

2014-07-07 Thread Julien Nioche
ution%20%3D%20Unresolved%20ORDER%20BY%20updated%20DESC> and change their fix version back to 1.9 if you think they should be included in the next release. Thanks Julien On 29 June 2014 10:20, Julien Nioche wrote: > Hi guys, > > We've done loads of good work on the trunk s

Nearing a 1.9 release?

2014-06-29 Thread Julien Nioche
Hi guys, We've done loads of good work on the trunk since the last release, in particular : - NUTCH-1736 - NUTCH-1647 - NUTCH-1793 whi

Re: Crawl-Delay in robots.txt and fetcher.threads.per.queue config property.

2014-06-26 Thread Julien Nioche
> > If I set fetcher.threads.per.queue property to more than 1 , I believe the > behavior would be to have those many number of threads per host from Nutch, > in that case would Nutch still respect the Crawl-Delay directive in > robots.txt and not crawl at a faster pace that what is specified in >

Re: updatedb deletes all metadata except _csh_

2014-06-17 Thread Julien Nioche
Any Nutch-2 users or committers to help Alex on this one?

Travel assistance for ApacheCon EU, Budapest November 17-21 2014

2014-06-11 Thread Julien Nioche
The Travel Assistance Committee (TAC) is happy to anounce that we now accept applications for ApacheCon Europe 2014, 17-21 November in Budapest, Hungary Applications are welcome from individuals within the Apache community at-large, users, developers, educators, students, Committers, and Members,

Re: Exception 'Missing elastic.cluster' with correct elasticsearch config

2014-06-11 Thread Julien Nioche
Hi Jake This has been fixed in trunk. see https://github.com/apache/nutch/commit/026b2ff414bcf166de4bfeabef57f0202375ea38#diff-68fe6210481889b1947da1fe7d7ed0afL254 and https://issues.apache.org/jira/browse/NUTCH-1745 Thanks Julien On 11 June 2014 16:37, Jake Dodd wrote: > Hi all, > > The fo

Re: Nutch use a Browser or phantomjs as fetcher

2014-06-10 Thread Julien Nioche
Hi Patrick You could look at the protocol-http plugin as an example. Julien On 10 June 2014 10:22, Patrick Kirsch wrote: > Hey, > > On 06/10/2014 10:52 AM, Julien Nioche wrote: > >> Hi >> >> You can do that as a custom protocol implementation. The fetcher code

Re: Nutch use a Browser or phantomjs as fetcher

2014-06-10 Thread Julien Nioche
Hi You can do that as a custom protocol implementation. The fetcher code would stay the same but the byte content returned for a given URL would be produced by phantomjs or whichever selenuim backend you'd to use. HTH Julien On 7 June 2014 11:35, remi tassing wrote: > I'm currently looking a

Re: Problem with crawling macys robots.txt

2014-06-04 Thread Julien Nioche
That's why we have fetcher.max.crawl.delay : if a ridiculously large value is set, at least you won't be slowed down too much. See https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L693 On 4 June 2014 05:10, S.L wrote: > Out of curiosity , what if one needs to set the rules of

Re: Error while trying to index with elasticsearch on hadoop

2014-05-30 Thread Julien Nioche
ens Jahnke wrote: > Hi, > > On Fri, 30 May 2014 13:34:23 +0100 > Julien Nioche wrote: > > JN> The cluster name is not the same thing as the index name. It's > JN> elasticsearch by default. Are you saying that it works when you > specify it > JN> on the co

Re: Error while trying to index with elasticsearch on hadoop

2014-05-30 Thread Julien Nioche
Hi The cluster name is not the same thing as the index name. It's elasticsearch by default. Are you saying that it works when you specify it on the command line but not in nutch-site.xml? J. On 30 May 2014 10:29, Jens Jahnke wrote: > Hi Julien, > > On Fri, 30 May 2014 10:04:23

Re: Error while trying to index with elasticsearch on hadoop

2014-05-30 Thread Julien Nioche
Hi again Before you open an issue : could you please try specifying the cluster name -D elastic.cluster=elasticsearch when indexing? For some reason it seems to have solved the issue in my case Thanks J. On 30 May 2014 09:22, Julien Nioche wrote: > Hi Jens > > I have been able to

Re: Error while trying to index with elasticsearch on hadoop

2014-05-30 Thread Julien Nioche
On Wed, 28 May 2014 15:32:28 +0100 > Julien Nioche wrote: > > JN> Ok, so you are running it on Hadoop 2 then. > > Sorry, I forgot to mention that. > > JN> [...] > JN> That file names.txt lives in the elasticsearch jar. The explanation > that > JN> comes

Re: Reading from Hbase

2014-05-29 Thread Julien Nioche
Murali > > > > On Thu, May 29, 2014 at 12:49 AM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > > > Hi Murali > > > > Why not using the GORA API to read from HBase? > > > > Julien > > > > > > On 28 May 2014 23:18, Murali

Re: Reading from Hbase

2014-05-29 Thread Julien Nioche
Just a thought. Alternatively, if you wanted to keep things simpler, you could use Nutch 1.x and write a custom IndexWriter to send the data into the RDMS of your choice. The cleaning of the data could be done with Indexing Filters. On 28 May 2014 23:18, Murali Parth wrote: > Hello, >

Re: Reading from Hbase

2014-05-29 Thread Julien Nioche
Hi Murali Why not using the GORA API to read from HBase? Julien On 28 May 2014 23:18, Murali Parth wrote: > Hello, > We are trying to use Nutch in our project. This is my first > project with Nutch and Hbase. > > I was able to make Nutch write to Hbase. When I go into the hbase shell

  1   2   3   4   5   6   7   >