Re: Build failed in Jenkins: Nutch-trunk #2064

2013-01-03 Thread Tejas Patil
Hi Markus, The tests are passing on my machine (verified twice). Maybe the junit tests' logs can give some clue. Can you send the following directory from the machine where hudson is setup ? /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/test Thanks, Tejas

Re: Failing Nightly Builds

2013-01-07 Thread Tejas Patil
Hi Lewis, These test cases pass on my machine (i guess on yours' too). Had it been related to Hadoop API then tests must fail everywhere. What is different about the setup where the nightly builds are executed ? Thanks, Tejas Patil On Mon, Jan 7, 2013 at 3:24 PM, Lewis John Mcgi

Re: Failing Nightly Builds

2013-01-07 Thread Tejas Patil
.com/Nutch-Crawling-error-td612107.html [1] : http://www.thegeekstuff.com/2012/02/hadoop-standalone-installation/ [2] : http://www.mail-archive.com/user@cassandra.apache.org/msg16668.html Thanks, Tejas Patil On Mon, Jan 7, 2013 at 7:44 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wr

Re: Which is the main branch of nutch 2 ?

2013-01-11 Thread Tejas Patil
Hi, Nutch 2.1 release source is at https://svn.apache.org/repos/asf/nutch/branches/2.1/ There are couple of changes (mainly bug fixes) done on top of 2.1 but not yet released officially ... kinda "work in progress". https://svn.apache.org/repos/asf/nutch/branches/2.x/ Thanks, Tejas

Re: [ANNOUNCE] New Nutch committer and PMC : Tejas Patil

2013-01-15 Thread Tejas Patil
on nutch 1.x and did several modifications and additions tailored to the client requirements. There are a lot of things in Nutch remaining to learn. I will try my best to improve Nutch and help people with their questions over user group. Thanks, Tejas Patil http://www.linkedin.com/in/tejaspatil1

Re: [CALL FOR TESTING] NUTCH-1047 Pluggable indexing backends

2013-01-18 Thread Tejas Patil
@Julien: +1 for the feature. I am in for verifying it. Thanks, Tejas Patil On Fri, Jan 18, 2013 at 12:14 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > I'm just setting up my new Solr server so I will also definately check the > newest patch out Julien as I

review board

2013-01-25 Thread Tejas Patil
" command. I am using svn, version 1.7.5. The patch was for nutch trunk. For uploading, I obtained the base directory from "svn info" command. Meanwhile I am googling for this issue, it would be great if someone can point out the problem here. [0] : https://issues.apache.org/jira/br

Re: Configuration improvements to GeneratorJob

2013-02-20 Thread Tejas Patil
> > -- > *Lewis* > [1] : http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html [2] : https://issues.apache.org/jira/browse/NUTCH-1409 Thanks, Tejas Patil

Re: Configuration improvements to GeneratorJob

2013-02-20 Thread Tejas Patil
ck" is implemented in GeneratorJob [0] at lines 176-183. [0] : http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup > > Thanks, > lufeng > > > > On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil wrote: > >> Hey Le

Re: Configuration improvements to GeneratorJob

2013-02-20 Thread Tejas Patil
few words or maybe accidentally they got deleted. Correction in bold: "There might be some reason behind removing it *and we must look into it*before adding it back ". > > Thanks > lufeng > > On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil wrote: > >> Hi

Re: Configuration improvements to GeneratorJob

2013-02-24 Thread Tejas Patil
seem to do nothing meaningful. 3. generate.min.score : remove ? 4. generate.filter, generate.normalise, generate.topN : there is not problem in keeping it. we can even remove it. 5. GENERATOR_COUNT_VALUE_IP : ?? thanks, Tejas Patil On Wed, Feb 20, 2013 at 9:44 PM, Tejas Patil wrote: > Hi Lufen

Re: dev Digest 25 Feb 2013 02:27:44 -0000 Issue 1555

2013-02-26 Thread Tejas Patil
Hi Lewis, I am not sure about what needs to be done for #3 and #5. So I left it as an open question. Once we reach a common understanding, I will open a jira for this. Thanks, Tejas Patil On Mon, Feb 25, 2013 at 1:17 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: >

Re: [ANNOUNCEMENT] Welcome Kiran Chitturi as Apache Nutch PMC and Committer

2013-03-09 Thread Tejas Patil
Welcome aboard Kiran :) On Sat, Mar 9, 2013 at 12:56 PM, lewis john mcgibbney wrote: > Hi All, > > Over the last while we have been aware of Kiran's ongoing contribution to > the Nutch community. > It is with great pleasure that we invite Kiran to join the Nutch PMC and > also take up Committer

wiki update for nutch commands

2013-03-20 Thread Tejas Patil
Hi Kiran, The command line arguments to the fetch command shown on wiki page [2] doesn't seem to be in sync with what is implemented in [0] and [1]. For 1.x [0] Usage: Fetcher [-threads n] For 2.x [1] Usage: FetcherJob ( | -all) [-crawlId ] [-threads N] [-resume] [-numTasks N] On wiki page [2]

Re: wiki update for nutch commands

2013-03-20 Thread Tejas Patil
line argument page, the table. >> >> The command line arguments page is quite important for users as you have >> mentioned and I am up for keeping the pages separate for 1.x and 2.x. >> >> >> On Wed, Mar 20, 2013 at 2:52 PM, Tejas Patil >> wrote: >&

Re: wiki update on Nutch Tutorial with crawl script

2013-03-20 Thread Tejas Patil
Phew.. was about to ignore this one as it was hidden among lot of other auto generated emails for wiki updates !! On Wed, Mar 20, 2013 at 12:01 PM, kiran chitturi wrote: > Hi! > > I want to update the Nutch tutorials in the wiki with the crawl script > (./bin/crawl). The presence of the crawl com

Re: Nutch

2013-04-06 Thread Tejas Patil
On Sat, Apr 6, 2013 at 9:58 AM, Parin Jogani wrote: > Hi, > Is there any way to perform a urlfilter from level 1-5 and a different one > from 5 onwards. I need to extract pdf files which will be only after a > given level (just to experiment). > You can run 2 crawls over the same crawldb using di

Re: Error when running Nutch, please help

2013-04-23 Thread Tejas Patil
The crawl script picks up the name of the segment created after the generate phase of nutch using some shell code: 124 if [ $mode = "local" ]; then 125 SEGMENT=`ls -l $CRAWL_PATH/segments/ | sed -e "s/ /\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1` 126 else 127SEGMENT=`hadoop fs -ls $CRAWL_PATH/s

Re: tickets for nutch beginners

2013-04-27 Thread Tejas Patil
Hi Mike, There were few jiras for some plugins not having junits. You can go through the corresponding plugin code, understand its working and then write a junit for it. That would give you some idea about the nutch code. Below are those jira links: https://issues.apache.org/jira/browse/NUTCH-1116

Nutch wiki down ?

2013-05-01 Thread Tejas Patil
Hi all, I have done few jira checkins this week and wanted to verify if that would need any wiki updates. Since past 2-3 days, I am not able to access nutch wiki pages[1]. It says: *wiki.apache.org is undergoing maintance.* *We should be back online at UTC* *Infrabot on twitter contains more inf

Re: Nutch wiki down ?

2013-05-01 Thread Tejas Patil
t checked and Solr wiki is also down > [1]. > > > [1] - https://wiki.apache.org/solr > > > On Thu, May 2, 2013 at 1:08 AM, Tejas Patil wrote: > >> Hi all, >> >> I have done few jira checkins this week and wanted to verify if that >> would need any w

Re: Nutch wiki down ?

2013-05-03 Thread Tejas Patil
Its up now :) On Wed, May 1, 2013 at 10:27 PM, Tejas Patil wrote: > I thought that the downtime might be for few hours or so .. but this one > is in #days !! Maybe they are upto something cool and need more time :) > Would wait for few more days. > > > On Wed, May 1, 2013

Re: Jenkins build is back to normal : Nutch-nutchgora #595

2013-05-07 Thread Tejas Patil
Hey Lewis, It seems that you fixed some settings and the build passes now. Thanks !! On Tue, May 7, 2013 at 4:20 PM, Apache Jenkins Server < jenk...@builds.apache.org> wrote: > See > >

Re: [DISCUSS] Apache Nutch 2.2 Release Candidate

2013-05-08 Thread Tejas Patil
Hi Lewis, Apache Gora 0.3 release is indeed a good news :) If we don't get ample time to test it, then we should include it in next release. Unless its a "stopper" or "critical" issue, there is no haste to add it. Its a +1 from me to get 2.2 pushed. Thanks, Tejas On Wed, May 8, 2013 at 8:20 PM,

Re: Request for Backup Mentor(s) for GSoCq

2013-05-16 Thread Tejas Patil
Hi Lewis, I have no problem with helping on Nutch front but being honest I have no idea about Giraph. Would that be a problem ? If not then I am willing to sign in as a backup mentor. Thanks, Tejas On Thu, May 16, 2013 at 10:19 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi A

[request] ACK for being backup mentor

2013-05-18 Thread Tejas Patil
Thanks, Tejas Patil http://www.linkedin.com/in/tejaspatil1

Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

2013-05-19 Thread Tejas Patil
that your work would be a great contribution to Nutch. Looking forward to see this feature in next release cycle. Thanks, Tejas Patil On Sun, May 19, 2013 at 12:30 PM, Ivan Vershinin wrote: > Hello! > I am student from Estonia (Tartu University). I want to participate in > GSoC

Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

2013-05-19 Thread Tejas Patil
: http://wiki.apache.org/nutch/NutchAdministrationUserInterface On Sun, May 19, 2013 at 2:23 PM, Tejas Patil wrote: > This will help for getting an idea about what is needed: > http://wiki.apache.org/nutch/NutchAdministrationUserInterface > > Rest API in nutch: (the jira comments and t

Re: Issues in 2.x with Patches for review

2013-05-20 Thread Tejas Patil
I have been looking at open jiras in past few weeks and getting a couple of them to trunk. What are the estimated/planned dates for new releases ? Thanks, Tejas Patil On Wed, Apr 24, 2013 at 6:13 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi All, > I made the

unable to build 2.x

2013-05-21 Thread Tejas Patil
Hi nutch-dev, I took a *fresh* checkout of 2.x and tried to build it (ant clean runtime). I get lot of compilation errors. At first when I saw that on the terminal, I said to my laptop : "Are you kidding me ?". I re-tried it 2 times again and still the same thing happens. I am checking the reason

Re: unable to build 2.x

2013-05-21 Thread Tejas Patil
Previously, without un-commenting the gora backend dependency in ivy.xml, the code could be built. Now, if its not specified, there are compilation errors and build fails. Specifying the backend is mandatory now. On Tue, May 21, 2013 at 11:04 PM, Tejas Patil wrote: > Hi nutch-dev, > >

Re: [VOTE] Apache Nutch 2.2 Release Candidate

2013-06-02 Thread Tejas Patil
+1 from my side too :) PS: Am I only one who is getting the mail today after 2 days ? Something seems wrong with the mail servers. On Fri, May 31, 2013 at 4:17 PM, lewis john mcgibbney wrote: > Good Friday Everyone, > > Glad to get to a stage where we can VOTE on the release of the Apache > Nut

Re: [DISCUSS] Nutch 1.7 ready for release?

2013-06-08 Thread Tejas Patil
+1. We should go for it On Sat, Jun 8, 2013 at 4:51 PM, Mattmann, Chris A (398J) < chris.a.mattm...@jpl.nasa.gov> wrote: > Go get em' > > ++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasade

Re: [ANNOUNCE] Apache Nutch 2.2 Released

2013-06-08 Thread Tejas Patil
Kewl !! Thanks Lewis :) I have circulated this news to big data group of my company and DB fellows @ UCI and my geek friends. Also posted over Bigdata and Hadoop pages over Facebook. On Sat, Jun 8, 2013 at 2:53 PM, lewis john mcgibbney wrote: > Good Afternoon Everyone, > > The Apache Nutch PMC a

Fwd: Nutch Compilation Error with Eclipse

2013-06-10 Thread Tejas Patil
?usp=sharing -- Forwarded message -- From: Tejas Patil Date: Mon, Jun 10, 2013 at 2:58 PM Subject: Re: Nutch Compilation Error with Eclipse To: "u...@nutch.apache.org" I have created a google doc [0] with several snapshots describing how to setup nutch 2.x + eclips

Re: Fwd: Nutch Compilation Error with Eclipse

2013-06-11 Thread Tejas Patil
". > > Cheers, > Sebastian > > On 06/11/2013 01:30 AM, Tejas Patil wrote: > > Hi @nutch-dev, > > > > I want to put out this [0] tutorial over Nutch wiki. > > > > 1. Do you see anything wrong in it or any improvements ? > > 2. Where do I uplo

Re: right place to put wiki images

2013-06-11 Thread Tejas Patil
As per suggestion by Seb, I have corrected wiki at several places. The images over Admin UI Proposal are lost as they were hosted somewhere else and the site is down now :( http://wiki.apache.org/nutch/NutchAdministrationUserInterface On Tue, Jun 11, 2013 at 11:14 AM, Tejas Patil wrote

Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-22 Thread Tejas Patil
+1 from me On Fri, Jun 21, 2013 at 11:51 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Apologies. I had seen only that the status had changed to closed but > removing the fix version definitely did the trick. > > +1 for releasing > > Thanks Lewis > > > On 21 June 2013 21:46, Lewis

Re: [VOTE] Apache Nutch 2.2.1 RC#1

2013-06-27 Thread Tejas Patil
+1 from me too On Thu, Jun 27, 2013 at 12:00 PM, Markus Jelsma wrote: > Looks fine Lewis! +1 > > -Original message- > From: Lewis John Mcgibbney > Sent: Thursday 27th June 2013 20:00 > To: dev@nutch.apache.org; u...@nutch.apache.org > Subject: [VOTE] Apache Nutch 2.2.1 RC#1 > > Hi, > > I

Re: Updating the documentation for crawl via 2.x

2013-06-30 Thread Tejas Patil
I think that the wiki page was made with an intention that users knew about 1.x and would now be switching to 2.x. So it had only the gora and datastore setup steps. I agree with you that it should contain complete set of steps. *@dev:* Unless there is any objection or better suggestion, I would g

Re: Adding nutch stage

2013-07-01 Thread Tejas Patil
On Mon, Jul 1, 2013 at 5:31 AM, Ahmet Emre Aladağ wrote: > Hi, > > I'd like to add a new stage called "updatescore" after "updatedb" to Nutch > 2.1. > > I tried two ways for this: > 1) public class ScoreUpdaterJob extends NutchTool implements Tool; > > Nutch requires me to define the InputFormat,

Nutch with YARN (aka Hadoop 2.0)

2013-12-08 Thread Tejas Patil
Has anyone tried out running Nutch over YARN ? If so, were there were any performance gains with the same ? Thanks, Tejas

Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-09 Thread Tejas Patil
ully ported to the new mapreduce API which > is a prerequisite for running it on Hadoop 2. > I can't think of a reason why that the performance would be any different > with Yarn. > > Julien > > > On 9 December 2013 06:42, Tejas Patil wrote: > >> Has anyone t

Re: Step Through Nutch 1.7 Inside Eclipse Missing Argument

2013-12-22 Thread Tejas Patil
You are asking the right question at the right place. The example shown in the tutorial was for Nutch 2.X series. The 1.X Injector needs an extra param as input which is the location of the crawldb to inject the urls into. (For first time, it would create a new one on the location in the command).

Re: Step Through Nutch 1.7 Inside Eclipse Missing Argument

2013-12-23 Thread Tejas Patil
Wiki so your tutorial is both consistent for Nutch 2.X and 1.X.. > > (thought I should contribute back when I got the help from the community.) > > Thanks, > /usr/bin > > > > On Sun, Dec 22, 2013 at 10:44 PM, Tejas Patil wrote: > >> You are asking the right questio

Re: Nutch Several Segment Folders Containing Duplicate Key/URLs

2013-12-24 Thread Tejas Patil
You ran 3 rounds of nutch crawl ("-depth 3") and those 3 folders are 3 segments created for each round of crawl. About the 520 URLs, I don't see any obvious reason for that happening. You should see few of the new urls that were added, what were their parent url and then run a small crawl using tho

Re: Nutch Crawl a Specific List Of URLs (150K)

2013-12-29 Thread Tejas Patil
Hi Bin Wang, >> nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 & You were creating a new crawldb or reusing some old one ? Were you running this on a cluster or in local mode ? Was there any failure due to which the fetch round got aborted ? (see logs for this). I would like to rep

Re: use to parse big Nutch/Content file

2014-01-02 Thread Tejas Patil
Here is what I would do: If you running a crawl, let it run with the default parser. Write a nutch plugin with your customized parse implementation to evaluate your parse logic. Now get some real segments (with a subset of those million pages) and run only the 'bin/nutch parse' command and see how

Re: How Map Reduce code in Nutch run in local mode vs distributed mode?

2014-01-02 Thread Tejas Patil
The config 'fs.default.name' of core-site.xml is what makes this happen. Its default value is "file:///" which corresponds to local mode of Hadoop. In local mode Hadoop looks for paths on the local file system. In distributed mode of Hadoop, 'fs.default.name' would be "hdfs://IP_OF_NAMENODE/" and i

Re: How Map Reduce code in Nutch run in local mode vs distributed mode?

2014-01-03 Thread Tejas Patil
here is no core-site.xml for > Nutch. > Isn't it? then it is default as local ? > > /usr/bin > > > > > On Thu, Jan 2, 2014 at 10:02 PM, Tejas Patil wrote: > >> The config 'fs.default.name' of core-site.xml is what makes this happen. >> Its

Re: Independent Map Reduce to parse Nutch content (Cont.)

2014-01-03 Thread Tejas Patil
Hi Bin Wang, I would suggest you to NOT use eclipse and run your code over command line. Use logger statements and see the logs for full stack traces of the failure. In my personal experience, logs are the best way to debug hadoop code compared to Eclipse debugger. Thanks, Tejas On Fri, Jan 3, 2

Inject operation: can't it be done in a single map-reduce job ?

2014-01-04 Thread Tejas Patil
Hi nutch-dev, I am looking at Injector code in trunk and I see that currently we are launching two map-reduce jobs for the same: 1. sort job: get the urls from seeds file, emit CrawlDatum objects. 2. merge job: read CrawlDatum objects from both crawldb and output of sort job. Merge and emit final

Re: Independent Map Reduce to parse Nutch content (Cont.)

2014-01-04 Thread Tejas Patil
rk locally... > Based on my understanding, I can see Nutch constantly uses Hadoop API > without hadoop pre-installed.. why can't my code work.. > > Well, any hint or directional guidance will be appreciated, many thanks! > > /usr/bin > > > > > On Sat, Jan 4, 2

Re: Inject operation: can't it be done in a single map-reduce job ?

2014-01-06 Thread Tejas Patil
Thanks Lewis and Markus. @Lewis: I don't have a dedicated cluster (I am currently not a student nor working anywhere) so would be running in the pseudo distributed mode on my laptop. I don't think that it would be a perfect setup to get some stats. Does ASF has any cluster which could be used ? T

Renovating "Nutch Hadoop Tutorial" wiki page

2014-01-21 Thread Tejas Patil
Hi nutch-dev, I was looking at [0] and realized that with the massive number of Hadoop setup tutorials out there on internet, we need not repeat the same on nutch wiki page and instead assume that user has already done Hadoop setup. For convinience, we could direct users to the Hadoop wiki page wh

Request for reviewing HostDb and Sitemap features

2014-01-21 Thread Tejas Patil
Hi, Is anyone interested in reviewing or trying out the patch for these new features ? I have recently updated [0] and [1] and would like to hear back comments on the same. [0] : https://issues.apache.org/jira/browse/NUTCH-1325 [1] : https://issues.apache.org/jira/browse/NUTCH-1465 Thanks, Tejas

Re: Renovating "Nutch Hadoop Tutorial" wiki page

2014-01-22 Thread Tejas Patil
escribed. +1 to remove the old >> nutchhadooptutorial page >> >> J. >> >> >> On 21 January 2014 17:44, Tejas Patil wrote: >> >>> Hi nutch-dev, >>> >>> I was looking at [0] and realized that with the massive number of Hadoop >&g

Re: Renovating "Nutch Hadoop Tutorial" wiki page

2014-01-22 Thread Tejas Patil
Moved the old nutchhadooptutorial page from Nutch wiki "Front page" to "Archive and Legacy". ~tejas On Wed, Jan 22, 2014 at 5:09 PM, Tejas Patil wrote: > Thanks *Julien* for pointing me to new "NutchHadoopSingleNodeTutorial" > wiki page [0]. I would soon rem

Right was to run crawl script in deploy mode

2014-01-22 Thread Tejas Patil
Hi nutch-dev, I was assuming that the commands to run the bin/crawl script in both local and deploy mode are the same. ie. from $NUTCH_HOME/runtime/local (or runtime/deploy), use > bin/crawl It turns out that in deploy mode, this does not obtain the segment location from HDFS and runs into p

Re: Renovating "Nutch Hadoop Tutorial" wiki page

2014-01-23 Thread Tejas Patil
, then just share the document in text format and I would add it to nutch wiki. ~tejas > > > > > On Wed, Jan 22, 2014 at 1:53 PM, Julien Nioche < > lists.digitalpeb...@gmail.com> wrote: > >> Thanks Tejas! >> >> >> On 22 January 2014 11:51, Tejas Pat

Re: Right way to run crawl script in deploy mode

2014-01-23 Thread Tejas Patil
Correction: the subject of this message should have read: "Right way to run crawl script in deploy mode" ~tejas On Wed, Jan 22, 2014 at 7:56 PM, Tejas Patil wrote: > Hi nutch-dev, > > I was assuming that the commands to run the bin/crawl script in both local > and deploy m

Fwd: [jira] [Commented] (NUTCH-356) Plugin repository cache can lead to memory leak

2014-01-24 Thread Tejas Patil
Hi Lewis, I won't be surprised if any user out there gets a cake for this jira ;) .. just like someone did over a MySQL bug ( http://www.youtube.com/watch?v=oAiVsbXVP6k) Cheers !! -- Forwarded message -- From: Lewis John McGibbney (JIRA) Date: Fri, Jan 24, 2014 at 7:01 PM Subject

Re: [DISCUSS] Release Trunk

2014-02-12 Thread Tejas Patil
Just saw the commits since 1.7 release. Apart from trivial bug fixes, we have some significant patches since 1.7. +1 for new release. I would be happy to volunteer / help. Thanks, Tejas On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi guys, > > At lea

Re: [DISCUSS] Release Trunk

2014-02-13 Thread Tejas Patil
Thanks Lewis. G+ hangout sounds cool. Is this wiki page complete and updated to start off ? http://wiki.apache.org/nutch/Release_HOWTO Thanks, Tejas On Thu, Feb 13, 2014 at 12:23 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Folks, > @Tejasp > > On Thu, Feb 13, 2014 at 6:30

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-12-22 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Attachment: NUTCH-1284.patch Patch for the fix > Add site fetcher.max.crawl.de

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-12-22 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538725#comment-13538725 ] Tejas Patil commented on NUTCH-1284: I searched for the relevant mail thread[0

[jira] [Comment Edited] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-12-22 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538725#comment-13538725 ] Tejas Patil edited comment on NUTCH-1284 at 12/22/12 10:5

[jira] [Updated] (NUTCH-1118) JUnit test for index-basic

2012-12-22 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1118: --- Attachment: NUTCH-1118.patch Wrote a test case which checks following: 1. basic searchable fields

[jira] [Updated] (NUTCH-1119) JUnit test for index-static

2012-12-23 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1119: --- Attachment: NUTCH-1119.patch Wrote a test case which checks following: 1. static data fields are

[jira] [Updated] (NUTCH-1224) Migrate FreeGenerator to MapReduce API

2012-12-29 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1224: --- Attachment: NUTCH-1224.1.patch First attempt. Only remaining question is: Should I create a separate

[jira] [Updated] (NUTCH-1127) JUnit test for urlfilter-validator

2012-12-29 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1127: --- Attachment: NUTCH-1127.patch Wrote test case capturing few scenarios. Attached the patch. Please let

[jira] [Commented] (NUTCH-1494) RSS feed plugin seems broken

2013-01-03 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542798#comment-13542798 ] Tejas Patil commented on NUTCH-1494: I was working on [NUTCH-1053|h

[jira] [Commented] (NUTCH-1053) Parsing of RSS feeds fails

2013-01-03 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542805#comment-13542805 ] Tejas Patil commented on NUTCH-1053: The exception seen by Lewis wrt command line

[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings

2013-01-03 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1274: --- Attachment: NUTCH-1274-trunk.patch NUTCH-1274-2.x.patch PFA the patches for trunk

[jira] [Created] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-04 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1513: -- Summary: Support Robots.txt for Ftp urls Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement

[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-04 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543720#comment-13543720 ] Tejas Patil commented on NUTCH-1513: For this has to be supported I have 2 approa

[jira] [Created] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-01-06 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1514: -- Summary: Phase out the deprecated configuration properties (if possible) Key: NUTCH-1514 URL: https://issues.apache.org/jira/browse/NUTCH-1514 Project: Nutch

[jira] [Updated] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-01-06 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1514: --- Attachment: NUTCH-1514.patch Attached the patch for changes in nutch trunk. Please let me know your

[jira] [Updated] (NUTCH-1514) Phase out the deprecated configuration properties (if possible)

2013-01-06 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1514: --- Attachment: NUTCH-1514-v2.patch Thanks Sebastian !! I removed those references in nutch-default.xml

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: NUTCH-1031.v1.patch The changes are done. Please let me know your comments. One issue

[jira] [Commented] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545691#comment-13545691 ] Tejas Patil commented on NUTCH-1513: Hi Lewis, Thanks for your suggestion. I t

[jira] [Commented] (NUTCH-1494) RSS feed plugin seems broken

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546454#comment-13546454 ] Tejas Patil commented on NUTCH-1494: Hi Lewis, I have could not run nutch with

[jira] [Updated] (NUTCH-1494) RSS feed plugin seems broken

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1494: --- Attachment: NUTCH-1494.3.patch @Lewis: it worked :) I have attached the patch. Please let me know

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546639#comment-13546639 ] Tejas Patil commented on NUTCH-1031: The current nutch robots parsing logic is

[jira] [Commented] (NUTCH-1274) Fix [cast] javac warnings

2013-01-11 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551809#comment-13551809 ] Tejas Patil commented on NUTCH-1274: Hi Lewis, I will do those changes. You

[jira] [Commented] (NUTCH-1274) Fix [cast] javac warnings

2013-01-11 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551815#comment-13551815 ] Tejas Patil commented on NUTCH-1274: Hi Lewis, I took a fresh checkout of trunk

[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings

2013-01-11 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1274: --- Attachment: NUTCH-1274-2.x.v2.patch NUTCH-1274-trunk.v2.patch Hi Lewis, PFA the

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Assignee: Tejas Patil > Add site fetcher.max.crawl.delay as log output by defa

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551857#comment-13551857 ] Tejas Patil commented on NUTCH-1284: Can anyone kindly review the p

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-18 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557930#comment-13557930 ] Tejas Patil commented on NUTCH-1031: After waiting for more than a week, I think

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: CC.robots.multiple.agents.patch I looked at the source code of CC to understand how it

[jira] [Assigned] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1031: -- Assignee: Tejas Patil (was: Julien Nioche) > Delegate parsing of robots.txt to craw

[jira] [Assigned] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1513: -- Assignee: Tejas Patil > Support Robots.txt for Ftp u

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Attachment: NUTCH-1284-trunk.v1.patch Hi Lewis, If I recall correctly, we want the crawl delay for

[jira] [Assigned] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1042: -- Assignee: Tejas Patil > Fetcher.max.crawl.delay property not taken into account correc

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558225#comment-13558225 ] Tejas Patil commented on NUTCH-1042: The patch for [NUTCH-1284|h

[jira] [Commented] (NUTCH-1329) parser not extract outlinks to external web sites

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558228#comment-13558228 ] Tejas Patil commented on NUTCH-1329: I am not able to reproduce this bug with

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558321#comment-13558321 ] Tejas Patil commented on NUTCH-1042: linked with NUTCH-

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558349#comment-13558349 ] Tejas Patil commented on NUTCH-1031: Hi Ken, Thanks for reviewing the patch. I

[jira] [Commented] (NUTCH-1223) Migrate WebGraph to MapReduce API

2013-01-20 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558579#comment-13558579 ] Tejas Patil commented on NUTCH-1223: Hi Lufeng, One suggestion: There are lo

  1   2   3   4   >