[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-30 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723956#comment-13723956 ] Ferdy Galema commented on NUTCH-1457: - Hi, Thanks for submitting the patch. It s

[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-17 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711529#comment-13711529 ] Ferdy Galema commented on NUTCH-1457: - Ok cool. Like Lewis said it would be bes

[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704505#comment-13704505 ] Ferdy Galema commented on NUTCH-1457: - That seems like a nice solution, alth

[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-05 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700978#comment-13700978 ] Ferdy Galema commented on NUTCH-1457: - That should work. Can't think of a r

[jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545766#comment-13545766 ] Ferdy Galema commented on NUTCH-1508: - NUTCH-1431 (aka 'distance'

[jira] [Comment Edited] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545748#comment-13545748 ] Ferdy Galema edited comment on NUTCH-1508 at 1/7/13 10:1

[jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545748#comment-13545748 ] Ferdy Galema commented on NUTCH-1508: - Hi, Is this related to? h

[jira] [Commented] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

2012-11-20 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500974#comment-13500974 ] Ferdy Galema commented on NUTCH-1495: - Fair enough. I understand the reasonin

[jira] [Commented] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

2012-11-19 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500895#comment-13500895 ] Ferdy Galema commented on NUTCH-1495: - Hi, Nice one! I took a glance at your p

[jira] [Commented] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-09 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493885#comment-13493885 ] Ferdy Galema commented on NUTCH-1370: - Hi, I checked the patch, it seems you

[jira] [Commented] (NUTCH-1489) elasticindex should report the indexed documents like solrindex does

2012-11-09 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493829#comment-13493829 ] Ferdy Galema commented on NUTCH-1489: - Agree with Lewis, it seems there is alr

[jira] [Commented] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-09 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493820#comment-13493820 ] Ferdy Galema commented on NUTCH-1484: - Hi, I checked the patch (attached in N

[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2012-11-08 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493565#comment-13493565 ] Ferdy Galema commented on NUTCH-1457: - There is a limited description of the Nu

[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2012-11-06 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491289#comment-13491289 ] Ferdy Galema commented on NUTCH-1457: - Hi, Not really because with a partial up

[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2012-10-08 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471480#comment-13471480 ] Ferdy Galema commented on NUTCH-1457: - Included effort is resolving the conflic

[jira] [Resolved] (NUTCH-1468) Redirects that are external links not adhering to db.ignore.external.links

2012-09-17 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema resolved NUTCH-1468. - Resolution: Fixed Fix Version/s: 2.1 Committed @ Nutch2.x ref 1386526 Thanks for the

Re: Nutch 2.1 Release???

2012-09-17 Thread Ferdy Galema
2.1 sounds good! On Sun, Sep 16, 2012 at 12:14 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi, > > On Sat, Sep 15, 2012 at 10:38 PM, Markus Jelsma > wrote: > > Trunk has some unresolved issues that are eligible for 1.6. Someone here > can create a 1.7 version in Jira? Then we

[jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE

2012-09-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451910#comment-13451910 ] Ferdy Galema commented on NUTCH-872: That IS really weird. Not sure why it doe

[jira] [Commented] (NUTCH-1468) Redirects that are external links not adhering to db.ignore.external.links

2012-09-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451897#comment-13451897 ] Ferdy Galema commented on NUTCH-1468: - A nice catch indeed. Looks fine. I'

[jira] [Closed] (NUTCH-1456) Updater not setting batchId in markers correctly.

2012-09-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1456. --- Resolution: Fixed Tested the patch and it works. Thanks Alexander. Commited @ Nutch2.x ref 1382037

[jira] [Commented] (NUTCH-1459) Remove dead code (phase2) from InjectorJob

2012-09-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450514#comment-13450514 ] Ferdy Galema commented on NUTCH-1459: - Ok. (If it still not right, just let me

[jira] [Commented] (NUTCH-1459) Remove dead code (phase2) from InjectorJob

2012-09-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450488#comment-13450488 ] Ferdy Galema commented on NUTCH-1459: - Hi, Do you mean "Committed @ Nutc

[jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE

2012-09-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450461#comment-13450461 ] Ferdy Galema commented on NUTCH-872: Christian, I ran a testcrawl with Nutch2.x br

[jira] [Closed] (NUTCH-1459) Remove dead code (phase2) from InjectorJob

2012-09-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1459. --- Resolution: Fixed Committed. > Remove dead code (phase2) from Injector

[jira] [Updated] (NUTCH-1459) Remove dead code (phase2) from InjectorJob

2012-09-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1459: Attachment: nutch-1459.txt > Remove dead code (phase2) from Injector

[jira] [Commented] (NUTCH-1461) Problem with TableUtil

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446518#comment-13446518 ] Ferdy Galema commented on NUTCH-1461: - Added comment in NUTCH-

[jira] [Commented] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446515#comment-13446515 ] Ferdy Galema commented on NUTCH-1448: - Yes it does show up as an outlink. About

[jira] [Commented] (NUTCH-872) Change the default fetcher.parse to FALSE

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446511#comment-13446511 ] Ferdy Galema commented on NUTCH-872: Yes that is correct. >

[jira] [Closed] (NUTCH-1431) Introduce link 'distance' and add configurable max distance in the generator

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1431. --- Resolution: Fixed committed > Introduce link 'distance' and add con

[jira] [Updated] (NUTCH-1456) Updater not setting batchId in markers correctly.

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1456: Fix Version/s: 2.1 > Updater not setting batchId in markers correc

[jira] [Closed] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1448. --- Resolution: Fixed Committed. > Redirected urls should be handled more cleanly (m

[jira] [Closed] (NUTCH-1463) Elasticsearch indexer should wait and check response for last flush

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1463. --- Resolution: Fixed committed. > Elasticsearch indexer should wait and check respo

[jira] [Updated] (NUTCH-1463) Elasticsearch indexer should wait and check response for last flush

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1463: Attachment: nutch-1463.patch > Elasticsearch indexer should wait and check response for l

[jira] [Created] (NUTCH-1463) Elasticsearch indexer should wait and check response for last flush

2012-08-31 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1463: --- Summary: Elasticsearch indexer should wait and check response for last flush Key: NUTCH-1463 URL: https://issues.apache.org/jira/browse/NUTCH-1463 Project: Nutch

[jira] [Closed] (NUTCH-1462) Elasticsearch not indexing when type==null in NutchDocument metadata

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1462. --- Resolution: Fixed committed > Elasticsearch not indexing when type==null

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445878#comment-13445878 ] Ferdy Galema commented on NUTCH-1445: - Created NUTCH-1462 for a fix. For a quick

[jira] [Updated] (NUTCH-1462) Elasticsearch not indexing when type==null in NutchDocument metadata

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1462: Attachment: nutch-1462.patch > Elasticsearch not indexing when type==null in NutchDocum

[jira] [Created] (NUTCH-1462) Elasticsearch not indexing when type==null in NutchDocument metadata

2012-08-31 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1462: --- Summary: Elasticsearch not indexing when type==null in NutchDocument metadata Key: NUTCH-1462 URL: https://issues.apache.org/jira/browse/NUTCH-1462 Project: Nutch

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445871#comment-13445871 ] Ferdy Galema commented on NUTCH-1445: - Ah I got it now. It's definitely a

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445850#comment-13445850 ] Ferdy Galema commented on NUTCH-1445: - ("feature requests" should be &q

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-31 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445849#comment-13445849 ] Ferdy Galema commented on NUTCH-1445: - Hi Matt, Sure we can resolve your issue

[jira] [Updated] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-28 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1448: Attachment: nutch-1448.txt Thank you for you interest Christian. This issue should indeed prevent

[jira] [Created] (NUTCH-1459) Remove dead code (phase2) from InjectorJob

2012-08-17 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1459: --- Summary: Remove dead code (phase2) from InjectorJob Key: NUTCH-1459 URL: https://issues.apache.org/jira/browse/NUTCH-1459 Project: Nutch Issue Type

Re: A FetchSchedule bug makes fetch time becoming more and more big

2012-08-15 Thread Ferdy Galema
Hi, Yeah this is something I noticed too some while ago. Although it does not directly break the crawling directly, it is not a nice implementation. Notice that the Generator tries to correct for fetchtime too far off in the future. (In the AbstractFetchSchedule shouldFetch method.) As a matter o

[jira] [Created] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2012-08-15 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1457: --- Summary: Nutch2 Refactor the update process so that fetched items are only processed once Key: NUTCH-1457 URL: https://issues.apache.org/jira/browse/NUTCH-1457 Project

Re: DbUpdateReducer could not mark it's batchid

2012-08-15 Thread Ferdy Galema
Hi, This bug was already remarked some posts ago on the mailing list, but thanks anyway for reporting. I have created issue for keeping track: https://issues.apache.org/jira/browse/NUTCH-1456 Ferdy. On Wed, Aug 15, 2012 at 1:59 PM, lin weijian wrote: > Hi, > i find a bug in nu

[jira] [Created] (NUTCH-1456) Updater not setting batchId in markers correctly.

2012-08-15 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1456: --- Summary: Updater not setting batchId in markers correctly. Key: NUTCH-1456 URL: https://issues.apache.org/jira/browse/NUTCH-1456 Project: Nutch Issue Type

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

2012-08-15 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435000#comment-13435000 ] Ferdy Galema commented on NUTCH-1434: - +1 for removing commandline args and u

[jira] [Created] (NUTCH-1452) hadoop.job.history.user.location in nutch-default making job history useless

2012-08-14 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1452: --- Summary: hadoop.job.history.user.location in nutch-default making job history useless Key: NUTCH-1452 URL: https://issues.apache.org/jira/browse/NUTCH-1452 Project

Re: hadoop.job.history.user.location in nutch-default with CDH rendering job history useless

2012-08-14 Thread Ferdy Galema
FYI I've created a Jira for followup discussion. https://issues.apache.org/jira/browse/NUTCH-1452 On Tue, Aug 7, 2012 at 11:21 AM, Ferdy Galema wrote: > Hi, > > There still is a property in nutch-default > 'hadoop.job.history.user.location' that redirects the creation

[jira] [Closed] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-08-14 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1365. --- Resolution: Fixed committed > Fix crawlId functionalilty by making using of

[jira] [Closed] (NUTCH-1442) indexingfilter.order is property is misread in code

2012-08-14 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1442. --- > indexingfilter.order is property is misread in c

[jira] [Commented] (NUTCH-1442) indexingfilter.order is property is misread in code

2012-08-14 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433955#comment-13433955 ] Ferdy Galema commented on NUTCH-1442: - Thanks. Looks fine. Assertions should

[jira] [Updated] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-13 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1448: Description: This is specifically for Nutch2.x. Handling a redirects url like an outlink is much

[jira] [Created] (NUTCH-1448) Redirected urls should be handled more cleanly (more like an outlink url)

2012-08-13 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1448: --- Summary: Redirected urls should be handled more cleanly (more like an outlink url) Key: NUTCH-1448 URL: https://issues.apache.org/jira/browse/NUTCH-1448 Project: Nutch

Re: Happy 10th Birthday Nutch!

2012-08-09 Thread Ferdy Galema
Cheers! On Thu, Aug 9, 2012 at 9:56 AM, Julien Nioche wrote: > Doug Cutting on twitter : > https://twitter.com/cutting/status/233415059798372353 > > *RT @StefanGroschupf: Happy 10th birthday#Nutch! Registered at sourceforce > august 2002. Turned out to be quite a game changer. #Hadoop > * > Happ

hadoop.job.history.user.location in nutch-default with CDH rendering job history useless

2012-08-07 Thread Ferdy Galema
Hi, There still is a property in nutch-default 'hadoop.job.history.user.location' that redirects the creation of history files from job output locations to a custom location. I noticed that the current value does not work well with CDH, because ${hadoop.log.dir} is not defined. This actually cause

[jira] [Commented] (NUTCH-1444) Indexing should not create temporary files (do not extend from FileOutputFormat)

2012-08-07 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429986#comment-13429986 ] Ferdy Galema commented on NUTCH-1444: - Just to add: The following exception is f

[jira] [Created] (NUTCH-1446) Port NUTCH-1444 to trunk (Indexing should not create temporary files)

2012-08-06 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1446: --- Summary: Port NUTCH-1444 to trunk (Indexing should not create temporary files) Key: NUTCH-1446 URL: https://issues.apache.org/jira/browse/NUTCH-1446 Project: Nutch

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2012-08-06 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429104#comment-13429104 ] Ferdy Galema commented on NUTCH-1047: - Ah yes I think that is what we should aim

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2012-08-06 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429095#comment-13429095 ] Ferdy Galema commented on NUTCH-1047: - I did not mean to confuse people by u

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2012-08-06 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429093#comment-13429093 ] Ferdy Galema commented on NUTCH-1047: - Changing NutchIndexWriter into an endp

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-06 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429067#comment-13429067 ] Ferdy Galema commented on NUTCH-1445: - Hi Julien, Agreed to wait a while be

[jira] [Closed] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-03 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1445. --- Resolution: Fixed > Add ElasticIndexerJob that indexes to elasticsea

[jira] [Updated] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-03 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1445: Attachment: NUTCH-1445-addPropsToConfig.patch Final addition that adds the properties to nutch

[jira] [Commented] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-08-02 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427226#comment-13427226 ] Ferdy Galema commented on NUTCH-1365: - Nutch should be updated to Gora

[jira] [Updated] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-01 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1445: Attachment: NUTCH-1445-addToNutchScript.patch Added and committed patch that adds command to Nutch

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-01 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426660#comment-13426660 ] Ferdy Galema commented on NUTCH-1445: - committed in Nutch2 &

[jira] [Closed] (NUTCH-1444) Indexing should not create temporary files (do not extend from FileOutputFormat)

2012-08-01 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1444. --- Resolution: Fixed committed > Indexing should not create temporary files (do

[jira] [Updated] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-01 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1445: Attachment: NUTCH-1445.patch > Add ElasticIndexerJob that indexes to elasticsea

[jira] [Updated] (NUTCH-1444) Indexing should not create temporary files (do not extend from FileOutputFormat)

2012-08-01 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1444: Attachment: NUTCH-1444.patch > Indexing should not create temporary files (do not extend f

[jira] [Created] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-01 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1445: --- Summary: Add ElasticIndexerJob that indexes to elasticsearch Key: NUTCH-1445 URL: https://issues.apache.org/jira/browse/NUTCH-1445 Project: Nutch Issue Type

[jira] [Created] (NUTCH-1444) Indexing should not create temporary files (do not extend from FileOutputFormat)

2012-08-01 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1444: --- Summary: Indexing should not create temporary files (do not extend from FileOutputFormat) Key: NUTCH-1444 URL: https://issues.apache.org/jira/browse/NUTCH-1444 Project

[jira] [Updated] (NUTCH-1441) AnchorIndexingFilter should use plain HashSet

2012-07-30 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1441: Attachment: NUTCH-1441-trunk.patch Patch for trunk. It would be great if you could apply and test

[jira] [Reopened] (NUTCH-1441) AnchorIndexingFilter should use plain HashSet

2012-07-30 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema reopened NUTCH-1441: - > AnchorIndexingFilter should use plain Hash

[jira] [Updated] (NUTCH-1441) AnchorIndexingFilter should use plain HashSet

2012-07-30 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1441: Patch Info: Patch Available Fix Version/s: 1.6 > AnchorIndexingFilter should use pl

[jira] [Created] (NUTCH-1442) indexingfilter.order is property is misread in code

2012-07-30 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1442: --- Summary: indexingfilter.order is property is misread in code Key: NUTCH-1442 URL: https://issues.apache.org/jira/browse/NUTCH-1442 Project: Nutch Issue Type

[jira] [Closed] (NUTCH-1441) AnchorIndexingFilter should use plain HashSet

2012-07-30 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1441. --- Resolution: Fixed committed > AnchorIndexingFilter should use plain Hash

[jira] [Updated] (NUTCH-1441) AnchorIndexingFilter should use plain HashSet

2012-07-30 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1441: Attachment: NUTCH-1441.patch > AnchorIndexingFilter should use plain Hash

[jira] [Created] (NUTCH-1441) AnchorIndexingFilter should use plain HashSet

2012-07-30 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1441: --- Summary: AnchorIndexingFilter should use plain HashSet Key: NUTCH-1441 URL: https://issues.apache.org/jira/browse/NUTCH-1441 Project: Nutch Issue Type: Bug

[jira] [Closed] (NUTCH-1438) ParserJob support for option -reparse

2012-07-26 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1438. --- Resolution: Fixed committed > ParserJob support for option -repa

[jira] [Created] (NUTCH-1438) ParserJob support for option -reparse

2012-07-26 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1438: --- Summary: ParserJob support for option -reparse Key: NUTCH-1438 URL: https://issues.apache.org/jira/browse/NUTCH-1438 Project: Nutch Issue Type: New Feature

[jira] [Updated] (NUTCH-1438) ParserJob support for option -reparse

2012-07-26 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1438: Attachment: NUTCH-1438.patch > ParserJob support for option -repa

[jira] [Updated] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-07-25 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1365: Attachment: NUTCH-1365-v4.patch new patch fixes crawlId functionality for HostInjectorJob too

[jira] [Closed] (NUTCH-1437) HostInjectorJob to accept lines with or without protocol

2012-07-25 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1437. --- Resolution: Fixed reopening/closing to set correct resolve status (FIXED

[jira] [Closed] (NUTCH-1437) HostInjectorJob to accept lines with or without protocol

2012-07-25 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1437. --- Resolution: Cannot Reproduce committed > HostInjectorJob to accept lines with

[jira] [Reopened] (NUTCH-1437) HostInjectorJob to accept lines with or without protocol

2012-07-25 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema reopened NUTCH-1437: - > HostInjectorJob to accept lines with or without proto

[jira] [Updated] (NUTCH-1437) HostInjectorJob to accept lines with or without protocol

2012-07-25 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1437: Attachment: NUTCH-1437.patch > HostInjectorJob to accept lines with or without proto

[jira] [Created] (NUTCH-1437) HostInjectorJob to accept lines with or without protocol

2012-07-25 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1437: --- Summary: HostInjectorJob to accept lines with or without protocol Key: NUTCH-1437 URL: https://issues.apache.org/jira/browse/NUTCH-1437 Project: Nutch Issue

[jira] [Updated] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-07-20 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1365: Attachment: NUTCH-1365-v3.patch Small improvement of the patch by showing the crawlId name in the

[jira] [Updated] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-07-19 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1365: Attachment: NUTCH-1365-v2.patch Updated patch for new version of GORA-150. >

[jira] [Created] (NUTCH-1432) property storage.schema does not work anymore, should be storage.schema.webpage and storage.schema.host

2012-07-19 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1432: --- Summary: property storage.schema does not work anymore, should be storage.schema.webpage and storage.schema.host Key: NUTCH-1432 URL: https://issues.apache.org/jira/browse/NUTCH

[jira] [Updated] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-07-18 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1365: Attachment: NUTCH-1365.patch The updated patch. (Because of the splitting up of the corresponding

[jira] [Commented] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-07-18 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417120#comment-13417120 ] Ferdy Galema commented on NUTCH-1365: - When we update Gora to 0.3, we can commit

[jira] [Updated] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-07-18 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1365: Attachment: (was: NUTCH-1365.patch) > Fix crawlId functionalilty by making using of

[jira] [Commented] (NUTCH-1431) Introduce link 'distance' and add configurable max distance in the generator

2012-07-18 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417005#comment-13417005 ] Ferdy Galema commented on NUTCH-1431: - It is a way to keep the size of a crawl wi

[jira] [Updated] (NUTCH-1431) Introduce link 'distance' and add configurable max distance in the generator

2012-07-18 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1431: Attachment: NUTCH-1431.patch > Introduce link 'distance' and add configurable ma

[jira] [Created] (NUTCH-1431) Introduce link 'distance' and add configurable max distance in the generator

2012-07-18 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1431: --- Summary: Introduce link 'distance' and add configurable max distance in the generator Key: NUTCH-1431 URL: https://issues.apache.org/jira/browse/NUTCH-1431

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-07-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410884#comment-13410884 ] Ferdy Galema commented on NUTCH-1360: - Thanks! Keep up the good

[jira] [Closed] (NUTCH-1428) GeneratorMapper should not initialize filters/normalizers when they are disabled

2012-07-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1428. --- Resolution: Fixed committed. > GeneratorMapper should not initialize filt

  1   2   3   4   >