[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973288#comment-15973288
 ] 

Hudson commented on NUTCH-2046:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3427 (See 
[https://builds.apache.org/job/Nutch-trunk/3427/])
fix for NUTCH-2046 contributed by jnioche (julien: 
[https://github.com/apache/nutch/commit/7b0103fe62c9b0e479bb03e7b9575522adcf68b8])
* (edit) src/bin/crawl


> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Julien Nioche
>  Labels: crawl, injection
> Fix For: 1.14
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973183#comment-15973183
 ] 

ASF GitHub Bot commented on NUTCH-2046:
---

jnioche closed pull request #161: Fix for NUTCH-2046 contributed by jnioche
URL: https://github.com/apache/nutch/pull/161
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.14
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973184#comment-15973184
 ] 

ASF GitHub Bot commented on NUTCH-2046:
---

jnioche commented on issue #161: Fix for NUTCH-2046 contributed by jnioche
URL: https://github.com/apache/nutch/pull/161#issuecomment-294931048
 
 
   thanks for the reviews and the nudge. Will add a comment on CHANGES in a 
separate commit.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.14
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969916#comment-15969916
 ] 

ASF GitHub Bot commented on NUTCH-2046:
---

lewismc commented on issue #161: Fix for NUTCH-2046 contributed by jnioche
URL: https://github.com/apache/nutch/pull/161#issuecomment-294287322
 
 
   @jnioche 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.14
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960412#comment-15960412
 ] 

ASF GitHub Bot commented on NUTCH-2046:
---

sebastian-nagel commented on issue #161: Fix for NUTCH-2046 contributed by 
jnioche
URL: https://github.com/apache/nutch/pull/161#issuecomment-292463192
 
 
   +1 from my side, but we could add a note to CHANGES.txt (as part of a 
section about API-breaking changes). Users need to update scripts calling 
bin/crawl.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.14
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2017-04-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960235#comment-15960235
 ] 

ASF GitHub Bot commented on NUTCH-2046:
---

lewismc commented on issue #161: Fix for NUTCH-2046 contributed by jnioche
URL: https://github.com/apache/nutch/pull/161#issuecomment-292426269
 
 
   +1 for me @jnioche 
   @sebastian-nagel you had made a comment over on NUTCH-2046, are you happy 
with this approach? I've tested in locally and like it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.14
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-12-13 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745073#comment-15745073
 ] 

Sebastian Nagel commented on NUTCH-2046:


A statement in change log and release notes that the behavior has changed 
should be sufficient.
On the long term an optional argument is cleaner than a required 
position-dependent argument which can take a magic form.
But it's more important to agree on one solution and get it done finally: my +1

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.13
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-12-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744889#comment-15744889
 ] 

ASF GitHub Bot commented on NUTCH-2046:
---

GitHub user jnioche opened a pull request:

https://github.com/apache/nutch/pull/161

Fix for NUTCH-2046 contributed by jnioche

This makes the seed argument optional and is an alternative to the solution 
proposed in [https://issues.apache.org/jira/browse/NUTCH-2046]. The latter is 
also acceptable and has the advantage of not breaking compatibility. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jnioche/nutch NUTCH-2046

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/161.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #161


commit 7b0103fe62c9b0e479bb03e7b9575522adcf68b8
Author: Julien Nioche 
Date:   2016-12-13T11:03:08Z

fix for NUTCH-2046 contributed by jnioche




> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.13
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142863#comment-15142863
 ] 

Julien Nioche commented on NUTCH-2046:
--

I agree with the objective but I'd rather have a consistent approach and deal 
with that in the same way as we do for indexing i.e. [-s seedPath]. Shouldn't 
be difficult to do

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.12
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-02-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142858#comment-15142858
 ] 

Chris A. Mattmann commented on NUTCH-2046:
--

+1 from me

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.12
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-02-11 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142855#comment-15142855
 ] 

Lewis John McGibbney commented on NUTCH-2046:
-

I am +1 to committing this. Any comments [~jnioche]?

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.12
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Luis Lopez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599933#comment-14599933
 ] 

Luis Lopez commented on NUTCH-2046:
---

I used just -skipInject instead of the actual path just because it's simpler. 
Also for these cases usually it's a negative parameter isn't it? like -noFilter 
-noParsing etc. 

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>  Labels: crawl, injection
> Fix For: 1.11
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599840#comment-14599840
 ] 

Julien Nioche commented on NUTCH-2046:
--

re-script : what about a positive parameter instead of a negative one (like we 
do for the indexing with -i)? Could have -s followed by the path to the seed.

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>  Labels: crawl, injection
> Fix For: 1.11
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2015-06-24 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599750#comment-14599750
 ] 

Lewis John McGibbney commented on NUTCH-2046:
-

Hi [~betolink], this is a nice issue. I think that we could easily have a 
[-skipInject] flag to the crawl script. 
Are you able to provide a patch?

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>  Labels: crawl, injection
> Fix For: 1.11
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)