[jira] [Commented] (NUTCH-1296) nutchgora fetcher does not show correct 'threads' and 'resuming' properties

2012-03-02 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221508#comment-13221508
 ] 

Hudson commented on NUTCH-1296:
---

Integrated in Nutch-nutchgora #181 (See 
[https://builds.apache.org/job/Nutch-nutchgora/181/])
NUTCH-1296 nutchgora fetcher does not show correct 'threads' and 'resuming' 
properties (Revision 1296203)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java


> nutchgora fetcher does not show correct 'threads' and 'resuming' properties
> ---
>
> Key: NUTCH-1296
> URL: https://issues.apache.org/jira/browse/NUTCH-1296
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
>Priority: Trivial
> Fix For: nutchgora
>
>
> The nutchgora FetcherJob logs the 'threads' and 'resuming' properties just 
> before fetching, but they are read from the config. (Ignoring the fact that 
> they are specified as parameters too. These paramaters are later set on the 
> config).
> Trivial fix will be right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1295) nutchgora restlet dependencies failing when remote repos is down

2012-03-02 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221509#comment-13221509
 ] 

Hudson commented on NUTCH-1295:
---

Integrated in Nutch-nutchgora #181 (See 
[https://builds.apache.org/job/Nutch-nutchgora/181/])
NUTCH-1295 nutchgora restlet dependencies failing when remote repos is down 
(Revision 1296114)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/ivy/ivysettings.xml


> nutchgora restlet dependencies failing when remote repos is down
> 
>
> Key: NUTCH-1295
> URL: https://issues.apache.org/jira/browse/NUTCH-1295
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
> Attachments: NUTCH-1295.patch
>
>
> Currently the head of nutchgora cannot be build when running "ant clean 
> runtime". This is because the restlet dependencies cannot be found. This is 
> even though there are local restlet copies in the ivy2 cache dir. Did we not 
> have this problem before?
> Anyway I found a solution. Basically I renamed the resolver name from the 
> chain name. This way the restlet dependencies are read from the local cache 
> when the remote one is not available. See patch for details.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1292) Better exception logging and debugging during fetch.

2012-03-02 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221506#comment-13221506
 ] 

Hudson commented on NUTCH-1292:
---

Integrated in Nutch-nutchgora #181 (See 
[https://builds.apache.org/job/Nutch-nutchgora/181/])
NUTCH-1292 Better exception logging and debugging during fetch. (Revision 
1296239)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetchEntry.java
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/scoring/ScoreDatum.java


> Better exception logging and debugging during fetch.
> 
>
> Key: NUTCH-1292
> URL: https://issues.apache.org/jira/browse/NUTCH-1292
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>Priority: Trivial
> Fix For: nutchgora
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1263) FetcherJob must put 'fetchTime' on input

2012-03-02 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221507#comment-13221507
 ] 

Hudson commented on NUTCH-1263:
---

Integrated in Nutch-nutchgora #181 (See 
[https://builds.apache.org/job/Nutch-nutchgora/181/])
NUTCH-1263 FetcherJob must put 'fetchTime' on input (Revision 1296236)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java


> FetcherJob must put 'fetchTime' on input
> 
>
> Key: NUTCH-1263
> URL: https://issues.apache.org/jira/browse/NUTCH-1263
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: nutchgora
>Reporter: Ferdy Galema
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1263.patch
>
>
> The reducer of the fetcher reads the field fetchTime, but does not include in 
> on the input. Trivial patch fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-475) Adaptive crawl delay

2012-03-02 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-475:
---

Attachment: NUTCH-475.patch

Updated patch which brings this issue up to speed as of Dogacan's comments. 
None of Todd's work was ever uploaded, however I think we should work towards 
an implementation as Enis' suggested. I suppose we can try/test this 
implementation... as I have not done so as of yet.

> Adaptive crawl delay
> 
>
> Key: NUTCH-475
> URL: https://issues.apache.org/jira/browse/NUTCH-475
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Doğacan Güney
> Attachments: NUTCH-475.patch, adaptive-delay_draft.patch
>
>
> Current fetcher implementation waits a default interval before making another 
> request to the same server (if crawl-delay is not specified in robots.txt). 
> IMHO, an adaptive implementation will be better. If the server is under 
> little load and can server requests fast, then fetcher can ask for more pages 
> in a given interval. Similarly, if the server is suffering from heavy load, 
> fetcher can slow down(w.r.t that host), easing the load on the server.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2012-03-02 Thread Ferdy Galema (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220984#comment-13220984
 ] 

Ferdy Galema commented on NUTCH-1253:
-

I'll give this one a go..

> Incompatible neko and xerces versions
> -
>
> Key: NUTCH-1253
> URL: https://issues.apache.org/jira/browse/NUTCH-1253
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4
> Environment: Ubuntu 10.04
>Reporter: Dennis Spathis
> Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch
>
>
> The Nutch 1.4 distribution includes
>  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
> nekohtml)
>  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
> These two JARs appear to be incompatible versions. When the HtmlParser 
> (configured to use neko) is invoked during a local-mode crawl, the parse 
> fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
> rebuild the HtmlParser plugin and add a
> catch(Throwable) clause in the getParse method to log the stacktrace.)
> I found that substituting a later, compatible version of nekohtml (1.9.11)
> fixes the problem.
> Curiously, and in support of the above, the nekohtml plugin.xml file in
> Nutch 1.4 contains the following:
> id="lib-nekohtml"
>name="CyberNeko HTML Parser"
>version="1.9.11"
>provider-name="org.cyberneko">
>
>
>
>
>
> 
> Note the conflicting version numbers (version tag is "1.9.11" but the
> specified library is "nekohtml-0.9.5.jar").
> Was the 0.9.5 version included by mistake? Was the intention rather to
> include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-1292) Better exception logging and debugging during fetch.

2012-03-02 Thread Ferdy Galema (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1292.
---

Resolution: Fixed

committed

> Better exception logging and debugging during fetch.
> 
>
> Key: NUTCH-1292
> URL: https://issues.apache.org/jira/browse/NUTCH-1292
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>Priority: Trivial
> Fix For: nutchgora
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-1263) FetcherJob must put 'fetchTime' on input

2012-03-02 Thread Ferdy Galema (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1263.
---

   Resolution: Fixed
Fix Version/s: nutchgora

This one slipped under the radar.

Committed.

> FetcherJob must put 'fetchTime' on input
> 
>
> Key: NUTCH-1263
> URL: https://issues.apache.org/jira/browse/NUTCH-1263
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: nutchgora
>Reporter: Ferdy Galema
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1263.patch
>
>
> The reducer of the fetcher reads the field fetchTime, but does not include in 
> on the input. Trivial patch fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Nutch with Letor

2012-03-02 Thread Lewis John Mcgibbney
Also please4 hip this discussion to user@ as it seems to be more relevant
there.

Thanks

On Fri, Mar 2, 2012 at 2:13 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi,
>
> Would be great if you could provide some links to the dataset, exactly
> what it is etc.
>
> Thank you
>
>
> On Fri, Mar 2, 2012 at 1:19 PM, varunpandeyengg  > wrote:
>
>> Hey Guys,
>>
>> I am new to Nutch. I am part of a IR research team & need to create a
>> setup
>> where in I need to crawl Microsoft's LETOR Dataset with Nutch. After
>> googling for a while, I didn't get any tutorial or help. Could anyone
>> guide
>> me for the same?
>>
>> I am using Nutch 1.4 on Ubuntu 11.10 & Eclipse 3.7.
>>
>> Till now I am able to crawl public network from my Nutch setup integrated
>> with Eclipse...
>>
>> Thanks in advance.
>>
>> -
>> Varun
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3793432p3793432.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>
>
>
>
> --
> *Lewis*
>
>


-- 
*Lewis*


Re: Nutch with Letor

2012-03-02 Thread Lewis John Mcgibbney
Hi,

Would be great if you could provide some links to the dataset, exactly what
it is etc.

Thank you

On Fri, Mar 2, 2012 at 1:19 PM, varunpandeyengg
wrote:

> Hey Guys,
>
> I am new to Nutch. I am part of a IR research team & need to create a setup
> where in I need to crawl Microsoft's LETOR Dataset with Nutch. After
> googling for a while, I didn't get any tutorial or help. Could anyone guide
> me for the same?
>
> I am using Nutch 1.4 on Ubuntu 11.10 & Eclipse 3.7.
>
> Till now I am able to crawl public network from my Nutch setup integrated
> with Eclipse...
>
> Thanks in advance.
>
> -
> Varun
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3793432p3793432.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>



-- 
*Lewis*


[jira] [Closed] (NUTCH-1296) nutchgora fetcher does not show correct 'threads' and 'resuming' properties

2012-03-02 Thread Ferdy Galema (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1296.
---

Resolution: Fixed

committed

> nutchgora fetcher does not show correct 'threads' and 'resuming' properties
> ---
>
> Key: NUTCH-1296
> URL: https://issues.apache.org/jira/browse/NUTCH-1296
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
>Priority: Trivial
> Fix For: nutchgora
>
>
> The nutchgora FetcherJob logs the 'threads' and 'resuming' properties just 
> before fetching, but they are read from the config. (Ignoring the fact that 
> they are specified as parameters too. These paramaters are later set on the 
> config).
> Trivial fix will be right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1296) nutchgora fetcher does not show correct 'threads' and 'resuming' properties

2012-03-02 Thread Ferdy Galema (Created) (JIRA)
nutchgora fetcher does not show correct 'threads' and 'resuming' properties
---

 Key: NUTCH-1296
 URL: https://issues.apache.org/jira/browse/NUTCH-1296
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema
Priority: Trivial
 Fix For: nutchgora


The nutchgora FetcherJob logs the 'threads' and 'resuming' properties just 
before fetching, but they are read from the config. (Ignoring the fact that 
they are specified as parameters too. These paramaters are later set on the 
config).

Trivial fix will be right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Nutch with Letor

2012-03-02 Thread varunpandeyengg
Hey Guys,

I am new to Nutch. I am part of a IR research team & need to create a setup
where in I need to crawl Microsoft's LETOR Dataset with Nutch. After
googling for a while, I didn't get any tutorial or help. Could anyone guide
me for the same?

I am using Nutch 1.4 on Ubuntu 11.10 & Eclipse 3.7.

Till now I am able to crawl public network from my Nutch setup integrated
with Eclipse...

Thanks in advance.

-
Varun

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-with-Letor-tp3793432p3793432.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

2012-03-02 Thread Lewis John Mcgibbney
Hi Andrzej,

On Fri, Mar 2, 2012 at 12:37 PM, Andrzej Bialecki  wrote:

> Fetcher2 is the current Fetcher. The original Fetcher was temporarily
> renamed OldFetcher and then removed.
>

So looks like this 'might' be more straight forward to implement than I
originally thought. When I get a bit of time I would like to dive into it.

Thanks


[jira] [Updated] (NUTCH-1273) Fix [deprecation] javac warnings

2012-03-02 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1273:


Attachment: NUTCH-1273-v2-trunk.patch

This patch goes some length to address the issues described on user or dev 
list. I'm having some problems with Exceptions, and tbh not really sure about 
the new API construction. I opted to switch the MimeUtil#autoResolveContentType 
code to use a mimetype String as oppose to either
* Switch the code to use MediaType rather than MimeType, and call
 DefaultDetector directly (rather than using the Tika facade class)
* If we get back a String (not null) for the mimetype, create a MimeType
 object for it.

In all honesty, if the method I have used is not suitable then I think the 
latter of the above alternatives would be better simply because we arwe not 
currently calling MediaType anywhere, I've been trying to keeep with 
consistency when workin on this one.

If someone could have a look it would be greatly appreciated. Thanks 

> Fix [deprecation] javac warnings
> 
>
> Key: NUTCH-1273
> URL: https://issues.apache.org/jira/browse/NUTCH-1273
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1273-nutchgora.patch, NUTCH-1273-trunk.patch, 
> NUTCH-1273-v2-trunk.patch
>
>
> As part of this task, these warnings should be resolved, however this 
> particular strand of warnings can either be resolved by adding
> {code}
> @SuppressWarnings("deprecation")
> {code}
> or by actually upgrading our class usage to rely upon non-deprecated classes. 
> Which option is more appropriate for the project?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

2012-03-02 Thread Andrzej Bialecki

On 02/03/2012 12:45, Lewis John Mcgibbney wrote:

Hi Guys,

As there were some comments on the user list, I recently got digging
with http redirects then stumbled across NUTCH-1042. Although these are
individual issues e.g. redirects and crawl delays, I think they are
certainly linked, however what is interesting is that users 'usually'
don't consider them to be interlinked as such and therefore struggle to
debug how and why either the redirect or the crawl delay pages are not
being fetched.

Doing some more digging I found the now rather old and tatty NUTCH-475,
which obviously got me thinking about how we maintain the
AdaptiveFetchSchedule for custom refetching. Now I begin to start
thinking about the following

- Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042
still needs fixed as this is obviously becoming a bit of a pain for some
users.


Yes.


- Can someone shine some light on what happened to Fetcher2.java that
Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0)


Fetcher2 is the current Fetcher. The original Fetcher was temporarily 
renamed OldFetcher and then removed.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Drawing an analogy between AdaptiveFetchSchedule and AdaptiveCrawlDelay

2012-03-02 Thread Lewis John Mcgibbney
Hi Guys,

As there were some comments on the user list, I recently got digging with
http redirects then stumbled across NUTCH-1042. Although these are
individual issues e.g. redirects and crawl delays, I think they are
certainly linked, however what is interesting is that users 'usually' don't
consider them to be interlinked as such and therefore struggle to debug how
and why either the redirect or the crawl delay pages are not being fetched.

Doing some more digging I found the now rather old and tatty NUTCH-475,
which obviously got me thinking about how we maintain the
AdaptiveFetchSchedule for custom refetching. Now I begin to start thinking
about the following

- Regardless of whether we implement an AdaptiveCrawlDelay, NUTCH-1042
still needs fixed as this is obviously becoming a bit of a pain for some
users.
- Can someone shine some light on what happened to Fetcher2.java that
Dogacan refers to? I was only ever accustomed to OldFetcher and Fetcher :0)
- For you guys managing/running/maintaining your own (and possibly
clients)  web servers, what are the perceptions of maintaining your own
AdaptiveCrawlDelay? Pro's and Con's (apart from the obvious)

I can't really think of anything else at the moment!

Thanks

Lewis

-- 
*Lewis*


[jira] [Closed] (NUTCH-1295) nutchgora restlet dependencies failing when remote repos is down

2012-03-02 Thread Ferdy Galema (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema closed NUTCH-1295.
---

Resolution: Fixed

committed

> nutchgora restlet dependencies failing when remote repos is down
> 
>
> Key: NUTCH-1295
> URL: https://issues.apache.org/jira/browse/NUTCH-1295
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
> Attachments: NUTCH-1295.patch
>
>
> Currently the head of nutchgora cannot be build when running "ant clean 
> runtime". This is because the restlet dependencies cannot be found. This is 
> even though there are local restlet copies in the ivy2 cache dir. Did we not 
> have this problem before?
> Anyway I found a solution. Basically I renamed the resolver name from the 
> chain name. This way the restlet dependencies are read from the local cache 
> when the remote one is not available. See patch for details.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1295) nutchgora restlet dependencies failing when remote repos is down

2012-03-02 Thread Ferdy Galema (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferdy Galema updated NUTCH-1295:


Attachment: NUTCH-1295.patch

> nutchgora restlet dependencies failing when remote repos is down
> 
>
> Key: NUTCH-1295
> URL: https://issues.apache.org/jira/browse/NUTCH-1295
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
> Attachments: NUTCH-1295.patch
>
>
> Currently the head of nutchgora cannot be build when running "ant clean 
> runtime". This is because the restlet dependencies cannot be found. This is 
> even though there are local restlet copies in the ivy2 cache dir. Did we not 
> have this problem before?
> Anyway I found a solution. Basically I renamed the resolver name from the 
> chain name. This way the restlet dependencies are read from the local cache 
> when the remote one is not available. See patch for details.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1295) nutchgora restlet dependencies failing when remote repos is down

2012-03-02 Thread Ferdy Galema (Created) (JIRA)
nutchgora restlet dependencies failing when remote repos is down


 Key: NUTCH-1295
 URL: https://issues.apache.org/jira/browse/NUTCH-1295
 Project: Nutch
  Issue Type: Bug
Reporter: Ferdy Galema


Currently the head of nutchgora cannot be build when running "ant clean 
runtime". This is because the restlet dependencies cannot be found. This is 
even though there are local restlet copies in the ivy2 cache dir. Did we not 
have this problem before?

Anyway I found a solution. Basically I renamed the resolver name from the chain 
name. This way the restlet dependencies are read from the local cache when the 
remote one is not available. See patch for details.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-02 Thread Markus Jelsma (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13220787#comment-13220787
 ] 

Markus Jelsma edited comment on NUTCH-1024 at 3/2/12 9:05 AM:
--

New patch for trunk! This also includes a change to the injector where injected 
fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected 
interval overrides anything else. This is useful for sites where you want to 
use AdaptiveFetchSchedule but still want the generator to select an injected 
homepage every N hours.

  was (Author: markus17):
New patch for trunk! This also includes a change to the injector where 
injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this 
injected interval overrides anything else.
  
> Dynamically set fetchInterval by MIME-type
> --
>
> Key: NUTCH-1024
> URL: https://issues.apache.org/jira/browse/NUTCH-1024
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.5
>
> Attachments: AdaptiveFetchSchedule.patch, 
> MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, 
> adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. 
> This is useful for conserving resources for files that are known to change 
> frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-02 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1024:
-

Attachment: NUTCH-1024-1.5-1.patch

New patch for trunk! This also includes a change to the injector where injected 
fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected 
interval overrides anything else.

> Dynamically set fetchInterval by MIME-type
> --
>
> Key: NUTCH-1024
> URL: https://issues.apache.org/jira/browse/NUTCH-1024
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.5
>
> Attachments: AdaptiveFetchSchedule.patch, 
> MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, 
> adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. 
> This is useful for conserving resources for files that are known to change 
> frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira