[jira] [Created] (NUTCH-2176) log4j.properties is a mess

2015-11-26 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2176:


 Summary: log4j.properties is a mess
 Key: NUTCH-2176
 URL: https://issues.apache.org/jira/browse/NUTCH-2176
 Project: Nutch
  Issue Type: Bug
Reporter: Markus Jelsma
Assignee: Markus Jelsma


Properties file:
- missing DeduplicationJob
- still has CrawldbScanner
- still has reverted HostDB stuff
- is not sorted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2176) log4j.properties is a mess

2015-11-26 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2176:
-
Attachment: NUTCH-2176.patch

Patch for trunk resolving above mentioned points. Anything else to add?

> log4j.properties is a mess
> --
>
> Key: NUTCH-2176
> URL: https://issues.apache.org/jira/browse/NUTCH-2176
> Project: Nutch
>  Issue Type: Bug
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2176.patch
>
>
> Properties file:
> - missing DeduplicationJob
> - still has CrawldbScanner
> - still has reverted HostDB stuff
> - is not sorted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2176) log4j.properties is a mess

2015-11-26 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2176:
-
Affects Version/s: 1.10
 Priority: Trivial  (was: Major)
Fix Version/s: 1.11

> log4j.properties is a mess
> --
>
> Key: NUTCH-2176
> URL: https://issues.apache.org/jira/browse/NUTCH-2176
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.11
>
> Attachments: NUTCH-2176.patch
>
>
> Properties file:
> - missing DeduplicationJob
> - still has CrawldbScanner
> - still has reverted HostDB stuff
> - is not sorted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2158.

Resolution: Fixed

Thanks! Committed to trunk, r1716573.

> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158-test-protocol-http.patch, NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-2177:


 Summary: Generator produces only one partition even in distributed 
mode
 Key: NUTCH-2177
 URL: https://issues.apache.org/jira/browse/NUTCH-2177
 Project: Nutch
  Issue Type: Bug
  Components: generator
Reporter: Julien Nioche
Priority: Blocker
 Fix For: 1.11


See 
[https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]

'mapred.job.tracker' is deprecated and has been replaced by 
'mapreduce.jobtracker.address', however when running Nutch on EMR 
mapreduce.jobtracker.address has local as a value. As a result we generate a 
single partition i.e. have a single map fetching later on (which defeats the 
object of having a distributed crawler).

We should probably detect whether we are running on YARN instead, see 
[http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029015#comment-15029015
 ] 

Markus Jelsma commented on NUTCH-2177:
--

There seems to be no value for mapred.job.tracker on our own Yarn 2.7.1, 
probably why we are not suffering from this issue. 

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029017#comment-15029017
 ] 

Markus Jelsma commented on NUTCH-2177:
--

+1 for issue being blocker

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029037#comment-15029037
 ] 

Julien Nioche commented on NUTCH-2177:
--

I am on  
Hadoop version: 2.4.0-amzn-7

not clear which version of Yarn it comes with. 

mapred.job.tracker does not appear in the conf at all but 
mapreduce.jobtracker.address does and has 'local' as value

I can see

15/11/26 15:58:13 INFO Configuration.deprecation: mapred.job.tracker is 
deprecated. Instead, use mapreduce.jobtracker.address
15/11/26 15:58:13 INFO crawl.Generator: Generator: jobtracker is 'local', 
generating exactly one partition.

so my guess is that the Configuration maps the old key (mapred.job.tracker) to 
the new one (mapreduce.jobtracker.address) 

@markus17 what do you get for mapreduce.jobtracker.address? 


> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-26 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15029054#comment-15029054
 ] 

Markus Jelsma commented on NUTCH-2177:
--

On standard Apache Hadoop YARN 2.7.1 running in high availability mode 
(stand-by ResourceManager) we see the correct current value for 
mapreduce.jobtracker.address.

The logs emits what i would expect:
15/11/26 16:11:16 INFO Configuration.deprecation: mapred.job.tracker is 
deprecated. Instead, use mapreduce.jobtracker.address

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)