Re: [ANNONCEMENT] Apache Nutch 1.8 Release

2014-03-17 Thread Markus Jelsma
Thanks lewis!Lewis John Mcgibbney lewis.mcgibb...@gmail.com schreef:Good 
Evening,

The Apache Nutch PMC are pleased to announce the immediate release of Apache 
Nutch v1.8. 

Apache Nutch is a highly extensible and scalable open source web crawler 
software project. Stemming from Apache Lucene, the project has diversified and 
now comprises two codebases, namely: Nutch 1.x: A well matured, production 
ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop 
data structures, which are great for batch processing. Nutch 2.x: An emerging 
alternative taking direct inspiration from 1.x, but which differs in one key 
area; storage is abstracted away from any specific underlying data store by 
using Apache Gora for handling object to persistent mappings. This means we can 
implement an extremely flexibile model/stack for storing everything (fetch 
time, status, content, parsed text, outlinks, inlinks, etc.) into a number of 
NoSQL storage solutions.

We advise all current users and developers of the 1.X series to upgrade to this 
release. Although this release includes library upgrades to Crawler Commons 0.3 
and Apache Tika 1.4, it also provides over 30 bug fixes as well as 18 
improvements. Please see the list of changes for a full breakdown, or see the 
release report. As usual in the 1.X series, this release is made available both 
as source and binary. Additionally developers can find Maven artifacts within 
Maven Central. The release is available here. 

Thank you
Lewis
(On behalf of the Nutch PMC)

-- 
Lewis 


[jira] [Created] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-17 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1738:
---

 Summary: Expose number of URLs generated per batch in GeneratorJob
 Key: NUTCH-1738
 URL: https://issues.apache.org/jira/browse/NUTCH-1738
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.2.1
Reporter: Lewis John McGibbney
 Fix For: 2.3


GeneratorJob contains one trivial line of logging
{code:title=GeneratorJob.java|borderStyle=solid}
LOG.info(GeneratorJob: generated batch id:  + batchId);
{code}
I propose to improve this logging by exposing how many URL's are contained 
within the generated batch. Something like
{code:title=GeneratorJob.java|borderStyle=solid}
LOG.info(GeneratorJob: generated batch id:  + batchId +  containing  + 
$numOfURLs +  URLs);
{code}






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937642#comment-13937642
 ] 

Lewis John McGibbney commented on NUTCH-1738:
-

This concept could also be ported to 1.X as AFAIK we do not know the num of 
URLs generated explicitly but rely upon a restrictive value to be set for 
generate.max.count property in nutch-default.xml. It is of course advised to 
set smaller more frequent fetchlists*, however the logging is still valuable as 
it indicates how many URLs _should/could_ have been fetched per round. 
*Please note I am referring to fetchlists and BatchId's as an equivalent entity 
here.

 Expose number of URLs generated per batch in GeneratorJob
 -

 Key: NUTCH-1738
 URL: https://issues.apache.org/jira/browse/NUTCH-1738
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.2.1
Reporter: Lewis John McGibbney
 Fix For: 2.3


 GeneratorJob contains one trivial line of logging
 {code:title=GeneratorJob.java|borderStyle=solid}
 LOG.info(GeneratorJob: generated batch id:  + batchId);
 {code}
 I propose to improve this logging by exposing how many URL's are contained 
 within the generated batch. Something like
 {code:title=GeneratorJob.java|borderStyle=solid}
 LOG.info(GeneratorJob: generated batch id:  + batchId +  containing  + 
 $numOfURLs +  URLs);
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (NUTCH-1738) Expose number of URLs generated per batch in GeneratorJob

2014-03-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1738:
---

Assignee: Lewis John McGibbney

 Expose number of URLs generated per batch in GeneratorJob
 -

 Key: NUTCH-1738
 URL: https://issues.apache.org/jira/browse/NUTCH-1738
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.2.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.3


 GeneratorJob contains one trivial line of logging
 {code:title=GeneratorJob.java|borderStyle=solid}
 LOG.info(GeneratorJob: generated batch id:  + batchId);
 {code}
 I propose to improve this logging by exposing how many URL's are contained 
 within the generated batch. Something like
 {code:title=GeneratorJob.java|borderStyle=solid}
 LOG.info(GeneratorJob: generated batch id:  + batchId +  containing  + 
 $numOfURLs +  URLs);
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: How do I customize Nutch to cater to existing SOLR schema

2014-03-17 Thread tripiy
Hi Lajos,

Appreciate ur help in providing the patch which would definitely improve the
usability of the product.

For now we have resolved the unique field issue using the following changes
to solrindex-mapping.xml:
field dest=_uniqueid source=url/ 
 copyField source=url dest=_uniqueid/

For other the Nutch fields we have added the corresponding fields in SOLR
schema which are copied into the respective target CMS Schema fields.

Will wait for ur patch to get a more robust and flexible solution.

thanx




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-I-customize-Nutch-to-cater-to-existing-SOLR-schema-tp4123062p4124742.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


[jira] [Resolved] (NUTCH-1671) indexchecker to add digest field

2014-03-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1671.


Resolution: Fixed

Committed to trunk r1578616 and 2.x r1578620.

 indexchecker to add digest field
 

 Key: NUTCH-1671
 URL: https://issues.apache.org/jira/browse/NUTCH-1671
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7, 2.2.1
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch


 IndexingFiltersChecker does not add field digest as done by 
 IndexerMapReduce. Digest/signature could be also used by indexing filters 
 which then may fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1671) indexchecker to add digest field

2014-03-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938653#comment-13938653
 ] 

Hudson commented on NUTCH-1671:
---

SUCCESS: Integrated in Nutch-nutchgora #957 (See 
[https://builds.apache.org/job/Nutch-nutchgora/957/])
NUTCH-1671 indexchecker to add digest field (snagel: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1578620)
* /nutch/branches/2.x/CHANGES.txt
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java


 indexchecker to add digest field
 

 Key: NUTCH-1671
 URL: https://issues.apache.org/jira/browse/NUTCH-1671
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7, 2.2.1
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch


 IndexingFiltersChecker does not add field digest as done by 
 IndexerMapReduce. Digest/signature could be also used by indexing filters 
 which then may fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1671) indexchecker to add digest field

2014-03-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938662#comment-13938662
 ] 

Hudson commented on NUTCH-1671:
---

SUCCESS: Integrated in Nutch-trunk #2568 (See 
[https://builds.apache.org/job/Nutch-trunk/2568/])
NUTCH-1671 indexchecker to add digest field (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1578616)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java


 indexchecker to add digest field
 

 Key: NUTCH-1671
 URL: https://issues.apache.org/jira/browse/NUTCH-1671
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7, 2.2.1
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch


 IndexingFiltersChecker does not add field digest as done by 
 IndexerMapReduce. Digest/signature could be also used by indexing filters 
 which then may fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[GitHub] nutch pull request: Patch for fixing coding bug

2014-03-17 Thread ysc
Github user ysc closed the pull request at:

https://github.com/apache/nutch/pull/2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---