Jenkins build is back to normal : Nutch-trunk #1671

2011-11-21 Thread Apache Jenkins Server
See 




[VOTE] Apache Nutch 1.4 release rc #2

2011-11-21 Thread Mattmann, Chris A (388J)
Hi Folks,

A second release candidate for the Nutch 1.4 release is available at:

  http://people.apache.org/~mattmann/apache-nutch-1.4/rc2/

The release candidate is a zip and tar.gz archive of the sources in:

  http://svn.apache.org/repos/asf/nutch/tags/release-1.4/

And a binary build suitable for deployment. 

A staged Maven repository is available here:

https://repository.apache.org/content/repositories/orgapachenutch-161/

I tried to address Julien's comments about -bin package unpackaging.

Please vote on releasing this package as Apache Nutch 1.4.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Nutch PMC votes are cast.

  [ ] +1 Release this package as Apache Nutch 1.4
  [ ] -1 Do not release this package because...

Thanks!

Cheers,
Chris

P.S. Here's my +1.

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Dependency Injection

2011-11-21 Thread Mattmann, Chris A (388J)
Hey PJ,

You aren't being an ass at all. You're asking an important question, and 
something I've been interested in for a while.
Here are some relevant threads to take a look at:

http://wiki.apache.org/nutch/Nutch2Architecture
http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12688.html
http://www.slideshare.net/chrismattmann/lessons-learned-in-the-development-of-a-webscale-search-engine-nutch2-and-beyond
https://issues.apache.org/jira/browse/NUTCH-609
http://osdir.com/ml/user.nutch.apache/2011-07/msg00080.html
http://5341.com/list/48/349985.html

If you're interested in contributing to Apache Nutch, check this great guide 
out written by Dennis Kubes:

wiki.apache.org/nutch/Becoming_A_Nutch_Developer

Before there wasn't a ton of interest in replacing the plugin system since it 
"worked" and we didn't get a lot of 
complaints or anything. That interest turned into the perception that a DI 
framework wouldn't be welcome. 
On the contrary, I'd say if you figured out how to get a DI framework working 
with the existing plugin system, 
I can personally say I'd dedicate some of my time towards helping you shepherd 
it in and I think the 
rest of the committers would be on board.

Thanks for your interest. If you have any more questions, please ask!

Cheers,
Chris


On Nov 21, 2011, at 1:14 PM, PJ Herring wrote:

> Hey,
> 
> So I am admittedly a noob with Nutch, but have spent some time digging 
> through the source code. I am just curious if anyone has talked about, in 
> future developments of Nutch, replacing the whole way we register plugins? I 
> ask because I am using Nutch on a project with Maven. At the moment I have to 
> copy a whole bunch of JAR files with there plugin.xml files into a certain 
> directory on build, which is fine, but is kind of breaking the whole Maven 
> paradigm. It would be nice to have some sort of Maven repository where 
> plugins lived, and then wire up which plugins I want to use using some kind 
> of DI framework, like Spring or Guice. Then instead of writing XML Plugin 
> Descriptor Files, every plugin could write a class extending PluginDescriptor 
> and register its self with the PluginRepo, or something of the sort.
> 
> Also, I have never contributed to an open source project, so if I am being an 
> ass I don't mean to be. Just would love to help make a great tool better in 
> any way.
> 
> Best,
> PJ Herring
> 
> 
> 
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Dependency Injection

2011-11-21 Thread PJ Herring
Hey,

So I am admittedly a noob with Nutch, but have spent some time digging through 
the source code. I am just curious if anyone has talked about, in future 
developments of Nutch, replacing the whole way we register plugins? I ask 
because I am using Nutch on a project with Maven. At the moment I have to copy 
a whole bunch of JAR files with there plugin.xml files into a certain directory 
on build, which is fine, but is kind of breaking the whole Maven paradigm. It 
would be nice to have some sort of Maven repository where plugins lived, and 
then wire up which plugins I want to use using some kind of DI framework, like 
Spring or Guice. Then instead of writing XML Plugin Descriptor Files, every 
plugin could write a class extending PluginDescriptor and register its self 
with the PluginRepo, or something of the sort.

Also, I have never contributed to an open source project, so if I am being an 
ass I don't mean to be. Just would love to help make a great tool better in any 
way.

Best,
PJ Herring






Re: Signature == null ?

2011-11-21 Thread Markus Jelsma
I can't dump the DB right now since it's far too large for a single node but 
from log output i can see that these records without signature were not 
parsable with Tika such as RSS feeds, bad PDF 's or timed out parses.


> > On 15/11/2011 20:33, Markus Jelsma wrote:
> > > It's back again! Last try if someone has a pointer for this.
> > > Cheers
> > > 
> > >> After some DB updates, they're gone! Anyone recognizes this
> > >> phenomenon?
> > >> 
> > >> On Tuesday 08 November 2011 11:22:48 Markus Jelsma wrote:
> > >>> On Tuesday 08 November 2011 11:15:37 Markus Jelsma wrote:
> >  Hi guys,
> >  
> >  I've a M/R job selecting only DB_FETCHED and DB_NOTMODIFIED records
> >  and their signatures. I had to add a sanity check on signature to
> >  avoid a NPE. I had the assumption any record with such DB_ status
> >  has to have a signature, right?
> >  
> >  Why does roughly 0.0001625% of my records exit without a signature?
> > >>> 
> > >>> Now with correct metrics:
> > >>> Why does roughly 0.84% of my records exist without a signature?
> > 
> > This could be somehow related to pages that come from redirects so that
> > when they are fetched they are accounted for under different urls, which
> > in turn may confuse the update code in CrawlDbReducer... Do you notice
> > any pattern to these pages? What's their origin?
> 
> Ah, this seems like a useful pointer. I'll add debug lines to identifiy the
> bad records and check them with a CrawlDB dump.
> 
> Can't use the reader since it seems to stumble over records with meta data.
> 
> Will report back here and maybe with a new ticket.
> 
> Thanks


[Nutch Wiki] Trivial Update of "RedirectHandling" by LewisJohnMcgibbney

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "RedirectHandling" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RedirectHandling?action=diff&rev1=1&rev2=2

  = Redirect handling in Nutch =
  This page is in construction but when completed will provide a comprehensive 
overview of redirect handling in Apache Nutch.
  
+ To begin with, we really want to define what HTTP URL redirects are, what 
types of problems they present for crawlers, and finally what Nutch does to 
address some of the areas. By the end of this tutorial, we should have 
addressed the complex and rather confusing area of redirects. For a whirlwind 
tour of this page please see the Table of Contents.
+ 
+ <>
+ 
+  
+ 


[Nutch Wiki] Trivial Update of "NutchResources" by LewisJohnMcgibbney

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchResources" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchResources?action=diff&rev1=1&rev2=2

  
  
   * [[http://nutch.sourceforge.net/blog/cutting.html|Doug's Weblog]] -- He's 
the one who originally wrote Lucene and Nutch.
-  * [[http://wiki.media-style.com/display/nutchDocu/Home|Stefan's Nutch 
Documentation]]
   * [[Search_Theory]] Search Theory & White Papers
   * [[http://blog.foofactory.fi/|FooFactory]] Nutch and Hadoop related posts
   * [[http://www.interadvertising.co.uk/blog/nutch_logos|Larger / better 
quality Nutch logos]] Re-created Nutch logos available in GIF, PNG & EPS in 
resolutions up to 1200 x 449


[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=35&rev2=36

  
  This document does not go into the Nutch or Hadoop architecture, resources 
relating to these topics can be found [[FrontPage#Nutch Development|here]]. It 
only tells how to get the systems up and running. There are also relevant 
resources at the end of this tutorial if you want to know more about the 
architecture of Nutch and Hadoop.
  
- '''N.B.''' Prerequsites for this tutorial are both the [[NutchTutorial|Nutch 
Tutorial]] and the [[http://hadoop.apache.org/common/docs/stable/|Hadoop 
Tutorial]]. 
+ '''N.B.''' Prerequsites for this tutorial are both the [[NutchTutorial|Nutch 
Tutorial]] and the [[http://hadoop.apache.org/common/docs/stable/|Hadoop 
Tutorial]]. It will also be of great benefit to have a look at the 
[[http://wiki.apache.org/hadoop/|Hadoop Wiki]]
  <>
  
  === Assumptions ===


[Nutch Wiki] Trivial Update of "RedirectHandling" by LewisJohnMcgibbney

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "RedirectHandling" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RedirectHandling

New page:
= Redirect handling in Nutch =
This page is in construction but when completed will provide a comprehensive 
overview of redirect handling in Apache Nutch.


[Nutch Wiki] Trivial Update of "InternalDocumentation" by LewisJohnMcgibbney

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "InternalDocumentation" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/InternalDocumentation?action=diff&rev1=9&rev2=10

* NutchDistributedFileSystem
   * DissectingTheNutchCrawler by MattKangas /!\ :TODO: /!\ This tutorial 
requires substantial updating to reflect current Nutch components and 
functionality
   * State diagram of a page in Nutch (CrawlDatumStates)
+  * RedirectHandling - A page providing a comprehensive overview of how Nutch 
handles redirects. /!\ :TODO: /!\ This tutorial is in construction
  


Re: Enabling Nutch wiki override of ACLs for Attachments

2011-11-21 Thread Lewis John Mcgibbney
I don't think this is possible. Setting can either be configured such that
anyone can edit but not upload attachments or else ONLY an AdminGroup or
ContributersGroup can add material. This requires someone to maintain the
respective configuration files in our wiki instance... which is not a huge
deal.

The whole blocking attachment issue was introduced as some projects were
experiencing high levels of spam. If this has/is not the case with Nutch
then for the time being we can simply remove this restriction and implement
the above restriction if/when spam occurs.

Any thoughts?

Examples of material which has been blocked are

http://wiki.apache.org/nutch/CrawlDatumStates?action=AttachFile&do=view&target=CrawlDatum.uxf
http://wiki.apache.org/nutch/Evaluations?action=AttachFile&do=view&target=OSU_Queries.pdf


On Mon, Nov 21, 2011 at 3:46 PM, Markus Jelsma
wrote:

> Spam happens once in a while. Can uploading of attachments be restricted to
> committers?
>
> On Monday 21 November 2011 16:40:11 Lewis John Mcgibbney wrote:
> > Hi Guys,
> >
> > There has been some discussion recently about broken links to attachments
> > on the Nutch wiki. The reason for this can be seen here [1].
> >
> > I am not aware of the Nutch wiki suffering from Spam attacks, however
> this
> > is not to say that it might not happen. Therefore is it worth re-enabling
> > this feature as per the comments in the link below?
> >
> > Thanks
> >
> > [1] http://wiki.apache.org/general/OurWikiFarm#Attachments
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*Lewis*


Re: Enabling Nutch wiki override of ACLs for Attachments

2011-11-21 Thread Markus Jelsma
Spam happens once in a while. Can uploading of attachments be restricted to 
committers?

On Monday 21 November 2011 16:40:11 Lewis John Mcgibbney wrote:
> Hi Guys,
> 
> There has been some discussion recently about broken links to attachments
> on the Nutch wiki. The reason for this can be seen here [1].
> 
> I am not aware of the Nutch wiki suffering from Spam attacks, however this
> is not to say that it might not happen. Therefore is it worth re-enabling
> this feature as per the comments in the link below?
> 
> Thanks
> 
> [1] http://wiki.apache.org/general/OurWikiFarm#Attachments

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


[Nutch Wiki] Trivial Update of "CrawlDatumStates" by MarkusJelsma

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "CrawlDatumStates" page has been changed by MarkusJelsma:
http://wiki.apache.org/nutch/CrawlDatumStates?action=diff&rev1=4&rev2=5

Comment:
added scoreupdater

   *Injector - to populate CrawlDb with new URLs 
   *Generator - to generate new fetchlists, and optionally mark those URLs in 
CrawlDb as "being in the process of fetching" 
   *CrawlDb update - to update the CrawlDb with new knowledge about the already 
known URLs (already in CrawlDb) as well as add new URLs discovered from page 
outlinks.
+  *[[http://wiki.apache.org/nutch/NewScoring#ScoreUpdater|ScoreUpdater]] 
updates the CrawlDB with LinkRank calculated URL scores.
  
  Below is a state diagram of CrawlDatum, which is a class that holds this 
state in CrawlDb.
  


Enabling Nutch wiki override of ACLs for Attachments

2011-11-21 Thread Lewis John Mcgibbney
Hi Guys,

There has been some discussion recently about broken links to attachments
on the Nutch wiki. The reason for this can be seen here [1].

I am not aware of the Nutch wiki suffering from Spam attacks, however this
is not to say that it might not happen. Therefore is it worth re-enabling
this feature as per the comments in the link below?

Thanks

[1] http://wiki.apache.org/general/OurWikiFarm#Attachments

-- 
*Lewis*


[Nutch Wiki] Trivial Update of "CrawlDatumStates" by LewisJohnMcgibbney

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "CrawlDatumStates" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/CrawlDatumStates?action=diff&rev1=3&rev2=4

  
  Nutch 1.x maintains state of pages in CrawlDb, which is updated by various 
tools:
  
-  * Injector - to populate CrawlDb with new URLs 
+  *Injector - to populate CrawlDb with new URLs 
-  * Generator - to generate new fetchlists, and optionally mark those URLs in 
CrawlDb as "being in the process of fetching" 
+  *Generator - to generate new fetchlists, and optionally mark those URLs in 
CrawlDb as "being in the process of fetching" 
-  * CrawlDb update - to update the CrawlDb with new knowledge about the 
already known URLs (already in CrawlDb) as well as add new URLs discovered from 
page outlinks.
+  *CrawlDb update - to update the CrawlDb with new knowledge about the already 
known URLs (already in CrawlDb) as well as add new URLs discovered from page 
outlinks.
  
  Below is a state diagram of CrawlDatum, which is a class that holds this 
state in CrawlDb.
  


[Nutch Wiki] Trivial Update of "CrawlDatumStates" by LewisJohnMcgibbney

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "CrawlDatumStates" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/CrawlDatumStates?action=diff&rev1=2&rev2=3

  
  Nutch 1.x maintains state of pages in CrawlDb, which is updated by various 
tools:
  
- * Injector - to populate CrawlDb with new URLs * Generator - to generate new 
fetchlists, and optionally mark those URLs in CrawlDb as "being in the process 
of fetching" * CrawlDb update - to update the CrawlDb with new knowledge about 
the already known URLs (already in CrawlDb) as well as add new URLs discovered 
from page outlinks.
+  * Injector - to populate CrawlDb with new URLs 
+  * Generator - to generate new fetchlists, and optionally mark those URLs in 
CrawlDb as "being in the process of fetching" 
+  * CrawlDb update - to update the CrawlDb with new knowledge about the 
already known URLs (already in CrawlDb) as well as add new URLs discovered from 
page outlinks.
  
  Below is a state diagram of CrawlDatum, which is a class that holds this 
state in CrawlDb.
  
@@ -25, +27 @@

  
  If there was a temporary problem in fetching (e.g. exception or time out) 
then this URL is left as "unfetched" but its retry counter is incremented. If 
this counter reaches a limit (default is 3) the page is marked as "gone". Pages 
that are "gone" are not considered for fetching by Generator for a long time, 
which is the maxFetchInterval (e.g. 180 days) - the reason for keeping them is 
that even gone pages may re-appear after a while, and also we want to avoid 
re-discovering them and giving them a status of "unfetched".
  
- Other possible states after fetching are "truly gone" ;) (e.g. forbidden by 
robots.txt or unauthorized), which get the same treatment as described above - 
that is after a long period of time we check again their status, which may have 
changed.
+ Other possible states after fetching are "truly gone" ;) (e.g. forbidden by 
robots.txt or unauthorized), which get the same treatment as described above - 
that is after a long period of time we check again their status, which ma
  
- In case of "success" we mark this URL as "fetched". This URL is not eligible 
for re-fetching until after fetchInterval, at which point it's considered 
outdated and in need of re-fetching (i.e. the same as "unfetched").
- 


[Nutch Wiki] Trivial Update of "InternalDocumentation" by LewisJohnMcgibbney

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "InternalDocumentation" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/InternalDocumentation?action=diff&rev1=8&rev2=9

* [[WebDB]]
* [[DistributedWebDB]]
* NutchDistributedFileSystem
-  * AboutPlugins
   * DissectingTheNutchCrawler by MattKangas /!\ :TODO: /!\ This tutorial 
requires substantial updating to reflect current Nutch components and 
functionality
   * State diagram of a page in Nutch (CrawlDatumStates)
  


[Nutch Wiki] Trivial Update of "NutchMavenSupport" by LewisJohnMcgibbney

2011-11-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchMavenSupport" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchMavenSupport?action=diff&rev1=1&rev2=2

  Starting with Nutch 1.3 and with Nutch 2.0, you can now use Nutch as a 
[[http://maven.apache.org/|Maven]] dependency. Just include the below block of 
code in your Maven pom.xml.
+ 
+ In addition to this, please see 
[[https://repository.apache.org/index.html#nexus-search;classname~Nutch|here]] 
for ALL Nutch artifacts on the Sonatype Nexus Maven Repository.
  
  {{{
  


[jira] [Commented] (NUTCH-1207) ParserChecker to output signature

2011-11-21 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154190#comment-13154190
 ] 

Hudson commented on NUTCH-1207:
---

Integrated in nutch-trunk-maven #33 (See 
[https://builds.apache.org/job/nutch-trunk-maven/33/])
NUTCH-1207 ParserChecker to output signature

markus : 
http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1204492
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java


> ParserChecker to output signature
> -
>
> Key: NUTCH-1207
> URL: https://issues.apache.org/jira/browse/NUTCH-1207
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.5
>
> Attachments: NUTCH-1207-1.5-1.patch
>
>
> ParserChecker should calculate and display the signature. Makes debugging a 
> bit easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2011-11-21 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1104:
-

Description: 
Umbrella issue for tracking issues that should be ported from 1.x trunk to the 
NutchGora branch. Please mark ported issues by modifying this description.

NOT YET PORTED:

* NUTCH-987 Support HTTP auth for Solr communication
* NUTCH-1028 Log parser keys
* NUTCH-1036 Solr jobs should increment counters in Reporter
* NUTCH-1057 Make fetcher thread time out configurable
* NUTCH-1067 Configure minimum throughput for fetcher
* NUTCH-1101 Options to purge db_gone records in updatedb
* NUTCH-1102 Fetcher, rely on fetcher.parse directive only
* NUTCH-1105 MaxContentLength option for index-basic
* NUTCH-940 Statis field plugin
* NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
* NUTCH-1207 ParserChecker to output signature
* NUTCH-1090 InvertLinks should inform when ignoring internal links
* NUTCH-1174 Outlinks are not properly normalized
* NUTCH-1203 ParseSegment to show number of milliseconds per parse
* NUTCH-1173 DomainStats doesn't count db_not_modified
* NUTCH-1155 Host/domain limit in generator is generate.max.count+1
* NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
* NUTCH-1142 Normalization and filtering in WebGraph
* NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file
* NUTCH-1195 Add Solr 4x (trunk) example schema
* NUTCH-1141 Configurable Fetcher queue depth

PORTED:
* No issues yet


NOT GOING TO BE PORTED:
* No issues, explain why it should not be ported



  was:
Umbrella issue for tracking issues that should be ported from 1.x trunk to the 
NutchGora branch. Please mark ported issues by modifying this description.

NOT YET PORTED:

* NUTCH-987 Support HTTP auth for Solr communication
* NUTCH-1028 Log parser keys
* NUTCH-1036 Solr jobs should increment counters in Reporter
* NUTCH-1057 Make fetcher thread time out configurable
* NUTCH-1067 Configure minimum throughput for fetcher
* NUTCH-1101 Options to purge db_gone records in updatedb
* NUTCH-1102 Fetcher, rely on fetcher.parse directive only
* NUTCH-1105 MaxContentLength option for index-basic
* NUTCH-940 Statis field plugin
* NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk


PORTED:
* No issues yet


NOT GOING TO BE PORTED:
* No issues, explain why it should not be ported




> Port issues from trunk NutchGora branch
> ---
>
> Key: NUTCH-1104
> URL: https://issues.apache.org/jira/browse/NUTCH-1104
> Project: Nutch
>  Issue Type: Task
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: nutchgora
>
>
> Umbrella issue for tracking issues that should be ported from 1.x trunk to 
> the NutchGora branch. Please mark ported issues by modifying this description.
> NOT YET PORTED:
> * NUTCH-987 Support HTTP auth for Solr communication
> * NUTCH-1028 Log parser keys
> * NUTCH-1036 Solr jobs should increment counters in Reporter
> * NUTCH-1057 Make fetcher thread time out configurable
> * NUTCH-1067 Configure minimum throughput for fetcher
> * NUTCH-1101 Options to purge db_gone records in updatedb
> * NUTCH-1102 Fetcher, rely on fetcher.parse directive only
> * NUTCH-1105 MaxContentLength option for index-basic
> * NUTCH-940 Statis field plugin
> * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
> * NUTCH-1207 ParserChecker to output signature
> * NUTCH-1090 InvertLinks should inform when ignoring internal links
> * NUTCH-1174 Outlinks are not properly normalized
> * NUTCH-1203 ParseSegment to show number of milliseconds per parse
> * NUTCH-1173 DomainStats doesn't count db_not_modified
> * NUTCH-1155 Host/domain limit in generator is generate.max.count+1
> * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
> * NUTCH-1142 Normalization and filtering in WebGraph
> * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS 
> file
> * NUTCH-1195 Add Solr 4x (trunk) example schema
> * NUTCH-1141 Configurable Fetcher queue depth
> PORTED:
> * No issues yet
> NOT GOING TO BE PORTED:
> * No issues, explain why it should not be ported

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-11-21 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154180#comment-13154180
 ] 

Markus Jelsma commented on NUTCH-1206:
--

Have you tried the Nutch trunk or the most recent Tika as suggested? 

> tika parser of nutch 1.3 is failing to prcess pdfs
> --
>
> Key: NUTCH-1206
> URL: https://issues.apache.org/jira/browse/NUTCH-1206
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: Solaris/Linux/Windows
>Reporter: dibyendu ghosh
>
> Please refer to this message: 
> http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
> parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
> though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
> not have parse-pdf plugin and it is not able to parse even older pdfs.
> my code (TestParse.java):
> 
> bash-2.00$ cat TestParse.java
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileOutputStream;
> import java.io.PrintStream;
> import java.util.Iterator;
> import java.util.Map;
> import java.util.Map.Entry;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.io.Text;
> import org.apache.nutch.metadata.Metadata;
> import org.apache.nutch.parse.ParseResult;
> import org.apache.nutch.parse.Parse;
> import org.apache.nutch.parse.ParseStatus;
> import org.apache.nutch.parse.ParseUtil;
> import org.apache.nutch.parse.ParseData;
> import org.apache.nutch.protocol.Content;
> import org.apache.nutch.util.NutchConfiguration;
> public class TestParse {
> private static Configuration conf = NutchConfiguration.create();
> public TestParse() {
> }
> public static void main(String[] args) {
> String filename = args[0];
> convert(filename);
> }
> public static String convert(String fileName) {
> String newName = "abc.html";
> try {
> System.out.println("Converting " + fileName + " to html.");
> if (convertToHtml(fileName, newName))
> return newName;
> } catch (Exception e) {
> (new File(newName)).delete();
> System.out.println("General exception " + e.getMessage());
> }
> return null;
> }
> private static boolean convertToHtml(String fileName, String newName)
> throws Exception {
> // Read the file
> FileInputStream in = new FileInputStream(fileName);
> byte[] buf = new byte[in.available()];
> in.read(buf);
> in.close();
> // Parse the file
> Content content = new Content("file:" + fileName, "file:" +
> fileName,
>   buf, "", new Metadata(), conf);
> ParseResult parseResult = new ParseUtil(conf).parse(content);
> parseResult.filter();
> if (parseResult.isEmpty()) {
> System.out.println("All parsing attempts failed");
> return false;
> }
> Iterator> iterator =
> parseResult.iterator();
> if (iterator == null) {
> System.out.println("Cannot iterate over successful parse
> results");
> return false;
> }
> Parse parse = null;
> ParseData parseData = null;
> while (iterator.hasNext()) {
> parse = parseResult.get((Text)iterator.next().getKey());
> parseData = parse.getData();
> ParseStatus status = parseData.getStatus();
> // If Parse failed then bail
> if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
> System.out.println("Could not parse " + fileName + ". " +
> status.getMessage());
> return false;
> }
> }
> // Start writing to newName
> FileOutputStream fout = new FileOutputStream(newName);
> PrintStream out = new PrintStream(fout, true, "UTF-8");
> // Start Document
> out.println("");
> // Start Header
> out.println("");
> // Write Title
> String title = parseData.getTitle();
> if (title != null && title.trim().length() > 0) {
> out.println("" + parseData.getTitle() + "");
> }
> // Write out Meta tags
> Metadata metaData = parseData.getContentMeta();
> String[] names = metaData.names();
> for (String name : names) {
> String[] subvalues = metaData.getValues(name);
> String values = null;
> for (String subvalue : subvalues) {
> values += subvalue;
> }
> if (values.length() > 0)
> out.printf("\n",
>   

[jira] [Resolved] (NUTCH-1207) ParserChecker to output signature

2011-11-21 Thread Markus Jelsma (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1207.
--

Resolution: Fixed

Committed for 1.5 in rev. 1204492.

> ParserChecker to output signature
> -
>
> Key: NUTCH-1207
> URL: https://issues.apache.org/jira/browse/NUTCH-1207
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.5
>
> Attachments: NUTCH-1207-1.5-1.patch
>
>
> ParserChecker should calculate and display the signature. Makes debugging a 
> bit easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1207) ParserChecker to output signature

2011-11-21 Thread Markus Jelsma (Created) (JIRA)
ParserChecker to output signature
-

 Key: NUTCH-1207
 URL: https://issues.apache.org/jira/browse/NUTCH-1207
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.5
 Attachments: NUTCH-1207-1.5-1.patch

ParserChecker should calculate and display the signature. Makes debugging a bit 
easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1207) ParserChecker to output signature

2011-11-21 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1207:
-

Attachment: NUTCH-1207-1.5-1.patch

> ParserChecker to output signature
> -
>
> Key: NUTCH-1207
> URL: https://issues.apache.org/jira/browse/NUTCH-1207
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.5
>
> Attachments: NUTCH-1207-1.5-1.patch
>
>
> ParserChecker should calculate and display the signature. Makes debugging a 
> bit easier.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira