[jira] Assigned: (NUTCH-817) parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1

2010-05-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-817:
---

Assignee: Julien Nioche

> parse-(html)does follow links of full html page, parse-(tika) does follow any 
> links and stops at level 1
> 
>
> Key: NUTCH-817
> URL: https://issues.apache.org/jira/browse/NUTCH-817
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: Suse linux 11.1, java version "1.6.0_13"
>Reporter: matthew a. grisius
>Assignee: Julien Nioche
> Attachments: sample-javadoc.html
>
>
> submitted per Julien Nioche. I did not see where to attach a file so I pasted 
> it here. btw: Tika command line returns empty html body for this file.
>  "http://www.w3.org/TR/html4/frameset.dtd";>
> 
> 
> 
> 
> 
> Matrix Application Development Kit
> 
> 
> targetPage = "" + window.location.search;
> if (targetPage != "" && targetPage != "undefined")
>targetPage = targetPage.substring(1);
> function loadFrames() {
> if (targetPage != "" && targetPage != "undefined")
>  top.classFrame.location = top.targetPage;
> }
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Frame Alert
> 
> This document is designed to be viewed using the frames feature. If you see 
> this message, you are using a non-frame-capable web client.
> 
> Link toNon-frame version.
> 
> 
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-710) Support for rel="canonical" attribute

2010-04-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859286#action_12859286
 ] 

Julien Nioche commented on NUTCH-710:
-

As suggested previously we could either treat canonicals as redirections or 
during deduplication. Neither are satisfactory solutions.

Redirection : we want to index the document if/when the target of the canonical 
is not available for indexing. We also want to follow the outlinks. 
Dedup : could modify the *DeleteDuplicates code but canonical are more complex 
due to fact that we need to follow redirections

We probably need a third approach: prefilter by going through the crawldb & 
detect URLs which have a canonical target already indexed or ready to be 
indexed. We need to follow up to X levels of redirection e.g. doc A marked as 
canonical representation doc B, doc B redirects to doc C etc...if end of 
redirection chain exists and is valid then mark A as duplicate of C 
(intermediate redirs will not get indexed anyway)

As we don't know if has been indexed yet we would give it a special marker 
(e.g. status_duplicate) in the crawlDB. Then
-> if indexer comes across such an entry : skip it
-> make so that *deleteDuplicates can take a list of URLs with status_duplicate 
as an additional source of input OR have a custom resource that deletes such 
entries in SOLR or Lucene indices

The implementation would be as follows :

Go through all redirections and generate all redirection chains e.g.

A -> B
B -> C
D -> C

where C is an indexable document (i.e. has been fetched and parsed - it may 
have been already indexed.

will yield

A -> C
B -> C
D -> C

but also

C -> C

Once we have all possible redirections : go through the crawlDB in search of 
canonicals. if the target of a canonical is the source of a valid alias (e.g. A 
- B - C - D) mark it as 'status:duplicate'

This design implies generating quite a few intermediate structures + scanning 
the whole crawlDB twice (once of the aliases then for the canonical) + rewrite 
the whole crawlDB to mark some of the entries as duplicates.

This would be much easier to do when we have Nutch2/HBase : could simply follow 
the redirs from the initial URL having a canonical tag instead of generating 
these intermediate structures. We can then modify the entries one by one 
instead of regenerating the whole crawlDB.

WDYT?



> Support for rel="canonical" attribute
> -
>
> Key: NUTCH-710
> URL: https://issues.apache.org/jira/browse/NUTCH-710
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.1
>Reporter: Frank McCown
>Priority: Minor
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of 
> URLs crawled and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE 2] Board resolution for Nutch as TLP

2010-04-13 Thread Julien Nioche
+1.  Request the Board make Nutch a TLP

On 12 April 2010 12:08, Andrzej Bialecki  wrote:

> Hi,
>
> Take two, after s/crawling/search/ ...
>
> Following the discussion, below is the text of the proposed Board
> Resolution to vote upon.
>
> [] +1.  Request the Board make Nutch a TLP
> [] +0.  I don't feel strongly about it, but I'm okay with this.
> [] -1.  No, don't request the Board make Nutch a TLP, and here are my
>  reasons...
>
> This is a majority count vote (i.e. no vetoes). The vote is open for 72
> hours.
>
> Here's my +1.
>
> ===
> X. Establish the Apache Nutch Project
>
> WHEREAS, the Board of Directors deems it to be in the best
> interests of the Foundation and consistent with the
> Foundation's purpose to establish a Project Management
> Committee charged with the creation and maintenance of
> open-source software related to a large-scale web search
> platform for distribution at no charge to the public.
>
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> Committee (PMC), to be known as the "Apache Nutch Project",
> be and hereby is established pursuant to Bylaws of the
> Foundation; and be it further
>
> RESOLVED, that the Apache Nutch Project be and hereby is
> responsible for the creation and maintenance of software
> related to a large-scale web search platform; and be it further
>
> RESOLVED, that the office of "Vice President, Apache Nutch" be
> and hereby is created, the person holding such office to
> serve at the direction of the Board of Directors as the chair
> of the Apache Nutch Project, and to have primary responsibility
> for management of the projects within the scope of
> responsibility of the Apache Nutch Project; and be it further
>
> RESOLVED, that the persons listed immediately below be and
> hereby are appointed to serve as the initial members of the
> Apache Nutch Project:
>
>• Andrzej Bialecki 
>• Otis Gospodnetic 
>• Dogacan Guney 
>• Dennis Kubes 
>• Chris Mattmann 
>• Julien Nioche 
>• Sami Siren 
>
> RESOLVED, that the Apache Nutch Project be and hereby
> is tasked with the migration and rationalization of the Apache
> Lucene Nutch sub-project; and be it further
>
> RESOLVED, that all responsibilities pertaining to the Apache
> Lucene Nutch sub-project encumbered upon the
> Apache Lucene Project are hereafter discharged.
>
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
> be appointed to the office of Vice President, Apache Nutch, to
> serve in accordance with and subject to the direction of the
> Board of Directors and the Bylaws of the Foundation until
> death, resignation, retirement, removal or disqualification,
> or until a successor is appointed.
> ===
>
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com


[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856349#action_12856349
 ] 

Julien Nioche commented on NUTCH-808:
-

Hi Enis,

{quote}
On the other hand, current implementation is ...
{quote}

What do you mean by current implementation? NutchBase?

My gut feeling would be to write a custom framework instead of relying on 
DataNucleus and use AVRO if possible. I really think that HBase support is 
urgently needed but am less convinced that we need MySQL in the very short 
term. 

I know that Cascading have various Tape/Sink implementations including JDBC, 
HBase  but also SimpleDB. Maybe it would be worth having a look at how they do 
it?

> Evaluate ORM Frameworks which support non-relational column-oriented 
> datastores and RDBMs 
> --
>
> Key: NUTCH-808
> URL: https://issues.apache.org/jira/browse/NUTCH-808
> Project: Nutch
>  Issue Type: Task
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific 
> Compiler to compile class definitions given in JSON. Before moving on with 
> this, we might benefit from evaluating other frameworks, whether they suit 
> our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-650) Hbase Integration

2010-04-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-650:


Affects Version/s: (was: 1.0.0)
Fix Version/s: 2.0

> Hbase Integration
> -
>
> Key: NUTCH-650
> URL: https://issues.apache.org/jira/browse/NUTCH-650
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
> Fix For: 2.0
>
> Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
> malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, 
> NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch
>
>
> This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-808:


Fix Version/s: 2.0

> Evaluate ORM Frameworks which support non-relational column-oriented 
> datastores and RDBMs 
> --
>
> Key: NUTCH-808
> URL: https://issues.apache.org/jira/browse/NUTCH-808
> Project: Nutch
>  Issue Type: Task
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Fix For: 2.0
>
>
> We have an ORM layer in the NutchBase branch, which uses Avro Specific 
> Compiler to compile class definitions given in JSON. Before moving on with 
> this, we might benefit from evaluating other frameworks, whether they suit 
> our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Nutch 2.0 roadmap

2010-04-07 Thread Julien Nioche
Hi,

I'm not sure what is the status of the nutchbase - it's missed a lot of
> fixes and changes in trunk since it's been last touched ...
>

yes, maybe we should start the 2.0 branch from 1.1 instead
Dogacan - what do you think?

BTW I see there is now a 2.0 label under JIRA, thanks to whoever added it


> Also, the goal of the crawler-commons project is to provide APIs and
> implementations of stuff that is needed for every open source crawler
> project, like: robots handling, url filtering and url normalization, URL
> state management, perhaps deduplication. We should coordinate our
> efforts, and share code freely so that other projects (bixo, heritrix,
> droids) may contribute to this shared pool of functionality, much like
> Tika does for the common need of parsing complex formats.
>

definitely

 +1 - we may still keep a thin abstract layer to allow other
> indexing/search backends, but the current mess of indexing/query filters
> and competing indexing frameworks (lucene, fields, solr) should go away.
> We should go directly from DOM to a NutchDocument, and stop there.
>


I think that separating the parsing filters from the indexing filters can
have its merits e.g. combining the metadata generated by 2 or more different
parsing filters into a single field in the NutchDocument, keeping only a
subset of the available information etc...


> >
> > I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
> > update?
>

Have created a new page to serve as a support for discussion :
http://wiki.apache.org/nutch/Nutch2Roadmap

julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


Nutch 2.0 roadmap

2010-04-06 Thread Julien Nioche
Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808
)
* plugin cleanup : Tika only for parsing - get rid of everything else?
* remove index / search and delegate to SOLR
* new functionalities e.g. sitemap support, canonical tag etc...

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?

I look forward to hearing your thoughts on this

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


release of 1.1?

2010-04-06 Thread Julien Nioche
Chris,

Just to let you know that I have committed
https://issues.apache.org/jira/browse/NUTCH-810 which was the last open
issue before the release of 1.1

Thanks

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


[jira] Closed: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-810.
---

Resolution: Fixed

Committed in rev 931098.

http://issues.apache.org/jira/browse/TIKA-317 changed the way the TikaConfig is 
created as it does not rely on a  tika-config.xml file any longer. Our custom 
TikaConfig has been modified to reflect these changes.

This was the last remaining issue marked for 1.1 



> Upgrade to Tika 0.7
> ---
>
> Key: NUTCH-810
> URL: https://issues.apache.org/jira/browse/NUTCH-810
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
>
> Upgrading to Tika 0.7 before 1.1 release
> The TikaConfig mechanism has changed and does not rely on a default XML 
> config file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-789) Improvements to Tika parser

2010-04-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-789:


  Component/s: (was: fetcher)
   parser
Fix Version/s: (was: 1.1)

Have created a separate issue for the upgrade of Tika 0.7 and moved this one 
out of 1.1

> Improvements to Tika parser
> ---
>
> Key: NUTCH-789
> URL: https://issues.apache.org/jira/browse/NUTCH-789
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
> Environment: reported by Sami, in NUTCH-766
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Minor
> Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
> Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)
Upgrade to Tika 0.7
---

 Key: NUTCH-810
 URL: https://issues.apache.org/jira/browse/NUTCH-810
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


Upgrading to Tika 0.7 before 1.1 release

The TikaConfig mechanism has changed and does not rely on a default XML config 
file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Description: 
h2. Parse-metatags plugin

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}

  metatags.names
  description;keywords

{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.

The query-basic plugin is used to include these fields in the search e.g. in 
nutch-site.xml

{code:xml}

  query.basic.description.boost
  2.0



  query.basic.keywords.boost
  2.0

{code}


This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



  was:
h2. Parse-metatags plugin

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}

  metatags.names
  description;keywords

{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com




> Parse-metatags plugin
> -
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> The parse-metatags plugin consists of a HTMLParserFilter which takes as 
> parameter a list of metatag names with '*' as default value. The values are 
> separated by ';'.
> In order to extract the values of the metatags description and keywords, you 
> must specify in nutch-site.xml
> {code:xml}
> 
>   metatags.names
>   description;keywords
> 
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 
> 'keywords' and 'description'. Note that keywords is multivalued.
> The query-basic plugin is used to include these fields in the search e.g. in 
> nutch-site.xml
> {code:xml}
> 
>   query.basic.description.boost
>   2.0
> 
> 
>   query.basic.keywords.boost
>   2.0
> 
> {code}
> This code has been developed by DigitalPebble Ltd and offered to the 
> community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-04 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853251#action_12853251
 ] 

Julien Nioche commented on NUTCH-789:
-

Will upgrade as soon as 0.7 is available from 
http://repo1.maven.org/maven2/org/apache/tika/ - which is not the case yet.
I will leave this issue open but unmark it as 1.1

> Improvements to Tika parser
> ---
>
> Key: NUTCH-789
> URL: https://issues.apache.org/jira/browse/NUTCH-789
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
> Environment: reported by Sami, in NUTCH-766
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.1
>
> Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
> Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Description: 
h2. Parse-metatags plugin

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}

  metatags.names
  description;keywords

{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



  was:
h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}

  

{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}

  metatags.names
  description;keywords

{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com




> Parse-metatags plugin
> -
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> The parse-metatags plugin consists of a HTMLParserFilter which takes as 
> parameter a list of metatag names with '*' as default value. The values are 
> separated by ';'.
> In order to extract the values of the metatags description and keywords, you 
> must specify in nutch-site.xml
> {code:xml}
> 
>   metatags.names
>   description;keywords
> 
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 
> 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch 
> queries.
> This code has been developed by DigitalPebble Ltd and offered to the 
> community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

Modified version of the plugin which is compatible with parse-tika

> Parse-metatags plugin
> -
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
> [TIKA-379]).* 
> To use the legacy HTML parser specify in parse-plugins.xml
> {code:xml}
> 
>   
> 
> {code}
> The parse-metatags plugin consists of a HTMLParserFilter which takes as 
> parameter a list of metatag names with '*' as default value. The values are 
> separated by ';'.
> In order to extract the values of the metatags description and keywords, you 
> must specify in nutch-site.xml
> {code:xml}
> 
>   metatags.names
>   description;keywords
> 
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 
> 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch 
> queries.
> This code has been developed by DigitalPebble Ltd and offered to the 
> community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: (was: NUTCH-809.patch)

> Parse-metatags plugin
> -
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
>
> h2. Parse-metatags plugin
> *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
> [TIKA-379]).* 
> To use the legacy HTML parser specify in parse-plugins.xml
> {code:xml}
> 
>   
> 
> {code}
> The parse-metatags plugin consists of a HTMLParserFilter which takes as 
> parameter a list of metatag names with '*' as default value. The values are 
> separated by ';'.
> In order to extract the values of the metatags description and keywords, you 
> must specify in nutch-site.xml
> {code:xml}
> 
>   metatags.names
>   description;keywords
> 
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 
> 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch 
> queries.
> This code has been developed by DigitalPebble Ltd and offered to the 
> community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Apache Tika 0.7 Release Candidate #1

2010-04-02 Thread Julien Nioche
Hi Chris,

>
> * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch
> 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That
> OK, Nutchers?
>
>
Great. I'll definitely give 0.7 a try and make sure it works in Nutch.

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


>
>
> On 4/1/10 11:41 PM, "Jukka Zitting"  wrote:
>
> > Hi,
> >
> > On Wed, Mar 31, 2010 at 10:01 PM, Mattmann, Chris A (388J)
> >  wrote:
> >> Please vote on releasing these packages as Apache Tika 0.7.
> >
> > +1 Thanks!
> >
> > Some minor notes:
> > * It would be good to have also a SHA1 checksum for the release archive.
> > * Perhaps we should start offering also the tika-app jar as a direct
> > download from l.a.o/tika/download.html?
> >
> >> The vote is open for the next 72 hours.
> >
> > It looks like people.apache.org is not accessible at the moment (I
> > downloaded the release candidate yesterday), so it might be a good
> > idea to extend the vote period over the Easter holidays.
> >
> > BR,
> >
> > Jukka Zitting
> >
>
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.mattm...@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>


[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

> Parse-metatags plugin
> -
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
> [TIKA-379]).* 
> To use the legacy HTML parser specify in parse-plugins.xml
> {code:xml}
> 
>   
> 
> {code}
> The parse-metatags plugin consists of a HTMLParserFilter which takes as 
> parameter a list of metatag names with '*' as default value. The values are 
> separated by ';'.
> In order to extract the values of the metatags description and keywords, you 
> must specify in nutch-site.xml
> {code:xml}
> 
>   metatags.names
>   description;keywords
> 
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 
> 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch 
> queries.
> This code has been developed by DigitalPebble Ltd and offered to the 
> community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
Parse-metatags plugin
-

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch

h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}

  

{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}

  metatags.names
  description;keywords

{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-03-31 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852095#action_12852095
 ] 

Julien Nioche commented on NUTCH-794:
-

The issue has not been fixed in Tika. Will refile post 1.1 as you suggested. 
Can we update to Tika 0.7 before finalising 1.1?

> Language Identification must use check the parse metadata for language values 
> --
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-794.patch
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-706:


Fix Version/s: (was: 1.1)

Both variants of the substitution rule above break existing tests. More work 
will be needed to get a pattern which covers the case described by Meghna *and* 
is compatible with the existing test cases.
Moving it to post-1.1

> Url regex normalizer
> 
>
> Key: NUTCH-706
> URL: https://issues.apache.org/jira/browse/NUTCH-706
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Meghna Kukreja
>Priority: Minor
>
> Hey,
> I encountered the following problem while trying to crawl a site using
> nutch-trunk. In the file regex-normalize.xml, the following regex is
> used to remove session ids:
> ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$).
> This pattern also transforms a url, such as,
> "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
> matches 'sId' in the 'newsId'), which is incorrect and hence does not
> get fetched. This expression needs to be changed to prevent this.
> Thanks,
> Meghna

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851545#action_12851545
 ] 

Julien Nioche commented on NUTCH-570:
-

{quote}Julien, want to take this?{quote}

Not particularly. I am busy on short term issues for 1.1  so feel free to take 
it if you have a particular interest in this. 
I would be curious to see some figures on the improvements from this patch, my 
impression is that NUTCH-776 would be quicker to implement and maintain and 
might possibly give similar gains. 

> Improvement of URL Ordering in Generator.java
> -
>
> Key: NUTCH-570
> URL: https://issues.apache.org/jira/browse/NUTCH-570
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Reporter: Ned Rockson
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
> (50-100M at a time).  I found that the URLs generated are not optimal because 
> they are simply randomized by a hash comparator.  In one crawl on 24 machines 
> it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
> had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by 
> randomization, but in order to get optimal ordering, urls from the same host 
> should be as far apart in the list as possible.  So I wrote a series of 2 
> map/reduces to optimize the ordering and for a list of 25M documents it takes 
> about 10 minutes on our cluster.  Right now I have it in its own class, but I 
> figured it can go in Generator.java and just add a flag in nutch-default.xml 
> determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851316#action_12851316
 ] 

Julien Nioche commented on NUTCH-789:
-

Shall we postpone the work on this issue to after 1.1?

> Improvements to Tika parser
> ---
>
> Key: NUTCH-789
> URL: https://issues.apache.org/jira/browse/NUTCH-789
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
> Environment: reported by Sami, in NUTCH-766
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.1
>
> Attachments: NutchTikaConfig.java, TikaParser.java
>
>
> As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
> Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-714) Need a SFTP and SCP Protocol Handler

2010-03-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-714:


Affects Version/s: (was: 0.9.0)
   1.0.0
Fix Version/s: (was: 0.8.2)

Changing fix version to 'unknown'

> Need a SFTP and SCP Protocol Handler
> 
>
> Key: NUTCH-714
> URL: https://issues.apache.org/jira/browse/NUTCH-714
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Sanjoy Ghosh
>Assignee: Chris A. Mattmann
> Attachments: protocol-sftp.zip
>
>
> An SFTP and SCP Protocol handler is needed to fetch intranet content on an 
> SFTP or SCP server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-785.
---

Resolution: Fixed

Committed revision 929039

Thanks Andrzej for reviewing it

> Fetcher : copy metadata from origin URL when redirecting + call 
> scfilters.initialScore on newly created URL
> ---
>
> Key: NUTCH-785
> URL: https://issues.apache.org/jira/browse/NUTCH-785
> Project: Nutch
>  Issue Type: Bug
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-785.patch
>
>
> When following the redirections, the Fetcher does not copy the metadata from 
> the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-779.
-

   Resolution: Fixed
Fix Version/s: 1.1

Committed revision 929038.

Thanks Andrzej for your feedback

> Mechanism for passing metadata from parse to crawldb
> 
>
> Key: NUTCH-779
> URL: https://issues.apache.org/jira/browse/NUTCH-779
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-779, NUTCH-779-v2.patch
>
>
> The patch attached allows to pass parse metadata to the corresponding entry 
> of the crawldb.  
> Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850915#action_12850915
 ] 

Julien Nioche commented on NUTCH-779:
-

Could anyone please review this issue? I would like to commit it in time for 
the 1.1 release

> Mechanism for passing metadata from parse to crawldb
> 
>
> Key: NUTCH-779
> URL: https://issues.apache.org/jira/browse/NUTCH-779
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-779, NUTCH-779-v2.patch
>
>
> The patch attached allows to pass parse metadata to the corresponding entry 
> of the crawldb.  
> Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850912#action_12850912
 ] 

Julien Nioche commented on NUTCH-785:
-

Could anyone please review this issue? I would like to commit it in time for 
the 1.1 release

> Fetcher : copy metadata from origin URL when redirecting + call 
> scfilters.initialScore on newly created URL
> ---
>
> Key: NUTCH-785
> URL: https://issues.apache.org/jira/browse/NUTCH-785
> Project: Nutch
>  Issue Type: Bug
>Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-785.patch
>
>
> When following the redirections, the Fetcher does not copy the metadata from 
> the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-783) IndexerChecker Utilty

2010-03-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-783:


Fix Version/s: (was: 1.1)

Removed tag 1.1
Will rename to IndexingPluginsChecker later

> IndexerChecker Utilty
> -
>
> Key: NUTCH-783
> URL: https://issues.apache.org/jira/browse/NUTCH-783
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Attachments: NUTCH-783.patch
>
>
> This patch contains a new utility which allows to check the configuration of 
> the indexing filters. The IndexerChecker reads and parses a URL and run the 
> indexers on it. Displays the fields obtained and the first
>  100 characters of their value.
> Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
> http://www.lemonde.fr/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader

2010-03-29 Thread Julien Nioche (JIRA)
Merge CrawlDBScanner with CrawlDBReader
---

 Key: NUTCH-806
 URL: https://issues.apache.org/jira/browse/NUTCH-806
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche


The CrawlDBScanner [NUTCH-784] should be merged with the CrawlDBReader. Will do 
that after the 1.1 release 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-784:


Fix Version/s: 1.1

> CrawlDBScanner 
> ---
>
> Key: NUTCH-784
> URL: https://issues.apache.org/jira/browse/NUTCH-784
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-784.patch
>
>
> The patch file contains a utility which dumps all the entries matching a 
> regular expression on their URL. The dump mechanism of the crawldb reader is 
> not  very useful on large crawldbs as the ouput can be extremely large and 
> the -url  function can't help if we don't know what url we want to have a 
> look at.
> The CrawlDBScanner can either generate a text representation of the 
> CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
> Usage: CrawlDBScanner[-s ] <-text>
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
> db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; 
> otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below : 
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
> -s db_fetched -text
> will generate a text file /tmp/amazon-dump containing all the entries of the 
> crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-784.
---

Resolution: Fixed

Committed revision 928746

> CrawlDBScanner 
> ---
>
> Key: NUTCH-784
> URL: https://issues.apache.org/jira/browse/NUTCH-784
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-784.patch
>
>
> The patch file contains a utility which dumps all the entries matching a 
> regular expression on their URL. The dump mechanism of the crawldb reader is 
> not  very useful on large crawldbs as the ouput can be extremely large and 
> the -url  function can't help if we don't know what url we want to have a 
> look at.
> The CrawlDBScanner can either generate a text representation of the 
> CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
> Usage: CrawlDBScanner[-s ] <-text>
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
> db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; 
> otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below : 
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
> -s db_fetched -text
> will generate a text file /tmp/amazon-dump containing all the entries of the 
> crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Will Nutch move to HBase 0.20

2010-03-26 Thread Julien Nioche
That's the plan. Check the list archive for 'nutchbase' to find related
discussions, in particular http://issues.apache.org/jira/browse/NUTCH-650

Hard to tell when this will happen as it depends on the community testing
the nutchbase branch, contributing patches etc... The best thing to do to
make it happen would be to give nutchbase a try and see if you can
contribute to it

Best,

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 24 March 2010 22:36, work only  wrote:

> Seeing that HBase 0.20 "Random access performance on par with open source
> relational databases such as MySQL" when will Nutch move to HBase for its
> databases?
>
>
>
>
>


[jira] Updated: (NUTCH-776) Configurable queue depth

2010-03-23 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-776:


Fix Version/s: (was: 1.1)

Moving this issue post 1.1
Needs a patch file, some description of the param in nutch-default.xml and more 
importantly some experimentation to see how it impacts the performance of the 
fetching

> Configurable queue depth
> 
>
> Key: NUTCH-776
> URL: https://issues.apache.org/jira/browse/NUTCH-776
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.1
>Reporter: MilleBii
>Priority: Minor
>
> I propose that we create a configurable item for the queuedepth in 
> Fetcher.java instead of the hard-coded value of 50.
> key name : fetcher.queues.depth
> Default value : remains 50 (of course)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-762.
---

Resolution: Fixed

Committed revision 926155

Have reverted the prefix for params to 'generate.' + added description of 
crawl.gen.delay on nutch-default + added warning when user specified 
generate.max.per.host.by.ip + param generate.max.per.host is now supported

Thanks Andzrej for your reviewing it 

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848140#action_12848140
 ] 

Julien Nioche commented on NUTCH-762:
-

The change of prefix also reflected that we now use 2 different parameters so 
specify how to count the URLs (host or domain) and the max number of URLs.  We 
can of course maintain the old parameters as well for the sake of 
compatibility, except that _generate.max.per.host.by.ip_ won't be of much use 
anymore as we don't count per IP.

Have just noticed  that 'crawl.gen.delay' is not documented in 
nutch-default.xml, and does not seem to be used outside the Generator. What is 
it supposed to be used for? 

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>    Affects Versions: 1.0.0
>        Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848095#action_12848095
 ] 

Julien Nioche commented on NUTCH-762:
-

{quote}
I just noticed that the new Generator uses different config property names 
("generator." vs. "generate."), and the older versions are now marked with 
"(Deprecated)". However, this doesn't reflect the reality - properties with old 
names are simply ignored now, whereas "deprecated" implies that they should 
still work
{quote}

They will still work if we keep the old Generator as OldGenerator - which is 
what we assume in the patch. If we decide to get shot of the OldGenerator then 
yes, they should not be marked  with "(Deprecated)"

{quote}
For back-compat reason I think they should still work - the current (admittedly 
awkward) prefix is good enough, and I think that changing it in a minor release 
would create confusion. I suggest reverting to the old names where appropriate, 
and add new properties with the same prefix, i.e. "generate.".
{quote}

the original assumption was that we'd keep both this version of the generator 
and the old one in which case we could have used a different prefix for the 
properties. If we want to *replace* the old generator altogether - which I 
think would be a good option - then indeed we should discuss whether or not to 
align on the old prefix. 

I don't have strong feelings on whether or not to modify the prefix in a minor 
release.  




> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: NUTCH-762-v3.patch

new patch which reintroduces the 'generator.update.crawldb' functionality 

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Fix Version/s: 1.1

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-740.
---

Resolution: Fixed
  Assignee: Julien Nioche

Committed in rev 926003
Thanks Marcin for contributing this patch

> Configuration option to override default language for fetched pages.
> 
>
> Key: NUTCH-740
> URL: https://issues.apache.org/jira/browse/NUTCH-740
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Marcin Okraszewski
>    Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.1
>
> Attachments: AcceptLanguage.patch, 
> AcceptLanguage_trunk_2009-06-09.patch, NUTCH-740.patch
>
>
> By default "Accept-Language" HTTP request header is set to English. 
> Unfortunately this value is hard coded and seems there is no way to override 
> it. As a result you may index English version of pages even though you would 
> prefer it in different language. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-740:


Attachment: NUTCH-740.patch

Slightly modified version of the patch with modifs for protocol-http.
will commit shortly

> Configuration option to override default language for fetched pages.
> 
>
> Key: NUTCH-740
> URL: https://issues.apache.org/jira/browse/NUTCH-740
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Marcin Okraszewski
>Priority: Minor
> Fix For: 1.1
>
> Attachments: AcceptLanguage.patch, 
> AcceptLanguage_trunk_2009-06-09.patch, NUTCH-740.patch
>
>
> By default "Accept-Language" HTTP request header is set to English. 
> Unfortunately this value is hard coded and seems there is no way to override 
> it. As a result you may index English version of pages even though you would 
> prefer it in different language. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846930#action_12846930
 ] 

Julien Nioche commented on NUTCH-762:
-

Yes, I came across that situation too on a large crawl where a single machine 
was used to host a whole range of unrelated domain names (needless to say the 
host of the domains was not very pleased). We can now handle such cases that 
simply by partitioning by IP (and counting by domain).

I will have a look at reintroducing *generate.update.crawldb* tomorrow.



 

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-18 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846910#action_12846910
 ] 

Julien Nioche commented on NUTCH-762:
-

OK, there was indeed an assumption that the generator would not need to be 
called again before an update.  Am happy to add back generate.update.crawldb. 

Note that this version of the Generator also differs from the original version 
in that 

{quote}
*IP resolution is done ONLY on the entries which have been selected for 
fetching (during the partitioning). Running the IP resolution on the whole 
crawlDb is too slow to be usable on a large scale
*can max the number of URLs per host or domain (but not by IP)
{quote}

We could allow more flexibility by counting per IP, again at the expense of 
performance. Not sure it is very useful in practice though. Since the way we 
count the URLs is now decoupled from the way we partition them, we can have an 
hybrid approach e.g. count per domain THEN partition by IP. 

Any thoughts on whether or not we should reintroduce the counting per IP?

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846141#action_12846141
 ] 

Julien Nioche commented on NUTCH-762:
-

If I am not mistaken the point of having  _generate.update.crawldb_ was to 
marke the URLs put in a fetchlist in order to be able to do another round of 
generation. This is not necessary now as we can generate several segments 
without writing a new crawldb.
Am I missing something?  

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-740) Configuration option to override default language for fetched pages.

2010-03-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845886#action_12845886
 ] 

Julien Nioche commented on NUTCH-740:
-

A nice contribution but should not this be applied to the *protocol-http* 
plugin as well e.g. in HttpResponse?

> Configuration option to override default language for fetched pages.
> 
>
> Key: NUTCH-740
> URL: https://issues.apache.org/jira/browse/NUTCH-740
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Marcin Okraszewski
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 1.1
>
> Attachments: AcceptLanguage.patch, 
> AcceptLanguage_trunk_2009-06-09.patch
>
>
> By default "Accept-Language" HTTP request header is set to English. 
> Unfortunately this value is hard coded and seems there is no way to override 
> it. As a result you may index English version of pages even though you would 
> prefer it in different language. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-469) changes to geoPosition plugin to make it work on nutch 0.9

2010-03-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-469:


Fix Version/s: (was: 1.1)

There has not been any changes to this issue since February 09 and it won't be 
included in 1.1
Marking it as 'fix version : unknown' 

> changes to geoPosition plugin to make it work on nutch 0.9
> --
>
> Key: NUTCH-469
> URL: https://issues.apache.org/jira/browse/NUTCH-469
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, searcher
>Affects Versions: 0.9.0
>Reporter: Mike Schwartz
> Attachments: geoPosition-0.5.tgz, geoPosition0.6_cdiff.zip, 
> NUTCH-469-2007-05-09.txt.gz
>
>
> I have modified the geoPosition plugin 
> (http://wiki.apache.org/nutch/GeoPosition) code to work with nutch 0.9.  (The 
> code was built originally using nutch 0.7.)  I'd like to contribute my 
> changes back to the nutch project.  I already communicated with the code's 
> author (Matthias Jaekle), and he agrees with my mods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2010-03-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-692.
-

   Resolution: Cannot Reproduce
Fix Version/s: 1.1

I cannot reproduce the issue since we moved to the Hadoop 0.20., which is good 
news

> AlreadyBeingCreatedException with Hadoop 0.19
> -
>
> Key: NUTCH-692
> URL: https://issues.apache.org/jira/browse/NUTCH-692
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-692.patch
>
>
> I have been using the SVN version of Nutch on an EC2 cluster and got some 
> AlreadyBeingCreatedException during the reduce phase of a parse. For some 
> reason one of my tasks crashed and then I ran into this 
> AlreadyBeingCreatedException when other nodes tried to pick it up.
> There was recently a discussion on the Hadoop user list on similar issues 
> with Hadoop 0.19 (see 
> http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
> using 0.18.2 yet but will do if the problems persist with 0.19
> I was wondering whether anyone else had experienced the same problem. Do you 
> think 0.19 is stable enough to use it for Nutch 1.0?
> I will be running a crawl on a super large cluster in the next couple of 
> weeks and I will confirm this issue  
> J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-710) Support for rel="canonical" attribute

2010-03-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-710:


Fix Version/s: (was: 1.1)

Great idea. Won't be included in 1.1 though so moving to *fix : unknown*


> Support for rel="canonical" attribute
> -
>
> Key: NUTCH-710
> URL: https://issues.apache.org/jira/browse/NUTCH-710
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.1
>Reporter: Frank McCown
>Priority: Minor
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of 
> URLs crawled and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-801) Remove RTF and MP3 parse plugins

2010-03-11 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-801.
-

Resolution: Fixed

Committed revision 921840.


> Remove RTF and MP3 parse plugins
> 
>
> Key: NUTCH-801
> URL: https://issues.apache.org/jira/browse/NUTCH-801
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
> Fix For: 1.1
>
>
> *Parse-rtf* and *parse-mp3* are not built by default  due to licensing 
> issues. Since we now have *parse-tika* to handle these formats I would be in 
> favour of removing these 2 plugins altogether to keep things nice and simple. 
> The other plugins will probably be phased out only after the release of 1.1  
> when parse-tika will have been tested a lot more.
> Any reasons not to?
> Julien

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-798) Upgrade to SOLR1.4

2010-03-11 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-798.
-

Resolution: Fixed

Updated SOLRJ's dependencies at the same time : 

Deleting   lib/apache-solr-common-1.3.0.jar
Adding  (bin)  lib/apache-solr-core-1.4.0.jar
Deleting   lib/apache-solr-solrj-1.3.0.jar
Adding  (bin)  lib/apache-solr-solrj-1.4.0.jar
Deleting   lib/commons-httpclient-3.0.1.jar
Adding  (bin)  lib/commons-httpclient-3.1.jar
Adding  (bin)  lib/commons-io-1.4.jar
Adding  (bin)  lib/geronimo-stax-api_1.0_spec-1.0.1.jar
Adding  (bin)  lib/jcl-over-slf4j-1.5.5.jar
Deleting   lib/slf4j-api-1.4.3.jar
Adding  (bin)  lib/slf4j-api-1.5.5.jar
Adding  (bin)  lib/wstx-asl-3.2.7.jar

Committed revision 921831

> Upgrade to SOLR1.4
> --
>
> Key: NUTCH-798
> URL: https://issues.apache.org/jira/browse/NUTCH-798
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>    Reporter: Julien Nioche
> Fix For: 1.1
>
>
> in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify 
> the way we buffer the docs before sending them to the SOLR instance 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-801) Remove RTF and MP3 parse plugins

2010-03-10 Thread Julien Nioche (JIRA)
Remove RTF and MP3 parse plugins


 Key: NUTCH-801
 URL: https://issues.apache.org/jira/browse/NUTCH-801
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Fix For: 1.1


*Parse-rtf* and *parse-mp3* are not built by default  due to licensing issues. 
Since we now have *parse-tika* to handle these formats I would be in favour of 
removing these 2 plugins altogether to keep things nice and simple. The other 
plugins will probably be phased out only after the release of 1.1  when 
parse-tika will have been tested a lot more.

Any reasons not to?

Julien



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 1.1 release?

2010-03-09 Thread Julien Nioche
Hi Chris,

Excellent idea! There have been quite a few changes since 1.0 and it's
probably the right time to have a new release.
Not really a blocker but
https://issues.apache.org/jira/browse/NUTCH-762would be nice to have
in 1.1, just needs a bit of reviewing / testing I
suppose. Otherwise this can wait until after 1.1

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 9 March 2010 17:09, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hey Guys,
>
> I have some extra time this weekend and early next week. Want me to be the
> RM and push out a 1.1 release? Any blockers? I'm happy to do it just let me
> know.
>
> Cheers,
> Chris
>
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.mattm...@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>


[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: NUTCH-762-v2.patch

Improved version of the patch : 

- fixed a few minor bugs
- renamed Generator into OldGenerator
- renamed MultiGenerator into Generator
- fixed test classes to use new Generator
- documented parameters in nutch-default.xml
- add names of segments to the LOG to facilitate integration in scripts
- PartitionUrlByHost is replaced by URLPartitioner which is more generic

I decided to keep the old version for the time being but we might as well get 
rid of it altogether. The new version is now used in the Crawl class. 

Would be nice if people could give it a good try before we put it in 1.1

Thanks

Julien 

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: (was: NUTCH-762-MultiGenerator.patch)

> Alternative Generator which can generate several segments in one parse of the 
> crawlDB
> -
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-762-v2.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations 
> related to the crawlDB (generate - update) tend to take the biggest part of 
> the time. One solution is to limit such operations to a minimum by generating 
> several fetchlists in one parse of the crawlDB then update the Db only once 
> on several segments. The existing Generator allows several successive runs by 
> generating a copy of the crawlDB and marking the URLs to be fetched. In 
> practice this approach does not work well as we need to read the whole 
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator  which can 
> generate several fetchlists by reading the crawlDB only once. The 
> MultiGenerator differs from the Generator in other aspects: 
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for  
> fetching (during the partitioning). Running the IP resolution on the whole 
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and 
> for partitioning; however as we can't count the max number of URLs by IP 
> another unit must be chosen while partitioning by IP. 
> We found that using a filter on the score can dramatically improve the 
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch 
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator   [-force] [-topN N] [-numFetchers 
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from : 
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than 
> the max value select e.g. not enough URLs are available for fetching and fit 
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-799.
---

Resolution: Fixed
  Assignee: Julien Nioche

Thanks for your feedback Andrzej

Committed revision 919358.

> SOLRIndexer to commit once all reducers have finished
> -
>
> Key: NUTCH-799
> URL: https://issues.apache.org/jira/browse/NUTCH-799
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-799.patch
>
>
> What about doing only one SOLR commit after the MR job has finished in 
> SOLRIndexer instead of doing that at the end of every Reducer? 
> I ran into timeout exceptions in some of my reducers and I suspect that this 
> was due to the fact that other reducers had already finished and called 
> commit. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-782) Ability to order htmlparsefilters

2010-03-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-782.
---

Resolution: Fixed

Committed revision 917557

> Ability to order htmlparsefilters
> -
>
> Key: NUTCH-782
> URL: https://issues.apache.org/jira/browse/NUTCH-782
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-782.patch
>
>
> Patch which adds a new parameter 'htmlparsefilter.order' which specifies the 
> order in which HTMLParse filters are applied. HTMLParse filter ordering MAY 
> have an impact on end result, as some filters could rely on the metadata 
> generated by a previous filter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-799:


Attachment: NUTCH-799.patch

> SOLRIndexer to commit once all reducers have finished
> -
>
> Key: NUTCH-799
> URL: https://issues.apache.org/jira/browse/NUTCH-799
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>    Reporter: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-799.patch
>
>
> What about doing only one SOLR commit after the MR job has finished in 
> SOLRIndexer instead of doing that at the end of every Reducer? 
> I ran into timeout exceptions in some of my reducers and I suspect that this 
> was due to the fact that other reducers had already finished and called 
> commit. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-799) SOLRIndexer to commit once all reducers have finished

2010-03-01 Thread Julien Nioche (JIRA)
SOLRIndexer to commit once all reducers have finished
-

 Key: NUTCH-799
 URL: https://issues.apache.org/jira/browse/NUTCH-799
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


What about doing only one SOLR commit after the MR job has finished in 
SOLRIndexer instead of doing that at the end of every Reducer? 
I ran into timeout exceptions in some of my reducers and I suspect that this 
was due to the fact that other reducers had already finished and called commit. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-798) Upgrade to SOLR1.4

2010-02-26 Thread Julien Nioche (JIRA)
Upgrade to SOLR1.4
--

 Key: NUTCH-798
 URL: https://issues.apache.org/jira/browse/NUTCH-798
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify the 
way we buffer the docs before sending them to the SOLR instance 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-23 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837147#action_12837147
 ] 

Julien Nioche commented on NUTCH-719:
-

the other addFetchItem method of FetchItemQueues uses the synchronized one 
internally so there is no need for synch there + the one in FetchItemQueue is 
not necessary as it uses a synchronized collection internally. 

> fetchQueues.totalSize incorrect in Fetcher2
> ---
>
> Key: NUTCH-719
> URL: https://issues.apache.org/jira/browse/NUTCH-719
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
>
> I had a look at the logs generated by Fetcher2 and found cases where there 
> were no active fetchQueues but fetchQueues.totalSize was != 0
> fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
> fetchQueues.totalSize=1, fetchQueues=0
> since the code relies on fetchQueues.totalSize to determine whether the work 
> is finished or not the task is blocked until the abortion mechanism kicks in
> 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
> threads.
> could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-719.
---


> fetchQueues.totalSize incorrect in Fetcher2
> ---
>
> Key: NUTCH-719
> URL: https://issues.apache.org/jira/browse/NUTCH-719
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
>
> I had a look at the logs generated by Fetcher2 and found cases where there 
> were no active fetchQueues but fetchQueues.totalSize was != 0
> fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
> fetchQueues.totalSize=1, fetchQueues=0
> since the code relies on fetchQueues.totalSize to determine whether the work 
> is finished or not the task is blocked until the abortion mechanism kicks in
> 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
> threads.
> could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-719) fetchQueues.totalSize incorrect in Fetcher2

2010-02-19 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-719.
-

   Resolution: Fixed
Fix Version/s: 1.1

Committed revision 911905.
Thanks to S. Dennis for investigating the issue + R. Schwab for testing it 

> fetchQueues.totalSize incorrect in Fetcher2
> ---
>
> Key: NUTCH-719
> URL: https://issues.apache.org/jira/browse/NUTCH-719
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
>
> I had a look at the logs generated by Fetcher2 and found cases where there 
> were no active fetchQueues but fetchQueues.totalSize was != 0
> fetcher.Fetcher2 - -activeThreads=200, spinWaiting=200, 
> fetchQueues.totalSize=1, fetchQueues=0
> since the code relies on fetchQueues.totalSize to determine whether the work 
> is finished or not the task is blocked until the abortion mechanism kicks in
> 2009-03-12 09:27:38,977 WARN  fetcher.Fetcher2 - Aborting with 200 hung 
> threads.
> could that be a synchronisation issue? any ideas?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-644) RTF parser doesn't compile anymore

2010-02-18 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-644.
-

Resolution: Fixed

RTF parsing is now handled by the TikaPlugin (NUTCH-766) which solves the issue 
of licensing.

> RTF parser doesn't compile anymore
> --
>
> Key: NUTCH-644
> URL: https://issues.apache.org/jira/browse/NUTCH-644
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Guillaume Smet
> Attachments: NUTCH-644_v2.patch, NUTCH-644_v3.patch, 
> RTFParseFactory.java-compilation_issues.diff
>
>
> Due to API changes, the RTF parser (which is not compiled by default due to 
> licensing problem) doesn't compile anymore.
> The build.xml script doesn't work anymore too as 
> http://www.cobase.cs.ucla.edu/pub/javacc/rtf_parser_src.jar doesn't exist 
> anymore (404). I didn't fix the build.xml as I don't know from where we want 
> to get the jar file but only the compilations issues.
> Regards,
> -- 
> Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-705) parse-rtf plugin

2010-02-18 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-705.
-

Resolution: Fixed

RTF parsing is now handled by the TikaPlugin (NUTCH-766). Please open an issue 
on Tika if  the original problem with non-ascii chars still occurs

> parse-rtf plugin
> 
>
> Key: NUTCH-705
> URL: https://issues.apache.org/jira/browse/NUTCH-705
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 1.0.0
>Reporter: Dmitry Lihachev
>Priority: Minor
> Fix For: 1.1
>
> Attachments: NUTCH-705.patch
>
>
> Demoting this issue and moving to 1.1 - current patch is not suitable due to 
> LGPL licensed parts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-750) HtmlParser plugin - page title extraction

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-750:


Component/s: parser

> HtmlParser plugin - page title extraction
> -
>
> Key: NUTCH-750
> URL: https://issues.apache.org/jira/browse/NUTCH-750
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.0.0
>Reporter: Alexey Torochkov
>Priority: Minor
> Fix For: 1.1
>
> Attachments: SkipBody.patch
>
>
> A little improvement to trying to extract  tag in body if it doesn't 
> exist in head.
> In current version DOMContentUtils just skip all after  in getTitle() 
> method.
> Attached patch allows to change this behavior (for default it doesn't change 
> anything) and can cope with webmasters mistakes

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-782:


Component/s: parser

> Ability to order htmlparsefilters
> -
>
> Key: NUTCH-782
> URL: https://issues.apache.org/jira/browse/NUTCH-782
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-782.patch
>
>
> Patch which adds a new parameter 'htmlparsefilter.order' which specifies the 
> order in which HTMLParse filters are applied. HTMLParse filter ordering MAY 
> have an impact on end result, as some filters could rely on the metadata 
> generated by a previous filter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-794:


Component/s: parser

> Language Identification must use check the parse metadata for language values 
> --
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-794.patch
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-794 started by Julien Nioche.

> Language Identification must use check the parse metadata for language values 
> --
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>  Issue Type: Bug
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-794.patch
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-794:


Summary: Language Identification must use check the parse metadata for 
language values   (was: Tika parser does identify lang attributes on html tag)

> Language Identification must use check the parse metadata for language values 
> --
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>  Issue Type: Bug
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-794.patch
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-02-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834147#action_12834147
 ] 

Julien Nioche commented on NUTCH-794:
-

Committed patch in revision 910454

Waiting for issue to be fixed in Tika before closing this issue

> Language Identification must use check the parse metadata for language values 
> --
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>  Issue Type: Bug
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-794.patch
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-794:


Attachment: NUTCH-794.patch

> Tika parser does identify lang attributes on html tag
> -
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>  Issue Type: Bug
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-794.patch
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834143#action_12834143
 ] 

Julien Nioche commented on NUTCH-794:
-

Apart from the html attribute being lost (see above) there is also an issue 
with  the fact that Tika does not put the lang attributes in its XHTML 
representation but stores that in the metadata instead. 
I will shortly release a patch to address that in the class HTMLLanguageParser

> Tika parser does identify lang attributes on html tag
> -
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>  Issue Type: Bug
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-794) Tika parser does identify lang attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-794:


Description: 
The following HTML document : 

document 1 titlejotain suomeksi

is rendered as the following xhtml by Tika : 

http://www.w3.org/1999/xhtml";>document 1 
titlejotain suomeksi

with the lang attribute getting lost.  The lang is not stored in the metadata 
either.

I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
tests don't break anymore 

  was:
The following HTML document : 

document 1 titlejotain suomeksi

is rendered as the following xhtml by Tika : 

http://www.w3.org/1999/xhtml";>document 1 
titlejotain suomeksi

with the lang attribute getting lost. 

I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
tests don't break anymore 

Summary: Tika parser does identify lang attributes on html tag  (was: 
Tika parser does not keep attributes on html tag)

> Tika parser does identify lang attributes on html tag
> -
>
> Key: NUTCH-794
> URL: https://issues.apache.org/jira/browse/NUTCH-794
> Project: Nutch
>      Issue Type: Bug
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
>
> The following HTML document : 
> document 1 titlejotain 
> suomeksi
> is rendered as the following xhtml by Tika : 
>  xmlns="http://www.w3.org/1999/xhtml";>document 1 
> titlejotain suomeksi
> with the lang attribute getting lost.  The lang is not stored in the metadata 
> either.
> I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
> tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-794) Tika parser does not keep attributes on html tag

2010-02-16 Thread Julien Nioche (JIRA)
Tika parser does not keep attributes on html tag


 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


The following HTML document : 

document 1 titlejotain suomeksi

is rendered as the following xhtml by Tika : 

http://www.w3.org/1999/xhtml";>document 1 
titlejotain suomeksi

with the lang attribute getting lost. 

I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-766) Tika parser

2010-02-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-766.
---


Have added small improvement in revision 910187 (Prioritise default Tika parser 
when discovering plugins matching mime-type).
Thanks to Chris for testing and committing it + Andrzej and Sami for their 
comments and suggestions

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832583#action_12832583
 ] 

Julien Nioche commented on NUTCH-766:
-

@Chris : did you do 

ant -f src/plugin/parse-tika/build-ivy.xml 

between 5 and 6? This is required in order to populate the lib directory 
automatically

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832564#action_12832564
 ] 

Julien Nioche edited comment on NUTCH-766 at 2/11/10 5:22 PM:
--

I had a closer look at the HTML parsing issue. What happens  is that the 
association between the mime-type and the parser implementation is not 
explicitely set in parse-plugins.xml so the ParserFactory goes through all the 
plugins and gets the ones with a matching mimetype (or * for Tika). The Tika 
parser takes no precedence over the default HTML parser and the latter gets 
first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in 
plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't 
think we want to have to specify explicitely that tika should be used in all 
the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set 
for a mimetype, Tika (or any parser marked as supporting any mimetype) will be 
put first in the list of discovered parsers so it would remain the default 
choice unless an explicit mapping is set (even if a plugin is loaded and can 
handle the type).

Makes sense?

The ParserFactory section of the patch v3 can be replaced by :  

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===
--- src/java/org/apache/nutch/parse/ParserFactory.java  (revision 909059)
+++ src/java/org/apache/nutch/parse/ParserFactory.java  (working copy)
@@ -348,11 +348,23 @@
 contentType)) {
   extList.add(extensions[i]);
 }
+else if ("*".equals(extensions[i].getAttribute("contentType"))){
+  // default plugins get the priority
+  extList.add(0, extensions[i]);
+}
   }
   
   if (extList.size() > 0) {
 if (LOG.isInfoEnabled()) {
-  LOG.info("The parsing plugins: " + extList +
+  StringBuffer extensionsIDs = new StringBuffer("[");
+  boolean isFirst = true;
+  for (Extension ext : extList){
+ if (!isFirst) extensionsIDs.append(" - ");
+ else isFirst=false;
+ extensionsIDs.append(ext.getId());
+  }
+ extensionsIDs.append("]");
+  LOG.info("The parsing plugins: " + extensionsIDs.toString() +
" are enabled via the plugin.includes system " +
"property, and all claim to support the content type " +
contentType + ", but they are not mapped to it  in the " +
@@ -369,7 +381,7 @@
 
   private boolean match(Extension extension, String id, String type) {
 return ((id.equals(extension.getId())) &&
-(type.equals(extension.getAttribute("contentType")) ||
+(type.equals(extension.getAttribute("contentType")) || 
extension.getAttribute("contentType").equals("*") ||
  type.equals(DEFAULT_PLUGIN)));
   }
   



  was (Author: jnioche):
I had a closer look at the HTML parsing issue. What happens  is that the 
association between the mime-type and the parser implementation is not 
explicitely set in parse-plugins.xml so the ParserFactory goes through all the 
plugins and gets the ones with a matching mimetype (or * for Tika). The Tika 
parser takes no precedence over the default HTML parser and the latter gets 
first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in 
plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't 
think we want to have to specify explicitely that tika should be used in all 
the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set 
for a mimetype, Tika (or any parser marked as supporting any mimetype) will be 
put first in the list of discovered parsers so it would remain the default 
choice unless an explicit mapping is set (even if a plugin is loaded and can 
handle the type).

Makes sense?


  
> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely 

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832564#action_12832564
 ] 

Julien Nioche commented on NUTCH-766:
-

I had a closer look at the HTML parsing issue. What happens  is that the 
association between the mime-type and the parser implementation is not 
explicitely set in parse-plugins.xml so the ParserFactory goes through all the 
plugins and gets the ones with a matching mimetype (or * for Tika). The Tika 
parser takes no precedence over the default HTML parser and the latter gets 
first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in 
plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't 
think we want to have to specify explicitely that tika should be used in all 
the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set 
for a mimetype, Tika (or any parser marked as supporting any mimetype) will be 
put first in the list of discovered parsers so it would remain the default 
choice unless an explicit mapping is set (even if a plugin is loaded and can 
handle the type).

Makes sense?



> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a H

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832454#action_12832454
 ] 

Julien Nioche commented on NUTCH-766:
-

@Chris : I just did a fresh co from svn, applied the patch v3 and unzipped 
sample.tar.gz onto  the directory parse-tika and ran the test just as you did 
but could not reproduce the problem.  Could there be a difference between your 
version and the trunk?

@Sami :  

{quote} was there a reason not to use AutoDetect parser?  {quote} 
I suppose we could as long we give it a clue about the MimeType obtained from 
the Content.  As you pointed out, there could be a duplication with the 
detection done by Mime-Util. I suppose one way to do would be to add a new 
version of the method getParse(Content conte, MimeType type). That's an 
interesting point.

{quote} Also was there a reson not to parse html wtih tika?  {quote} 
It is supposed to do so, if it does not then it's a bug which needs urgent 
fixing.

Regarding parsing package formats, I think the plan is that Tika will handle 
that in the future but we could try to do that now if we find a relatively 
clean mechanism for doing so. BTW could you please send a diff and not the full 
code of the class you posted earlier, that would make the comparison much 
easier.




> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably t

[jira] Updated: (NUTCH-787) Upgrade Lucene to 3.0.0.

2010-02-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-787:


Fix Version/s: 1.1

> Upgrade Lucene to 3.0.0.
> 
>
> Key: NUTCH-787
> URL: https://issues.apache.org/jira/browse/NUTCH-787
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Dawid Weiss
>Priority: Trivial
> Fix For: 1.1
>
> Attachments: NUTCH-787.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Spill failed

2010-02-10 Thread Julien Nioche
the explanation can be found in the stack trace you sent :
"java.io.IOException: error=12, Cannot allocate memory"

Small instances on EC2 does not give you enough memory. from the
configuration below the slaves will use up to 1300M for the datanode
and  tasktracker; if you add to that the memory used by the OS and of
course the tasks themselves it is not surprising that you used the
1.7G you had. Things get worse if you parse at the same time as you
fetch as this tend to take some RAM.

>From my experience EC2 large instances are more appropriate for a Nutch cluster

PS: nutch-user would be a more appropriate list for this type of messages

J.

-- 
DigitalPebble Ltd
http://www.digitalpebble.com



On 10 February 2010 08:41, Santiago Pérez  wrote:
>
> Hej
>
> I am running Nutch in a cluster with 1 master and 6 slaves in Amazon (with
> the same instances for all of them with 1.7GB RAM memory)
>
> My configuration is the following:
>
> HADOOP_HEAPSIZE=1300
> HADOOP_NAMENODE_OPTS=-Xmx400m
> HADOOP_SECONDARYNAMENODE_OPTS=-Xmx400m
> HADOOP_JOBTRACKER_OPTS=-Xmx400m
> dfs.replication=3
> mapred.map.tasks=6
> mapred.reduce.tasks=6
> mapred.child.java.opts=-Xmx950m
>
> But in the second depth fetch, I got the following errors in some instances
> (while the other ones seems they fetched correctly) :
>
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - java.io.IOException: Spill
> failed
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:822)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:907)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:670)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - Caused by:
> java.io.IOException: Cannot run program "bash": java.io.IOException:
> error=12, Cannot allocate memory
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.util.Shell.run(Shell.java:134)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1183)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:648)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1135)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - Caused by:
> java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> java.lang.UNIXProcess.(UNIXProcess.java:148)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - at
> java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> 2010-02-10 03:18:31,185 FATAL fetcher.Fetcher - ... 9 more
> .
> .
> .
> .
> .
> 2010-02-10 03:18:31,463 WARN  mapred.TaskTracker - Error running child
> java.io.IOException: Spill failed
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1085)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.io.IOException: Cannot run program "bash":
> java.io.IOException: error=12, Cannot allocate memory
>        at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>        at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>        at org.apache.hadoop.util.Shell.run(Shell.java:134)
>        at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
>        at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
>        at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
>        at
> org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)

[jira] Closed: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-786.
---

Resolution: Fixed

Committed revision 906907

> Better list of suffix domains
> -
>
> Key: NUTCH-786
> URL: https://issues.apache.org/jira/browse/NUTCH-786
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-786.patch
>
>
> Small improvement to the content of domain-suffixes.xml : added compound TLD 
> for .ar, .co, .id, .il, .mx, .nz and .za

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-786:


Attachment: NUTCH-786.patch

Small improvement to the content of domain-suffixes.xml : added compound TLD 
for .ar, .co, .id, .il, .mx, .nz and .za

> Better list of suffix domains
> -
>
> Key: NUTCH-786
> URL: https://issues.apache.org/jira/browse/NUTCH-786
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-786.patch
>
>
> Small improvement to the content of domain-suffixes.xml : added compound TLD 
> for .ar, .co, .id, .il, .mx, .nz and .za

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-786) Better list of suffix domains

2010-02-05 Thread Julien Nioche (JIRA)
Better list of suffix domains
-

 Key: NUTCH-786
 URL: https://issues.apache.org/jira/browse/NUTCH-786
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


Small improvement to the content of domain-suffixes.xml : added compound TLD 
for .ar, .co, .id, .il, .mx, .nz and .za

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-02 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828548#action_12828548
 ] 

Julien Nioche commented on NUTCH-781:
-

> did you forgot to update conf/tika-mimetypes.xml ?
indeed - well spotted, thanks

> Related question: do we actually need our own version on the tika config 
> anymore? I saw there were some old issues that were fixed in the custom 
> version but i would quess those changes, if important, have already made 
> their way into Tika?
the version we had was the same as the one provided by Tika 0.4 so I suppose we 
could safely rely on theTika defaults. MimeUtil currently requires needs 
tika-mimetypes.xml to be in the available in the classpath but we could modify 
that so that it uses the default version from the tika jar if nothing can be 
found in conf. Let's put that in a separate JIRA issue if we really want it, in 
the meantime I'll commit the v 0.6 of tika-mimetypes.xml

J.


> Update Tika to v0.6  for the MimeType detection
> ---
>
> Key: NUTCH-781
> URL: https://issues.apache.org/jira/browse/NUTCH-781
> Project: Nutch
>      Issue Type: Improvement
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.1
>
>
> [from annoucement]
> Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
> extracting metadata and structured text content from various documents using
> existing parser libraries.
> Apache Tika 0.6 contains a number of improvements and bug fixes. Details can
> be found in the changes file:
> http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-785:


Attachment: NUTCH-785.patch

> Fetcher : copy metadata from origin URL when redirecting + call 
> scfilters.initialScore on newly created URL
> ---
>
> Key: NUTCH-785
> URL: https://issues.apache.org/jira/browse/NUTCH-785
> Project: Nutch
>  Issue Type: Bug
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-785.patch
>
>
> When following the redirections, the Fetcher does not copy the metadata from 
> the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-02-01 Thread Julien Nioche (JIRA)
Fetcher : copy metadata from origin URL when redirecting + call 
scfilters.initialScore on newly created URL
---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


When following the redirections, the Fetcher does not copy the metadata from 
the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-784) CrawlDBScanner

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-784:


Attachment: NUTCH-784.patch

> CrawlDBScanner 
> ---
>
> Key: NUTCH-784
> URL: https://issues.apache.org/jira/browse/NUTCH-784
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Attachments: NUTCH-784.patch
>
>
> The patch file contains a utility which dumps all the entries matching a 
> regular expression on their URL. The dump mechanism of the crawldb reader is 
> not  very useful on large crawldbs as the ouput can be extremely large and 
> the -url  function can't help if we don't know what url we want to have a 
> look at.
> The CrawlDBScanner can either generate a text representation of the 
> CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
> Usage: CrawlDBScanner[-s ] <-text>
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
> db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; 
> otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below : 
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
> -s db_fetched -text
> will generate a text file /tmp/amazon-dump containing all the entries of the 
> crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-784) CrawlDBScanner

2010-02-01 Thread Julien Nioche (JIRA)
CrawlDBScanner 
---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-784.patch

The patch file contains a utility which dumps all the entries matching a 
regular expression on their URL. The dump mechanism of the crawldb reader is 
not  very useful on large crawldbs as the ouput can be extremely large and the 
-url  function can't help if we don't know what url we want to have a look at.

The CrawlDBScanner can either generate a text representation of the 
CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 

Usage: CrawlDBScanner[-s ] <-text>

regex: regular expression on the crawldb key
-s status : constraint on the status of the crawldb entries e.g. db_fetched, 
db_unfetched
-text : if this parameter is used, the output will be of TextOutputFormat; 
otherwise it generates a 'normal' crawldb with the MapFileOutputFormat

for instance the command below : 
./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s 
db_fetched -text

will generate a text file /tmp/amazon-dump containing all the entries of the 
crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-779:


Attachment: NUTCH-779-v2.patch

Improved version of the patch. Followed AB's recommendations and renamed  
STATUS_PARSE_META + added description for param 'db.parsemeta.to.crawldb' in 
nutch-default.xml + fixed issue with IndexerMapReduce

> Mechanism for passing metadata from parse to crawldb
> 
>
> Key: NUTCH-779
> URL: https://issues.apache.org/jira/browse/NUTCH-779
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Attachments: NUTCH-779, NUTCH-779-v2.patch
>
>
> The patch attached allows to pass parse metadata to the corresponding entry 
> of the crawldb.  
> Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-779:
---

Assignee: Julien Nioche

> Mechanism for passing metadata from parse to crawldb
> 
>
> Key: NUTCH-779
> URL: https://issues.apache.org/jira/browse/NUTCH-779
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Attachments: NUTCH-779
>
>
> The patch attached allows to pass parse metadata to the corresponding entry 
> of the crawldb.  
> Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-783:


Attachment: NUTCH-783.patch

> IndexerChecker Utilty
> -
>
> Key: NUTCH-783
> URL: https://issues.apache.org/jira/browse/NUTCH-783
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>    Reporter: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-783.patch
>
>
> This patch contains a new utility which allows to check the configuration of 
> the indexing filters. The IndexerChecker reads and parses a URL and run the 
> indexers on it. Displays the fields obtained and the first
>  100 characters of their value.
> Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
> http://www.lemonde.fr/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-783:
---

Assignee: Julien Nioche

> IndexerChecker Utilty
> -
>
> Key: NUTCH-783
> URL: https://issues.apache.org/jira/browse/NUTCH-783
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-783.patch
>
>
> This patch contains a new utility which allows to check the configuration of 
> the indexing filters. The IndexerChecker reads and parses a URL and run the 
> indexers on it. Displays the fields obtained and the first
>  100 characters of their value.
> Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
> http://www.lemonde.fr/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-783) IndexerChecker Utilty

2010-02-01 Thread Julien Nioche (JIRA)
IndexerChecker Utilty
-

 Key: NUTCH-783
 URL: https://issues.apache.org/jira/browse/NUTCH-783
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


This patch contains a new utility which allows to check the configuration of 
the indexing filters. The IndexerChecker reads and parses a URL and run the 
indexers on it. Displays the fields obtained and the first
 100 characters of their value.

Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
http://www.lemonde.fr/



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-782) Ability to order htmlparsefilters

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-782:


Attachment: NUTCH-782.patch

> Ability to order htmlparsefilters
> -
>
> Key: NUTCH-782
> URL: https://issues.apache.org/jira/browse/NUTCH-782
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-782.patch
>
>
> Patch which adds a new parameter 'htmlparsefilter.order' which specifies the 
> order in which HTMLParse filters are applied. HTMLParse filter ordering MAY 
> have an impact on end result, as some filters could rely on the metadata 
> generated by a previous filter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-782) Ability to order htmlparsefilters

2010-02-01 Thread Julien Nioche (JIRA)
Ability to order htmlparsefilters
-

 Key: NUTCH-782
 URL: https://issues.apache.org/jira/browse/NUTCH-782
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1
 Attachments: NUTCH-782.patch

Patch which adds a new parameter 'htmlparsefilter.order' which specifies the 
order in which HTMLParse filters are applied. HTMLParse filter ordering MAY 
have an impact on end result, as some filters could rely on the metadata 
generated by a previous filter.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-766:


Attachment: (was: Nutch-766.ParserFactory.patch)

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-766:


Attachment: (was: NUTCH-766.tika.patch)

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-766) Tika parser

2010-02-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-766:


Attachment: NUTCH-766-v3.patch

Updated version of the plugin : uses Tika 0.6

> Tika parser
> ---
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
>  Issue Type: New Feature
>    Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
>   
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   >