首页 reverted to revision 5 on Nutch Wiki

2010-06-30 Thread Apache Wiki
Dear wiki user,

You have subscribed to a wiki page "Nutch Wiki" for change notification.

The page 首页 has been reverted to revision 5 by Upayavira.
The comment on this change is: spam.
http://wiki.apache.org/nutch/%E9%A6%96%E9%A1%B5?action=diff&rev1=6&rev2=7

--

-  Today, they spent a whole day at home. Day do not know what to do, holding 
hands to hear the phone will quickly pick up the zero-sound mind, not looking 
at the number you want, tears down subsequently. Hunhunee, like a zombie. Do 
not know why all of a sudden it became like this day, nearly a month, I lost 
too much faith, too much happiness. Always thought he was always standing 
there, so never thought of the loss of pain! You are in front of him is a 
transparent body, he knows what you're thinking, know what you like, love what, 
know what you need. He makes you feel very close, and with him you will feel 
relaxed and happy. No worries, and he will feel very safe with you. And 
suddenly this person does not belong to you, and suddenly it seems likely 
collapse. Someone to replace your position, but why is this time, why now? Why? 
Asked myself a thousand times, and asked him? But I did not get a satisfactory 
result to their own, do not know will not collapse, did not know who will be 
this fall ... ... has been telling him he might be happy to not be afraid of 
him that is subject to certain should not have to endure what he ah! At this 
moment I really want him to hold me, for I want love and affection. I do not 
treasure, or he really fear, and retreat? Do not really want to own that are 
good, will he come back, but I can not, I can not use this method, even if he 
did return, nor is my old bear, and alter the nature of the feelings I do not. 
He said that I had grown in his heart, may now, in this time, he felt pain in 
my mind you, this can not be words of pain, this pain Heartbreakers! There? 
Even a little bit, I am afraid that this time the tender is being submerged in 
another village in the doing part of their dreams! Insomnia every night take 
me, how do I do in order to allow time to stop this torture people? Well you 
will come back, Cubs, still thinking about Stupid me, that belongs to your 
simple-minded, how can you bear to make her so miserable for you? This cry for 
you, my conscience gone? No matter how, simple-minded, or wish you happiness. 
All bad, are not happy retribution upon it in the simple-minded, somebody had 
this pain, then let some of the pain is more pain, until you feel no pain ... 
...[[http://www.highheeled-shoes.com]]  wholesale High Heels shoes
+ ## Please edit system and help pages ONLY in the moinmaster wiki! For more
+ ## information, please see MoinMaster:MoinPagesEditorGroup.
+ ##master-page:FrontPage
+ ##master-date:2004-11-21 15:15:01
+ #format wiki
+ #language zh
+ #pragma section-numbers off
+ = 维基链接名 维基 =
+ 您也许可以从这些连接开始:
  
+  * [[最新改动]]: 谁最近改动了什么 (我在修改)
+  * [[维基沙盘演练]]: 您可以随意改动编辑,热身演练
+  * [[查找网页]]: 用多种方法搜索浏览这个站点
+  * [[语法参考]]: 维基语法简便参考
+  * [[站点导航]]: 本站点内容概要
+ 
+ 这个维基是有关什么的?
+ 
+ 测试
+ 
+ == 如何使用这个站点 ==
+ 维基(wiki)是一种协同合作网站,任何人都可以参与网站的建立、编辑和维护并分享网站的内容:
+ 
+  * 点击每个网页页眉或页尾中的'''<>'''就可以随意编辑改动这个网页。
+  * 
创建一个链接简单的不能再简单了:您可以使用连在一起的,每个单词第一个字母大写,但不用空格分隔的词组(比如WikiSandBox),也可以用{{{["quoted
 words in brackets"]}}}。简体中文的链接可以使用后者,比如{{{["维基沙盘演练"]}}}。
+  * 每页的页眉中的搜索框可以用来将进行网页标题搜索或者进行全文检索。
+  * 新手可以参阅[[帮助-新手入门]],如需要详尽的帮助,参阅[[帮助目录]]。
+ 
+ 如需要更多有关[[维基网]]的信息,请参考[[维基好坏说]]和MoinMoin:WikiNature 
(英文)。也请参考MoinMoin:WikiWikiWebFaq (英文)。
+ 
+ 本维基使用[[简体中文MoinMoin]]系统。[[简体中文MoinMoin]]是MoinMoin的简体中文版本。
+ 
+ 此页的英文版本:FrontPage
+ 


[jira] Commented: (NUTCH-834) Separate the Nutch web site from trunk

2010-06-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884177#action_12884177
 ] 

Hudson commented on NUTCH-834:
--

Integrated in Nutch-trunk #1194 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1194/])
(NUTCH-834) Separate the Nutch web site from trunk


> Separate the Nutch web site from trunk
> --
>
> Key: NUTCH-834
> URL: https://issues.apache.org/jira/browse/NUTCH-834
> Project: Nutch
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
>
> As discussed on dev@, it would be useful to move the -PDFBox- Nutch web site 
> sources from .../asf/nutch/trunk to .../asf/nutch/site and to use the 
> svnpubsub mechanism for instant deployment of site changes.
> The related issue for infra is 
> https://issues.apache.org/jira/browse/INFRA-2822
> See also https://issues.apache.org/jira/browse/PDFBOX-623

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Update of "首页" by tuanzhan g

2010-06-30 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "首页" page has been changed by tuanzhang.
http://wiki.apache.org/nutch/%E9%A6%96%E9%A1%B5?action=diff&rev1=5&rev2=6

--

+  Today, they spent a whole day at home. Day do not know what to do, holding 
hands to hear the phone will quickly pick up the zero-sound mind, not looking 
at the number you want, tears down subsequently. Hunhunee, like a zombie. Do 
not know why all of a sudden it became like this day, nearly a month, I lost 
too much faith, too much happiness. Always thought he was always standing 
there, so never thought of the loss of pain! You are in front of him is a 
transparent body, he knows what you're thinking, know what you like, love what, 
know what you need. He makes you feel very close, and with him you will feel 
relaxed and happy. No worries, and he will feel very safe with you. And 
suddenly this person does not belong to you, and suddenly it seems likely 
collapse. Someone to replace your position, but why is this time, why now? Why? 
Asked myself a thousand times, and asked him? But I did not get a satisfactory 
result to their own, do not know will not collapse, did not know who will be 
this fall ... ... has been telling him he might be happy to not be afraid of 
him that is subject to certain should not have to endure what he ah! At this 
moment I really want him to hold me, for I want love and affection. I do not 
treasure, or he really fear, and retreat? Do not really want to own that are 
good, will he come back, but I can not, I can not use this method, even if he 
did return, nor is my old bear, and alter the nature of the feelings I do not. 
He said that I had grown in his heart, may now, in this time, he felt pain in 
my mind you, this can not be words of pain, this pain Heartbreakers! There? 
Even a little bit, I am afraid that this time the tender is being submerged in 
another village in the doing part of their dreams! Insomnia every night take 
me, how do I do in order to allow time to stop this torture people? Well you 
will come back, Cubs, still thinking about Stupid me, that belongs to your 
simple-minded, how can you bear to make her so miserable for you? This cry for 
you, my conscience gone? No matter how, simple-minded, or wish you happiness. 
All bad, are not happy retribution upon it in the simple-minded, somebody had 
this pain, then let some of the pain is more pain, until you feel no pain ... 
...[[http://www.highheeled-shoes.com]]  wholesale High Heels shoes
- ## Please edit system and help pages ONLY in the moinmaster wiki! For more
- ## information, please see MoinMaster:MoinPagesEditorGroup.
- ##master-page:FrontPage
- ##master-date:2004-11-21 15:15:01
- #format wiki
- #language zh
- #pragma section-numbers off
- = 维基链接名 维基 =
- 您也许可以从这些连接开始:
  
-  * [[最新改动]]: 谁最近改动了什么 (我在修改)
-  * [[维基沙盘演练]]: 您可以随意改动编辑,热身演练
-  * [[查找网页]]: 用多种方法搜索浏览这个站点
-  * [[语法参考]]: 维基语法简便参考
-  * [[站点导航]]: 本站点内容概要
- 
- 这个维基是有关什么的?
- 
- 测试
- 
- == 如何使用这个站点 ==
- 维基(wiki)是一种协同合作网站,任何人都可以参与网站的建立、编辑和维护并分享网站的内容:
- 
-  * 点击每个网页页眉或页尾中的'''<>'''就可以随意编辑改动这个网页。
-  * 
创建一个链接简单的不能再简单了:您可以使用连在一起的,每个单词第一个字母大写,但不用空格分隔的词组(比如WikiSandBox),也可以用{{{["quoted
 words in brackets"]}}}。简体中文的链接可以使用后者,比如{{{["维基沙盘演练"]}}}。
-  * 每页的页眉中的搜索框可以用来将进行网页标题搜索或者进行全文检索。
-  * 新手可以参阅[[帮助-新手入门]],如需要详尽的帮助,参阅[[帮助目录]]。
- 
- 如需要更多有关[[维基网]]的信息,请参考[[维基好坏说]]和MoinMoin:WikiNature 
(英文)。也请参考MoinMoin:WikiWikiWebFaq (英文)。
- 
- 本维基使用[[简体中文MoinMoin]]系统。[[简体中文MoinMoin]]是MoinMoin的简体中文版本。
- 
- 此页的英文版本:FrontPage
- 


[jira] Assigned: (NUTCH-838) Add timing information to all Tool classes

2010-06-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-838:
---

Assignee: Chris A. Mattmann

> Add timing information to all Tool classes
> --
>
> Key: NUTCH-838
> URL: https://issues.apache.org/jira/browse/NUTCH-838
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator, indexer, linkdb, parser
>Affects Versions: 1.1
> Environment: JDK 1.6, Linux & Windows
>Reporter: Jeroen van Vianen
>Assignee: Chris A. Mattmann
> Fix For: 2.0
>
> Attachments: timings.patch
>
>
> Am happily trying to crawl a few hundred URLs incrementally. Performance is 
> degrading suddenly after the index reaches approximately 25000 URLs.
> At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, 
> solrindex, solrdedup batch takes approximately half an hour with topN 500, 
> but elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. 
> As I'm uncertain which of the phases takes so much time I decided to add 
> start and finish times to al classes that implement Tool so I at least have a 
> feeling and can review them in a log file.
> Am using pretty old hardware, but I am planning to recrawl these URLs on a 
> regular basis and if every iteration is going to take more and more time, 
> index updates will be few and far between :-(
> I added timing information to *all* Tool classes for consistency whereas 
> there are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-838) Add timing information to all Tool classes

2010-06-30 Thread Jeroen van Vianen (JIRA)
Add timing information to all Tool classes
--

 Key: NUTCH-838
 URL: https://issues.apache.org/jira/browse/NUTCH-838
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator, indexer, linkdb, parser
Affects Versions: 1.1
 Environment: JDK 1.6, Linux & Windows
Reporter: Jeroen van Vianen
 Fix For: 2.0


Am happily trying to crawl a few hundred URLs incrementally. Performance is 
degrading suddenly after the index reaches approximately 25000 URLs.

At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, 
solrindex, solrdedup batch takes approximately half an hour with topN 500, but 
elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. As I'm 
uncertain which of the phases takes so much time I decided to add start and 
finish times to al classes that implement Tool so I at least have a feeling and 
can review them in a log file.

Am using pretty old hardware, but I am planning to recrawl these URLs on a 
regular basis and if every iteration is going to take more and more time, index 
updates will be few and far between :-(

I added timing information to *all* Tool classes for consistency whereas there 
are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-838) Add timing information to all Tool classes

2010-06-30 Thread Jeroen van Vianen (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeroen van Vianen updated NUTCH-838:


Attachment: timings.patch

Here's the patch to add timings to all Tool classes.

Additionally, it removes some @Override where they were used incorrectly and 
adds the ability to use '#' to mark a line as a comment while injecting new URLs

> Add timing information to all Tool classes
> --
>
> Key: NUTCH-838
> URL: https://issues.apache.org/jira/browse/NUTCH-838
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator, indexer, linkdb, parser
>Affects Versions: 1.1
> Environment: JDK 1.6, Linux & Windows
>Reporter: Jeroen van Vianen
> Fix For: 2.0
>
> Attachments: timings.patch
>
>
> Am happily trying to crawl a few hundred URLs incrementally. Performance is 
> degrading suddenly after the index reaches approximately 25000 URLs.
> At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, 
> solrindex, solrdedup batch takes approximately half an hour with topN 500, 
> but elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. 
> As I'm uncertain which of the phases takes so much time I decided to add 
> start and finish times to al classes that implement Tool so I at least have a 
> feeling and can review them in a log file.
> Am using pretty old hardware, but I am planning to recrawl these URLs on a 
> regular basis and if every iteration is going to take more and more time, 
> index updates will be few and far between :-(
> I added timing information to *all* Tool classes for consistency whereas 
> there are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Trivial Update of "RunningNutchAndSolr" by SeanOConnor

2010-06-30 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "RunningNutchAndSolr" page has been changed by SeanOConnor.
The comment on this change is: change "export SEGMENT..." line from showing 
backtick entity, to just backtick.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=38&rev2=39

--

  The above command will generate a new segment directory under crawl/segments 
that at this point contains files that store the url(s) to be fetched. In the 
following commands we need the latest segment dir as parameter so we’ll store 
it in an environment variable. 
  
  {{{
- export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
+ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
  echo $SEGMENT
  }}}
  


[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Attachment: NUTCH-836-2.patch

New patch which fixes the issues mentioned earlier. 

*languageidentifier* and *parse-zip* : dependence was only in the plugin 
descriptor but the code works fine with Tika used as a default plugin
*creative-commons* : had hard-coded dependence (fixed). Using Tika returns 
slightly different results - see adapted test code.
 
Also fixed the TestParserFactory.

All tests OK. 

> Remove deprecated parse plugins
> ---
>
> Key: NUTCH-836
> URL: https://issues.apache.org/jira/browse/NUTCH-836
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-836-2.patch
>
>
> Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
> plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
> on parse-tika almost exclusively. Some existing plugins might be kept when 
> there is no equivalent in Tika (to be discussed). The following plugins are 
> removed : 
> * parse-html
> * parse-msexcel
> * parse-mspowerpoint
> * parse-msword
> * parse-pdf
> * parse-oo
> * parse-text
> * lib-jakarta-poi
> * lib-parsems
> The patch does not (yet) remove :
> * parse-ext
> * parse-js
> * parse-rss
> * parse-swf
> * parse-zip
> * feed
> Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Attachment: (was: NUTCH-836.patch)

> Remove deprecated parse plugins
> ---
>
> Key: NUTCH-836
> URL: https://issues.apache.org/jira/browse/NUTCH-836
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-836-2.patch
>
>
> Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
> plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
> on parse-tika almost exclusively. Some existing plugins might be kept when 
> there is no equivalent in Tika (to be discussed). The following plugins are 
> removed : 
> * parse-html
> * parse-msexcel
> * parse-mspowerpoint
> * parse-msword
> * parse-pdf
> * parse-oo
> * parse-text
> * lib-jakarta-poi
> * lib-parsems
> The patch does not (yet) remove :
> * parse-ext
> * parse-js
> * parse-rss
> * parse-swf
> * parse-zip
> * feed
> Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-837) Remove search servers and Lucene dependencies

2010-06-30 Thread Julien Nioche (JIRA)
Remove search servers and Lucene dependencies 
--

 Key: NUTCH-837
 URL: https://issues.apache.org/jira/browse/NUTCH-837
 Project: Nutch
  Issue Type: Task
  Components: searcher, web gui
Affects Versions: 1.1
Reporter: Julien Nioche
 Fix For: 2.0


One of the main aspects of 2.0 is the delegation of the indexing and search to 
external resources like SOLR. We can simplify the code a lot by getting rid of 
the : 
* search servers
* indexing and analysis with Lucene
* search side functionalities : ontologies / clustering etc...
In the short term only SOLR / SOLRCloud will be supported but the plan would be 
to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Update of "Nutch2Roadmap" by JulienNioche

2010-06-30 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Nutch2Roadmap" page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Nutch2Roadmap?action=diff&rev1=2&rev2=3

--

  * robots handling, url filtering and url normalization, URL state 
management, perhaps deduplication. We should coordinate our efforts, and share 
code freely so that other projects (bixo, heritrix,droids) may contribute to 
this shared pool of functionality, much like Tika does for the common need of 
parsing complex formats.
* Remove index / search and delegate to SOLR
  * we may still keep a thin abstract layer to allow other indexing/search 
backends (ElasticSearch?), but the current mess of indexing/query filters and 
competing indexing frameworks (lucene, fields, solr) should go away. We should 
go directly from DOM to a NutchDocument, and stop there.
+   * Rewrite SOLR deduplication : do everything using the webtable and avoid 
retrieving content from SOLR 
* Various new functionalities 
  * e.g. sitemap support, canonical tag, better handling of redirects, 
detecting duplicated sites, detection of spam cliques, tools to manage the 
webgraph, etc.
  


[Nutch Wiki] Update of "Support" by JulienNioche

2010-06-30 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Support" page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Support?action=diff&rev1=50&rev2=51

--

* [[http://www.30digits.com/|30 Digits]] - Implementation, consulting, 
support, and value-add components (i.e. spiders, UI, security) for Nutch, 
Lucene and Solr.  Based in Germany (Deutschland) with customers across Europe 
and North America. 
* [[http://www.sigram.com|Andrzej Bialecki]] 
* CNLP  http://www.cnlp.org/tech/lucene.asp
-   * [[http://www.digitalpebble.com/|DigitalPebble Ltd.]] . Norwich, UK.
+   * [[http://www.digitalpebble.com/|DigitalPebble Ltd.]] . Bristol, UK.
* [[http://www.doculibre.com/|Doculibre Inc.]] Open source and information 
management consulting. (Lucene, Nutch, Hadoop, Solr, Lius etc.) 
* [[http://www.dsen.nl|Thomas Delnoij (DSEN) - Java | J2EE | Agile 
Development & Consultancy]]
* eventax GmbH 


[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Description: 
Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
on parse-tika almost exclusively. Some existing plugins might be kept when 
there is no equivalent in Tika (to be discussed). The following plugins are 
removed : 
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems

The patch does not (yet) remove :
* parse-ext
* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed

Please review the patch and vote for its inclusion in the trunk.




  was:
Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
on parse-tika almost exclusively. Some existing plugins might be kept when 
there is no equivalent in Tika (to be discussed). The following plugins are 
removed : 
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems

The patch does not (yet) remove :

* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed

Please review the patch and vote for its inclusion in the trunk.





> Remove deprecated parse plugins
> ---
>
> Key: NUTCH-836
> URL: https://issues.apache.org/jira/browse/NUTCH-836
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-836.patch
>
>
> Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
> plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
> on parse-tika almost exclusively. Some existing plugins might be kept when 
> there is no equivalent in Tika (to be discussed). The following plugins are 
> removed : 
> * parse-html
> * parse-msexcel
> * parse-mspowerpoint
> * parse-msword
> * parse-pdf
> * parse-oo
> * parse-text
> * lib-jakarta-poi
> * lib-parsems
> The patch does not (yet) remove :
> * parse-ext
> * parse-js
> * parse-rss
> * parse-swf
> * parse-zip
> * feed
> Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883891#action_12883891
 ] 

Julien Nioche commented on NUTCH-836:
-

Actually creative-commons + languageidentifier currently have a dependency on 
parse-html and parse-zip has one on parse-text in their build script.
The tests for the Fetcher and ParserFactory also fail without parse-html and 
parse-text. 

I will modify the patch to prevent these issues

> Remove deprecated parse plugins
> ---
>
> Key: NUTCH-836
> URL: https://issues.apache.org/jira/browse/NUTCH-836
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-836.patch
>
>
> Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
> plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
> on parse-tika almost exclusively. Some existing plugins might be kept when 
> there is no equivalent in Tika (to be discussed). The following plugins are 
> removed : 
> * parse-html
> * parse-msexcel
> * parse-mspowerpoint
> * parse-msword
> * parse-pdf
> * parse-oo
> * parse-text
> * lib-jakarta-poi
> * lib-parsems
> The patch does not (yet) remove :
> * parse-js
> * parse-rss
> * parse-swf
> * parse-zip
> * feed
> Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)
Remove deprecated parse plugins
---

 Key: NUTCH-836
 URL: https://issues.apache.org/jira/browse/NUTCH-836
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.1
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 2.0
 Attachments: NUTCH-836.patch

Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
on parse-tika almost exclusively. Some existing plugins might be kept when 
there is no equivalent in Tika (to be discussed). The following plugins are 
removed : 
* parse-html
* parse-msexcel
* parse-mspowerpoint
* parse-msword
* parse-pdf
* parse-oo
* parse-text
* lib-jakarta-poi
* lib-parsems

The patch does not (yet) remove :

* parse-js
* parse-rss
* parse-swf
* parse-zip
* feed

Please review the patch and vote for its inclusion in the trunk.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-836) Remove deprecated parse plugins

2010-06-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-836:


Attachment: NUTCH-836.patch

> Remove deprecated parse plugins
> ---
>
> Key: NUTCH-836
> URL: https://issues.apache.org/jira/browse/NUTCH-836
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-836.patch
>
>
> Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
> plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
> on parse-tika almost exclusively. Some existing plugins might be kept when 
> there is no equivalent in Tika (to be discussed). The following plugins are 
> removed : 
> * parse-html
> * parse-msexcel
> * parse-mspowerpoint
> * parse-msword
> * parse-pdf
> * parse-oo
> * parse-text
> * lib-jakarta-poi
> * lib-parsems
> The patch does not (yet) remove :
> * parse-js
> * parse-rss
> * parse-swf
> * parse-zip
> * feed
> Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-650) Hbase Integration

2010-06-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883880#action_12883880
 ] 

Julien Nioche commented on NUTCH-650:
-

The patch has been committed with revision # 959259. The content of 
https://svn.apache.org/repos/asf/nutch/branches/nutchbase is now the same as 
github.

> Hbase Integration
> -
>
> Key: NUTCH-650
> URL: https://issues.apache.org/jira/browse/NUTCH-650
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
> Fix For: 2.0
>
> Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
> latest-nutchbase-vs-original-branch-point.patch, 
> latest-nutchbase-vs-svn-nutchbase.patch, malformedurl.patch, meta.patch, 
> meta2.patch, nb-design.txt, nb-installusage.txt, nofollow-hbase.patch, 
> NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch
>
>
> This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Update svn nutchbase - Nutch 2.0

2010-06-30 Thread Julien Nioche
Hi,

The patch has been committed with revision # 959259. The content of
https://svn.apache.org/repos/asf/nutch/branches/nutchbase is now the same as
github.

I'll start filing JIRAs to progressively transfer the stuff to the trunk,
starting with the easiest bits e.g. dependencies with IVY, deletion of old
plugins, etc...

Thanks

J.

On 29 June 2010 21:27, Dennis Kubes  wrote:

>  +1 on this
>
>
> On 06/29/2010 08:57 AM, Julien Nioche wrote:
>
> Dogacan has produced a patch for svn nutchbase that brings it to the level
> of github. See https://issues.apache.org/jira/browse/NUTCH-650
> The patch has been marked as 'licensed for inclusion in ASF work' and works
> fine.
>
> Any objections to this patch being committed?
>
> Thanks Dogacan for producing it BTW
>
>
> On 29 June 2010 14:14, Mattmann, Chris A (388J) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Hey Guys,
>>
>> On 6/29/10 2:30 AM, "Andrzej Bialecki"  wrote:
>>
>> >> I am probably missing an important point here, but if so I would
>> >> appreciate if someone (Dogacan?) could explain why we should not stick
>> >> to the original plan
>> >> (a) clear the existing svn nutchbase
>> >> (b) generate a large patch with the code from github and JIRA it
>> >> (c) commit the changes to svn nutchbase
>> >> then get on with the interesting bits.
>>
>>  Like I said, whether we merge the Github Nutchbase into the Apache
>> Nutchbase
>> branch or we blow away the Apache Nutchbase branch and then import the
>> Github Nutchbase branch wholesale, either way, we are left with an Apache
>> Nutchbase branch that needs to incrementally be merged into the Nutch 2.0
>> trunk, which I agree with Andrzej, and Julien, is the most important part.
>>
>> So, either way works fine with me, so long as we are left with an Apache
>> Nutchbase branch that can be merged incrementally with the Apache Nutch
>> 2.0
>> trunk. I'm just not going to be the one doing that first part (Github
>> transfer), so I didn't want to push one way or another.
>>
>> Once the Apache Nutchbase branch is ready, can we identify a set of 5-10
>> JIRA patches that we can use to track how to bring the Apache Nutchbase
>> branch into the Apache Nutch 2.0 trunk? At that point, I'll likely be of
>> use
>> again :) Until then, Julien, Dogacan, I think the floor is yours.
>>
>> Cheers,
>> Chris
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.mattm...@jpl.nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>>
>>
>>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


[jira] Closed: (NUTCH-834) Separate the Nutch web site from trunk

2010-06-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-834.
---

Resolution: Fixed

Committed revision 959228.

Thanks Chris for your comments and help with this

> Separate the Nutch web site from trunk
> --
>
> Key: NUTCH-834
> URL: https://issues.apache.org/jira/browse/NUTCH-834
> Project: Nutch
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
>
> As discussed on dev@, it would be useful to move the -PDFBox- Nutch web site 
> sources from .../asf/nutch/trunk to .../asf/nutch/site and to use the 
> svnpubsub mechanism for instant deployment of site changes.
> The related issue for infra is 
> https://issues.apache.org/jira/browse/INFRA-2822
> See also https://issues.apache.org/jira/browse/PDFBOX-623

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-834) Separate the Nutch web site from trunk

2010-06-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883841#action_12883841
 ] 

Julien Nioche commented on NUTCH-834:
-

The content of the site is now taken from 
http://svn.apache.org/repos/asf/nutch/site/publish/ (see INFRA-2822), I've 
tested with a small change and it works fine. 
I've updated http://wiki.apache.org/nutch/Website_Update_HOWTO accordingly and 
will now remove the src/site and site directories from the trunk

> Separate the Nutch web site from trunk
> --
>
> Key: NUTCH-834
> URL: https://issues.apache.org/jira/browse/NUTCH-834
> Project: Nutch
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
>
> As discussed on dev@, it would be useful to move the -PDFBox- Nutch web site 
> sources from .../asf/nutch/trunk to .../asf/nutch/site and to use the 
> svnpubsub mechanism for instant deployment of site changes.
> The related issue for infra is 
> https://issues.apache.org/jira/browse/INFRA-2822
> See also https://issues.apache.org/jira/browse/PDFBOX-623

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Update of "Website_Update_HOWTO" by Ju lienNioche

2010-06-30 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Website_Update_HOWTO" page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Website_Update_HOWTO?action=diff&rev1=8&rev2=9

--

  
  == How to modify the docs ==
  
-   1. Checkout nutch code.
+   1. Checkout nutch site code 
(https://svn.apache.org/repos/asf/nutch/site)
-   1. Go to {{{src/site}}}. It is the root of Nutch Website Forrest 
project.
+   1. Go to {{{forrest}}}. It is the root of Nutch Website Forrest project.
-   1. Run {{{forrest}}} to build current version of documentation. If the 
build was successful it means Forrest was correctly installed and generated 
site is in {{{src/site/build/site}}} directory. 
+   1. Run {{{forrest}}} to build current version of documentation. If the 
build was successful it means Forrest was correctly installed and generated 
site is in {{{forrest/build/site}}} directory. 
-   1. Modify files in {{{src/site/src}}} (mainly in 
{{{src/site/src/documentation/content/xdocs}}}). Run {{{forrest}}} in 
{{{/src/site}}} and review the changes after build.
+   1. Modify files in {{{forrest/src}}} (mainly in 
{{{forrest/src/documentation/content/xdocs}}}). Run {{{forrest}}} in 
{{{forrest}}} and review the changes after build.
  
  If you aren't a committer for this project, you now need to follow the 
instructions in HowToContribute to get your changes applied to the site.  
You'll specifically want to read the sections on "Creating a patch" and 
"Proposing your work".  If you are a committer, it's time to deploy the site.
  
  == How to deploy the site ==
  
-   1. When you are finally happy with your changes copy files from 
{{{src/site/build/site}}} directory to {{{site}}} and commit them to SVN.
+   1. When you are finally happy with your changes copy files from 
{{{forrest/build/site}}} directory to {{{publish}}} and commit them to SVN.
+   1. The modifications should be visible on the website within a few 
minutes.
-   1. {{{ssh people.apache.org}}}
-   1. {{{cd /www/lucene.apache.org/nutch}}}
- 1. Before doing svn up make sure your umask is set to 002 (if you're 
using bash check that ~/.profile contains umask 002), this allows other 
committers to update the site too
-   1. {{{svn update}}}
-   1. Wait a few hours.  The website is synchronized from 
people.apache.org.
- 
  
  == Skins Note ==