[Nutch Wiki] Update of FrontPage by ysc

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by ysc:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=260rev2=261

Comment:
add some vedio resource 

   * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, 
Nutch, and Gora]] - A step-by-step tutorial
  
   Other Tutorial(s) 
+  * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video 
Tutorial: Nutch Relevant Framework]] - The first free video for Nutch in China.
+  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting Up 
And Use Tutorial]] - The best guide of how to setting up and use nutch relevant 
framework in China.
+ 
   * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch 
being based Hadoop, it helps to have a better understanding of Hadoop.
   * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch 
in deploy mode over a Hadoop cluster. 
   * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within 
Eclipse
   * [[IntranetDocumentSearch|Intranet Document Search]] - Index and search 
Microsoft Office, PDF etc. documents in a file system hierarchy with a Solr 
backend.
   * 
[[http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/|Recrawling 
with Nutch]] - How to re-crawl with Nutch. 
   * 
[[https://github.com/evolvingweb/ajax-solr/wiki/Tutorial%3A-Nutch|Ajax-Solr 
Tutorial: Nutch]] - Quick and easy guide to getting a nice UI on top of your 
Nutch crawl data. 
+ 
  
  === Configuration ===
   * OverviewDeploymentConfigs /!\ :This full page requires a complete update 
to reflect recent Nutch releases: /!\


[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-27 Thread Roland von Herget (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615063#comment-13615063
 ] 

Roland von Herget commented on NUTCH-1538:
--

Hi lufeng,

after reading a bit more of nutch code, the question arises if it is really 
necessary to load any of this ParserJob.FIELDS.
Shouldn't the fetcher set up all fields (all of fit.page) necessary for the 
parser during the fetch?
I'll think I will give this a try here.


 tuning of loaded fields during fetcherJob start-up
 --

 Key: NUTCH-1538
 URL: https://issues.apache.org/jira/browse/NUTCH-1538
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 2.1
 Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
 gora-core 0.2.1 
 running fetch with parse=true
Reporter: Roland von Herget

 Main problem is, nutch is loading nearly every row  column from DB during 
 startup of a fetcherJob when fetcher.parse=true.
 A parserJob needs e.g. the CONTENT field from db, to parse.
 The fetcherJob adds all fields of the parserJob to it's needed fields, if 
 running with fetcher.parse=true. [FetcherJob.getFields()]
 If the nutch configuration saves all fetched data to DB 
 (fetcher.store.content=true) you'll end up loading GBs of unused content 
 during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[Nutch Wiki] Trivial Update of EstelaDom by EstelaDom

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The EstelaDom page has been changed by EstelaDom:
http://wiki.apache.org/nutch/EstelaDom

New page:
Hello, Ok Nothing to write about myself.BR
Great to be a member of this website.BR
BR
My page - [[http://GamesActual.com/|http://GamesActual.com/]]


[Nutch Wiki] Update of FrontPage by ysc

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by ysc:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=261rev2=262

  
   Other Tutorial(s) 
   * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video 
Tutorial: Nutch Relevant Framework]] - The first free video for Nutch in China.
-  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting Up 
And Use Tutorial]] - The best guide of how to setting up and use nutch relevant 
framework in China.
+  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting Up 
And Use Tutorial: Nutch Relevant Framework]] - The best guide of how to setting 
up and use nutch relevant framework in China.
  
   * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch 
being based Hadoop, it helps to have a better understanding of Hadoop.
   * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch 
in deploy mode over a Hadoop cluster. 


[Nutch Wiki] Update of FrontPage by ysc

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by ysc:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=262rev2=263

   * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, 
Nutch, and Gora]] - A step-by-step tutorial
  
   Other Tutorial(s) 
-  * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video 
Tutorial: Nutch Relevant Framework]] - The first free video for Nutch in China.
+  * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese video 
tutorial]] - The first free video for Nutch in China.
+  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese installing 
and using instruction]] - The best guidance in installing and using  Nutch in 
China.
-  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting Up 
And Use Tutorial: Nutch Relevant Framework]] - The best guide of how to setting 
up and use nutch relevant framework in China.
- 
   * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch 
being based Hadoop, it helps to have a better understanding of Hadoop.
   * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch 
in deploy mode over a Hadoop cluster. 
   * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within 
Eclipse


[Nutch Wiki] Update of FrontPage by ysc

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by ysc:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=263rev2=264

   * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, 
Nutch, and Gora]] - A step-by-step tutorial
  
   Other Tutorial(s) 
-  * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese video 
tutorial]] - The first free video for Nutch in China.
+  * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video 
Tutorial]] - The first free video for Nutch in China.
-  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese installing 
and using instruction]] - The best guidance in installing and using  Nutch in 
China.
+  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting up 
and Using Instruction]] - The best guidance in setting up and using  Nutch in 
China.
   * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch 
being based Hadoop, it helps to have a better understanding of Hadoop.
   * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch 
in deploy mode over a Hadoop cluster. 
   * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within 
Eclipse


[Nutch Wiki] Update of FrontPage by ysc

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by ysc:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=264rev2=265

   * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, 
Nutch, and Gora]] - A step-by-step tutorial
  
   Other Tutorial(s) 
-  * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video 
Tutorial]] - The first free video for Nutch in China.
+  * [[http://user.qzone.qq.com/281032878/blog/1364233492|ChineseVideo 
Tutorial]] - The first free video for Nutch in China.
-  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting up 
and Using Instruction]] - The best guidance in setting up and using  Nutch in 
China.
+  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese installing 
and using instruction]] - The best guidance in installing and using  Nutch in 
China.
   * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch 
being based Hadoop, it helps to have a better understanding of Hadoop.
   * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch 
in deploy mode over a Hadoop cluster. 
   * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within 
Eclipse


[Nutch Wiki] Trivial Update of MarshallW by MarshallW

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The MarshallW page has been changed by MarshallW:
http://wiki.apache.org/nutch/MarshallW

New page:
Individuals might wonder,  Regardless of whether I don't calculate fat from 
fat, after whom how will I dominance my personal eating practices?  To start, 
you'll use the in which to strategy your meals and additionally keep an eye 
within your feelings after consuming in your own Diet Solution Food Journal. 
Over time, although, turned into familiar with how individuals can meet your 
different body's healthy requirements definitely.BR
Humans are born from the ability to know when the body is in fact nourished 
then when genuinely is not. Isabel's professional experience suggests that 
numerous yo-yo dieters and those people who have fought at present overweight 
don't listen which can their bodies' cues in about reaction to the diet items 
and also servings they consume. The actual good news is that this ability can 
come to be (re)learned.BR
BR
Is the Diet Plan Way Program new? No, it certainly is not too. It has been in 
and around for a few a number of now, but you have now likely never heard pc. 
However, it is incredibly effective. The reason that it works out so well is 
because you will end  up  with a plan that do is specifically designed in 
order to. That is the reason why it works so amazingly well.BR
BR
If this is so, and it is, the challenge is i'm I prepared to change my diet so 
should give my liver chance to do its work while it should? Am Anyway i 
prepared to go to be able to eating a simpler food that consists of a variety 
of food groups cooked inside my own kitchen rather in comparison to living on 
the refined food in our supermarkets  fast-food chains?BR
Yes, these prepared foods are convenient and they build nicely into our very 
busy lifestyles, but for many these are the meal items are killing us. The 
change in lifespan needs thought and planning the layout and stresses as a 
model due to changes. I may easily testify that leaving and also behind is not 
so simple as it first appears.BR
BR
BR
Declined Salt Diet Solution: Cocinero your own sandwich meat by slow roasting 
pork, beef and turkey every 300 degrees until widely cooked. Beware 
implementing gravy mixes that experience high salt content single package-use 
beer, wine or even vinegar and herbs for greater flavor.BR
BR
BR
BR
Factors able to sense healthful and joyful, you will have give The 
[[http://consigliper-dimagrire.com/|pillole per dimagrire]] Fashion a 2nd 
search. Remaining obese leads so that you bodily injury to the equipment and 
puts you along with danger for diabetes and heart disease.BR
Moreover, it destroys a fighter's self-esteem.BR
BR
Another necessary factor is that strategy should taste good and be simple to 
prepare. A lot of loose weight programs either torture you using recipes, or 
suggest supplements that are simply unrealistic to recreate on every day basis. 
Quite to the contrary,  diet solution program presents simple to prepare, 
classy and flexible recipes also diet plans.BR
BR
The entire author of this book, Isabel De Los Rios will teach you the way in 
which to eat correctly but also how to eat the latest lot healthier no matter 
what your body type is. The daily diet is going to generally be hard at first 
most definitely when you need to help avoid or change individuals unhealthy 
foods that you may used to eat. The program aimed the new long term success 
present in terms of losing weight, understanding of proper healthy diet and 
maintaining a in perfect shape body.


[jira] [Updated] (NUTCH-1547) BasicIndexingFilter - Problem to index full title

2013-03-27 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1547:
--

Attachment: NUTCH-1547-2x.patch

add patch to Nutch 2.x

 BasicIndexingFilter - Problem to index full title
 -

 Key: NUTCH-1547
 URL: https://issues.apache.org/jira/browse/NUTCH-1547
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Gustavo Rauber
Assignee: lufeng
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 I have faced this issue when trying to index the entire title, just like the 
 content, configuring its value on nutch-default.xml to -1 
 (indexer.max.title.length). I think the behavior should be the same as the 
 content.
 If you would like to fix it, just replace the line number 90:
 if (title.length()  MAX_TITLE_LENGTH) {  // truncate title if needed
 by this one:
 if (MAX_TITLE_LENGTH  -1  title.length()  MAX_TITLE_LENGTH) {  // 
 truncate title if needed
 Stack Trace:
 java.lang.StringIndexOutOfBoundsException: String index out of range: -1
   at java.lang.String.substring(String.java:1937)
   at 
 org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91)
   at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
 Cheers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1389) parsechecker and indexchecker to report truncated content

2013-03-27 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615360#comment-13615360
 ] 

lufeng commented on NUTCH-1389:
---

+1 Sebstian

 parsechecker and indexchecker to report truncated content
 -

 Key: NUTCH-1389
 URL: https://issues.apache.org/jira/browse/NUTCH-1389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Affects Versions: nutchgora, 1.5
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1389-2x.patch, NUTCH-1389-trunk.patch


 ParserChecker and IndexingFiltersChecker should report when a document is 
 truncated due to {http,file,ftp}.content.limit.
 Truncated content may cause text and metadata extraction to fail for PDF and 
 other binary document formats.
 A hint that truncation (and not a broken plugin) is the possible reason would 
 be useful.
 See NUTCH-965 and {{ParseSegment.isTruncated(content)}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-03-27 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615422#comment-13615422
 ] 

lufeng commented on NUTCH-1545:
---

yes, the concept of crawldb is not used in 2.x, and grab the generate return 
batchId is also a TODO issue in bin/crawl script. i will fix these later. 
thanks Lewis.

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1545.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-03-27 Thread lufeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng reassigned NUTCH-1545:
-

Assignee: lufeng

 capture batchId and remove references to segments in 2.x crawl script.
 --

 Key: NUTCH-1545
 URL: https://issues.apache.org/jira/browse/NUTCH-1545
 Project: Nutch
  Issue Type: Task
Affects Versions: 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1545.patch


 The concept of segment is replaced by batchId in 2.x
 I'm currently getting rid of segments references in 2.x
 This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1547) BasicIndexingFilter - Problem to index full title

2013-03-27 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615434#comment-13615434
 ] 

Lewis John McGibbney commented on NUTCH-1547:
-

+1

 BasicIndexingFilter - Problem to index full title
 -

 Key: NUTCH-1547
 URL: https://issues.apache.org/jira/browse/NUTCH-1547
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Gustavo Rauber
Assignee: lufeng
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 I have faced this issue when trying to index the entire title, just like the 
 content, configuring its value on nutch-default.xml to -1 
 (indexer.max.title.length). I think the behavior should be the same as the 
 content.
 If you would like to fix it, just replace the line number 90:
 if (title.length()  MAX_TITLE_LENGTH) {  // truncate title if needed
 by this one:
 if (MAX_TITLE_LENGTH  -1  title.length()  MAX_TITLE_LENGTH) {  // 
 truncate title if needed
 Stack Trace:
 java.lang.StringIndexOutOfBoundsException: String index out of range: -1
   at java.lang.String.substring(String.java:1937)
   at 
 org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91)
   at 
 org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272)
   at 
 org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
   at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
 Cheers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[Nutch Wiki] Update of FrontPage by kiranchitturi

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by kiranchitturi:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=265rev2=266

   * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, 
Nutch, and Gora]] - A step-by-step tutorial
  
   Other Tutorial(s) 
-  * [[http://user.qzone.qq.com/281032878/blog/1364233492|ChineseVideo 
Tutorial]] - The first free video for Nutch in China.
-  * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese installing 
and using instruction]] - The best guidance in installing and using  Nutch in 
China.
   * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch 
being based Hadoop, it helps to have a better understanding of Hadoop.
   * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch 
in deploy mode over a Hadoop cluster. 
   * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within 
Eclipse


[Nutch Wiki] Trivial Update of Kirby6738 by Kirby6738

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Kirby6738 page has been changed by Kirby6738:
http://wiki.apache.org/nutch/Kirby6738

New page:
Got nothing to tell about me really.BR
Finally a member of apache.org.BR
I really hope I am useful in one way here.BR
BR
Feel free to surf to my blog [[http://www.realtimesync.com|real time file sync]]


[jira] [Updated] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker

2013-03-27 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1501:


Attachment: NUTCH-1501-trunk.patch
NUTCH-1501-2.x.patch

Patches for trunk and 2.x branches which makes effort harmonize behaviour. 
The 2.x ParserChecker now reports more or less identical information to stdout 
with the exception of ParseData (which I think it attempts to simulate with 
Metadata), there is work to be done here.
AFAIK, the recent NUTCH-1389 should address harmonization between 2.x and trunk 
IndexChecker.
I also added some Javadoc which I hope will help the user to see what the toll 
is doing.

 Harmonize behavior of parsechecker and indexchecker
 ---

 Key: NUTCH-1501
 URL: https://issues.apache.org/jira/browse/NUTCH-1501
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1501-2.x.patch, NUTCH-1501-trunk.patch


 Behaviour of ParserChecker and IndexingFiltersChecker has diverged between 
 trunk and 2.x
 - missing in 2.x: NUTCH-1320, NUTCH-1207
 - open issue to be also applied to 2.x: NUTCH-1419, NUTCH-1389

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1038) Port IndexingFiltersChecker to 2.0

2013-03-27 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1038.
-

Resolution: Fixed

I would like to resolve the issue as it states that it blocks NUTCH-1501 which 
we are now working on... as this has been resolved it is not the case anymore.
[~markus17] please reopen if you are not happy. Thanks for reporting and to Seb 
for the patch :) 

 Port IndexingFiltersChecker to 2.0
 --

 Key: NUTCH-1038
 URL: https://issues.apache.org/jira/browse/NUTCH-1038
 Project: Nutch
  Issue Type: New Feature
Affects Versions: nutchgora
Reporter: Markus Jelsma
 Fix For: 2.2

 Attachments: NUTCH-1038.patch, NUTCH-1038v2.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker

2013-03-27 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1501:
---

Assignee: Lewis John McGibbney

 Harmonize behavior of parsechecker and indexchecker
 ---

 Key: NUTCH-1501
 URL: https://issues.apache.org/jira/browse/NUTCH-1501
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Sebastian Nagel
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1501-2.x.patch, NUTCH-1501-trunk.patch


 Behaviour of ParserChecker and IndexingFiltersChecker has diverged between 
 trunk and 2.x
 - missing in 2.x: NUTCH-1320, NUTCH-1207
 - open issue to be also applied to 2.x: NUTCH-1419, NUTCH-1389

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1389) parsechecker and indexchecker to report truncated content

2013-03-27 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1389:
---

Assignee: Sebastian Nagel

 parsechecker and indexchecker to report truncated content
 -

 Key: NUTCH-1389
 URL: https://issues.apache.org/jira/browse/NUTCH-1389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Affects Versions: nutchgora, 1.5
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1389-2x.patch, NUTCH-1389-trunk.patch


 ParserChecker and IndexingFiltersChecker should report when a document is 
 truncated due to {http,file,ftp}.content.limit.
 Truncated content may cause text and metadata extraction to fail for PDF and 
 other binary document formats.
 A hint that truncation (and not a broken plugin) is the possible reason would 
 be useful.
 See NUTCH-965 and {{ParseSegment.isTruncated(content)}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1389) parsechecker and indexchecker to report truncated content

2013-03-27 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1389.


Resolution: Fixed

committed to trunk (r1461854) and 2.x (r1461857)

 parsechecker and indexchecker to report truncated content
 -

 Key: NUTCH-1389
 URL: https://issues.apache.org/jira/browse/NUTCH-1389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Affects Versions: nutchgora, 1.5
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1389-2x.patch, NUTCH-1389-trunk.patch


 ParserChecker and IndexingFiltersChecker should report when a document is 
 truncated due to {http,file,ftp}.content.limit.
 Truncated content may cause text and metadata extraction to fail for PDF and 
 other binary document formats.
 A hint that truncation (and not a broken plugin) is the possible reason would 
 be useful.
 See NUTCH-965 and {{ParseSegment.isTruncated(content)}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[Nutch Wiki] Trivial Update of Bankruptcy_Appropriate_Implications by MaurineZI

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Bankruptcy_Appropriate_Implications page has been changed by MaurineZI:
http://wiki.apache.org/nutch/Bankruptcy_Appropriate_Implications

New page:
Bankruptcy information means knowing about any of it, like its functionality, 
specifications, consequences, types, laws, authorized training etc. It is a 
basically authorized process than it seems more technical and related with 
finance. It's fascinating that you ought to not get any economic choice all on 
your own, at the very least not in the event of bankruptcy. Subsequently do 
consult with bankruptcy experts first, If you're unable to settle the present 
obligations. It can be a wise decision to employ attorney while submitting 
bankruptcy for almost any reason. BR
BR
As a person, it is possible to request any query from bankruptcy attorneys, and 
these professionals always come up with bankruptcy assistance. After groping 
your circumstance and monetary circumstances, right information is presented 
you by them. It is lawful procedure and customers are suggested to not offer 
any false detail at all. At once people are needed to record a bankruptcy case 
on the first point. BR
The ask is the paper which retains the data about lenders, debts, profits and 
expenses of person. That report could be the confirmation that you will be 
searching for bankruptcy. There are three key sections have already been 
released by the chapter 7 and government, courtroom, chapter 11, and chapter 
13. Anyone is absolve to report bankruptcy through any page as per the 
necessity and need. Bankruptcy info represents essential purpose because you 
cannot count just on lawyer. It's the responsibility of borrowers to keep or 
get comprehensive information regarding his/her bankruptcy situation. At once, 
borrowers don't have to get full details about the bankruptcy, because it can 
build more perplexity. One important misapprehension about the process is, all 
items or repossessed are consumed by the courtroom. You ought to be obvious 
concerning the instances and benefits, in order that he/she may sign up for the 
bankruptcy sections subsequently. BR
Different choices BR
BR
If you know anybody who has confronted such circumstances and nevertheless has 
coped up with that, then she or he is the greatest person to steer you the aid. 
Talk about your monetary circumstance with him so that you can find suitable 
techniques from this problem rather than submitting a bankruptcy petition in 
the judge. Later than if he feels that there's an alternative probable solution 
discussing with him subsequently go for that assortment. Your first option 
should not be thought about bankruptcy by you if you're not qualified to pay 
off the credit volume with time that you've on loan. There are other techniques 
out as effectively to access the base of the difficulty. BR
BR
Bankruptcy guidance represents crucial position while declaring bankruptcy. 
Right information can assist you to financially and bad information can create 
your particular predicament worse. One can get the aid of bankruptcy attorney; 
anybody can employ them. They are particular who provide you with the very best 
information at proper time according to a state of matters , as 
[[http://lovelybbws.com/blog/view/281680/foreclosure-and-how-to-use-it-to-your-benefit|useful
 content]].


[jira] [Updated] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker

2013-03-27 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1501:
---

Attachment: NUTCH-1501-2.x-v2.patch
NUTCH-1501-trunk-v2.patch

Great, Lewis!
Attached revised patches:
- fixed 2.x patch (broken by commit of NUTCH-1389)
- merge NUTCH-1320 into 2.x
- minor changes to reduce the number of differences between both branches: 
replace System.exit by return, System.err.println by LOG.error, etc.

 Harmonize behavior of parsechecker and indexchecker
 ---

 Key: NUTCH-1501
 URL: https://issues.apache.org/jira/browse/NUTCH-1501
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Reporter: Sebastian Nagel
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.2

 Attachments: NUTCH-1501-2.x.patch, NUTCH-1501-2.x-v2.patch, 
 NUTCH-1501-trunk.patch, NUTCH-1501-trunk-v2.patch


 Behaviour of ParserChecker and IndexingFiltersChecker has diverged between 
 trunk and 2.x
 - missing in 2.x: NUTCH-1320, NUTCH-1207
 - open issue to be also applied to 2.x: NUTCH-1419, NUTCH-1389

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[Nutch Wiki] Trivial Update of Hassan390 by Hassan390

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Hassan390 page has been changed by Hassan390:
http://wiki.apache.org/nutch/Hassan390

New page:
There is nothing to say about myself really.BR
Great to be a part of this community.BR
I just wish I am useful in one way here.BR
BR
My homepage; [[http://www.starcraft2heartoftheswarm.com/|starcraft 2 heart of 
the swarm]]


[Nutch Wiki] Trivial Update of Mamie5339 by Mamie5339

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Mamie5339 page has been changed by Mamie5339:
http://wiki.apache.org/nutch/Mamie5339

New page:
My name: Sheri PendletonBR
Age: 35BR
Country: Great BritainBR
Home town: Stagden Cross BR
Post code: CM1 2YSBR
Street: 89 Argyll StreetBR
BR
Feel free to visit my homepage; [[http://www.syncback4all.com|synchronize 
backup software]]


[Nutch Wiki] Trivial Update of the_right_way_to_sync_files_to_a_variety_of_computers. by Mamie5339

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The the_right_way_to_sync_files_to_a_variety_of_computers. page has been 
changed by Mamie5339:
http://wiki.apache.org/nutch/the_right_way_to_sync_files_to_a_variety_of_computers.

New page:
Perhaps you have an external hard drive that you simply use as a mirror disk, 
for emergency backup, and you want making sure that the files it holds are all 
updated. Probably the most problems that comes with owning one or more computer 
is figuring out the way to sync files and directories between computers. You 
possibly can of course do regular transitions. This is a real pain however and 
forgetting once can get frustrating. There is an easy way to sync files 
involving the hard disk and USB Flash Drive or yet another devices. Here is 
how: SyncBack4all can sync files between not one but two computers or between 
some type of computer and a removable (external) device just like a thumb 
(flash) drive. The computer sync software can automatically handle changes in 
drive letters (since removable devices are usually plugged in with a different 
drive letter) not to mention detect conflicts (which include when a file is 
deleted on one device but has been modified at the other device), allowing you 
to decide how to proceed manually.BR
BR
If you are you looking for more information regarding 
[[http://www.syncback4all.com|synchronize backup software]] check out 
http://www.syncback4all.com


[Nutch Wiki] Trivial Update of StephanVa by StephanVa

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The StephanVa page has been changed by StephanVa:
http://wiki.apache.org/nutch/StephanVa

New page:
Wiley is the identify he enjoys to be referred to as with and he fully digs 
that name.BR
Dispatching has been his day career for a whilst and he is accomplishing pretty 
excellent monetarily. For a though he is been in Illinois. As a man what he 
definitely likes is to do magic but he's been having on new points recently. 
See what's new on his internet site in this article: https://estudiantes.BR
gfc.edu.co/FannyHarr


[Nutch Wiki] Trivial Update of SamualJbu by SamualJbu

2013-03-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The SamualJbu page has been changed by SamualJbu:
http://wiki.apache.org/nutch/SamualJbu

New page:
彼は書いている 情報 オーストラリアでフラッシュ これまで ので 頻繁に。 歴史的 オージー 人々 と見なさ シープスキン スニーカー として 存在 必需品。 
ルイヴィトン 財布の使用、最大 シープスキンおよび御馳走 両方 フリースと、肌 ファセット。 維持。BR
BR
Take a look at my site; [[http://www.gucchisaifu.com/|Click That Link]]


Important : Bunch of Spam Created under Nutch Wiki!!

2013-03-27 Thread Binoy d
I am quite suprised looking at the notification I am getting for new pages
for Nutch Wiki
Example :
http://wiki.apache.org/nutch/KarlPuent

I see at least 25-35 emails regarding such notification.

All of the links I got are  rooted under http://wiki.apache.org/nutch/


Is some one looking into this , If needed I can gladly forward emails to
the person cleaning it up as I am not sure if every one has access to
delete the pages.

Regards,
b

-- Forwarded message --
From: Apache Wiki wikidi...@apache.org
Date: Wed, Mar 27, 2013 at 9:32 PM
Subject: [Nutch Wiki] Trivial Update of EdwinaBro by EdwinaBro
To: Apache Wiki wikidi...@apache.org


Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for
change notification.

The EdwinaBro page has been changed by EdwinaBro:
http://wiki.apache.org/nutch/EdwinaBro

New page:
I am 24 years old and my name is Edwina Brownlee. I life in Corjolens
(Switzerland).BR
BR
BR
Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]]


Re: Important : Bunch of Spam Created under Nutch Wiki!!

2013-03-27 Thread kiran chitturi
Thank you Binoy for reporting.

We have been monitoring the pages and deleting them when we get time but
there are more coming up. Today, I have seen a spam editing on the home
page of Nutch wiki. It has inserted spam links under tutorials.

We need to find a permanent solution to this. I wonder if any other
list-servs are facing the same issue.


On Thu, Mar 28, 2013 at 12:49 AM, Binoy d binoy...@gmail.com wrote:

 I am quite suprised looking at the notification I am getting for new pages
 for Nutch Wiki
 Example :
 http://wiki.apache.org/nutch/KarlPuent

 I see at least 25-35 emails regarding such notification.

 All of the links I got are  rooted under http://wiki.apache.org/nutch/


 Is some one looking into this , If needed I can gladly forward emails to
 the person cleaning it up as I am not sure if every one has access to
 delete the pages.

 Regards,
 b

 -- Forwarded message --
 From: Apache Wiki wikidi...@apache.org
 Date: Wed, Mar 27, 2013 at 9:32 PM
 Subject: [Nutch Wiki] Trivial Update of EdwinaBro by EdwinaBro
 To: Apache Wiki wikidi...@apache.org


 Dear Wiki user,

 You have subscribed to a wiki page or wiki category on Nutch Wiki for
 change notification.

 The EdwinaBro page has been changed by EdwinaBro:
 http://wiki.apache.org/nutch/EdwinaBro

 New page:
 I am 24 years old and my name is Edwina Brownlee. I life in Corjolens
 (Switzerland).BR
 BR
 BR
 Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]]




-- 
Kiran Chitturi

http://www.linkedin.com/in/kiranchitturi


Re: Important : Bunch of Spam Created under Nutch Wiki!!

2013-03-27 Thread Ken Krugler

On Mar 27, 2013, at 6:54pm, kiran chitturi wrote:

 Thank you Binoy for reporting.
 
 We have been monitoring the pages and deleting them when we get time but 
 there are more coming up. Today, I have seen a spam editing on the home page 
 of Nutch wiki. It has inserted spam links under tutorials.
 
 We need to find a permanent solution to this. I wonder if any other 
 list-servs are facing the same issue.

Yes - Solr recently had to lock down editing on their wiki:

 The wiki at http://wiki.apache.org/solr/ has come under attack by spammers 
 more frequently of late, so the PMC has decided to lock it down in an attempt 
 to reduce the work involved in tracking and removing spam.
 
 From now on, only people who appear on 
 http://wiki.apache.org/solr/ContributorsGroup will be able to 
 create/modify/delete wiki pages.
 
 Please request either on the solr-u...@lucene.apache.org or on 
 d...@lucene.apache.org to have your wiki username added to the 
 ContributorsGroup page - this is a one-time step.

So I think you need to make a request to Infra to lock down the wiki, then add 
people (generally in response to explicit requests) to the ContributorsGroup 
page.

-- Ken


 
 
 On Thu, Mar 28, 2013 at 12:49 AM, Binoy d binoy...@gmail.com wrote:
 I am quite suprised looking at the notification I am getting for new pages 
 for Nutch Wiki
 Example :
 http://wiki.apache.org/nutch/KarlPuent
 
 I see at least 25-35 emails regarding such notification.
 
 All of the links I got are  rooted under http://wiki.apache.org/nutch/
 
 
 Is some one looking into this , If needed I can gladly forward emails to the 
 person cleaning it up as I am not sure if every one has access to delete the 
 pages.
 
 Regards,
 b
 
 -- Forwarded message --
 From: Apache Wiki wikidi...@apache.org
 Date: Wed, Mar 27, 2013 at 9:32 PM
 Subject: [Nutch Wiki] Trivial Update of EdwinaBro by EdwinaBro
 To: Apache Wiki wikidi...@apache.org
 
 
 Dear Wiki user,
 
 You have subscribed to a wiki page or wiki category on Nutch Wiki for 
 change notification.
 
 The EdwinaBro page has been changed by EdwinaBro:
 http://wiki.apache.org/nutch/EdwinaBro
 
 New page:
 I am 24 years old and my name is Edwina Brownlee. I life in Corjolens 
 (Switzerland).BR
 BR
 BR
 Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]]
 
 
 
 
 -- 
 Kiran Chitturi
 
 
 
 

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr