[Nutch Wiki] Update of FrontPage by ysc
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by ysc: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=260rev2=261 Comment: add some vedio resource * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, Nutch, and Gora]] - A step-by-step tutorial Other Tutorial(s) + * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video Tutorial: Nutch Relevant Framework]] - The first free video for Nutch in China. + * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting Up And Use Tutorial]] - The best guide of how to setting up and use nutch relevant framework in China. + * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch being based Hadoop, it helps to have a better understanding of Hadoop. * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch in deploy mode over a Hadoop cluster. * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse * [[IntranetDocumentSearch|Intranet Document Search]] - Index and search Microsoft Office, PDF etc. documents in a file system hierarchy with a Solr backend. * [[http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/|Recrawling with Nutch]] - How to re-crawl with Nutch. * [[https://github.com/evolvingweb/ajax-solr/wiki/Tutorial%3A-Nutch|Ajax-Solr Tutorial: Nutch]] - Quick and easy guide to getting a nice UI on top of your Nutch crawl data. + === Configuration === * OverviewDeploymentConfigs /!\ :This full page requires a complete update to reflect recent Nutch releases: /!\
[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up
[ https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615063#comment-13615063 ] Roland von Herget commented on NUTCH-1538: -- Hi lufeng, after reading a bit more of nutch code, the question arises if it is really necessary to load any of this ParserJob.FIELDS. Shouldn't the fetcher set up all fields (all of fit.page) necessary for the parser during the fetch? I'll think I will give this a try here. tuning of loaded fields during fetcherJob start-up -- Key: NUTCH-1538 URL: https://issues.apache.org/jira/browse/NUTCH-1538 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 2.1 Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / gora-core 0.2.1 running fetch with parse=true Reporter: Roland von Herget Main problem is, nutch is loading nearly every row column from DB during startup of a fetcherJob when fetcher.parse=true. A parserJob needs e.g. the CONTENT field from db, to parse. The fetcherJob adds all fields of the parserJob to it's needed fields, if running with fetcher.parse=true. [FetcherJob.getFields()] If the nutch configuration saves all fetched data to DB (fetcher.store.content=true) you'll end up loading GBs of unused content during fetcherJob start-up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of EstelaDom by EstelaDom
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The EstelaDom page has been changed by EstelaDom: http://wiki.apache.org/nutch/EstelaDom New page: Hello, Ok Nothing to write about myself.BR Great to be a member of this website.BR BR My page - [[http://GamesActual.com/|http://GamesActual.com/]]
[Nutch Wiki] Update of FrontPage by ysc
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by ysc: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=261rev2=262 Other Tutorial(s) * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video Tutorial: Nutch Relevant Framework]] - The first free video for Nutch in China. - * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting Up And Use Tutorial]] - The best guide of how to setting up and use nutch relevant framework in China. + * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting Up And Use Tutorial: Nutch Relevant Framework]] - The best guide of how to setting up and use nutch relevant framework in China. * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch being based Hadoop, it helps to have a better understanding of Hadoop. * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch in deploy mode over a Hadoop cluster.
[Nutch Wiki] Update of FrontPage by ysc
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by ysc: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=262rev2=263 * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, Nutch, and Gora]] - A step-by-step tutorial Other Tutorial(s) - * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video Tutorial: Nutch Relevant Framework]] - The first free video for Nutch in China. + * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese video tutorial]] - The first free video for Nutch in China. + * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese installing and using instruction]] - The best guidance in installing and using Nutch in China. - * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting Up And Use Tutorial: Nutch Relevant Framework]] - The best guide of how to setting up and use nutch relevant framework in China. - * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch being based Hadoop, it helps to have a better understanding of Hadoop. * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch in deploy mode over a Hadoop cluster. * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse
[Nutch Wiki] Update of FrontPage by ysc
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by ysc: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=263rev2=264 * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, Nutch, and Gora]] - A step-by-step tutorial Other Tutorial(s) - * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese video tutorial]] - The first free video for Nutch in China. + * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video Tutorial]] - The first free video for Nutch in China. - * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese installing and using instruction]] - The best guidance in installing and using Nutch in China. + * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting up and Using Instruction]] - The best guidance in setting up and using Nutch in China. * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch being based Hadoop, it helps to have a better understanding of Hadoop. * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch in deploy mode over a Hadoop cluster. * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse
[Nutch Wiki] Update of FrontPage by ysc
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by ysc: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=264rev2=265 * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, Nutch, and Gora]] - A step-by-step tutorial Other Tutorial(s) - * [[http://user.qzone.qq.com/281032878/blog/1364233492|Chinese Video Tutorial]] - The first free video for Nutch in China. + * [[http://user.qzone.qq.com/281032878/blog/1364233492|ChineseVideo Tutorial]] - The first free video for Nutch in China. - * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese Setting up and Using Instruction]] - The best guidance in setting up and using Nutch in China. + * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese installing and using instruction]] - The best guidance in installing and using Nutch in China. * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch being based Hadoop, it helps to have a better understanding of Hadoop. * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch in deploy mode over a Hadoop cluster. * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse
[Nutch Wiki] Trivial Update of MarshallW by MarshallW
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The MarshallW page has been changed by MarshallW: http://wiki.apache.org/nutch/MarshallW New page: Individuals might wonder, Regardless of whether I don't calculate fat from fat, after whom how will I dominance my personal eating practices? To start, you'll use the in which to strategy your meals and additionally keep an eye within your feelings after consuming in your own Diet Solution Food Journal. Over time, although, turned into familiar with how individuals can meet your different body's healthy requirements definitely.BR Humans are born from the ability to know when the body is in fact nourished then when genuinely is not. Isabel's professional experience suggests that numerous yo-yo dieters and those people who have fought at present overweight don't listen which can their bodies' cues in about reaction to the diet items and also servings they consume. The actual good news is that this ability can come to be (re)learned.BR BR Is the Diet Plan Way Program new? No, it certainly is not too. It has been in and around for a few a number of now, but you have now likely never heard pc. However, it is incredibly effective. The reason that it works out so well is because you will end up with a plan that do is specifically designed in order to. That is the reason why it works so amazingly well.BR BR If this is so, and it is, the challenge is i'm I prepared to change my diet so should give my liver chance to do its work while it should? Am Anyway i prepared to go to be able to eating a simpler food that consists of a variety of food groups cooked inside my own kitchen rather in comparison to living on the refined food in our supermarkets fast-food chains?BR Yes, these prepared foods are convenient and they build nicely into our very busy lifestyles, but for many these are the meal items are killing us. The change in lifespan needs thought and planning the layout and stresses as a model due to changes. I may easily testify that leaving and also behind is not so simple as it first appears.BR BR BR Declined Salt Diet Solution: Cocinero your own sandwich meat by slow roasting pork, beef and turkey every 300 degrees until widely cooked. Beware implementing gravy mixes that experience high salt content single package-use beer, wine or even vinegar and herbs for greater flavor.BR BR BR BR Factors able to sense healthful and joyful, you will have give The [[http://consigliper-dimagrire.com/|pillole per dimagrire]] Fashion a 2nd search. Remaining obese leads so that you bodily injury to the equipment and puts you along with danger for diabetes and heart disease.BR Moreover, it destroys a fighter's self-esteem.BR BR Another necessary factor is that strategy should taste good and be simple to prepare. A lot of loose weight programs either torture you using recipes, or suggest supplements that are simply unrealistic to recreate on every day basis. Quite to the contrary, diet solution program presents simple to prepare, classy and flexible recipes also diet plans.BR BR The entire author of this book, Isabel De Los Rios will teach you the way in which to eat correctly but also how to eat the latest lot healthier no matter what your body type is. The daily diet is going to generally be hard at first most definitely when you need to help avoid or change individuals unhealthy foods that you may used to eat. The program aimed the new long term success present in terms of losing weight, understanding of proper healthy diet and maintaining a in perfect shape body.
[jira] [Updated] (NUTCH-1547) BasicIndexingFilter - Problem to index full title
[ https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1547: -- Attachment: NUTCH-1547-2x.patch add patch to Nutch 2.x BasicIndexingFilter - Problem to index full title - Key: NUTCH-1547 URL: https://issues.apache.org/jira/browse/NUTCH-1547 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Gustavo Rauber Assignee: lufeng Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch Original Estimate: 1h Remaining Estimate: 1h I have faced this issue when trying to index the entire title, just like the content, configuring its value on nutch-default.xml to -1 (indexer.max.title.length). I think the behavior should be the same as the content. If you would like to fix it, just replace the line number 90: if (title.length() MAX_TITLE_LENGTH) { // truncate title if needed by this one: if (MAX_TITLE_LENGTH -1 title.length() MAX_TITLE_LENGTH) { // truncate title if needed Stack Trace: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1937) at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) Cheers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1389) parsechecker and indexchecker to report truncated content
[ https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615360#comment-13615360 ] lufeng commented on NUTCH-1389: --- +1 Sebstian parsechecker and indexchecker to report truncated content - Key: NUTCH-1389 URL: https://issues.apache.org/jira/browse/NUTCH-1389 Project: Nutch Issue Type: Improvement Components: indexer, parser Affects Versions: nutchgora, 1.5 Reporter: Sebastian Nagel Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1389-2x.patch, NUTCH-1389-trunk.patch ParserChecker and IndexingFiltersChecker should report when a document is truncated due to {http,file,ftp}.content.limit. Truncated content may cause text and metadata extraction to fail for PDF and other binary document formats. A hint that truncation (and not a broken plugin) is the possible reason would be useful. See NUTCH-965 and {{ParseSegment.isTruncated(content)}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615422#comment-13615422 ] lufeng commented on NUTCH-1545: --- yes, the concept of crawldb is not used in 2.x, and grab the generate return batchId is also a TODO issue in bin/crawl script. i will fix these later. thanks Lewis. capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Priority: Minor Fix For: 2.2 Attachments: NUTCH-1545.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng reassigned NUTCH-1545: - Assignee: lufeng capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1545.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1547) BasicIndexingFilter - Problem to index full title
[ https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615434#comment-13615434 ] Lewis John McGibbney commented on NUTCH-1547: - +1 BasicIndexingFilter - Problem to index full title - Key: NUTCH-1547 URL: https://issues.apache.org/jira/browse/NUTCH-1547 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Gustavo Rauber Assignee: lufeng Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch Original Estimate: 1h Remaining Estimate: 1h I have faced this issue when trying to index the entire title, just like the content, configuring its value on nutch-default.xml to -1 (indexer.max.title.length). I think the behavior should be the same as the content. If you would like to fix it, just replace the line number 90: if (title.length() MAX_TITLE_LENGTH) { // truncate title if needed by this one: if (MAX_TITLE_LENGTH -1 title.length() MAX_TITLE_LENGTH) { // truncate title if needed Stack Trace: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1937) at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) Cheers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Update of FrontPage by kiranchitturi
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by kiranchitturi: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=265rev2=266 * [[http://www.covert.io/post/18414889381/accumulo-nutch-and-gora|Accumulo, Nutch, and Gora]] - A step-by-step tutorial Other Tutorial(s) - * [[http://user.qzone.qq.com/281032878/blog/1364233492|ChineseVideo Tutorial]] - The first free video for Nutch in China. - * [[http://user.qzone.qq.com/281032878/blog/1362131478|Chinese installing and using instruction]] - The best guidance in installing and using Nutch in China. * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch being based Hadoop, it helps to have a better understanding of Hadoop. * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch in deploy mode over a Hadoop cluster. * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within Eclipse
[Nutch Wiki] Trivial Update of Kirby6738 by Kirby6738
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Kirby6738 page has been changed by Kirby6738: http://wiki.apache.org/nutch/Kirby6738 New page: Got nothing to tell about me really.BR Finally a member of apache.org.BR I really hope I am useful in one way here.BR BR Feel free to surf to my blog [[http://www.realtimesync.com|real time file sync]]
[jira] [Updated] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker
[ https://issues.apache.org/jira/browse/NUTCH-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1501: Attachment: NUTCH-1501-trunk.patch NUTCH-1501-2.x.patch Patches for trunk and 2.x branches which makes effort harmonize behaviour. The 2.x ParserChecker now reports more or less identical information to stdout with the exception of ParseData (which I think it attempts to simulate with Metadata), there is work to be done here. AFAIK, the recent NUTCH-1389 should address harmonization between 2.x and trunk IndexChecker. I also added some Javadoc which I hope will help the user to see what the toll is doing. Harmonize behavior of parsechecker and indexchecker --- Key: NUTCH-1501 URL: https://issues.apache.org/jira/browse/NUTCH-1501 Project: Nutch Issue Type: Improvement Components: indexer, parser Reporter: Sebastian Nagel Priority: Minor Fix For: 2.2 Attachments: NUTCH-1501-2.x.patch, NUTCH-1501-trunk.patch Behaviour of ParserChecker and IndexingFiltersChecker has diverged between trunk and 2.x - missing in 2.x: NUTCH-1320, NUTCH-1207 - open issue to be also applied to 2.x: NUTCH-1419, NUTCH-1389 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1038) Port IndexingFiltersChecker to 2.0
[ https://issues.apache.org/jira/browse/NUTCH-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1038. - Resolution: Fixed I would like to resolve the issue as it states that it blocks NUTCH-1501 which we are now working on... as this has been resolved it is not the case anymore. [~markus17] please reopen if you are not happy. Thanks for reporting and to Seb for the patch :) Port IndexingFiltersChecker to 2.0 -- Key: NUTCH-1038 URL: https://issues.apache.org/jira/browse/NUTCH-1038 Project: Nutch Issue Type: New Feature Affects Versions: nutchgora Reporter: Markus Jelsma Fix For: 2.2 Attachments: NUTCH-1038.patch, NUTCH-1038v2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker
[ https://issues.apache.org/jira/browse/NUTCH-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1501: --- Assignee: Lewis John McGibbney Harmonize behavior of parsechecker and indexchecker --- Key: NUTCH-1501 URL: https://issues.apache.org/jira/browse/NUTCH-1501 Project: Nutch Issue Type: Improvement Components: indexer, parser Reporter: Sebastian Nagel Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.2 Attachments: NUTCH-1501-2.x.patch, NUTCH-1501-trunk.patch Behaviour of ParserChecker and IndexingFiltersChecker has diverged between trunk and 2.x - missing in 2.x: NUTCH-1320, NUTCH-1207 - open issue to be also applied to 2.x: NUTCH-1419, NUTCH-1389 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1389) parsechecker and indexchecker to report truncated content
[ https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1389: --- Assignee: Sebastian Nagel parsechecker and indexchecker to report truncated content - Key: NUTCH-1389 URL: https://issues.apache.org/jira/browse/NUTCH-1389 Project: Nutch Issue Type: Improvement Components: indexer, parser Affects Versions: nutchgora, 1.5 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1389-2x.patch, NUTCH-1389-trunk.patch ParserChecker and IndexingFiltersChecker should report when a document is truncated due to {http,file,ftp}.content.limit. Truncated content may cause text and metadata extraction to fail for PDF and other binary document formats. A hint that truncation (and not a broken plugin) is the possible reason would be useful. See NUTCH-965 and {{ParseSegment.isTruncated(content)}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1389) parsechecker and indexchecker to report truncated content
[ https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1389. Resolution: Fixed committed to trunk (r1461854) and 2.x (r1461857) parsechecker and indexchecker to report truncated content - Key: NUTCH-1389 URL: https://issues.apache.org/jira/browse/NUTCH-1389 Project: Nutch Issue Type: Improvement Components: indexer, parser Affects Versions: nutchgora, 1.5 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1389-2x.patch, NUTCH-1389-trunk.patch ParserChecker and IndexingFiltersChecker should report when a document is truncated due to {http,file,ftp}.content.limit. Truncated content may cause text and metadata extraction to fail for PDF and other binary document formats. A hint that truncation (and not a broken plugin) is the possible reason would be useful. See NUTCH-965 and {{ParseSegment.isTruncated(content)}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of Bankruptcy_Appropriate_Implications by MaurineZI
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Bankruptcy_Appropriate_Implications page has been changed by MaurineZI: http://wiki.apache.org/nutch/Bankruptcy_Appropriate_Implications New page: Bankruptcy information means knowing about any of it, like its functionality, specifications, consequences, types, laws, authorized training etc. It is a basically authorized process than it seems more technical and related with finance. It's fascinating that you ought to not get any economic choice all on your own, at the very least not in the event of bankruptcy. Subsequently do consult with bankruptcy experts first, If you're unable to settle the present obligations. It can be a wise decision to employ attorney while submitting bankruptcy for almost any reason. BR BR As a person, it is possible to request any query from bankruptcy attorneys, and these professionals always come up with bankruptcy assistance. After groping your circumstance and monetary circumstances, right information is presented you by them. It is lawful procedure and customers are suggested to not offer any false detail at all. At once people are needed to record a bankruptcy case on the first point. BR The ask is the paper which retains the data about lenders, debts, profits and expenses of person. That report could be the confirmation that you will be searching for bankruptcy. There are three key sections have already been released by the chapter 7 and government, courtroom, chapter 11, and chapter 13. Anyone is absolve to report bankruptcy through any page as per the necessity and need. Bankruptcy info represents essential purpose because you cannot count just on lawyer. It's the responsibility of borrowers to keep or get comprehensive information regarding his/her bankruptcy situation. At once, borrowers don't have to get full details about the bankruptcy, because it can build more perplexity. One important misapprehension about the process is, all items or repossessed are consumed by the courtroom. You ought to be obvious concerning the instances and benefits, in order that he/she may sign up for the bankruptcy sections subsequently. BR Different choices BR BR If you know anybody who has confronted such circumstances and nevertheless has coped up with that, then she or he is the greatest person to steer you the aid. Talk about your monetary circumstance with him so that you can find suitable techniques from this problem rather than submitting a bankruptcy petition in the judge. Later than if he feels that there's an alternative probable solution discussing with him subsequently go for that assortment. Your first option should not be thought about bankruptcy by you if you're not qualified to pay off the credit volume with time that you've on loan. There are other techniques out as effectively to access the base of the difficulty. BR BR Bankruptcy guidance represents crucial position while declaring bankruptcy. Right information can assist you to financially and bad information can create your particular predicament worse. One can get the aid of bankruptcy attorney; anybody can employ them. They are particular who provide you with the very best information at proper time according to a state of matters , as [[http://lovelybbws.com/blog/view/281680/foreclosure-and-how-to-use-it-to-your-benefit|useful content]].
[jira] [Updated] (NUTCH-1501) Harmonize behavior of parsechecker and indexchecker
[ https://issues.apache.org/jira/browse/NUTCH-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1501: --- Attachment: NUTCH-1501-2.x-v2.patch NUTCH-1501-trunk-v2.patch Great, Lewis! Attached revised patches: - fixed 2.x patch (broken by commit of NUTCH-1389) - merge NUTCH-1320 into 2.x - minor changes to reduce the number of differences between both branches: replace System.exit by return, System.err.println by LOG.error, etc. Harmonize behavior of parsechecker and indexchecker --- Key: NUTCH-1501 URL: https://issues.apache.org/jira/browse/NUTCH-1501 Project: Nutch Issue Type: Improvement Components: indexer, parser Reporter: Sebastian Nagel Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.2 Attachments: NUTCH-1501-2.x.patch, NUTCH-1501-2.x-v2.patch, NUTCH-1501-trunk.patch, NUTCH-1501-trunk-v2.patch Behaviour of ParserChecker and IndexingFiltersChecker has diverged between trunk and 2.x - missing in 2.x: NUTCH-1320, NUTCH-1207 - open issue to be also applied to 2.x: NUTCH-1419, NUTCH-1389 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[Nutch Wiki] Trivial Update of Hassan390 by Hassan390
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Hassan390 page has been changed by Hassan390: http://wiki.apache.org/nutch/Hassan390 New page: There is nothing to say about myself really.BR Great to be a part of this community.BR I just wish I am useful in one way here.BR BR My homepage; [[http://www.starcraft2heartoftheswarm.com/|starcraft 2 heart of the swarm]]
[Nutch Wiki] Trivial Update of Mamie5339 by Mamie5339
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Mamie5339 page has been changed by Mamie5339: http://wiki.apache.org/nutch/Mamie5339 New page: My name: Sheri PendletonBR Age: 35BR Country: Great BritainBR Home town: Stagden Cross BR Post code: CM1 2YSBR Street: 89 Argyll StreetBR BR Feel free to visit my homepage; [[http://www.syncback4all.com|synchronize backup software]]
[Nutch Wiki] Trivial Update of the_right_way_to_sync_files_to_a_variety_of_computers. by Mamie5339
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The the_right_way_to_sync_files_to_a_variety_of_computers. page has been changed by Mamie5339: http://wiki.apache.org/nutch/the_right_way_to_sync_files_to_a_variety_of_computers. New page: Perhaps you have an external hard drive that you simply use as a mirror disk, for emergency backup, and you want making sure that the files it holds are all updated. Probably the most problems that comes with owning one or more computer is figuring out the way to sync files and directories between computers. You possibly can of course do regular transitions. This is a real pain however and forgetting once can get frustrating. There is an easy way to sync files involving the hard disk and USB Flash Drive or yet another devices. Here is how: SyncBack4all can sync files between not one but two computers or between some type of computer and a removable (external) device just like a thumb (flash) drive. The computer sync software can automatically handle changes in drive letters (since removable devices are usually plugged in with a different drive letter) not to mention detect conflicts (which include when a file is deleted on one device but has been modified at the other device), allowing you to decide how to proceed manually.BR BR If you are you looking for more information regarding [[http://www.syncback4all.com|synchronize backup software]] check out http://www.syncback4all.com
[Nutch Wiki] Trivial Update of StephanVa by StephanVa
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The StephanVa page has been changed by StephanVa: http://wiki.apache.org/nutch/StephanVa New page: Wiley is the identify he enjoys to be referred to as with and he fully digs that name.BR Dispatching has been his day career for a whilst and he is accomplishing pretty excellent monetarily. For a though he is been in Illinois. As a man what he definitely likes is to do magic but he's been having on new points recently. See what's new on his internet site in this article: https://estudiantes.BR gfc.edu.co/FannyHarr
[Nutch Wiki] Trivial Update of SamualJbu by SamualJbu
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The SamualJbu page has been changed by SamualJbu: http://wiki.apache.org/nutch/SamualJbu New page: 彼は書いている 情報 オーストラリアでフラッシュ これまで ので 頻繁に。 歴史的 オージー 人々 と見なさ シープスキン スニーカー として 存在 必需品。 ルイヴィトン 財布の使用、最大 シープスキンおよび御馳走 両方 フリースと、肌 ファセット。 維持。BR BR Take a look at my site; [[http://www.gucchisaifu.com/|Click That Link]]
Important : Bunch of Spam Created under Nutch Wiki!!
I am quite suprised looking at the notification I am getting for new pages for Nutch Wiki Example : http://wiki.apache.org/nutch/KarlPuent I see at least 25-35 emails regarding such notification. All of the links I got are rooted under http://wiki.apache.org/nutch/ Is some one looking into this , If needed I can gladly forward emails to the person cleaning it up as I am not sure if every one has access to delete the pages. Regards, b -- Forwarded message -- From: Apache Wiki wikidi...@apache.org Date: Wed, Mar 27, 2013 at 9:32 PM Subject: [Nutch Wiki] Trivial Update of EdwinaBro by EdwinaBro To: Apache Wiki wikidi...@apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The EdwinaBro page has been changed by EdwinaBro: http://wiki.apache.org/nutch/EdwinaBro New page: I am 24 years old and my name is Edwina Brownlee. I life in Corjolens (Switzerland).BR BR BR Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]]
Re: Important : Bunch of Spam Created under Nutch Wiki!!
Thank you Binoy for reporting. We have been monitoring the pages and deleting them when we get time but there are more coming up. Today, I have seen a spam editing on the home page of Nutch wiki. It has inserted spam links under tutorials. We need to find a permanent solution to this. I wonder if any other list-servs are facing the same issue. On Thu, Mar 28, 2013 at 12:49 AM, Binoy d binoy...@gmail.com wrote: I am quite suprised looking at the notification I am getting for new pages for Nutch Wiki Example : http://wiki.apache.org/nutch/KarlPuent I see at least 25-35 emails regarding such notification. All of the links I got are rooted under http://wiki.apache.org/nutch/ Is some one looking into this , If needed I can gladly forward emails to the person cleaning it up as I am not sure if every one has access to delete the pages. Regards, b -- Forwarded message -- From: Apache Wiki wikidi...@apache.org Date: Wed, Mar 27, 2013 at 9:32 PM Subject: [Nutch Wiki] Trivial Update of EdwinaBro by EdwinaBro To: Apache Wiki wikidi...@apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The EdwinaBro page has been changed by EdwinaBro: http://wiki.apache.org/nutch/EdwinaBro New page: I am 24 years old and my name is Edwina Brownlee. I life in Corjolens (Switzerland).BR BR BR Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]] -- Kiran Chitturi http://www.linkedin.com/in/kiranchitturi
Re: Important : Bunch of Spam Created under Nutch Wiki!!
On Mar 27, 2013, at 6:54pm, kiran chitturi wrote: Thank you Binoy for reporting. We have been monitoring the pages and deleting them when we get time but there are more coming up. Today, I have seen a spam editing on the home page of Nutch wiki. It has inserted spam links under tutorials. We need to find a permanent solution to this. I wonder if any other list-servs are facing the same issue. Yes - Solr recently had to lock down editing on their wiki: The wiki at http://wiki.apache.org/solr/ has come under attack by spammers more frequently of late, so the PMC has decided to lock it down in an attempt to reduce the work involved in tracking and removing spam. From now on, only people who appear on http://wiki.apache.org/solr/ContributorsGroup will be able to create/modify/delete wiki pages. Please request either on the solr-u...@lucene.apache.org or on d...@lucene.apache.org to have your wiki username added to the ContributorsGroup page - this is a one-time step. So I think you need to make a request to Infra to lock down the wiki, then add people (generally in response to explicit requests) to the ContributorsGroup page. -- Ken On Thu, Mar 28, 2013 at 12:49 AM, Binoy d binoy...@gmail.com wrote: I am quite suprised looking at the notification I am getting for new pages for Nutch Wiki Example : http://wiki.apache.org/nutch/KarlPuent I see at least 25-35 emails regarding such notification. All of the links I got are rooted under http://wiki.apache.org/nutch/ Is some one looking into this , If needed I can gladly forward emails to the person cleaning it up as I am not sure if every one has access to delete the pages. Regards, b -- Forwarded message -- From: Apache Wiki wikidi...@apache.org Date: Wed, Mar 27, 2013 at 9:32 PM Subject: [Nutch Wiki] Trivial Update of EdwinaBro by EdwinaBro To: Apache Wiki wikidi...@apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The EdwinaBro page has been changed by EdwinaBro: http://wiki.apache.org/nutch/EdwinaBro New page: I am 24 years old and my name is Edwina Brownlee. I life in Corjolens (Switzerland).BR BR BR Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue]] -- Kiran Chitturi -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr