Re: [Nutch-dev] Patch Available status?

2006-08-30 Thread Uroš Gruber
Chris Mattmann wrote: > Hi Doug and Andrzej, > > +1. I think that workflow makes a lot of sense. Currently users in the > nutch-developers group can close and resolve issues. In the Hadoop workflow, > would this continue to be the case? > > +1 Regards, Uros > Cheers, > Chris > > > > On 8/30

[Nutch-dev] 互惠互利

2006-08-30 Thread 间小姐
贵公司负责人(经理/财务)您好:   我是深圳市恒凯实业有限公司的。本公司是由一群国内进 出口代理与销售管理专业人士组成的,公司本着互惠互利的, 原则合理对外优惠代开发票,代开发票范围:(增值税票.普通国 税商品销售发票,地税广告发票、运输发票、其它服务发票、 租赁发票、维修发票、建筑安装发票、餐饮定额发票,)点数 优惠!本公同所开出税票真实有效,欢迎来电咨询!   (打扰之处,请谅解) 业务手机: 13590108911 联 系 人:简结英 -

Re: [Nutch-dev] Patch Available status?

2006-08-30 Thread Chris Mattmann
Hi Doug and Andrzej, +1. I think that workflow makes a lot of sense. Currently users in the nutch-developers group can close and resolve issues. In the Hadoop workflow, would this continue to be the case? Cheers, Chris On 8/30/06 3:14 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: > Do

[Nutch-dev] [jira] Closed: (NUTCH-143) Improper error numbers returned on exit

2006-08-30 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-143?page=all ] Andrzej Bialecki closed NUTCH-143. --- Resolution: Fixed Fixed in rev. 438670, with modifications. > Improper error numbers returned on exit > --- > >

[Nutch-dev] [jira] Closed: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-08-30 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-242?page=all ] Andrzej Bialecki closed NUTCH-242. --- Resolution: Fixed Fixed in rev. 438670, with modifications. > Add optional -urlFiltering to updatedb > -- > >

Re: [Nutch-dev] Patch Available status?

2006-08-30 Thread Andrzej Bialecki
Doug Cutting wrote: > Sami Siren wrote: >> I am not able to do it either, or then I just don't know how, can >> Doug help us here? > > This requires a change the the project's workflow. I'd be happy to > move Nutch to use the workflow we use for Hadoop, which supports > "Patch Available". > > T

Re: [Nutch-dev] Patch Available status?

2006-08-30 Thread Doug Cutting
Sami Siren wrote: > I am not able to do it either, or then I just don't know how, can Doug > help us here? This requires a change the the project's workflow. I'd be happy to move Nutch to use the workflow we use for Hadoop, which supports "Patch Available". This workflow has one other non-def

[Nutch-dev] 刘先生

2006-08-30 Thread 刘先生
你好: 本公司有小部分普通类型发票优惠可以代开, 如需要请来电与刘先生洽谈,打扰之处望见谅 手机:13510126956 - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Downlo

[Nutch-dev] fetcher status missing in log file

2006-08-30 Thread AJ Chen
I'm using nutch-0.9-dev from svn. hadoop.log has records from fetching except the status line. is there a setting required to print the fetch status line? the status is set in Fetcher.java: report.setStatus(string), but where does the report object print the status? thanks, -- AJ Chen http://web

Re: [Nutch-dev] books (and articles) about search engine algorithms

2006-08-30 Thread Thomas Delnoij
I found "Mining the web - discovering knowledge from Hypertext Data" by Soumen Ckakrabarti a usefull reference. http://www.amazon.com/gp/product/1558607544/103-9548474-1631829?v=glance&n=283155 Rgrds, Thomas On 8/29/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Mladen Adamovic wrote: > > Hi!

Re: [Nutch-dev] Use CrawlDb as a metadata Db?

2006-08-30 Thread HUYLEBROECK Jeremy RD-ILAB-SSF
I think at the parser plugin level, you can't get back to the original crawldatum. The parsers get only the Content. What I did is putting stuff from the Crawldb in the Content MetaData at fetch time. Then the Parser gets this Metadata and can put it in the Parse object as needed. If you do fetch

Re: [Nutch-dev] get CrawlDatum

2006-08-30 Thread HUYLEBROECK Jeremy RD-ILAB-SSF
My current solution is having a modified Fetcher putting info in the Parse Metadata in the output method. Then this info can be used during parsing and so on. As Andrzej said, I also had to create my own OutputFormat. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]

[Nutch-dev] Should URL normalization iterate?

2006-08-30 Thread Doug Cook
Hi, I've run across a few patterns in URLs where applying a normalization puts the URL in a form matching another normalization pattern (or even the same one). But that pattern won't get executed because the patterns are applied only once. Should normalization iterate until no patterns match (wi

[Nutch-dev] 深圳9月海运价格

2006-08-30 Thread 胜超货运
> > 深圳市胜超国际货运代理有限公司 > > 电话:0755-33340372 传真:0755-33341400,33341403 > > > > > > 中东印巴(还有更多的内陆点 > > > > Karachi625./20' 1150./40' 1150/40HQ (周六截关16天到) > > dubai (J) 650/20 1250/40' 1250/40HQ (周五截关13天到) > > Nhava Sheva550/20

[Nutch-dev] 你好!

2006-08-30 Thread 3qshuang
优惠代开发票 您好! 首先,对我的冒昧来函至歉,但愿此函对贵公司有所帮助。 祝您生意兴隆,万事如意。我公司是广州一家包税实业公司,实力雄厚。 因我公司与多家公司合作,每月有剩的税票可向外提供,可解贵公司 在业务运作中补帐,作帐的需要提供如下发票: 国税:商品普通销售收费:(5)厘--(3)点, 地税:广告.运输.服务.租赁.建筑安装.餐饮定额及其它发票. 增值税,海关缴款书,税务代开收费:(4)点--(6)点

[Nutch-dev] 深圳9月海运价格

2006-08-30 Thread 胜超货运
> > 深圳市胜超国际货运代理有限公司 > > 电话:0755-33340372 传真:0755-33341400,33341403 > > > > > > 中东印巴(还有更多的内陆点 > > > > Karachi625./20' 1150./40' 1150/40HQ (周六截关16天到) > > dubai (J) 650/20 1250/40' 1250/40HQ (周五截关13天到) > > Nhava Sheva550/20

[Nutch-dev] [jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-30 Thread Enis Soztutar (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12431548 ] Enis Soztutar commented on NUTCH-356: - I observed strange behaviour, when one of the plug-ins could not be included. For example the plugin system fails to load

[Nutch-dev] 你好!

2006-08-30 Thread 3qshuang
优惠代开发票 您好! 首先,对我的冒昧来函至歉,但愿此函对贵公司有所帮助。 祝您生意兴隆,万事如意。我公司是广州一家包税实业公司,实力雄厚。 因我公司与多家公司合作,每月有剩的税票可向外提供,可解贵公司 在业务运作中补帐,作帐的需要提供如下发票: 国税:商品普通销售收费:(5)厘--(3)点, 地税:广告.运输.服务.租赁.建筑安装.餐饮定额及其它发票. 增值税,海关缴款书,税务代开收费:(4)点--(6)点

Re: [Nutch-dev] Fetch error

2006-08-30 Thread anton
Preview error I got from tasktracker log. In jobtracker log I am see next error now: 06/08/30 01:04:07 INFO mapred.TaskInProgress: Error from task_0001_r_00_1: java.lang.AbstractMethodError: org.apache.n utch.fetcher.FetcherOutputFormat.getRecordWriter(Lorg/apache/hadoop/fs/FileS ystem;Lorg/ap

Re: [Nutch-dev] get CrawlDatum

2006-08-30 Thread Uroš Gruber
Andrzej Bialecki wrote: Uroš Gruber wrote: ParseData.metadata sounds nice, but I think I'm lost again :) If I understand code flow the best place would be in Fetcher [262] but i'm not sure that datum holds info of url being fetched On the input to the fetcher you get a URL and a CrawlDatum (o

Re: [Nutch-dev] get CrawlDatum

2006-08-30 Thread Andrzej Bialecki
Uroš Gruber wrote: > ParseData.metadata sounds nice, but I think I'm lost again :) > If I understand code flow the best place would be in Fetcher [262] > > but i'm not sure that datum holds info of url being fetched On the input to the fetcher you get a URL and a CrawlDatum (originally coming fro

Re: [Nutch-dev] get CrawlDatum

2006-08-30 Thread Uroš Gruber
Andrzej Bialecki wrote: > Uroš Gruber wrote: >> Hi, >> >> Could someone point me how to get CrawlDatum data from key url in >> ParseOutputFormat.write [83]. >> I would like to add data to link urls but this data depend on data of >> url being crawled. > > You can't, because that instance of Crawl

[Nutch-dev] Fetch error

2006-08-30 Thread anton
I update hadoop but I am get next error now on fetch step (reduce): 06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_00_3 0.3334% reduce > copy (6 of 6 at 11.77 MB/s) 06/08/29 08:31:20 WARN /: /getMapOutput.jsp?map=task_0003_m_02_0&reduce=1: java.lang.IllegalStateException

Re: [Nutch-dev] Use CrawlDb as a metadata Db?

2006-08-30 Thread Enis Soztutar
HUYLEBROECK Jeremy RD-ILAB-SSF wrote: > If I am not wrong, segments generated by Generator are some sort of > CrawlDatum. > I am putting metadata in the CrawlDb (I keep information that never > change) and I think they are copied to the segments by the Generator. > > But now I want to access those

Re: [Nutch-dev] get CrawlDatum

2006-08-30 Thread Andrzej Bialecki
Uroš Gruber wrote: > Hi, > > Could someone point me how to get CrawlDatum data from key url in > ParseOutputFormat.write [83]. > I would like to add data to link urls but this data depend on data of > url being crawled. You can't, because that instance of CrawlDatum is not available at this pla

[Nutch-dev] get CrawlDatum

2006-08-30 Thread Uroš Gruber
Hi, Could someone point me how to get CrawlDatum data from key url in ParseOutputFormat.write [83]. I would like to add data to link urls but this data depend on data of url being crawled. I hope I was clear enough about my problem. regards Uros --