date:20150208

hbase content of injectorjob

2015-02-08 Thread jinhong lu

Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of injectorjob

2015-02-08 Thread jinhong lu

Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

Re: Need to crawl the site that requires flash to be enabled

2015-02-08 Thread Lewis John Mcgibbney

Hi Kartik and Alexis, On Fri, Feb 6, 2015 at 5:19 AM, user-digest-h...@nutch.apache.org wrote: The site you're trying to crawl is a Flash website. Unfortunatly that will be a problem for Nutch. Nutch doesn't render the page, only fetches it. It won't load Flash, CSS or JS that are included

hbase content of nutch

2015-02-08 Thread lu_jin_hong（陆锦洪）

Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

Re: hbase content of nutch

2015-02-08 Thread Alfonso Nishikawa

Hi, Lujinhong, They are not error codes, they are serialized values using Gora. HBase shell shows scaped byte values when the byte is not printable in the console. In order to dump the values you have to use 'nutch readdb'. Take a look here: https://wiki.apache.org/nutch/CommandLineOptions

Re: Nutch project

2015-02-08 Thread Nibal Sawaya

Hello Reza, besides of the resources that Sebastian pointed you to, I would recommend you to check the unit tests as you will learn a lot of what each part of Nutch does and how it is supposed to work. Best Regards On Sat, Feb 7, 2015 at 1:38 AM, Sebastian Nagel wastl.na...@googlemail.com

Re: hbase content of the injectorjob

2015-02-08 Thread jinhong lu

On Saturday, February 7, 2015 9:34 PM, jinhong lu lujinh...@yahoo.com wrote: Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase:

hbase content of injectorjob

2015-02-08 Thread jinhong lu

Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of nutch

2015-02-08 Thread lu_jin_hong（陆锦洪）

Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of the injectorjob

2015-02-08 Thread lujinhong

Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of the injectorjob

2015-02-08 Thread lujinhong

Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

RE: how to crawl image first on every round of nutch?

2015-02-08 Thread Markus Jelsma

Implement a ScoringFilter, specifically the generate something method(), and emit a high float for image MIME's. -Original message- From:Eyeris RodrIguez Rueda eru...@uci.cu Sent: Friday 6th February 2015 19:54 To: user@nutch.apache.org Subject: how to crawl image first on every

Re: Nutch project

2015-02-08 Thread Sebastian Nagel

Dear Reza Nazarpour, Nutch is an open, community-driven project. That's why a I loop this communication forward to the Nutch mailing list (user@nutch.apache.org). which is a brilliant piece of work. On behalf of all contributors and volunteers: thank you very much! without a thorough document

hbase content of the injectorjob

2015-02-08 Thread jinhong lu

Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

Re: InvertLinks Performance Nutch 1.6

2015-02-08 Thread Sebastian Nagel

Yeah, impressive. The defaults are not really optimal for production crawls where it's unlikely that URL filter / normalization rules get changed somewhere in between the steps of a running crawl. Ideally, URL should be filtered / normalized only if new URLs are added to CrawlDb: - seeds -

Re: Newbie

2015-02-08 Thread Mattmann, Chris A (3980)

Thanks Trevor. Moving user-owner@n.a.o to BCC since I think you meant to ask this on the user@n.a.o list. I think the best bet is to check out the Nutch wiki with several tutorials and other info on how to get started. Also we would welcome you to join the dev and user lists (by sending blank

Re: unsubscribe

2015-02-08 Thread Mattmann, Chris A (3980)

Please send an email to dev-unsubscr...@nutch.apache.org and user-unsubscr...@nutch.apache.org and follow the instructions from there. [moved dev@nutch.a.o and user@nutch.a.o to BCC] ++ Chris Mattmann, Ph.D. Chief Architect

How to crawl specific pages of a website

2015-02-08 Thread Phong Nguyen

Hi all, I want to crawl all posts of a blog except home, category, tag page of https://thinkarchitect.wordpress.com. For example: https://thinkarchitect.wordpress.com/2015/02/06/difficult-to-work-with-sometimes-2/ So I add the following rule in regex-urlfilter.txt

hbase content of injectorjob

hbase content of injectorjob

Re: Need to crawl the site that requires flash to be enabled

hbase content of nutch

Re: hbase content of nutch

Re: Nutch project

Re: hbase content of the injectorjob

hbase content of injectorjob

hbase content of nutch

hbase content of the injectorjob

hbase content of the injectorjob

RE: how to crawl image first on every round of nutch?

Re: Nutch project

hbase content of the injectorjob

Re: InvertLinks Performance Nutch 1.6

Re: Newbie

Re: unsubscribe

How to crawl specific pages of a website

18 matches

Site Navigation

Mail list logo

Footer information