hbase content of injectorjob

2015-02-08 Thread jinhong lu
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of injectorjob

2015-02-08 Thread jinhong lu
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

Re: Need to crawl the site that requires flash to be enabled

2015-02-08 Thread Lewis John Mcgibbney
Hi Kartik and Alexis, On Fri, Feb 6, 2015 at 5:19 AM, user-digest-h...@nutch.apache.org wrote: The site you're trying to crawl is a Flash website. Unfortunatly that will be a problem for Nutch. Nutch doesn't render the page, only fetches it. It won't load Flash, CSS or JS that are included

hbase content of nutch

2015-02-08 Thread lu_jin_hong(陆锦洪)
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

Re: hbase content of nutch

2015-02-08 Thread Alfonso Nishikawa
Hi, Lujinhong, They are not error codes, they are serialized values using Gora. HBase shell shows scaped byte values when the byte is not printable in the console. In order to dump the values you have to use 'nutch readdb'. Take a look here: https://wiki.apache.org/nutch/CommandLineOptions

Re: Nutch project

2015-02-08 Thread Nibal Sawaya
Hello Reza, besides of the resources that Sebastian pointed you to, I would recommend you to check the unit tests as you will learn a lot of what each part of Nutch does and how it is supposed to work. Best Regards On Sat, Feb 7, 2015 at 1:38 AM, Sebastian Nagel wastl.na...@googlemail.com

Re: hbase content of the injectorjob

2015-02-08 Thread jinhong lu
On Saturday, February 7, 2015 9:34 PM, jinhong lu lujinh...@yahoo.com wrote: Hi  all,         I ran the InjectorJob and inject the url to hbase, the content in see.txt is :     http://money.163.com/     The job finished successfully and I found this in hbase:   

hbase content of injectorjob

2015-02-08 Thread jinhong lu
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of nutch

2015-02-08 Thread lu_jin_hong(陆锦洪)
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of the injectorjob

2015-02-08 Thread lujinhong
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of the injectorjob

2015-02-08 Thread lujinhong
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

RE: how to crawl image first on every round of nutch?

2015-02-08 Thread Markus Jelsma
Implement a ScoringFilter, specifically the generate something method(), and emit a high float for image MIME's. -Original message- From:Eyeris RodrIguez Rueda eru...@uci.cu Sent: Friday 6th February 2015 19:54 To: user@nutch.apache.org Subject: how to crawl image first on every

Re: Nutch project

2015-02-08 Thread Sebastian Nagel
Dear Reza Nazarpour, Nutch is an open, community-driven project. That's why a I loop this communication forward to the Nutch mailing list (user@nutch.apache.org). which is a brilliant piece of work. On behalf of all contributors and volunteers: thank you very much! without a thorough document

hbase content of the injectorjob

2015-02-08 Thread jinhong lu
Hi  all,         I ran the InjectorJob and inject the url to hbase, the content in see.txt is :     http://money.163.com/     The job finished successfully and I found this in hbase:    hbase(main):029:0* scan 'webpage' ROW   COLUMN+CELL  

Re: InvertLinks Performance Nutch 1.6

2015-02-08 Thread Sebastian Nagel
Yeah, impressive. The defaults are not really optimal for production crawls where it's unlikely that URL filter / normalization rules get changed somewhere in between the steps of a running crawl. Ideally, URL should be filtered / normalized only if new URLs are added to CrawlDb: - seeds -

Re: Newbie

2015-02-08 Thread Mattmann, Chris A (3980)
Thanks Trevor. Moving user-owner@n.a.o to BCC since I think you meant to ask this on the user@n.a.o list. I think the best bet is to check out the Nutch wiki with several tutorials and other info on how to get started. Also we would welcome you to join the dev and user lists (by sending blank

Re: unsubscribe

2015-02-08 Thread Mattmann, Chris A (3980)
Please send an email to dev-unsubscr...@nutch.apache.org and user-unsubscr...@nutch.apache.org and follow the instructions from there. [moved dev@nutch.a.o and user@nutch.a.o to BCC] ++ Chris Mattmann, Ph.D. Chief Architect

How to crawl specific pages of a website

2015-02-08 Thread Phong Nguyen
Hi all, I want to crawl all posts of a blog except home, category, tag page of https://thinkarchitect.wordpress.com. For example: https://thinkarchitect.wordpress.com/2015/02/06/difficult-to-work-with-sometimes-2/ So I add the following rule in regex-urlfilter.txt