hbase content of nutch

2015-02-08 Thread lu_jin_hong(陆锦洪)
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of injectorjob

2015-02-08 Thread jinhong lu
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

Suggestion: move README.txt to README.md

2015-02-08 Thread Mattmann, Chris A (3980)
Hi Guys, In Tika and OODT we’ve moved our README from a .txt file to a .md file so all the links an other stuff is resolved on Github. I’d like to do the same with Nutch. Any objections? If not, I’ll do this in the next 24-48 hours. Cheers, Chris

Re: [jira] [Commented] (NUTCH-1937) Error: Could not find or load main class bin.crawl

2015-02-08 Thread nishant jani
Hi Chris, This is the output for my *pwd* /home/nishant/programming_dump/assignment1/nutch/runtime/local I have set the NUTCH_HOME, NUTCH_JAVA_HOME and JAVA_HOME variables Also, I have been *issuing the command*: bin/nutch bin/crawl urls where urls is a valid directory in the local

hbase content of injectorjob

2015-02-08 Thread jinhong lu
Hi  all,        I ran the InjectorJob and inject the url to hbase, the content in see.txt is :    http://money.163.com/     The job finished successfully and I found this in hbase:    hbase(main):029:0* scan 'webpage'ROW                                               COLUMN+CELL                  

[no subject]

2015-02-08 Thread lujinhong
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of the injectorjob

2015-02-08 Thread lujinhong
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of injectorjob

2015-02-08 Thread jinhong lu
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

[no subject]

2015-02-08 Thread Siddharth Mahendra Dasani

Isnt the All-in-one Crawl Deprecated?

2015-02-08 Thread nishant jani
Hello All, I have been following the Nutch tutorial on http://wiki.apache.org/nutch/NutchTutorial which has the following command to be executed bin/nutch bin/crawl urls -dir crawl -depth 3 -topN 5 which throws me Error: Could not find or load main class bin.crawl A quick glance through the

Re: hbase content of the injectorjob

2015-02-08 Thread jinhong lu
On Saturday, February 7, 2015 9:34 PM, jinhong lu lujinh...@yahoo.com wrote: Hi  all,         I ran the InjectorJob and inject the url to hbase, the content in see.txt is :     http://money.163.com/     The job finished successfully and I found this in hbase:   

[no subject]

2015-02-08 Thread Ching Chiu

hbase content of the injectorjob

2015-02-08 Thread lujinhong
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of nutch

2015-02-08 Thread lu_jin_hong(陆锦洪)
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

[no subject]

2015-02-08 Thread Subodh Sah

[no subject]

2015-02-08 Thread Avi Sanadhya

hbase content of nutch

2015-02-08 Thread lu_jin_hong(陆锦洪)
Hi all, I ran the InjectorJob and inject the url to hbase, the content in see.txt is : http://money.163.com/ The job finished successfully and I found this in hbase: hbase(main):029:0* scan 'webpage' ROW COLUMN+CELL

hbase content of the injectorjob

2015-02-08 Thread jinhong lu
Hi  all,         I ran the InjectorJob and inject the url to hbase, the content in see.txt is :     http://money.163.com/     The job finished successfully and I found this in hbase:    hbase(main):029:0* scan 'webpage' ROW   COLUMN+CELL  

Google Summer of Code Program

2015-02-08 Thread Owen A Lin
Hi, I saw this post on the Opportunities page through NYU and I was curious what the summer program entails? I really want to get involved and improve my programming skills during the summer. Here is my resume. Thanks, Owen Professional Resume.docx Description: MS-Word 2007 document

unsubscribe

2015-02-08 Thread Arthur Cinader
unsubscribe

Re: GSoC 2015

2015-02-08 Thread Mattmann, Chris A (3980)
I am very much for figuring out how to do a Nutch + Spark - +1! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office:

unsubscribe

2015-02-08 Thread Arthur Cinader
unsubscribe

Fetch queue size, Multiple seed URLs and Maximum Depth

2015-02-08 Thread Preetam Pradeepkumar Shingavi
Hi, I have configured NUTCH with seed URL in local/url/seed.txt with just 1 URL to test (depth=2) : https://www.aoncadis.org/home.htm DOUBTS : 1. Fetch queue size : Watching at the LOGs first time, while NUTCH crawls, it shows (see below *fetchQueues.totalSize=29 *which changes to something

Re: Fetch queue size, Multiple seed URLs and Maximum Depth

2015-02-08 Thread Mattmann, Chris A (3980)
Hi Preetam, -Original Message- From: Preetam Pradeepkumar Shingavi shing...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Sunday, February 8, 2015 at 10:18 AM To: dev@nutch.apache.org dev@nutch.apache.org, Chris Mattmann mattm...@usc.edu Subject: Fetch queue size,

Re: Google Summer of Code Program

2015-02-08 Thread Lewis John Mcgibbney
Hi Owen, On Sun, Feb 8, 2015 at 10:59 AM, dev-digest-h...@nutch.apache.org wrote: I saw this post on the Opportunities page through NYU and I was curious what the summer program entails? I really want to get involved and improve my programming skills during the summer. I strongly advise

Re: Isnt the All-in-one Crawl Deprecated?

2015-02-08 Thread Mattmann, Chris A (3980)
Hi Nishant, You are entirely correct. The all in one crawl script is now ./bin/crawl If you have time to update the tutorials we would welcome it! Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and

Re: [jira] [Commented] (NUTCH-1937) Error: Could not find or load main class bin.crawl

2015-02-08 Thread Mattmann, Chris A (3980)
Thank you. The issue is that we have a ./bin/crawl script now. Even before, you wouldn’t run: bin/nutch bin/crawl urls It would have been something like: ./bin/nutch crawlDir seedDir .. etc If you run the commands without any params they should tell you the positional args. HTH! Cheers,

Re: Google Summer of Code Program

2015-02-08 Thread Mattmann, Chris A (3980)
Thanks Owen, appreciate you sending this in. The way it works is: 1. Basically propose or find an existing project in the NUTCH issue tracker: https://issues.apache.org/jira/browse/NUTCH and search for labels gsoc2015. 2. Work with a mentor on Nutch’s PMC to identify and have a project ready.

Re: unsubscribe

2015-02-08 Thread Mattmann, Chris A (3980)
Please send an email to dev-unsubscr...@nutch.apache.org and user-unsubscr...@nutch.apache.org and follow the instructions from there. [moved dev@nutch.a.o and user@nutch.a.o to BCC] ++ Chris Mattmann, Ph.D. Chief Architect

[jira] [Commented] (NUTCH-1938) Unable to load realm info from SCDynamicStore

2015-02-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311496#comment-14311496 ] Lewis John McGibbney commented on NUTCH-1938: - Additionally support for JDK

[jira] [Resolved] (NUTCH-1938) Unable to load realm info from SCDynamicStore

2015-02-08 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1938. - Resolution: Not a Problem As of NUTCH-1920 we do not support JDK 1.6 Unable to

Re: Fetch queue size, Multiple seed URLs and Maximum Depth

2015-02-08 Thread Preetam Pradeepkumar Shingavi
Comments inline . Thanks, Preetam On Sun, Feb 8, 2015 at 10:56 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Preetam, -Original Message- From: Preetam Pradeepkumar Shingavi shing...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Sunday,

[GitHub] nutch pull request: Update README.txt

2015-02-08 Thread chrismattmann
GitHub user chrismattmann opened a pull request: https://github.com/apache/nutch/pull/6 Update README.txt add in instructions for contributing via Github and Hub You can merge this pull request into a Git repository by running: $ git pull https://github.com/chrismattmann/nutch

unsubscribe

2015-02-08 Thread Gioele Zanzico
unsubscribe Sent from my iPhone On 8 Feb 2015, at 13:59, Arthur Cinader acina...@gmail.commailto:acina...@gmail.com wrote: unsubscribe __ This email has been scanned by the Symantec Email Security.cloud service. For more

[GitHub] nutch pull request: Update README.txt

2015-02-08 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/6 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[jira] [Commented] (NUTCH-1938) Unable to load realm info from SCDynamicStore

2015-02-08 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14311418#comment-14311418 ] Chris A. Mattmann commented on NUTCH-1938: -- Thanks for your issue and your

[jira] [Assigned] (NUTCH-1938) Unable to load realm info from SCDynamicStore

2015-02-08 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-1938: Assignee: Chris A. Mattmann Unable to load realm info from SCDynamicStore

[jira] [Updated] (NUTCH-1938) Unable to load realm info from SCDynamicStore

2015-02-08 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-1938: - Summary: Unable to load realm info from SCDynamicStore (was: Error When Running Nutch)

[jira] [Work started] (NUTCH-1938) Unable to load realm info from SCDynamicStore

2015-02-08 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1938 started by Chris A. Mattmann. Unable to load realm info from SCDynamicStore

[Nutch Wiki] Trivial Update of GoogleSummerOfCode by LewisJohnMcgibbney

2015-02-08 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The GoogleSummerOfCode page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/GoogleSummerOfCode?action=diffrev1=9rev2=10 = Welcome to the Nutch Google Summer of

Re: Fetch queue size, Multiple seed URLs and Maximum Depth

2015-02-08 Thread Mattmann, Chris A (3980)
Thanks Preetam: [..snip..] Why would you want to? Preetam : Just was curious to manually handle this if possible. I was anticipating that once the db has been fetched and CrawlDB has all the URLs crawled data to depth 2, the next run should not crawl the same URLs again. Is it that URLs

[Nutch Wiki] Trivial Update of GoogleSummerOfCode by LewisJohnMcgibbney

2015-02-08 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The GoogleSummerOfCode page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/GoogleSummerOfCode?action=diffrev1=8rev2=9 = Description = The Nutch PMC

Re: Fetch queue size, Multiple seed URLs and Maximum Depth

2015-02-08 Thread Preetam Pradeepkumar Shingavi
Cool. Thanks Regards, Preetam On Sun, Feb 8, 2015 at 11:16 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Preetam: [..snip..] Why would you want to? Preetam : Just was curious to manually handle this if possible. I was anticipating that once the db has

[Nutch Wiki] New attachment added to page GoogleSummerOfCode

2015-02-08 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page GoogleSummerOfCode for change notification. An attachment has been added to that page by LewisJohnMcgibbney. Following detailed information is available: Attachment name: gsoc2015.png Attachment size: 115321 Attachment link:

572:Crawl statistics for each repository ?

2015-02-08 Thread Jaydeep Bagrecha
Is there a way to crawl all 3 repositories together and get statistics for each one individually? OR Do we have to crawl each repository separately and get its statistics from corresponding crawldb? Thanks, Jaydeep

Re: Google Summer of Code Program

2015-02-08 Thread Owen A Lin
Hi Chris and Lewis, Thank you so much for your emails and guidance. I am very serious about Nutch and Hadoop and am working on submitting a proposal right away. Hopefully, the project is selected but only time will tell. Again, thank you so much and I look forward to the project. Thanks, Owen

Re: 572:Crawl statistics for each repository ?

2015-02-08 Thread Mattmann, Chris A (3980)
Hi Jaydeep, Please qualify what this question is about - I know what it’s about but you have provided very little detail for anyone else on this to list to discern it. The short answer is no: crawldb stats are per crawl. Cheers, Chris

Re: 572:Crawl statistics for each repository ?

2015-02-08 Thread Jaydeep Bagrecha
Thanks. P.S The question was:- Given M (repo)repositories(M corresponding seedlist urls),find crawl statistics(number of fetched/unfetched urls,etc)for each repo separately? So,Is there a way to crawl all M repo together(include eg:-domain name of all m in regex-urlfilter.txt file) and get