Fixing dev@nutch list address ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 180-503E, Mailstop: 180-503 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
On 12/5/16, 9:32 PM, "Mattmann, Chris A (3010)" <chris.a.mattm...@jpl.nasa.gov> wrote: Hi Jyoti, I need a lot more detail than “it didn’t work”. What didn’t work about it? Do you have a log file? What site were you trying to crawl? What command did you use? Where is your nutch config? Were you running in distributed or local mode? Onto Selenium – have you tried it or simply reading the docs, you think it’s old? What have you done? What have you tried? I need a LOT more detail before I (and I’m guessing anyone else on these lists) can help. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 180-503E, Mailstop: 180-503 Email: chris.a.mattm...@nasa.gov<mailto:chris.a.mattm...@nasa.gov> WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: jyoti aditya <jyotiaditya...@gmail.com> Date: Monday, December 5, 2016 at 9:29 PM To: "Mattmann, Chris A (3010)" <chris.a.mattm...@jpl.nasa.gov> Cc: "u...@nutch.apache.org" <u...@nutch.apache.org>, "d...@nutch.apatche.org" <d...@nutch.apatche.org> Subject: Re: Impolite crawling using NUTCH Hi Chris/Team, Whitelisting domain name din't work. And when i was trying to configure selenium. It need one headless browser to be integrated with. Documentation for selenium-protocol plugin looks old. firefox-11 is now not supported as headless browser with selenium. So please help me out in configuring selenium plugin configuration. I am yet not sure, after configuring above what result it will fetch me. With Regards, Jyoti Aditya On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) <chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>> wrote: Hi Jyoti, Again, please keep dev@nutch.a.o<mailto:dev@nutch.a.o> CC’ed, and also you may consider looking at this page: https://wiki.apache.org/nutch/AdvancedAjaxInteraction Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 180-503E, Mailstop: 180-503 Email: chris.a.mattm...@nasa.gov<mailto:chris.a.mattm...@nasa.gov> WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: jyoti aditya <jyotiaditya...@gmail.com<mailto:jyotiaditya...@gmail.com>> Date: Monday, December 5, 2016 at 1:42 AM To: Chris Mattmann <mattm...@apache.org<mailto:mattm...@apache.org>> Subject: Re: Impolite crawling using NUTCH Hi Chris, Whitelist din't work. And I was trying to configure selenium with nutch. But I am not sure that by doing so, what result will come. And also, it looks very clumsy to configure selenium with firefox. Regards, Jyoti Aditya On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <mattm...@apache.org<mailto:mattm...@apache.org>> wrote: Hmm, I’m a little confused here. You were first trying to use white list robots.txt, and now you are talking about Selenium. 1. Did the white list work 2. Are you now asking how to use Nutch and Selenium? Cheers, Chris From: jyoti aditya <jyotiaditya...@gmail.com<mailto:jyotiaditya...@gmail.com>> Date: Thursday, December 1, 2016 at 10:26 PM To: "Mattmann, Chris A (3010)" <chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>> Subject: Re: Impolite crawling using NUTCH Hi Chris, Thanks for the response. I added the changes as you mentioned above. But I am still not able to get all content from a webpage. Can you please tell me that do I need to add some selenium plugin to crawl dynamic content available on web page? I have a concern that this kind of wiki pages are not directly accessible. There is no way we can reach to these kind of useful pages. So please do needful regarding this. With Regards, Jyoti Aditya On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <chris.a.mattm...@jpl.nasa.gov<mailto:chris.a.mattm...@jpl.nasa.gov>> wrote: There is a robots.txt whitelist. You can find documentation here: https://wiki.apache.org/nutch/WhiteListRobots ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 180-503E, Mailstop: 180-503 Email: chris.a.mattm...@nasa.gov<mailto:chris.a.mattm...@nasa.gov> WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ On 11/29/16, 8:57 AM, "Tom Chiverton" <t...@extravision.com<mailto:t...@extravision.com>> wrote: Sure, you can remove the check from the code and recompile. Under what circumstances would you need to ignore robots.txt ? Would something like allowing access by particular IP or user agents be an alternative ? Tom On 29/11/16 04:07, jyoti aditya wrote: > Hi team, > > Can we use NUTCH to do impolite crawling? > Or is there any way by which we can disobey robots.text? > > > With Regards > Jyoti Aditya > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ -- With Regards Jyoti Aditya -- With Regards Jyoti Aditya -- With Regards Jyoti Aditya