Re: Open Graph metadata?
Hi Ralf, Do mean here the Open Graph Protocol [0] markup? If so, then if it is resent within then it is already parsed out and stored within Parse [1] and can be accessed Parse.getData(). Please use the ParserChecker to double check this and if necessary post an example here so that I can be corrected if I am wrong. Thanks Lewis [0] http://ogp.me/ [1] http://nutch.apache.org/apidocs/apidocs-1.12/index.html?org/apache/nutch/parse/Parse.html On Sun, Sep 18, 2016 at 7:41 PM, wrote: > > From: BlackIce > To: user@nutch.apache.org > Cc: > Date: Sun, 18 Sep 2016 17:39:56 +0200 > Subject: Open Graph metadata? > Can we now use Open graph metadata, if so how? > > > Thnx > > > Ralf > > -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney
Re: Nutch in production
Can I have a link to this ? Regards, Sachin Shaju sachi...@mstack.com +919539887554 On Thu, Sep 29, 2016 at 11:13 PM, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote: > Yep also check out the work that Sujen Shah just merged (also on my team > at JPL and > USC) where you can publish events to an ActiveMQ queue from Nutch > crawling. That > should allow all sorts of production dashboards and analytics. > > ++ > Chris Mattmann, Ph.D. > Chief Architect, Instrument Software and Science Data Systems Section (398) > Manager, Open Source Projects Formulation and Development Office (8212) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++ > > > On 9/29/16, 10:41 AM, "Karanjeet Singh" wrote: > > Hi Sachin, > > Just a suggestion here - you can use Apache Kafka to generate and catch > events which are mapped to incoming crawl requests, crawl status and > much > more. > > I have created a prototype for production queue [0] which runs on top > of a > supercomputer (TACC Wrangler) and integrated it with Kafka. Please > have a > look and let me know if you have any questions. > > [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler > > P.S. - There can be many solutions to this. I am just giving one. :) > > Regards, > Karanjeet Singh > http://irds.usc.edu > > On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju > wrote: > > > Hi, > >I was experimenting some crawl cycles with nutch and would like > to setup > > a distributed crawl environment. But I wonder how can I trigger > nutch for > > incoming crawl requests in a production system. I read about nutch > REST > > api. Is that the real option that I have ? Or can I run nutch as a > > continuously running distributed server by any other option ? > > > > My preferred nutch version is nutch 1.12. > > > > Regards, > > Sachin Shaju > > > > sachi...@mstack.com > > +919539887554 > > > > -- > > > > > > The information contained in this electronic message and any > attachments to > > this message are intended for the exclusive use of the addressee(s) > and may > > contain proprietary, confidential or privileged information. If you > are not > > the intended recipient, you should not disseminate, distribute or > copy this > > e-mail. Please notify the sender immediately and destroy all copies > of this > > message and any attachments. > > > > WARNING: Computer viruses can be transmitted via email. The recipient > > should check this email and any attachments for the presence of > viruses. > > The company accepts no liability for any damage caused by any virus > > transmitted by this email. > > > > www.mStack.com > > > > ᐧ > > > -- The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.mStack.com
Re: Nutch in production
Thank you guys for your replies. I will look into the suggestions you gave. But I have one more query. How can I trigger nutch from a queue system in a distributed environment ? Can REST api be a real option in distributed mode ? Or whether I will have to go for a command line invocation for nutch ? Thanks and Regards, Sachin Shaju sachi...@mstack.com +919539887554 On Thu, Sep 29, 2016 at 11:11 PM, Karanjeet Singh wrote: > Hi Sachin, > > Just a suggestion here - you can use Apache Kafka to generate and catch > events which are mapped to incoming crawl requests, crawl status and much > more. > > I have created a prototype for production queue [0] which runs on top of a > supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a > look and let me know if you have any questions. > > [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler > > P.S. - There can be many solutions to this. I am just giving one. :) > > Regards, > Karanjeet Singh > http://irds.usc.edu > > On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju wrote: > > > Hi, > >I was experimenting some crawl cycles with nutch and would like to > setup > > a distributed crawl environment. But I wonder how can I trigger nutch for > > incoming crawl requests in a production system. I read about nutch REST > > api. Is that the real option that I have ? Or can I run nutch as a > > continuously running distributed server by any other option ? > > > > My preferred nutch version is nutch 1.12. > > > > Regards, > > Sachin Shaju > > > > sachi...@mstack.com > > +919539887554 > > > > -- > > > > > > The information contained in this electronic message and any attachments > to > > this message are intended for the exclusive use of the addressee(s) and > may > > contain proprietary, confidential or privileged information. If you are > not > > the intended recipient, you should not disseminate, distribute or copy > this > > e-mail. Please notify the sender immediately and destroy all copies of > this > > message and any attachments. > > > > WARNING: Computer viruses can be transmitted via email. The recipient > > should check this email and any attachments for the presence of viruses. > > The company accepts no liability for any damage caused by any virus > > transmitted by this email. > > > > www.mStack.com > > > > ᐧ > -- The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.mStack.com
RE: Arch 1.9.2 is available
You are welcome. > -Original Message- > From: lewis john mcgibbney [mailto:lewi...@apache.org] > Sent: Friday, 30 September 2016 2:22 AM > To: user@nutch.apache.org > Subject: Re: Arch 1.9.2 is available > > Cool... thanks for posting. > > On Wed, Sep 28, 2016 at 1:36 AM, > wrote: > > > > > user Digest 28 Sep 2016 08:36:56 - Issue 2648 > > > > Topics (messages 32792 through 32792) > > > > Arch 1.9.2 is available > > 32792 by: Arkadi.Kosmynin.csiro.au > > > > Administrivia: > > > > - > > To post to the list, e-mail: user@nutch.apache.org To unsubscribe, > > e-mail: user-digest-unsubscr...@nutch.apache.org > > For additional commands, e-mail: user-digest-h...@nutch.apache.org > > > > -- > > > > > > > > -- Forwarded message -- > > From: > > To: > > Cc: > > Date: Tue, 27 Sep 2016 07:00:18 + > > Subject: Arch 1.9.2 is available > > Hello, > > > > I am announcing release of Arch 1.9.2, based on Nutch 1.9. > > > > Arch is a free, open source extension of Nutch designed for indexing > > and searching of intranets. Many features have been added that make > > this task easier and deliver high precision search results. > > > > For details and downloads, please see Arch home page: > > > > http://www.atnf.csiro.au/computing/software/arch/ > > > > You may know that Google Search Appliance is being discontinued. See, > > for example, > > http://fortune.com/2016/02/04/google-ends-search-appliance/. If you > > need a replacement, you may want to try Arch. It is at least comparable to > GSA in terms of search quality. See more in this article: > > > > http://www.atnf.csiro.au/computing/software/arch/ArchVsGSA.pdf > > > > Regards, > > > > Arkadi Kosmynin > > > > > > > > > -- > http://home.apache.org/~lewismc/ > @hectorMcSpector > http://www.linkedin.com/in/lmcgibbney
Re: Nutch in production
Yep also check out the work that Sujen Shah just merged (also on my team at JPL and USC) where you can publish events to an ActiveMQ queue from Nutch crawling. That should allow all sorts of production dashboards and analytics. ++ Chris Mattmann, Ph.D. Chief Architect, Instrument Software and Science Data Systems Section (398) Manager, Open Source Projects Formulation and Development Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++ On 9/29/16, 10:41 AM, "Karanjeet Singh" wrote: Hi Sachin, Just a suggestion here - you can use Apache Kafka to generate and catch events which are mapped to incoming crawl requests, crawl status and much more. I have created a prototype for production queue [0] which runs on top of a supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a look and let me know if you have any questions. [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler P.S. - There can be many solutions to this. I am just giving one. :) Regards, Karanjeet Singh http://irds.usc.edu On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju wrote: > Hi, >I was experimenting some crawl cycles with nutch and would like to setup > a distributed crawl environment. But I wonder how can I trigger nutch for > incoming crawl requests in a production system. I read about nutch REST > api. Is that the real option that I have ? Or can I run nutch as a > continuously running distributed server by any other option ? > > My preferred nutch version is nutch 1.12. > > Regards, > Sachin Shaju > > sachi...@mstack.com > +919539887554 > > -- > > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient > should check this email and any attachments for the presence of viruses. > The company accepts no liability for any damage caused by any virus > transmitted by this email. > > www.mStack.com > ᐧ
Re: Nutch in production
Hi Sachin, Just a suggestion here - you can use Apache Kafka to generate and catch events which are mapped to incoming crawl requests, crawl status and much more. I have created a prototype for production queue [0] which runs on top of a supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a look and let me know if you have any questions. [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler P.S. - There can be many solutions to this. I am just giving one. :) Regards, Karanjeet Singh http://irds.usc.edu On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju wrote: > Hi, >I was experimenting some crawl cycles with nutch and would like to setup > a distributed crawl environment. But I wonder how can I trigger nutch for > incoming crawl requests in a production system. I read about nutch REST > api. Is that the real option that I have ? Or can I run nutch as a > continuously running distributed server by any other option ? > > My preferred nutch version is nutch 1.12. > > Regards, > Sachin Shaju > > sachi...@mstack.com > +919539887554 > > -- > > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient > should check this email and any attachments for the presence of viruses. > The company accepts no liability for any damage caused by any virus > transmitted by this email. > > www.mStack.com > ᐧ
Re: Arch 1.9.2 is available
Cool... thanks for posting. On Wed, Sep 28, 2016 at 1:36 AM, wrote: > > user Digest 28 Sep 2016 08:36:56 - Issue 2648 > > Topics (messages 32792 through 32792) > > Arch 1.9.2 is available > 32792 by: Arkadi.Kosmynin.csiro.au > > Administrivia: > > - > To post to the list, e-mail: user@nutch.apache.org > To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org > For additional commands, e-mail: user-digest-h...@nutch.apache.org > > -- > > > > -- Forwarded message -- > From: > To: > Cc: > Date: Tue, 27 Sep 2016 07:00:18 + > Subject: Arch 1.9.2 is available > Hello, > > I am announcing release of Arch 1.9.2, based on Nutch 1.9. > > Arch is a free, open source extension of Nutch designed for indexing and > searching of intranets. Many features have been added that make this task > easier and deliver high precision search results. > > For details and downloads, please see Arch home page: > > http://www.atnf.csiro.au/computing/software/arch/ > > You may know that Google Search Appliance is being discontinued. See, for > example, http://fortune.com/2016/02/04/google-ends-search-appliance/. If > you need a replacement, you may want to try Arch. It is at least comparable > to GSA in terms of search quality. See more in this article: > > http://www.atnf.csiro.au/computing/software/arch/ArchVsGSA.pdf > > Regards, > > Arkadi Kosmynin > > > -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney
Custom options in nutch crawl script
I was trying to give custom options in *bin/crawl* script and encountered an issue. I gave a custom config in nutch to ignore external outlinks in my crawl command like :- *bin/crawl -i -D elastic.index=test -D db.ignore.external.links=true urls/ CrawlTest/ 3* But this is not working. Then I set this property in nutch-site.xml then it is working. Then I tried to set a custom config to index data to a specific elastic index other than what is given in nutch-site.xml as java option in bin/crawl. To my surprise it is working. The command I've used :- *bin/crawl -i -D elastic.index=test urls/ CrawlTest/ 3* So I would like to know why my first command didn't work ?Am I missing anything. Please help. Regards, Sachin Shaju sachi...@mstack.com +919539887554 -- The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.mStack.com
Nutch in production
Hi, I was experimenting some crawl cycles with nutch and would like to setup a distributed crawl environment. But I wonder how can I trigger nutch for incoming crawl requests in a production system. I read about nutch REST api. Is that the real option that I have ? Or can I run nutch as a continuously running distributed server by any other option ? My preferred nutch version is nutch 1.12. Regards, Sachin Shaju sachi...@mstack.com +919539887554 -- The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.mStack.com
How to run nutch server on distributed environment
Hi, I have tested running of nutch in server mode by starting it using bin/nutch startserver command*locally*. Now I wonder whether I can start nutch in *server mode* on top of a hadoop cluster(in distributed environment) and submit crawl requests to server using nutch REST api ? Please help. Regards, Sachin Shaju sachi...@mstack.com +919539887554 -- The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.mStack.com