Re: Open Graph metadata?

2016-09-29 Thread lewis john mcgibbney
Hi Ralf,
Do mean here the Open Graph Protocol [0] markup?
If so, then if it is resent within  then it is already parsed
out and stored within Parse [1] and can be accessed Parse.getData().
Please use the ParserChecker to double check this and if necessary post an
example here so that I can be corrected if I am wrong.
Thanks
Lewis

[0] http://ogp.me/
[1]
http://nutch.apache.org/apidocs/apidocs-1.12/index.html?org/apache/nutch/parse/Parse.html

On Sun, Sep 18, 2016 at 7:41 PM,  wrote:

>
> From: BlackIce 
> To: user@nutch.apache.org
> Cc:
> Date: Sun, 18 Sep 2016 17:39:56 +0200
> Subject: Open Graph metadata?
> Can we now use Open graph metadata, if so how?
>
>
> Thnx
>
>
> Ralf
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Nutch in production

2016-09-29 Thread Sachin Shaju
Can I have a link to this ?

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

On Thu, Sep 29, 2016 at 11:13 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Yep also check out the work that Sujen Shah just merged (also on my team
> at JPL and
> USC) where you can publish events to an ActiveMQ queue from Nutch
> crawling. That
> should allow all sorts of production dashboards and analytics.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect, Instrument Software and Science Data Systems Section (398)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
> On 9/29/16, 10:41 AM, "Karanjeet Singh"  wrote:
>
> Hi Sachin,
>
> Just a suggestion here - you can use Apache Kafka to generate and catch
> events which are mapped to incoming crawl requests, crawl status and
> much
> more.
>
> I have created a prototype for production queue [0] which runs on top
> of a
> supercomputer (TACC Wrangler) and integrated it with Kafka. Please
> have a
> look and let me know if you have any questions.
>
> [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler
>
> P.S. - There can be many solutions to this. I am just giving one.  :)
>
> Regards,
> Karanjeet Singh
> http://irds.usc.edu
>
> On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju 
> wrote:
>
> > Hi,
> >I was experimenting some crawl cycles with nutch and would like
> to setup
> > a distributed crawl environment. But I wonder how can I trigger
> nutch for
> > incoming crawl requests in a production system. I read about nutch
> REST
> > api. Is that the real option that I have ? Or can I run nutch as a
> > continuously running distributed server by any other option ?
> >
> >  My preferred nutch version is nutch 1.12.
> >
> > Regards,
> > Sachin Shaju
> >
> > sachi...@mstack.com
> > +919539887554
> >
> > --
> >
> >
> > The information contained in this electronic message and any
> attachments to
> > this message are intended for the exclusive use of the addressee(s)
> and may
> > contain proprietary, confidential or privileged information. If you
> are not
> > the intended recipient, you should not disseminate, distribute or
> copy this
> > e-mail. Please notify the sender immediately and destroy all copies
> of this
> > message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The recipient
> > should check this email and any attachments for the presence of
> viruses.
> > The company accepts no liability for any damage caused by any virus
> > transmitted by this email.
> >
> > www.mStack.com
> >
>
> ᐧ
>
>
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: Nutch in production

2016-09-29 Thread Sachin Shaju
Thank you guys for your replies. I will look into the suggestions you gave.
But I have one more query. How can I trigger nutch from a queue system in a
distributed environment ? Can REST api be a real option in distributed mode
? Or whether I will have to go for a command line invocation for nutch ?

Thanks and Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

On Thu, Sep 29, 2016 at 11:11 PM, Karanjeet Singh  wrote:

> Hi Sachin,
>
> Just a suggestion here - you can use Apache Kafka to generate and catch
> events which are mapped to incoming crawl requests, crawl status and much
> more.
>
> I have created a prototype for production queue [0] which runs on top of a
> supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
> look and let me know if you have any questions.
>
> [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler
>
> P.S. - There can be many solutions to this. I am just giving one.  :)
>
> Regards,
> Karanjeet Singh
> http://irds.usc.edu
>
> On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju  wrote:
>
> > Hi,
> >I was experimenting some crawl cycles with nutch and would like to
> setup
> > a distributed crawl environment. But I wonder how can I trigger nutch for
> > incoming crawl requests in a production system. I read about nutch REST
> > api. Is that the real option that I have ? Or can I run nutch as a
> > continuously running distributed server by any other option ?
> >
> >  My preferred nutch version is nutch 1.12.
> >
> > Regards,
> > Sachin Shaju
> >
> > sachi...@mstack.com
> > +919539887554
> >
> > --
> >
> >
> > The information contained in this electronic message and any attachments
> to
> > this message are intended for the exclusive use of the addressee(s) and
> may
> > contain proprietary, confidential or privileged information. If you are
> not
> > the intended recipient, you should not disseminate, distribute or copy
> this
> > e-mail. Please notify the sender immediately and destroy all copies of
> this
> > message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The recipient
> > should check this email and any attachments for the presence of viruses.
> > The company accepts no liability for any damage caused by any virus
> > transmitted by this email.
> >
> > www.mStack.com
> >
>
> ᐧ
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


RE: Arch 1.9.2 is available

2016-09-29 Thread Arkadi.Kosmynin
You are welcome.

> -Original Message-
> From: lewis john mcgibbney [mailto:lewi...@apache.org]
> Sent: Friday, 30 September 2016 2:22 AM
> To: user@nutch.apache.org
> Subject: Re: Arch 1.9.2 is available
> 
> Cool... thanks for posting.
> 
> On Wed, Sep 28, 2016 at 1:36 AM, 
> wrote:
> 
> >
> > user Digest 28 Sep 2016 08:36:56 - Issue 2648
> >
> > Topics (messages 32792 through 32792)
> >
> > Arch 1.9.2 is available
> > 32792 by: Arkadi.Kosmynin.csiro.au
> >
> > Administrivia:
> >
> > -
> > To post to the list, e-mail: user@nutch.apache.org To unsubscribe,
> > e-mail: user-digest-unsubscr...@nutch.apache.org
> > For additional commands, e-mail: user-digest-h...@nutch.apache.org
> >
> > --
> >
> >
> >
> > -- Forwarded message --
> > From: 
> > To: 
> > Cc:
> > Date: Tue, 27 Sep 2016 07:00:18 +
> > Subject: Arch 1.9.2 is available
> > Hello,
> >
> > I am announcing release of Arch 1.9.2, based on Nutch 1.9.
> >
> > Arch is a free, open source extension of Nutch designed for indexing
> > and searching of intranets. Many features have been added that make
> > this task easier and deliver high precision search results.
> >
> > For details and downloads, please see Arch home page:
> >
> > http://www.atnf.csiro.au/computing/software/arch/
> >
> > You may know that Google Search Appliance is being discontinued. See,
> > for example,
> > http://fortune.com/2016/02/04/google-ends-search-appliance/. If you
> > need a replacement, you may want to try Arch. It is at least comparable to
> GSA in terms of search quality. See more in this article:
> >
> > http://www.atnf.csiro.au/computing/software/arch/ArchVsGSA.pdf
> >
> > Regards,
> >
> > Arkadi Kosmynin
> >
> >
> >
> 
> 
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney


Re: Nutch in production

2016-09-29 Thread Mattmann, Chris A (3980)
Yep also check out the work that Sujen Shah just merged (also on my team at JPL 
and
USC) where you can publish events to an ActiveMQ queue from Nutch crawling. That
should allow all sorts of production dashboards and analytics.

++
Chris Mattmann, Ph.D.
Chief Architect, Instrument Software and Science Data Systems Section (398)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 

On 9/29/16, 10:41 AM, "Karanjeet Singh"  wrote:

Hi Sachin,

Just a suggestion here - you can use Apache Kafka to generate and catch
events which are mapped to incoming crawl requests, crawl status and much
more.

I have created a prototype for production queue [0] which runs on top of a
supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
look and let me know if you have any questions.

[0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler

P.S. - There can be many solutions to this. I am just giving one.  :)

Regards,
Karanjeet Singh
http://irds.usc.edu

On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju  wrote:

> Hi,
>I was experimenting some crawl cycles with nutch and would like to 
setup
> a distributed crawl environment. But I wonder how can I trigger nutch for
> incoming crawl requests in a production system. I read about nutch REST
> api. Is that the real option that I have ? Or can I run nutch as a
> continuously running distributed server by any other option ?
>
>  My preferred nutch version is nutch 1.12.
>
> Regards,
> Sachin Shaju
>
> sachi...@mstack.com
> +919539887554
>
> --
>
>
> The information contained in this electronic message and any attachments 
to
> this message are intended for the exclusive use of the addressee(s) and 
may
> contain proprietary, confidential or privileged information. If you are 
not
> the intended recipient, you should not disseminate, distribute or copy 
this
> e-mail. Please notify the sender immediately and destroy all copies of 
this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>

ᐧ




Re: Nutch in production

2016-09-29 Thread Karanjeet Singh
Hi Sachin,

Just a suggestion here - you can use Apache Kafka to generate and catch
events which are mapped to incoming crawl requests, crawl status and much
more.

I have created a prototype for production queue [0] which runs on top of a
supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
look and let me know if you have any questions.

[0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler

P.S. - There can be many solutions to this. I am just giving one.  :)

Regards,
Karanjeet Singh
http://irds.usc.edu

On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju  wrote:

> Hi,
>I was experimenting some crawl cycles with nutch and would like to setup
> a distributed crawl environment. But I wonder how can I trigger nutch for
> incoming crawl requests in a production system. I read about nutch REST
> api. Is that the real option that I have ? Or can I run nutch as a
> continuously running distributed server by any other option ?
>
>  My preferred nutch version is nutch 1.12.
>
> Regards,
> Sachin Shaju
>
> sachi...@mstack.com
> +919539887554
>
> --
>
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>

ᐧ


Re: Arch 1.9.2 is available

2016-09-29 Thread lewis john mcgibbney
Cool... thanks for posting.

On Wed, Sep 28, 2016 at 1:36 AM,  wrote:

>
> user Digest 28 Sep 2016 08:36:56 - Issue 2648
>
> Topics (messages 32792 through 32792)
>
> Arch 1.9.2 is available
> 32792 by: Arkadi.Kosmynin.csiro.au
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
> -- Forwarded message --
> From: 
> To: 
> Cc:
> Date: Tue, 27 Sep 2016 07:00:18 +
> Subject: Arch 1.9.2 is available
> Hello,
>
> I am announcing release of Arch 1.9.2, based on Nutch 1.9.
>
> Arch is a free, open source extension of Nutch designed for indexing and
> searching of intranets. Many features have been added that make this task
> easier and deliver high precision search results.
>
> For details and downloads, please see Arch home page:
>
> http://www.atnf.csiro.au/computing/software/arch/
>
> You may know that Google Search Appliance is being discontinued. See, for
> example, http://fortune.com/2016/02/04/google-ends-search-appliance/. If
> you need a replacement, you may want to try Arch. It is at least comparable
> to GSA in terms of search quality. See more in this article:
>
> http://www.atnf.csiro.au/computing/software/arch/ArchVsGSA.pdf
>
> Regards,
>
> Arkadi Kosmynin
>
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Custom options in nutch crawl script

2016-09-29 Thread Sachin Shaju
I was trying to give custom options in *bin/crawl* script and encountered
an issue. I gave a custom config in nutch to ignore external outlinks in my
crawl command like :-

*bin/crawl -i -D elastic.index=test -D db.ignore.external.links=true urls/
CrawlTest/ 3*

But this is not working. Then I set this property in nutch-site.xml then it
is working.

Then I tried to set a custom config to index data to a specific elastic
index other than what is given in nutch-site.xml as java option in
bin/crawl. To my surprise it is working.
The command I've used :-

*bin/crawl -i -D elastic.index=test urls/ CrawlTest/ 3*

So I would like to know why my first command didn't work ?Am I missing
anything. Please help.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Nutch in production

2016-09-29 Thread Sachin Shaju
Hi,
   I was experimenting some crawl cycles with nutch and would like to setup
a distributed crawl environment. But I wonder how can I trigger nutch for
incoming crawl requests in a production system. I read about nutch REST
api. Is that the real option that I have ? Or can I run nutch as a
continuously running distributed server by any other option ?

 My preferred nutch version is nutch 1.12.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


How to run nutch server on distributed environment

2016-09-29 Thread Sachin Shaju
Hi,

I have tested running of nutch in server mode by starting it using
bin/nutch startserver command*locally*. Now I wonder whether I can start
nutch in *server mode* on top of a hadoop cluster(in distributed
environment) and submit crawl requests to server using nutch REST api ?
Please help.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com