date:20150222

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

I log the instance id and get the result:

2015-02-22 21:42:15,972 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
423250256
2015-02-22 21:42:24,782 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:24,795 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:24,804 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
...
2015-02-22 21:42:25,039 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:25,041 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:28,282 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
2015-02-22 21:42:28,292 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
...
2015-02-22 21:42:28,487 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
2015-02-22 21:42:28,489 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
2015-02-22 21:42:43,984 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1818924295
2015-02-22 21:42:44,090 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1818924295
...
2015-02-22 21:42:53,404 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1818924295
2015-02-22 21:44:08,533 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
2015-02-22 21:44:08,544 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
...
2015-02-22 21:44:10,418 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
2015-02-22 21:44:10,420 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
2015-02-22 21:44:14,467 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
2015-02-22 21:44:14,478 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
...
2015-02-22 21:44:15,643 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
2015-02-22 21:44:15,644 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
2015-02-22 21:44:26,189 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1343455839
2015-02-22 21:44:28,501 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1343455839
...
2015-02-22 21:45:29,707 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1343455839

As the url filters are called inthe injector and crawldb update, I grep:
➜  local git:(trunk) ✗ grep 'Injector: starting at' logs/hadoop.log
2015-02-22 21:42:14,896 INFO  crawl.Injector - Injector: starting at
*2015-02-22
21:42:14*

Which means the URlFilter ID: 423250256 is the one created in the injector.

➜  local git:(trunk) ✗ grep 'CrawlDb update: starting at' logs/hadoop.log
2015-02-22 21:42:25,951 INFO  crawl.CrawlDb - CrawlDb update: starting at
2015-02-22 21:42:25
2015-02-22 21:44:11,208 INFO  crawl.CrawlDb - CrawlDb update: starting at
2015-02-22 21:44:11

Here is confusing, there are 6 unique urlfilter ids after injector, while
there are only two crawldb update.

On Sun, Feb 22, 2015 at 9:24 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Cool, good test. I thought the Nutch plugin system cached instances
> of plugins - I am not sure if it creates a new one each time. are you
> sure you don’t have the same URLFilter instance, it’s just called on
> different datasets and thus produces different counts?
>
> Either way, so you should simply proceed with the filters in whatever
> form they are working in (cached or not).
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Renxia Wang 
> Reply-To: "dev@nutch.apache.org" 
> Date: Sunday, February 22, 2015 at 9:16 PM
> To: "dev@nutch.apache.org" 
> Subject: Re: How to read metadata/content of an URL in URLFilter?
>
> >I just added a counter in my URLFilter, and prove that the URLFilter
> >instances in each fetching circle are different.
> >
> >
> >Sample logs:
> >2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
> >links
> >2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
> >links
> >2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
> >links
> >2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
> >links
> >2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
> >links
> >2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
> >links
> >2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
> >links
> >2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
> >links
> >2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
> >links
> >2015-0

Subscribe to the mailing list

2015-02-22 Thread Chetan Vazirabadkar

Thanks,
Chetan Vazirabadkar.

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)

Cool, good test. I thought the Nutch plugin system cached instances
of plugins - I am not sure if it creates a new one each time. are you
sure you don’t have the same URLFilter instance, it’s just called on
different datasets and thus produces different counts?

Either way, so you should simply proceed with the filters in whatever
form they are working in (cached or not).

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Renxia Wang 
Reply-To: "dev@nutch.apache.org" 
Date: Sunday, February 22, 2015 at 9:16 PM
To: "dev@nutch.apache.org" 
Subject: Re: How to read metadata/content of an URL in URLFilter?

>I just added a counter in my URLFilter, and prove that the URLFilter
>instances in each fetching circle are different.
>
>
>Sample logs:
>2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
>links
>2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
>links
>2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
>links
>2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
>links
>2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
>links
>2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
>links
>2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
>links
>2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
>links
>2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
>links
>2015-02-22 21:07:13,835 INFO  exactdup.ExactDupURLFilter - Processed 1
>links
>2015-02-22 21:07:13,850 INFO  exactdup.ExactDupURLFilter - Processed 2
>links
>2015-02-22 21:07:13,865 INFO  exactdup.ExactDupURLFilter - Processed 3
>links
>2015-02-22 21:07:13,878 INFO  exactdup.ExactDupURLFilter - Processed 4
>links
>2015-02-22 21:07:13,889 INFO  exactdup.ExactDupURLFilter - Processed 5
>links
>2015-02-22 21:07:13,899 INFO  exactdup.ExactDupURLFilter - Processed 6
>links
>
>
>
>Not sure if it is configurable?
>
>
>
>
>On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980)
> wrote:
>
>That’s one way - for sure - but what I was implying is that
>you can train (read: feed data into) your model (read: algorithm)
>using previously crawled information. So, no I wasn’t implying
>machine learning.
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-
>From: Renxia Wang 
>Reply-To: "dev@nutch.apache.org" 
>Date: Sunday, February 22, 2015 at 8:47 PM
>To: "dev@nutch.apache.org" 
>Subject: Re: How to read metadata/content of an URL in URLFilter?
>
>>Hi Prof Mattmann,
>>
>>
>>You are saying "train" and "model", are we expected to use machine
>>learning algorithms to train model for duplication detection?
>>
>>
>>Thanks,
>>
>>
>>Renxia
>>
>>
>>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
>> wrote:
>>
>>There is nothing stating in your assignment that you can’t
>>use *previously* crawled data to train your model - you
>>should have at least 2 full sets of this.
>>
>>Cheers,
>>Chris
>>
>>
>>++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattm...@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++
>>
>>
>>
>>
>>
>>
>>-Original Message-
>>From: Majisha Parambath 
>>Reply-To: "dev@nutch.apache.org" 
>>Date: Sunday, February 22, 2015 at 8:30 PM
>>To: dev 
>>Subject: Re: How to read metadata/content of an URL in URLFilter?
>>
>>>
>>>
>>>
>>>My understanding is th

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

I just added a counter in my URLFilter, and prove that the URLFilter
instances in each fetching circle are different.

Sample logs:
2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
links
2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
links
2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
links
2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
links
2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
links
2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
links
2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
links
2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
links
2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
links
2015-02-22 21:07:13,835 INFO  exactdup.ExactDupURLFilter - Processed 1 links
2015-02-22 21:07:13,850 INFO  exactdup.ExactDupURLFilter - Processed 2 links
2015-02-22 21:07:13,865 INFO  exactdup.ExactDupURLFilter - Processed 3 links
2015-02-22 21:07:13,878 INFO  exactdup.ExactDupURLFilter - Processed 4 links
2015-02-22 21:07:13,889 INFO  exactdup.ExactDupURLFilter - Processed 5 links
2015-02-22 21:07:13,899 INFO  exactdup.ExactDupURLFilter - Processed 6 links

Not sure if it is configurable?


On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> That’s one way - for sure - but what I was implying is that
> you can train (read: feed data into) your model (read: algorithm)
> using previously crawled information. So, no I wasn’t implying
> machine learning.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Renxia Wang 
> Reply-To: "dev@nutch.apache.org" 
> Date: Sunday, February 22, 2015 at 8:47 PM
> To: "dev@nutch.apache.org" 
> Subject: Re: How to read metadata/content of an URL in URLFilter?
>
> >Hi Prof Mattmann,
> >
> >
> >You are saying "train" and "model", are we expected to use machine
> >learning algorithms to train model for duplication detection?
> >
> >
> >Thanks,
> >
> >
> >Renxia
> >
> >
> >On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
> > wrote:
> >
> >There is nothing stating in your assignment that you can’t
> >use *previously* crawled data to train your model - you
> >should have at least 2 full sets of this.
> >
> >Cheers,
> >Chris
> >
> >
> >++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattm...@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++
> >
> >
> >
> >
> >
> >
> >-Original Message-
> >From: Majisha Parambath 
> >Reply-To: "dev@nutch.apache.org" 
> >Date: Sunday, February 22, 2015 at 8:30 PM
> >To: dev 
> >Subject: Re: How to read metadata/content of an URL in URLFilter?
> >
> >>
> >>
> >>
> >>My understanding is that the LinkDB or CrawlDB will contain the results
> >>of previously fetched and parsed pages.
> >>
> >>However if we want to get the contents of a URL/page in the URL Filtering
> >>stage(
> >>which is not yet fetched) , is there any util in Nutch  that we can use
> >>to fetch the contents of the page ?
> >>
> >>
> >>Thanks and regards,
> >>Majisha Namath Parambath
> >>Graduate Student, M.S in Computer Science
> >>Viterbi School of Engineering
> >>University of Southern California, Los Angeles
> >>
> >>
> >>
> >>
> >>
> >>
> >>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
> >> wrote:
> >>
> >>In the constructor of your URLFilter, why not consider passing
> >>in a NutchConfiguration object, and then reading the path to e.g,
> >>the LinkDb from the config. Then have a private member variable
> >>for the LinkDbReader (maybe static initialized for efficiency)
> >>and use that in your interface method.
> >>
> >>Cheers,
> >>Chris
> >>
> >>++
> >>Chris Mattmann, Ph.D.
> >>Chief Architect
> >>Instrument Software and Science Data Systems S

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)

That’s one way - for sure - but what I was implying is that
you can train (read: feed data into) your model (read: algorithm)
using previously crawled information. So, no I wasn’t implying
machine learning. 

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Renxia Wang 
Reply-To: "dev@nutch.apache.org" 
Date: Sunday, February 22, 2015 at 8:47 PM
To: "dev@nutch.apache.org" 
Subject: Re: How to read metadata/content of an URL in URLFilter?

>Hi Prof Mattmann,
>
>
>You are saying "train" and "model", are we expected to use machine
>learning algorithms to train model for duplication detection?
>
>
>Thanks,
>
>
>Renxia
>
>
>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
> wrote:
>
>There is nothing stating in your assignment that you can’t
>use *previously* crawled data to train your model - you
>should have at least 2 full sets of this.
>
>Cheers,
>Chris
>
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-
>From: Majisha Parambath 
>Reply-To: "dev@nutch.apache.org" 
>Date: Sunday, February 22, 2015 at 8:30 PM
>To: dev 
>Subject: Re: How to read metadata/content of an URL in URLFilter?
>
>>
>>
>>
>>My understanding is that the LinkDB or CrawlDB will contain the results
>>of previously fetched and parsed pages.
>>
>>However if we want to get the contents of a URL/page in the URL Filtering
>>stage(
>>which is not yet fetched) , is there any util in Nutch  that we can use
>>to fetch the contents of the page ?
>>
>>
>>Thanks and regards,
>>Majisha Namath Parambath
>>Graduate Student, M.S in Computer Science
>>Viterbi School of Engineering
>>University of Southern California, Los Angeles
>>
>>
>>
>>
>>
>>
>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
>> wrote:
>>
>>In the constructor of your URLFilter, why not consider passing
>>in a NutchConfiguration object, and then reading the path to e.g,
>>the LinkDb from the config. Then have a private member variable
>>for the LinkDbReader (maybe static initialized for efficiency)
>>and use that in your interface method.
>>
>>Cheers,
>>Chris
>>
>>++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: chris.a.mattm...@nasa.gov
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++
>>
>>
>>
>>
>>
>>
>>-Original Message-
>>From: Renxia Wang 
>>Reply-To: "dev@nutch.apache.org" 
>>Date: Sunday, February 22, 2015 at 3:36 PM
>>To: "dev@nutch.apache.org" 
>>Subject: How to read metadata/content of an URL in URLFilter?
>>
>>>
>>>
>>>
>>>Hi
>>>
>>>
>>>I want to develop an UrlFIlter which takes an url, takes its metadata or
>>>even the fetched content, then use some duplicate detection algorithms
>>>to
>>>determine if it is a duplicate of any url in bitch. However, the only
>>>parameter passed into the Urlfilter
>>> is the url, is it possible to get the data I want of that input url in
>>>Urlfilter?
>>>
>>>
>>>Thanks,
>>>
>>>
>>>Zhique
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

Hi Prof Mattmann,

You are saying "train" and "model", are we expected to use machine learning
algorithms to train model for duplication detection?

Thanks,

Renxia

On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> There is nothing stating in your assignment that you can’t
> use *previously* crawled data to train your model - you
> should have at least 2 full sets of this.
>
> Cheers,
> Chris
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Majisha Parambath 
> Reply-To: "dev@nutch.apache.org" 
> Date: Sunday, February 22, 2015 at 8:30 PM
> To: dev 
> Subject: Re: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >My understanding is that the LinkDB or CrawlDB will contain the results
> >of previously fetched and parsed pages.
> >
> >However if we want to get the contents of a URL/page in the URL Filtering
> >stage(
> >which is not yet fetched) , is there any util in Nutch  that we can use
> >to fetch the contents of the page ?
> >
> >
> >Thanks and regards,
> >Majisha Namath Parambath
> >Graduate Student, M.S in Computer Science
> >Viterbi School of Engineering
> >University of Southern California, Los Angeles
> >
> >
> >
> >
> >
> >
> >On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
> > wrote:
> >
> >In the constructor of your URLFilter, why not consider passing
> >in a NutchConfiguration object, and then reading the path to e.g,
> >the LinkDb from the config. Then have a private member variable
> >for the LinkDbReader (maybe static initialized for efficiency)
> >and use that in your interface method.
> >
> >Cheers,
> >Chris
> >
> >++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattm...@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++
> >
> >
> >
> >
> >
> >
> >-Original Message-
> >From: Renxia Wang 
> >Reply-To: "dev@nutch.apache.org" 
> >Date: Sunday, February 22, 2015 at 3:36 PM
> >To: "dev@nutch.apache.org" 
> >Subject: How to read metadata/content of an URL in URLFilter?
> >
> >>
> >>
> >>
> >>Hi
> >>
> >>
> >>I want to develop an UrlFIlter which takes an url, takes its metadata or
> >>even the fetched content, then use some duplicate detection algorithms to
> >>determine if it is a duplicate of any url in bitch. However, the only
> >>parameter passed into the Urlfilter
> >> is the url, is it possible to get the data I want of that input url in
> >>Urlfilter?
> >>
> >>
> >>Thanks,
> >>
> >>
> >>Zhique
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

Hi Majisha,

>From the source code of the URLFilter interface comments, the urlfilter is
called in the injector and db updater, which means that you do have the
data of the url you are processing in the the filter crawled.
You may want to take a look at this article, which illustrate the workflow
of Nutch, although it is for Nutch 1.4:
http://www.atlantbh.com/apache-nutch-overview/

Thanks,

Renxia

On Sun, Feb 22, 2015 at 8:30 PM, Majisha Parambath  wrote:

>
>
> My understanding is that the LinkDB or CrawlDB will contain the results of
> previously fetched and parsed pages.
> However if we want to get the contents of a URL/page in the URL Filtering
> stage( *which is not yet fetched*) , is there any util in Nutch  that we
> can use to fetch the contents of the page ?
>
> Thanks and regards,
> *Majisha Namath Parambath*
> *Graduate Student, M.S in Computer Science*
> *Viterbi School of Engineering*
> *University of Southern California, Los Angeles*
>
> On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> In the constructor of your URLFilter, why not consider passing
>> in a NutchConfiguration object, and then reading the path to e.g,
>> the LinkDb from the config. Then have a private member variable
>> for the LinkDbReader (maybe static initialized for efficiency)
>> and use that in your interface method.
>>
>> Cheers,
>> Chris
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Renxia Wang 
>> Reply-To: "dev@nutch.apache.org" 
>> Date: Sunday, February 22, 2015 at 3:36 PM
>> To: "dev@nutch.apache.org" 
>> Subject: How to read metadata/content of an URL in URLFilter?
>>
>> >
>> >
>> >
>> >Hi
>> >
>> >
>> >I want to develop an UrlFIlter which takes an url, takes its metadata or
>> >even the fetched content, then use some duplicate detection algorithms to
>> >determine if it is a duplicate of any url in bitch. However, the only
>> >parameter passed into the Urlfilter
>> > is the url, is it possible to get the data I want of that input url in
>> >Urlfilter?
>> >
>> >
>> >Thanks,
>> >
>> >
>> >Zhique
>>
>>
>

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)

There is nothing stating in your assignment that you can’t
use *previously* crawled data to train your model - you
should have at least 2 full sets of this.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Majisha Parambath 
Reply-To: "dev@nutch.apache.org" 
Date: Sunday, February 22, 2015 at 8:30 PM
To: dev 
Subject: Re: How to read metadata/content of an URL in URLFilter?

>
>
>
>My understanding is that the LinkDB or CrawlDB will contain the results
>of previously fetched and parsed pages.
>
>However if we want to get the contents of a URL/page in the URL Filtering
>stage(
>which is not yet fetched) , is there any util in Nutch  that we can use
>to fetch the contents of the page ?
>
>
>Thanks and regards,
>Majisha Namath Parambath
>Graduate Student, M.S in Computer Science
>Viterbi School of Engineering
>University of Southern California, Los Angeles
>
>
>
>
>
>
>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
> wrote:
>
>In the constructor of your URLFilter, why not consider passing
>in a NutchConfiguration object, and then reading the path to e.g,
>the LinkDb from the config. Then have a private member variable
>for the LinkDbReader (maybe static initialized for efficiency)
>and use that in your interface method.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-
>From: Renxia Wang 
>Reply-To: "dev@nutch.apache.org" 
>Date: Sunday, February 22, 2015 at 3:36 PM
>To: "dev@nutch.apache.org" 
>Subject: How to read metadata/content of an URL in URLFilter?
>
>>
>>
>>
>>Hi
>>
>>
>>I want to develop an UrlFIlter which takes an url, takes its metadata or
>>even the fetched content, then use some duplicate detection algorithms to
>>determine if it is a duplicate of any url in bitch. However, the only
>>parameter passed into the Urlfilter
>> is the url, is it possible to get the data I want of that input url in
>>Urlfilter?
>>
>>
>>Thanks,
>>
>>
>>Zhique
>
>
>
>
>
>
>
>
>

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Majisha Parambath

My understanding is that the LinkDB or CrawlDB will contain the results of
previously fetched and parsed pages.
However if we want to get the contents of a URL/page in the URL Filtering
stage( *which is not yet fetched*) , is there any util in Nutch  that we
can use to fetch the contents of the page ?

Thanks and regards,
*Majisha Namath Parambath*
*Graduate Student, M.S in Computer Science*
*Viterbi School of Engineering*
*University of Southern California, Los Angeles*

On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> In the constructor of your URLFilter, why not consider passing
> in a NutchConfiguration object, and then reading the path to e.g,
> the LinkDb from the config. Then have a private member variable
> for the LinkDbReader (maybe static initialized for efficiency)
> and use that in your interface method.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Renxia Wang 
> Reply-To: "dev@nutch.apache.org" 
> Date: Sunday, February 22, 2015 at 3:36 PM
> To: "dev@nutch.apache.org" 
> Subject: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >Hi
> >
> >
> >I want to develop an UrlFIlter which takes an url, takes its metadata or
> >even the fetched content, then use some duplicate detection algorithms to
> >determine if it is a duplicate of any url in bitch. However, the only
> >parameter passed into the Urlfilter
> > is the url, is it possible to get the data I want of that input url in
> >Urlfilter?
> >
> >
> >Thanks,
> >
> >
> >Zhique
>
>

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332506#comment-14332506
 ] 

Lewis John McGibbney commented on NUTCH-1946:
-

Right now I am bumping in to the following issue!
{code}
  1 Testsuite: org.apache.nutch.fetcher.TestFetcher
  2 Tests run: 2, Failures: 0, Errors: 2, Skipped: 1, Time elapsed: 0.594 sec
  3 - Standard Output ---
  4 2015-02-22 18:49:28,141 WARN  util.NativeCodeLoader 
(NativeCodeLoader.java:(62)) - Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
  5 -  ---
  6
  7 Testcase: testAgentNameCheck took 0.575 sec
  8 Caused an ERROR
  9 Not implemented by the DistributedFileSystem FileSystem implementation
 10 java.lang.UnsupportedOperationException: Not implemented by the 
DistributedFileSystem FileSystem implementation
 11 at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214)
 12 at 
org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2559)
 13 at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2569)
 14 at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2586)
 15 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
 16 at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625)
 17 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607)
 18 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
 19 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
 20 at 
org.apache.nutch.util.AbstractNutchTest.setUp(AbstractNutchTest.java:42)
 21 at org.apache.nutch.fetcher.TestFetcher.setUp(TestFetcher.java:54)
 22
 23 Caused an ERROR
 24 null
 25 java.lang.NullPointerException
 26 at 
org.apache.nutch.fetcher.TestFetcher.tearDown(TestFetcher.java:64)
 27
 28 Testcase: testFetch took 0 sec
 29 SKIPPED: Temporarily diable until NUTCH-1572 is addressed.
{code}

> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332505#comment-14332505
 ] 

Lewis John McGibbney commented on NUTCH-1946:
-

[Ongoing discussion on Gora shims 
support|http://www.mail-archive.com/dev%40gora.apache.org/msg05752.html]
I do not particular want to upgrade to Hadoop 2.5.2 in this issue as well as 
Gora 0.6. So ideally we can leverage Hadoop 1.2.1, Nutch 2.3.1 and Gora 0.6 
combination.

> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332504#comment-14332504
 ] 

Hudson commented on NUTCH-1928:
---

SUCCESS: Integrated in Nutch-trunk #2986 (See 
[https://builds.apache.org/job/Nutch-trunk/2986/])
NUTCH-1928 Indexing filter of documents by the MIME type (jorgelbg: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1661600)
* /nutch/trunk/build.xml
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/default.properties
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/mimetype-filter
* /nutch/trunk/src/plugin/mimetype-filter/build.xml
* /nutch/trunk/src/plugin/mimetype-filter/ivy.xml
* /nutch/trunk/src/plugin/mimetype-filter/plugin.xml
* /nutch/trunk/src/plugin/mimetype-filter/sample
* /nutch/trunk/src/plugin/mimetype-filter/sample/allow-images.txt
* /nutch/trunk/src/plugin/mimetype-filter/sample/block-html.txt
* /nutch/trunk/src/plugin/mimetype-filter/src
* /nutch/trunk/src/plugin/mimetype-filter/src/java
* /nutch/trunk/src/plugin/mimetype-filter/src/java/org
* /nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache
* /nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer
* 
/nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter
* 
/nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter/MimeTypeIndexingFilter.java
* /nutch/trunk/src/plugin/mimetype-filter/src/test
* /nutch/trunk/src/plugin/mimetype-filter/src/test/org
* /nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache
* /nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache/nutch
* /nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache/nutch/indexer
* 
/nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache/nutch/indexer/filter
* 
/nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache/nutch/indexer/filter/MimeTypeIndexingFilterTest.java


> Indexing filter of documents by the MIME type
> -
>
> Key: NUTCH-1928
> URL: https://issues.apache.org/jira/browse/NUTCH-1928
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>  Labels: filter, mime-type, plugin
> Fix For: 1.10
>
> Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
> NUTCH-1928v6.patch, mimetype-patch-v3.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332502#comment-14332502
 ] 

Lewis John McGibbney commented on NUTCH-1928:
-

Fantastic [~jorgelbg]
If you could do us a HUGE favour and please resolve the issue stating the 
commit revision when closing off an issue that would be ideal.
Happy 1st commit to Nutch :)

> Indexing filter of documents by the MIME type
> -
>
> Key: NUTCH-1928
> URL: https://issues.apache.org/jira/browse/NUTCH-1928
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>  Labels: filter, mime-type, plugin
> Fix For: 1.10
>
> Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
> NUTCH-1928v6.patch, mimetype-patch-v3.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332500#comment-14332500
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1928:
---

[~lewismc] Committed successfully ;) also I've updated the JIRA with the last 
patch to keep it with sync with the committed version, I fixed a problem when 
running {{ant test}} for the whole project (which takes ~12 mins in my laptop).

> Indexing filter of documents by the MIME type
> -
>
> Key: NUTCH-1928
> URL: https://issues.apache.org/jira/browse/NUTCH-1928
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>  Labels: filter, mime-type, plugin
> Fix For: 1.10
>
> Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
> NUTCH-1928v6.patch, mimetype-patch-v3.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: NUTCH-1928v6.patch

> Indexing filter of documents by the MIME type
> -
>
> Key: NUTCH-1928
> URL: https://issues.apache.org/jira/browse/NUTCH-1928
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>  Labels: filter, mime-type, plugin
> Fix For: 1.10
>
> Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
> NUTCH-1928v6.patch, mimetype-patch-v3.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1946 started by Lewis John McGibbney.
---
> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)

I believe the Plugin system caches plugins, but you will need
to confirm (haven’t looked in a long time).


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Renxia Wang 
Reply-To: "dev@nutch.apache.org" 
Date: Sunday, February 22, 2015 at 6:37 PM
To: "dev@nutch.apache.org" 
Subject: Re: How to read metadata/content of an URL in URLFilter?

>
>
>
>Is there only one instance of a plugin for all fetch circles? I am
>assuming that when the job is started, a plugin instance is initialized
>and used in every fetching circle. Is it correct?
>
>On Sunday, February 22, 2015, Mattmann, Chris A (3980)
> wrote:
>
>In the constructor of your URLFilter, why not consider passing
>in a NutchConfiguration object, and then reading the path to e.g,
>the LinkDb from the config. Then have a private member variable
>for the LinkDbReader (maybe static initialized for efficiency)
>and use that in your interface method.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: 
>chris.a.mattm...@nasa.gov 
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-
>From: Renxia Wang >
>Reply-To: "dev@nutch.apache.org " >
>Date: Sunday, February 22, 2015 at 3:36 PM
>To: "dev@nutch.apache.org " >
>Subject: How to read metadata/content of an URL in URLFilter?
>
>>
>>
>>
>>Hi
>>
>>
>>I want to develop an UrlFIlter which takes an url, takes its metadata or
>>even the fetched content, then use some duplicate detection algorithms to
>>determine if it is a duplicate of any url in bitch. However, the only
>>parameter passed into the Urlfilter
>> is the url, is it possible to get the data I want of that input url in
>>Urlfilter?
>>
>>
>>Thanks,
>>
>>
>>Zhique
>
>
>

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-22 Thread Mattmann, Chris A (3980)

Going to implement more configuration in the plugin, but
based on the student emails I think your advice helped :)

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Mo Omer 
Date: Sunday, February 22, 2015 at 5:45 PM
To: Chris Mattmann 
Cc: "dev@nutch.apache.org" 
Subject: Re: Vagrant Crushed When using Nutch-Selenium

>No problem! How'd it work out?
>
>Mo
>
>This message was drafted on a tiny touch screen; please forgive brevity &
>tpyos
>
>> On Feb 22, 2015, at 6:19 PM, "Mattmann, Chris A (3980)"
>> wrote:
>> 
>> Thanks Mo, great advice.
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>> 
>> 
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: Jiaxin Ye 
>> Reply-To: "dev@nutch.apache.org" 
>> Date: Tuesday, February 17, 2015 at 2:49 PM
>> To: Mohammed Omer 
>> Cc: "dev@nutch.apache.org" 
>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>> 
>>> 
>>> 
>>> 
>>> Thank you so much!! I am going to try it out tonight.
>>> 
>>> On Tuesday, February 17, 2015, Mohammed Omer 
>>> wrote:
>>> 
>>> Jiaxin, 
>>> 
>>> 
>>> Each page takes about 3 seconds to crawl due to this piece of code - we
>>> allow selenium 3 seconds to grab the page [0]. Due to what I was
>>> crawling, I didn't want to wait for a specific element/class/id to show
>>> up. However, you can change it up if you want.
>>> Selenium documentation [1] has more info on Ex/Implicit waiting.
>>> 
>>> 
>>> Again, it's not the most efficient way to crawl; but, if you need JS to
>>> render, it's a backwards way that ensures it happens. Selenium Grid has
>>> the benefit of being able to handle more throughput, but at the end of
>>> the day we're waiting for a browser to
>>> go out and fetch the url.
>>> 
>>> 
>>> I've suggested that most items be configurable when merged into trunk
>>> [2], but I'll make a specific call-out to the wait time.
>>> 
>>> 
>>> Due to the way Selenium standalone works, it's wayy less efficient
>>> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
>>> set-up. 
>>> 
>>> 
>>> Wish I could help out more, but 30 threads might be too much. 5
>>>threads,
>>> at a total fetch/parse time of 4 seconds per url, would still
>>> theoretically churn out > 100k urls per day. There are multiple tweaks
>>> that could be made to optimize for your system,
>>> I'd start with reducing thread count, as you might be saturating your
>>> system [4].
>>> 
>>> 
>>> Sorry I can't be of more help!
>>> 
>>> 
>>> Thank you,
>>> 
>>> 
>>> Mo
>>> 
>>> 
>>> [0]: 
>>> 
>>>https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/jav
>>>a/
>>> org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
>>> 
>>>>>va
>>> /org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49>
>>> [1]: 
>>> http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
>>> 
>>> [2]: https://issues.apache.org/jira/browse/NUTCH-1933
>>> [3]: https://code.google.com/p/selenium/wiki/Grid2
>>> [4]: http://stackoverflow.com/a/4895271
>>> 
>>> 
>>> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye
>>> >
>>> wrote:
>>> 
>>> I am using fetcher.threads.per.queue = 30 by the way.
>>> 
>>> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye
>>> >
>>> wrote:
>>> 
>>> Hi Mo,
>>> 
>>> 
>>> I have a problem about the selenium plugin on mac. I think I
>>>successfully
>>> set it up on mac but I have a question about the performance.
>>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>>> that each url fetched takes about 1 seconds to open and close
>>> the firefox window. Is it a normal speed? or anything is wrong? And is
>>>it
>>> possible to install selenium grid plugin on Mac? I will cry if you
>>> ask me to change machine now..
>>> 
>>> 
>>> Best,
>>> Jiaxin
>>> 
>>> 
>>> On Fri, Feb 13, 2015

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

Is there only one instance of a plugin for all fetch circles? I am assuming
that when the job is started, a plugin instance is initialized and used in
every fetching circle. Is it correct?

On Sunday, February 22, 2015, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> In the constructor of your URLFilter, why not consider passing
> in a NutchConfiguration object, and then reading the path to e.g,
> the LinkDb from the config. Then have a private member variable
> for the LinkDbReader (maybe static initialized for efficiency)
> and use that in your interface method.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov 
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Renxia Wang >
> Reply-To: "dev@nutch.apache.org "  >
> Date: Sunday, February 22, 2015 at 3:36 PM
> To: "dev@nutch.apache.org "  >
> Subject: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >Hi
> >
> >
> >I want to develop an UrlFIlter which takes an url, takes its metadata or
> >even the fetched content, then use some duplicate detection algorithms to
> >determine if it is a duplicate of any url in bitch. However, the only
> >parameter passed into the Urlfilter
> > is the url, is it possible to get the data I want of that input url in
> >Urlfilter?
> >
> >
> >Thanks,
> >
> >
> >Zhique
>
>

Re:

2015-02-22 Thread Majisha Parambath

Hey Jiaxin,

My understanding is that the suffix_urlfilter will not come into the
picture unless it is part of the plugin.includes property of the
nutch-configuration. By default only the regex_urlfilter is integrated into
nutch, and we need to set the mime types to skip/not skip in the
regex_urlfilter.txt

Please correct me if my understanding is wrong.

Thanks and regards,
*Majisha Namath Parambath*
*Graduate Student, M.S in Computer Science*
*Viterbi School of Engineering*
*University of Southern California, Los Angeles*

On Sun, Feb 15, 2015 at 11:34 PM, Jiaxin Ye  wrote:

> Hi Swati,
>
> I am also the student in Prof Matmann's class. I think the politeness
> depends on the crawl-delay to the same server. Usually in the robots.txt
> the crawl-delay will be set to 5 to 15 seconds. It's true that setting
> fetcher.threads.per.queue to be bigger than 1 will cause the Crawl-Delay
> value from robots.txt to be ignored, but you can set the
> fetcher.server.delay to be 5 to 15 seconds to rebalance the successive
> requests time.
>
> I also think we should change the content in suffix_urlfillter as well, as
> our task is to collect as much data as we can from the three websites.
>
> Jiaxin
>
> On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari  wrote:
>
>> Hi,
>> We are working on a project under Professor Chris Mattmann as part of
>> Information Retrieval course.
>> We are trying to edit different properties to change politeness and do
>> url filtering.
>>
>> We are trying more than 1 thread, which makes it impolite, but we are not
>> sure how impolite it should be made for better results.
>> Also, url filtering blocks almost all image, audio, video formats in
>> suffix_urlfilter.xml, should that be tampered with or not?
>>
>
>

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-22 Thread Mo Omer

No problem! How'd it work out?

Mo

This message was drafted on a tiny touch screen; please forgive brevity & tpyos

> On Feb 22, 2015, at 6:19 PM, "Mattmann, Chris A (3980)" 
>  wrote:
> 
> Thanks Mo, great advice.
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: Jiaxin Ye 
> Reply-To: "dev@nutch.apache.org" 
> Date: Tuesday, February 17, 2015 at 2:49 PM
> To: Mohammed Omer 
> Cc: "dev@nutch.apache.org" 
> Subject: Re: Vagrant Crushed When using Nutch-Selenium
> 
>> 
>> 
>> 
>> Thank you so much!! I am going to try it out tonight.
>> 
>> On Tuesday, February 17, 2015, Mohammed Omer 
>> wrote:
>> 
>> Jiaxin, 
>> 
>> 
>> Each page takes about 3 seconds to crawl due to this piece of code - we
>> allow selenium 3 seconds to grab the page [0]. Due to what I was
>> crawling, I didn't want to wait for a specific element/class/id to show
>> up. However, you can change it up if you want.
>> Selenium documentation [1] has more info on Ex/Implicit waiting.
>> 
>> 
>> Again, it's not the most efficient way to crawl; but, if you need JS to
>> render, it's a backwards way that ensures it happens. Selenium Grid has
>> the benefit of being able to handle more throughput, but at the end of
>> the day we're waiting for a browser to
>> go out and fetch the url.
>> 
>> 
>> I've suggested that most items be configurable when merged into trunk
>> [2], but I'll make a specific call-out to the wait time.
>> 
>> 
>> Due to the way Selenium standalone works, it's wayy less efficient
>> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
>> set-up. 
>> 
>> 
>> Wish I could help out more, but 30 threads might be too much. 5 threads,
>> at a total fetch/parse time of 4 seconds per url, would still
>> theoretically churn out > 100k urls per day. There are multiple tweaks
>> that could be made to optimize for your system,
>> I'd start with reducing thread count, as you might be saturating your
>> system [4].
>> 
>> 
>> Sorry I can't be of more help!
>> 
>> 
>> Thank you,
>> 
>> 
>> Mo
>> 
>> 
>> [0]: 
>> https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/
>> org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
>> > /org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49>
>> [1]: 
>> http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
>> 
>> [2]: https://issues.apache.org/jira/browse/NUTCH-1933
>> [3]: https://code.google.com/p/selenium/wiki/Grid2
>> [4]: http://stackoverflow.com/a/4895271
>> 
>> 
>> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye
>> >
>> wrote:
>> 
>> I am using fetcher.threads.per.queue = 30 by the way.
>> 
>> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye
>> >
>> wrote:
>> 
>> Hi Mo,
>> 
>> 
>> I have a problem about the selenium plugin on mac. I think I successfully
>> set it up on mac but I have a question about the performance.
>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>> that each url fetched takes about 1 seconds to open and close
>> the firefox window. Is it a normal speed? or anything is wrong? And is it
>> possible to install selenium grid plugin on Mac? I will cry if you
>> ask me to change machine now..
>> 
>> 
>> Best,
>> Jiaxin
>> 
>> 
>> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer
>> > > wrote:
>> 
>> No worries man, glad everything works! Glad, since I was having hostname
>> issues with nutch/hbase just now as I quickly tried to get it
>> working/fixed for ya, ha.
>> 
>> Mo
>> 
>> 
>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li
>> > wrote:
>> 
>> Hey guys,
>> 
>> 
>> After change my RAM to 2GB, everything works fine. My bad. Thanks for
>> your help.
>> 
>> 
>> Regards,
>> Shuo Li
>> 
>> 
>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980)
>> > > wrote:
>> 
>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>> 
>> I will work to get your nutch selenium grid plugin contributed
>> to work with Nutch 1.x.
>> 
>> Cheers,
>> Chris
>> 
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: 
>> chris.a.mattm...@nasa.gov
>> 
>> WW

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

Thanks. That's what I was trying to figure out, but don't know which class
to get the path to the data files. Thanks to point it out.

On Sunday, February 22, 2015, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> In the constructor of your URLFilter, why not consider passing
> in a NutchConfiguration object, and then reading the path to e.g,
> the LinkDb from the config. Then have a private member variable
> for the LinkDbReader (maybe static initialized for efficiency)
> and use that in your interface method.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov 
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Renxia Wang >
> Reply-To: "dev@nutch.apache.org "  >
> Date: Sunday, February 22, 2015 at 3:36 PM
> To: "dev@nutch.apache.org "  >
> Subject: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >Hi
> >
> >
> >I want to develop an UrlFIlter which takes an url, takes its metadata or
> >even the fetched content, then use some duplicate detection algorithms to
> >determine if it is a duplicate of any url in bitch. However, the only
> >parameter passed into the Urlfilter
> > is the url, is it possible to get the data I want of that input url in
> >Urlfilter?
> >
> >
> >Thanks,
> >
> >
> >Zhique
>
>

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)

In the constructor of your URLFilter, why not consider passing
in a NutchConfiguration object, and then reading the path to e.g,
the LinkDb from the config. Then have a private member variable
for the LinkDbReader (maybe static initialized for efficiency)
and use that in your interface method.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Renxia Wang 
Reply-To: "dev@nutch.apache.org" 
Date: Sunday, February 22, 2015 at 3:36 PM
To: "dev@nutch.apache.org" 
Subject: How to read metadata/content of an URL in URLFilter?

>
>
>
>Hi 
>
>
>I want to develop an UrlFIlter which takes an url, takes its metadata or
>even the fetched content, then use some duplicate detection algorithms to
>determine if it is a duplicate of any url in bitch. However, the only
>parameter passed into the Urlfilter
> is the url, is it possible to get the data I want of that input url in
>Urlfilter? 
>
>
>Thanks, 
>
>
>Zhique

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

YES. I tried that out, while that one has only url as input. The problem is
how to get the data of that url locally.

On Sunday, February 22, 2015, Nagarjun Pola  wrote:

> I have just started looking up in those lines and found that interface
> URLFilter has a method named "filter". And I think this is our point of
> interest.
> Maybe you should look at how to use this method in your plugin.
>
>
>
>
> On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye  > wrote:
>
>> You are absolutely right! I am just throwing ideas :) If you are looking
>> at local data, org.apache.nutch.segment.SegmentReader may be helpful I
>> guess. As all data contents parsed are located there.
>>
>> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang > > wrote:
>>
>>> Thank you for you suggestion. I will take a look at that. There is a
>>> URLUtil class in nutch's source code, but I am just wonder if that one will
>>> send a request to the URL again to get the data. Cause the url's metadata
>>> has already been downloaded, it is better if we can get the data locally.
>>>
>>>
>>> On Sunday, February 22, 2015, Jiaxin Ye >> > wrote:
>>>
 Hey,

 I haven't started working on the deduplicatiin yet, but if I were you I
 will use tika library to retrieve the MIMEtype and metadata. The code is
 presented in the book tika. Why not try that out? :)

 Best,
 Jiaxin

 On Sunday, February 22, 2015, Renxia Wang  wrote:

> Hi
>
> I want to develop an UrlFIlter which takes an url, takes its metadata
> or even the fetched content, then use some duplicate detection algorithms
> to determine if it is a duplicate of any url in bitch. However, the only
> parameter passed into the Urlfilter is the url, is it possible to get the
> data I want of that input url in Urlfilter?
>
> Thanks,
>
> Zhique
>

>>
>

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

Thanks. I will take a look at that.

On Sunday, February 22, 2015, Jiaxin Ye  wrote:

> You are absolutely right! I am just throwing ideas :) If you are looking
> at local data, org.apache.nutch.segment.SegmentReader may be helpful I
> guess. As all data contents parsed are located there.
>
> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang  > wrote:
>
>> Thank you for you suggestion. I will take a look at that. There is a
>> URLUtil class in nutch's source code, but I am just wonder if that one will
>> send a request to the URL again to get the data. Cause the url's metadata
>> has already been downloaded, it is better if we can get the data locally.
>>
>>
>> On Sunday, February 22, 2015, Jiaxin Ye > > wrote:
>>
>>> Hey,
>>>
>>> I haven't started working on the deduplicatiin yet, but if I were you I
>>> will use tika library to retrieve the MIMEtype and metadata. The code is
>>> presented in the book tika. Why not try that out? :)
>>>
>>> Best,
>>> Jiaxin
>>>
>>> On Sunday, February 22, 2015, Renxia Wang  wrote:
>>>
 Hi

 I want to develop an UrlFIlter which takes an url, takes its metadata
 or even the fetched content, then use some duplicate detection algorithms
 to determine if it is a duplicate of any url in bitch. However, the only
 parameter passed into the Urlfilter is the url, is it possible to get the
 data I want of that input url in Urlfilter?

 Thanks,

 Zhique

>>>
>

Re: linkdb/current/part-00000/data does not exist

2015-02-22 Thread Shuo Li

I was using ./bin/crawl and not incremental crawling at that time. This
file appears after I start crawling *.gif, *.jpg, *.mov, etc. I will
provide more information if I can reproduce this error.

Thanks =)

On Sun, Feb 22, 2015 at 4:47 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> What command are you using to crawl? Are you using bin/crawl, and/or
> doing incremental crawling?
>
> Cheers,
> Chris
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: Shuo Li 
> Reply-To: "dev@nutch.apache.org" 
> Date: Friday, February 20, 2015 at 3:26 PM
> To: "dev@nutch.apache.org" 
> Subject: linkdb/current/part-0/data does not exist
>
> >Hi,
> >
> >
> >I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem
> >with linkdb/current/part-0/data does not exist. I checked my
> >directory and my files during crawling, and it appears this file
> >sometimes exist and sometimes disappear. This is quite weird and stranger.
> >
> >
> >Another problem is when we crawl NSIDC ADE, it will give us a 403
> >forbidden error. Does this mean NSIDC ADE is blocking us?
> >
> >
> >The log of first error is in the bottom of this email. Any help would be
> >appreciated.
> >
> >
> >Regards,
> >Shuo Li
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
> >LinkDb: java.io.FileNotFoundException: File
> >file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-
> >0/data does not exist.
> >at
> >org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.j
> >ava:402)
> >at
> >org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:
> >255)
> >at
> >org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> >putFormat.java:47)
> >at
> >org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
> >8)
> >at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> >at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> >at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> >at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> >at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> >at java.security.AccessController.doPrivileged(Native Method)
> >at javax.security.auth.Subject.doAs(Subject.java:415)
> >at
> >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
> >java:1190)
> >at
> >org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> >at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> >at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> >at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
> >at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
> >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
> >
>
>

Exactly, Mohammad, thank you.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Mohammad Al-Mohsin 
Reply-To: "dev@nutch.apache.org" 
Date: Friday, February 20, 2015 at 9:24 PM
To: "dev@nutch.apache.org" 
Subject: Re: Nutchpy crawled statistics

>Hi Pranshu,
>
>
>I assume you're talking about
>CS-572  class assignment at
>USC.
>
>
>I think the stats provided by bin/nutch for the crawldb are sufficient
>(Dr. Mattmann correct me if I'm wrong, please).
>
>
>However, you need to write a script/program to extract the MIME types you
>encountered. You can do this natively with Java or if you prefer Python ~
>like me, you can use
>nutchpy .
>
>
>
>Best regards,
>Mohammad Al-Mohsin
>
>
>On Fri, Feb 20, 2015 at 8:45 PM, Pranshu Kumar
> wrote:
>
>
>I just wanted to know how can we get the crawl statistics ? Is it just
>using the command line options of nutch or do we need to write a script
>to generate the stats using nutchpy ?
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-22 Thread Mattmann, Chris A (3980)

You are using the Github version of the patch which only works
with Nutch2 - you need to use NUTCH-1933.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Yash Sangani 
Reply-To: "dev@nutch.apache.org" 
Date: Friday, February 20, 2015 at 12:36 AM
To: "dev@nutch.apache.org" 
Subject: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

>Hi,
>I have checked out the latest Nutch trunk(1.10) which has tika 1.7.
>I followed all the steps mentioned on the Selenium git hub page and also
>applied the patch.
>Yet when I am trying to build the Nutch, I get the following errors.
>
>
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/Http.java:14: error: package
>org.apache.nutch.storage does not exist
>[javac] import org.apache.nutch.storage.WebPage;
>[javac]^
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/Http.java:15: error: package
>org.apache.nutch.storage.WebPage does not exist
>[javac] import org.apache.nutch.storage.WebPage.Field;
>[javac]^
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/Http.java:26: error: package WebPage
>does not exist
>[javac]   private static final Collection FIELDS = new
>HashSet();
>[javac]  ^
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/Http.java:49: error: cannot find symbol
>[javac] protected Response getResponse(URL url, WebPage page,
>boolean redirect)
>[javac] ^
>[javac]   symbol:   class WebPage
>[javac]   location: class Http
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/Http.java:55: error: package WebPage
>does not exist
>[javac]   public Collection getFields() {
>[javac]^
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/HttpResponse.java:16: error: package
>org.apache.nutch.storage does not exist
>[javac] import org.apache.nutch.storage.WebPage;
>[javac]^
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/HttpResponse.java:47: error: cannot
>find symbol
>[javac] public HttpResponse(Http http, URL url, WebPage page,
>Configuration conf) throws ProtocolException, IOException {
>[javac] ^
>[javac]   symbol:   class WebPage
>[javac]   location: class HttpResponse
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/Http.java:26: error: package WebPage
>does not exist
>[javac]   private static final Collection FIELDS = new
>HashSet();
>[javac]   
>   ^
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/Http.java:29: error: package WebPage
>does not exist
>[javac] FIELDS.add(WebPage.Field.MODIFIED_TIME);
>[javac]   ^
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/Http.java:30: error: package WebPage
>does not exist
>[javac] FIELDS.add(WebPage.Field.HEADERS);
>[javac]   ^
>[javac] 
>/home/yash/Desktop/572/nutch3/nutch/src/plugin/protocol-selenium/src/java/
>org/apache/nutch/protocol/selenium/Http.java:54: error: method does not
>override or implement a method from a supertype
>[javac]   @Override
>[javac]   ^
>[javac] 11 errors
>
>
>Any help on how to resolve this issue will be appreciated.
>I know this error was raised in a previous email, but there were no
>solutions stated, except for checking out the nutch trunk again.
>
>
>
>
>Thanks, 
>Yash Sangani
>

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-22 Thread Mattmann, Chris A (3980)

Hi Nikunj,

Please see this:

https://en.wikipedia.org/wiki/Patch_(Unix)


Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Nikunj Gala 
Reply-To: "dev@nutch.apache.org" 
Date: Saturday, February 21, 2015 at 12:20 PM
To: "dev@nutch.apache.org" 
Subject: Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

>Being completely new to patch files, I don't know how patch files work
>but after looking at patch file, ivy.xml (new one) ivy.xml.rej,
>ivy.xml.orig I could understand that
>
>
>selenium dependencies were not added by the patch in ivy.xml file which I
>added manually and I could build Nutch 1.10 Trunk with Tika dependency
>1.7 and Selenium.
>
>This build runs perfectly fine.
>

Re: linkdb/current/part-00000/data does not exist

2015-02-22 Thread Mattmann, Chris A (3980)

What command are you using to crawl? Are you using bin/crawl, and/or
doing incremental crawling?

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Shuo Li 
Reply-To: "dev@nutch.apache.org" 
Date: Friday, February 20, 2015 at 3:26 PM
To: "dev@nutch.apache.org" 
Subject: linkdb/current/part-0/data does not exist

>Hi,
>
>
>I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem
>with linkdb/current/part-0/data does not exist. I checked my
>directory and my files during crawling, and it appears this file
>sometimes exist and sometimes disappear. This is quite weird and stranger.
>
>
>Another problem is when we crawl NSIDC ADE, it will give us a 403
>forbidden error. Does this mean NSIDC ADE is blocking us?
>
>
>The log of first error is in the bottom of this email. Any help would be
>appreciated.
>
>
>Regards,
>Shuo Li
>
>
>
>
>
>
>
>
>
>
>LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
>LinkDb: java.io.FileNotFoundException: File
>file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-
>0/data does not exist.
>at 
>org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.j
>ava:402)
>at 
>org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:
>255)
>at 
>org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
>putFormat.java:47)
>at 
>org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
>8)
>at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
>at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
>at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:415)
>at 
>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
>java:1190)
>at 
>org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
>at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
>at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
>at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
>at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
>

Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-22 Thread Mattmann, Chris A (3980)

I think this is fantastic Mohammad!

Can you update the patch on NUTCH-1933 with this improvement,
so we can get it into the sources?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Mohammad Al-Mohsin 
Reply-To: "dev@nutch.apache.org" 
Date: Saturday, February 21, 2015 at 6:03 AM
To: "dev@nutch.apache.org" 
Cc: Mohammad Al-Mohsin 
Subject: Nutch-Selenium Plugin Truncates Binary Data

>I am using 
>nutch-selenium  plugin and I
>also have 
>Tesseract  installed for parsing
>text off images.
>
>
>While crawling with Nutch & selenium, I noticed that binary data (e.g.
>images, pdf) are always truncated and thus skip/fail parsing. Here is a
>sample of the log:
>Content of size 800750 was truncated to 368. Content is truncated, parse
>may fail!
>
>When I turn selenium off, parsing works fine and the content is not
>truncated.
>
>
>I found that nutch-selenium gets the html body of whatever Firefox
>displays. So even though you're fetching an image, selenium will just
>give you the image html tag instead of the image itself.
>e.g. 
>
>
>To get around this, I modified selenium plugin to handle the fetch only
>if the Content-Type header starts with 'text', i.e. to catch 'text/html'.
>Otherwise, if the content is not textual, it just returns the content as
>protocol-httpclient does.
>
>
>Now, I am getting binary data properly parsed and also getting selenium
>handle page rendering with javascript.
>
>
>Is this is the proper way to tackle this? what do you think?
>
>
>
>
>Best regards,
>Mohammad Al-Mohsin
>
>

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Nagarjun Pola

I have just started looking up in those lines and found that interface
URLFilter has a method named "filter". And I think this is our point of
interest.
Maybe you should look at how to use this method in your plugin.

On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye  wrote:

> You are absolutely right! I am just throwing ideas :) If you are looking
> at local data, org.apache.nutch.segment.SegmentReader may be helpful I
> guess. As all data contents parsed are located there.
>
> On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang  wrote:
>
>> Thank you for you suggestion. I will take a look at that. There is a
>> URLUtil class in nutch's source code, but I am just wonder if that one will
>> send a request to the URL again to get the data. Cause the url's metadata
>> has already been downloaded, it is better if we can get the data locally.
>>
>>
>> On Sunday, February 22, 2015, Jiaxin Ye  wrote:
>>
>>> Hey,
>>>
>>> I haven't started working on the deduplicatiin yet, but if I were you I
>>> will use tika library to retrieve the MIMEtype and metadata. The code is
>>> presented in the book tika. Why not try that out? :)
>>>
>>> Best,
>>> Jiaxin
>>>
>>> On Sunday, February 22, 2015, Renxia Wang  wrote:
>>>
 Hi

 I want to develop an UrlFIlter which takes an url, takes its metadata
 or even the fetched content, then use some duplicate detection algorithms
 to determine if it is a duplicate of any url in bitch. However, the only
 parameter passed into the Urlfilter is the url, is it possible to get the
 data I want of that input url in Urlfilter?

 Thanks,

 Zhique

>>>
>

Re: [MASSMAIL] Re: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-22 Thread Yusniel Hidalgo Delgado

Congratulations Jorge!

Yusniel Hidalgo Delgado 
Semantic Web Research Group 
University of Informatics Sciences 
http://gws-uci.blogspot.com/ 
Havana, Cuba 

- Mensaje original -
> De: "Julien Nioche" 
> Para: dev@nutch.apache.org, u...@nutch.apache.org
> Enviados: Jueves, 19 de Febrero 2015 14:44:06
> Asunto: [MASSMAIL] Re: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis 
> Betancourt Gonzalez
> 
> Congratulations and welcome Jorge! Great to have you with us
> 
> Julien
> 
> On 19 February 2015 at 17:20, Sebastian Nagel 
> wrote:
> 
> > Dear all,
> >
> > on behalf of the Nutch PMC it is my pleasure to announce that
> > Jorge Luis Betancourt Gonzalez has been voted in as committer
> > and member of the Nutch PMC. Jorge, would you mind telling us
> > about yourself, what you've done so far with Nutch, which areas
> > you think you'd like to get involved, etc...
> >
> > Congratulations and welcome on board!
> >
> > Regards,
> > Sebastian
> >
> 
> 
> 
> --
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Jiaxin Ye

You are absolutely right! I am just throwing ideas :) If you are looking at
local data, org.apache.nutch.segment.SegmentReader may be helpful I guess.
As all data contents parsed are located there.

On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang  wrote:

> Thank you for you suggestion. I will take a look at that. There is a
> URLUtil class in nutch's source code, but I am just wonder if that one will
> send a request to the URL again to get the data. Cause the url's metadata
> has already been downloaded, it is better if we can get the data locally.
>
>
> On Sunday, February 22, 2015, Jiaxin Ye  wrote:
>
>> Hey,
>>
>> I haven't started working on the deduplicatiin yet, but if I were you I
>> will use tika library to retrieve the MIMEtype and metadata. The code is
>> presented in the book tika. Why not try that out? :)
>>
>> Best,
>> Jiaxin
>>
>> On Sunday, February 22, 2015, Renxia Wang  wrote:
>>
>>> Hi
>>>
>>> I want to develop an UrlFIlter which takes an url, takes its metadata or
>>> even the fetched content, then use some duplicate detection algorithms to
>>> determine if it is a duplicate of any url in bitch. However, the only
>>> parameter passed into the Urlfilter is the url, is it possible to get the
>>> data I want of that input url in Urlfilter?
>>>
>>> Thanks,
>>>
>>> Zhique
>>>
>>

Re: Tesseract OCR and GDAL in Tika plugin for Nutch?

2015-02-22 Thread Mattmann, Chris A (3980)

You need to install 1.8-SNAPSHOT version of Tika in your assignment.
Please read the assignment instructions again.

http://sunset.usc.edu/classes/cs572_2015/

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Nikunj Gala 
Reply-To: "dev@nutch.apache.org" 
Date: Wednesday, February 18, 2015 at 2:24 PM
To: "dev@nutch.apache.org" 
Subject: Tesseract OCR and GDAL in Tika plugin for Nutch?

>The current source of Nutch uses Tika 1.7 as per repository in github.
>(https://github.com/apache/nutch/commit/3e2e688bd097727f457f1aa882c74a128f
>0a53da)
>
>As per Apache Tika 1.7 webpage, Tika 1.7 includes GDAL and Tesseract OCR
>(installation required).
>But the Nutch source does not have GDAL and Tesseract OCR in parse-tika
>plugin. 
>
>
>How to include GDAL and Tesseract OCR sources in Tika plugin for Nutch?
>

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

Thank you for you suggestion. I will take a look at that. There is a
URLUtil class in nutch's source code, but I am just wonder if that one will
send a request to the URL again to get the data. Cause the url's metadata
has already been downloaded, it is better if we can get the data locally.

On Sunday, February 22, 2015, Jiaxin Ye  wrote:

> Hey,
>
> I haven't started working on the deduplicatiin yet, but if I were you I
> will use tika library to retrieve the MIMEtype and metadata. The code is
> presented in the book tika. Why not try that out? :)
>
> Best,
> Jiaxin
>
> On Sunday, February 22, 2015, Renxia Wang  > wrote:
>
>> Hi
>>
>> I want to develop an UrlFIlter which takes an url, takes its metadata or
>> even the fetched content, then use some duplicate detection algorithms to
>> determine if it is a duplicate of any url in bitch. However, the only
>> parameter passed into the Urlfilter is the url, is it possible to get the
>> data I want of that input url in Urlfilter?
>>
>> Thanks,
>>
>> Zhique
>>
>

[Nutch Wiki] Update of "AdvancedAjaxInteraction" by ChrisMattmann

2015-02-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "AdvancedAjaxInteraction" page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/AdvancedAjaxInteraction?action=diff&rev1=2&rev2=3

Comment:
- add links for install Selenium Grid 2 on a Mac and link to the binary.

   * [[https://issues.apache.org/jira/browse/NUTCH-1933|NUTCH-1933]]
   * [[https://github.com/momer/nutch-selenium|momer/nutch-selenium]] - This 
plugin allows you to fetch javascript pages using Selenium, while relying on 
the rest of the awesome Nutch stack! (ported to issue NUTCH-1933)
   * 
[[https://github.com/momer/nutch-selenium-grid-plugin|momer/nutch-selenium-grid-plugin]]
 - This plugin allows you to fetch javascript pages using an existing Selenium 
Hub/Node set-up, while relying on the rest of the awesome Nutch stack! 
- 
+  * [[Install Selenium Grid 2 on 
Mac|http://grid.selenium.googlecode.com/git-history/24150d2e97090b8b439bcc6a396911fb53200749/src/main/webapp/step_by_step_installation_instructions_for_osx.html]]
 - Installation instructions for Selenium Grid 2 on a Mac (needed for the 
[[https://github.com/momer/nutch-selenium-grid-plugin|momer/nutch-selenium-grid-plugin]]).
+  * [[Selenium Grid 
Binary|http://grid.selenium.googlecode.com/git-history/00eae2a86d81c4ef8da355b0a8b916a9095a5cd9/src/main/webapp/download.html]]
 - latest version of Selenium Grid (Ver 1.0.8).
  
  == Related Articles ==
   * 
[[http://soryy.com/blog/2014/ajax-javascript-enabled-parsing-apache-nutch-selenium/|AJAX/JavaScript
 Enabled Parsing with Apache Nutch and Selenium]]

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-22 Thread Mattmann, Chris A (3980)

Thanks Mo, great advice.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Jiaxin Ye 
Reply-To: "dev@nutch.apache.org" 
Date: Tuesday, February 17, 2015 at 2:49 PM
To: Mohammed Omer 
Cc: "dev@nutch.apache.org" 
Subject: Re: Vagrant Crushed When using Nutch-Selenium

>
>
>
>Thank you so much!! I am going to try it out tonight.
>
>On Tuesday, February 17, 2015, Mohammed Omer 
>wrote:
>
>Jiaxin, 
>
>
>Each page takes about 3 seconds to crawl due to this piece of code - we
>allow selenium 3 seconds to grab the page [0]. Due to what I was
>crawling, I didn't want to wait for a specific element/class/id to show
>up. However, you can change it up if you want.
> Selenium documentation [1] has more info on Ex/Implicit waiting.
>
>
>Again, it's not the most efficient way to crawl; but, if you need JS to
>render, it's a backwards way that ensures it happens. Selenium Grid has
>the benefit of being able to handle more throughput, but at the end of
>the day we're waiting for a browser to
> go out and fetch the url.
>
>
>I've suggested that most items be configurable when merged into trunk
>[2], but I'll make a specific call-out to the wait time.
>
>
>Due to the way Selenium standalone works, it's wayy less efficient
>than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
>set-up. 
>
>
>Wish I could help out more, but 30 threads might be too much. 5 threads,
>at a total fetch/parse time of 4 seconds per url, would still
>theoretically churn out > 100k urls per day. There are multiple tweaks
>that could be made to optimize for your system,
> I'd start with reducing thread count, as you might be saturating your
>system [4].
>
>
>Sorry I can't be of more help!
>
>
>Thank you,
>
>
>Mo
>
>
>[0]: 
>https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/
>org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
>/org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49>
>[1]: 
>http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
>
>[2]: https://issues.apache.org/jira/browse/NUTCH-1933
>[3]: https://code.google.com/p/selenium/wiki/Grid2
>[4]: http://stackoverflow.com/a/4895271
>
>
>On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye
>>
>wrote:
>
>I am using fetcher.threads.per.queue = 30 by the way.
>
>On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye
>>
>wrote:
>
>Hi Mo,
>
>
>I have a problem about the selenium plugin on mac. I think I successfully
>set it up on mac but I have a question about the performance.
>I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>that each url fetched takes about 1 seconds to open and close
>the firefox window. Is it a normal speed? or anything is wrong? And is it
>possible to install selenium grid plugin on Mac? I will cry if you
>ask me to change machine now..
>
>
>Best,
>Jiaxin
>
>
>On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer
>> wrote:
>
>No worries man, glad everything works! Glad, since I was having hostname
>issues with nutch/hbase just now as I quickly tried to get it
>working/fixed for ya, ha.
>
>Mo
>
>
>On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li
>> wrote:
>
>Hey guys,
>
>
>After change my RAM to 2GB, everything works fine. My bad. Thanks for
>your help.
>
>
>Regards,
>Shuo Li
>
>
>On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980)
>> wrote:
>
>Thank you Mo. I sincerely appreciate your guidance and contribution.
>
>I will work to get your nutch selenium grid plugin contributed
>to work with Nutch 1.x.
>
>Cheers,
>Chris
>
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: 
>chris.a.mattm...@nasa.gov
>
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++
>
>
>
>
>
>
>-Original Message-
>From: Mo Omer >
>Date: Friday, February 13, 2015 at 11:10 AM
>To: Chris Mattmann >
>Cc: "dev@nutch.apache.org
>"
>>
>Subject: Re: Vagrant Crushed When using Nutch-Selenium
>
>>Hey all,

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Jiaxin Ye

Hey,

I haven't started working on the deduplicatiin yet, but if I were you I
will use tika library to retrieve the MIMEtype and metadata. The code is
presented in the book tika. Why not try that out? :)

Best,
Jiaxin

On Sunday, February 22, 2015, Renxia Wang  wrote:

> Hi
>
> I want to develop an UrlFIlter which takes an url, takes its metadata or
> even the fetched content, then use some duplicate detection algorithms to
> determine if it is a duplicate of any url in bitch. However, the only
> parameter passed into the Urlfilter is the url, is it possible to get the
> data I want of that input url in Urlfilter?
>
> Thanks,
>
> Zhique
>

[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-22 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332441#comment-14332441
 ] 

Chris A. Mattmann commented on NUTCH-1933:
--

Thank you [~almohsin], I will update the patch according. [~momer] good point 
let me think about this.

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Nutch-Selenium Error

2015-02-22 Thread Mattmann, Chris A (3980)

Good to hear!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Mohammad Al-Mohsin 
Reply-To: "dev@nutch.apache.org" 
Date: Monday, February 16, 2015 at 7:56 PM
To: "dev@nutch.apache.org" 
Subject: Re: Nutch-Selenium Error

>FYI, the issue was resolved by deleting 'runtime' directory and then
>recompiling Nutch.
>
>
>cd nutch/trunk
>rm -r runtime
>ant runtime
>
>
>
>
>
>Best regards,
>Mohammad Al-Mohsin
>
>
>On Mon, Feb 16, 2015 at 2:56 AM, Mohammad Al-Mohsin
> wrote:
>
>Here is the error stack:
>
>
>2015-02-16 01:32:29,699 ERROR selenium.Http - Failed to get protocol
>output
>java.lang.NoClassDefFoundError: Could not initialize class
>org.apache.http.impl.conn.ManagedHttpClientConnectionFactory
>at 
>org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConne
>ctionFactory.(PoolingHttpClientConnectionManager.java:493)
>at 
>org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(Poolin
>gHttpClientConnectionManager.java:149)
>at 
>org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(Poolin
>gHttpClientConnectionManager.java:138)
>at 
>org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(Poolin
>gHttpClientConnectionManager.java:114)
>at 
>org.openqa.selenium.remote.internal.HttpClientFactory.getClientConnectionM
>anager(HttpClientFactory.java:68)
>at 
>org.openqa.selenium.remote.internal.HttpClientFactory.(HttpClientFac
>tory.java:54)
>at 
>org.openqa.selenium.remote.HttpCommandExecutor.(HttpCommandExecutor.
>java:98)
>at 
>org.openqa.selenium.remote.HttpCommandExecutor.(HttpCommandExecutor.
>java:81)
>at 
>org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(N
>ewProfileExtensionConnection.java:93)
>at 
>org.openqa.selenium.firefox.FirefoxDriver.startClient(FirefoxDriver.java:2
>46)
>at 
>org.openqa.selenium.remote.RemoteWebDriver.(RemoteWebDriver.java:114
>)
>at 
>org.openqa.selenium.firefox.FirefoxDriver.(FirefoxDriver.java:191)
>at 
>org.openqa.selenium.firefox.FirefoxDriver.(FirefoxDriver.java:186)
>at 
>org.openqa.selenium.firefox.FirefoxDriver.(FirefoxDriver.java:182)
>at 
>org.openqa.selenium.firefox.FirefoxDriver.(FirefoxDriver.java:95)
>at 
>org.apache.nutch.protocol.selenium.HttpWebClient.getHtmlPage(HttpWebClient
>.java:53)
>at 
>org.apache.nutch.protocol.selenium.HttpResponse.readPlainContent(HttpRespo
>nse.java:199)
>at 
>org.apache.nutch.protocol.selenium.HttpResponse.(HttpResponse.java:1
>61)
>at 
>org.apache.nutch.protocol.selenium.Http.getResponse(Http.java:56)
>at 
>org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.jav
>a:206)
>at 
>org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:758)
>
>
>Best regards,
>Mohammad Al-Mohsin
>
>
>On Mon, Feb 16, 2015 at 1:57 AM, Mohammad Al-Mohsin
> wrote:
>
>Hi,
>
>
>I'm trying to use Nutch-Selenium plugin with Nutch 1.10 trunk on Mac
>Yosemite.
>
>
>I applied the patch from
>NUTCH-1933 , installed
>X11, and included protocol-selenium plugin in nutch-site config file.
>
>
>Now when I start crawling, at the first fetch, I see that Firefox is
>opened and closed immediately and I get this error in the console:
>fetch of 
>http://www.mywebsite.com  failed with:
>java.lang.NoClassDefFoundError: Could not initialize class
>org.apache.http.impl.conn.ManagedHttpClientConnectionFactory
>
>
>
>
>Any idea how to fix this error?
>
>Best regards,
>Mohammad Al-Mohsin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re:

2015-02-22 Thread Mattmann, Chris A (3980)

Exactly, Jiaxin, great answer.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Jiaxin Ye 
Reply-To: "dev@nutch.apache.org" 
Date: Sunday, February 15, 2015 at 11:34 PM
To: "dev@nutch.apache.org" 
Subject: Re:

>Hi Swati,
>
>
>I am also the student in Prof Matmann's class. I think the politeness
>depends on the crawl-delay to the same server. Usually in the robots.txt
>the crawl-delay will be set to 5 to 15 seconds. It's true that setting
>fetcher.threads.per.queue to be bigger
> than 1 will cause the Crawl-Delay value from robots.txt to be ignored,
>but you can set the fetcher.server.delay to be 5 to 15 seconds to
>rebalance the successive requests time.
>
>
>I also think we should change the content in suffix_urlfillter as well,
>as our task is to collect as much data as we can from the three websites.
>
>
>Jiaxin
>
>
>On Sun, Feb 15, 2015 at 10:48 PM, Swati Kothari
> wrote:
>
>Hi,
>We are working on a project under Professor Chris Mattmann as part of
>Information Retrieval course.
>We are trying to edit different properties to change politeness and do
>url filtering.
>
>
>We are trying more than 1 thread, which makes it impolite, but we are not
>sure how impolite it should be made for better results.
>Also, url filtering blocks almost all image, audio, video formats in
>suffix_urlfilter.xml, should that be tampered with or not?
>
>
>
>
>
>

Re: Nutch-Selenium Error

2015-02-22 Thread Mattmann, Chris A (3980)

Hi Mohammad, did you get this fixed?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Mohammad Al-Mohsin 
Reply-To: "dev@nutch.apache.org" 
Date: Monday, February 16, 2015 at 1:57 AM
To: "dev@nutch.apache.org" 
Subject: Nutch-Selenium Error

>Hi,
>
>
>I'm trying to use Nutch-Selenium plugin with Nutch 1.10 trunk on Mac
>Yosemite.
>
>
>I applied the patch from
>NUTCH-1933 , installed
>X11, and included protocol-selenium plugin in nutch-site config file.
>
>
>Now when I start crawling, at the first fetch, I see that Firefox is
>opened and closed immediately and I get this error in the console:
>fetch of 
>http://www.mywebsite.com  failed with:
>java.lang.NoClassDefFoundError: Could not initialize class
>org.apache.http.impl.conn.ManagedHttpClientConnectionFactory
>
>
>
>
>Any idea how to fix this error?
>
>Best regards,
>Mohammad Al-Mohsin
>
>

How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang

Hi

I want to develop an UrlFIlter which takes an url, takes its metadata or
even the fetched content, then use some duplicate detection algorithms to
determine if it is a duplicate of any url in bitch. However, the only
parameter passed into the Urlfilter is the url, is it possible to get the
data I want of that input url in Urlfilter?

Thanks,

Zhique

[jira] [Commented] (NUTCH-1944) Add raw content to indexes

2015-02-22 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332417#comment-14332417
 ] 

Sebastian Nagel commented on NUTCH-1944:


This issue duplicates NUTCH-1785 but this solution via an IndexingFilter plugin 
is only for 2.x. On 1.x an indexing filter cannot request the raw content from 
segments, which is addressed in NUTCH-1785 by implementing the functionality in 
the indexer core. However, an IndexingFilter seems to be the simpler and more 
modular solution.

Conversion from raw content is implicitly done relying on the system's locale 
(cf. NUTCH-1807). The encoding used to represent the HTML as a string should be 
predictable, as [discussed in 
NUTCH-1785|https://issues.apache.org/jira/browse/NUTCH-1785?focusedCommentId=14011649&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-140116499].

> Add raw content to indexes
> --
>
> Key: NUTCH-1944
> URL: https://issues.apache.org/jira/browse/NUTCH-1944
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, plugin
>Reporter: Lewis John McGibbney
> Fix For: 2.4
>
>
> The issues is described very well here
> https://github.com/Meabed/nutch2-index-html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1946:
---

Assignee: Lewis John McGibbney

> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2015-02-22 Thread Lewis John McGibbney (JIRA)

Lewis John McGibbney created NUTCH-1947:
---

 Summary: Overhaul o.a.n.parse.OutlinkExtractor.java 
 Key: NUTCH-1947
 URL: https://issues.apache.org/jira/browse/NUTCH-1947
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.9, 2.3
Reporter: Lewis John McGibbney
 Fix For: 2.4, 1.10


Right now in both trunk and 2.X, the 
[OutlinkExtractor.java|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java]
 class need a bit of TLC. It is referencing JDK1.5 in a few places, there are 
misleading URL entries and it boasts some interesting @Deprecation methods 
which we could ideally remove.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1709) Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avsc

2015-02-22 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1709:

Fix Version/s: (was: 2.4)
   2.3.1

> Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain 
> methods not defined in source .avsc
> -
>
> Key: NUTCH-1709
> URL: https://issues.apache.org/jira/browse/NUTCH-1709
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1709.patch
>
>
> When using the GoraCompiler currently packaged with gora-core-0.4-SNAPSHOT, 
> the following methods are removed from o.a.n.storage.Host or 
> o.a.n.storage.ProtocolStatus
> {code:title=Host.java|borderStyle=solid}
>   public boolean contains(String key) {
> return metadata.containsKey(new Utf8(key));
>   }
>   
>   public String getValue(String key, String defaultValue) {
> if (!contains(key)) return defaultValue;
> return Bytes.toString(metadata.get(new Utf8(key)));
>   }
>   
>   public int getInt(String key, int defaultValue) {
> if (!contains(key)) return defaultValue;
> return Integer.parseInt(getValue(key,null));
>   }
>   public long getLong(String key, long defaultValue) {
> if (!contains(key)) return defaultValue;
> return Long.parseLong(getValue(key,null));
>   }
> {code}
> {code:title=ProtocolStatus.java|borderStyle=solid}
>   /**
>* A convenience method which returns a successful {@link ProtocolStatus}.
>* @return the {@link ProtocolStatus} value for 200 (success).
>*/
>   public boolean isSuccess() {
> return code == ProtocolStatusUtils.SUCCESS; 
>   }
> {code}
> This results in compilation errors... I am not sure if it is good practice 
> for non-default methods to be contained within generated Persistent classes. 
> This is certainly the case with newer versions of Avro when using the Java 
> API.
> compile-core:
> [javac] Compiling 104 source files to 
> /home/mary/Downloads/apache/2.x/build/classes
> [javac] warning: [options] bootstrap class path not set in conjunction 
> with -source 1.6
> [javac] 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java:345:
>  error: cannot find symbol
> [javac]host.getInt("q_mt", 
> maxThreads),
> [javac]^
> [javac]   symbol:   method getInt(String,int)
> [javac]   location: variable host of type Host
> [javac] 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java:346:
>  error: cannot find symbol
> [javac]host.getLong("q_cd", 
> crawlDelay),
> [javac]^
> [javac]   symbol:   method getLong(String,long)
> [javac]   location: variable host of type Host
> [javac] 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/fetcher/FetcherReducer.java:347:
>  error: cannot find symbol
> [javac]host.getLong("q_mcd", 
> minCrawlDelay));
> [javac]^
> [javac]   symbol:   method getLong(String,long)
> [javac]   location: variable host of type Host
> [javac] 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/parse/ParserChecker.java:114:
>  error: cannot find symbol
> [javac] if(!protocolOutput.getStatus().isSuccess()) {
> [javac]   ^
> [javac]   symbol:   method isSuccess()
> [javac]   location: class ProtocolStatus
> [javac] Note: 
> /home/mary/Downloads/apache/2.x/src/java/org/apache/nutch/storage/Host.java 
> uses unchecked or unsafe operations.
> [javac] Note: Recompile with -Xlint:unchecked for details.
> [javac] 4 errors
> [javac] 1 warning
> I think it would be a good idea to find another home for such methods as it 
> will undoubtedly avoid problems when we do Gora upgrades in the future.
> Right now I don't have a suggestion but will work on a solution non-the-less.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1923) Nutch + Cassandra Docker

2015-02-22 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1923:

Fix Version/s: (was: 2.4)
   2.3.1

> Nutch + Cassandra Docker
> 
>
> Key: NUTCH-1923
> URL: https://issues.apache.org/jira/browse/NUTCH-1923
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Reporter: Lewis John McGibbney
> Fix For: 2.3.1
>
>
>  Apache Nutch With Cassandra With Elasticsearch and Hadoop on Docker
> https://github.com/Meabed/nutch-cassandra-docker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-22 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332384#comment-14332384
 ] 

Sebastian Nagel commented on NUTCH-1925:


Great to see again successful Jenkins builds. Thanks!

> Upgrade Tika to version 1.7
> ---
>
> Key: NUTCH-1925
> URL: https://issues.apache.org/jira/browse/NUTCH-1925
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Tyler Palsulich
>Assignee: Markus Jelsma
>Priority: Blocker
> Fix For: 1.10, 2.3.1
>
> Attachments: NUTCH-1925-2x.patch, NUTCH-1925.palsulich.p2.patch, 
> NUTCH-1925.palsulich.p2.v2.patch, NUTCH-1925.palsulich.patch, 
> NUTCH-1925.palsulich.v2.patch
>
>
> Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
> API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1944) Add raw content to indexes

2015-02-22 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1944:

Fix Version/s: (was: 2.3.1)
   2.4

> Add raw content to indexes
> --
>
> Key: NUTCH-1944
> URL: https://issues.apache.org/jira/browse/NUTCH-1944
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, plugin
>Reporter: Lewis John McGibbney
> Fix For: 2.4
>
>
> The issues is described very well here
> https://github.com/Meabed/nutch2-index-html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2015-02-22 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---
Attachment: NUTCH-840-2.x.patch

Patch for 2.X.
There currently appears to be a discrepancy in the detection of Outlunks. We 
are detecting more than the test expects

{code}
  1 Testsuite: org.apache.nutch.parse.tika.TestDOMContentUtils
  2 Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.496 sec
  3
  4 Testcase: testGetTitle took 0.331 sec
  5 Testcase: testGetText took 0.069 sec
  6 Testcase: testGetOutlinks took 0.08 sec
  7 FAILED
  8 got wrong number of outlinks (expecting 3, got 5)
  9 answer:
 10 toUrl: http://www.nutch.org/ anchor: home
 11 toUrl: http://www.nutch.org/docs/1 anchor: 1
 12 toUrl: http://www.nutch.org/docs/2 anchor: 2
 13
 14 got:
 15 toUrl: http://www.nutch.org/ anchor: home
 16 toUrl: http://www.nutch.org/ anchor:
 17 toUrl: http://www.nutch.org/docs/1 anchor: 1
 18 toUrl: http://www.nutch.org/docs/1 anchor:
 19 toUrl: http://www.nutch.org/docs/2 anchor: 2
 20
 21
 22 junit.framework.AssertionFailedError: got wrong number of outlinks 
(expecting 3, got 5)
 23 answer:
 24 toUrl: http://www.nutch.org/ anchor: home
 25 toUrl: http://www.nutch.org/docs/1 anchor: 1
 26 toUrl: http://www.nutch.org/docs/2 anchor: 2
 27
 28 got:
 29 toUrl: http://www.nutch.org/ anchor: home
 30 toUrl: http://www.nutch.org/ anchor:
 31 toUrl: http://www.nutch.org/docs/1 anchor: 1
 32 toUrl: http://www.nutch.org/docs/1 anchor:
 33 toUrl: http://www.nutch.org/docs/2 anchor: 2
 34
 35
 36 at 
org.apache.nutch.parse.tika.TestDOMContentUtils.compareOutlinks(TestDOMContentUtils.ja
va:315)
 37 at 
org.apache.nutch.parse.tika.TestDOMContentUtils.testGetOutlinks(TestDOMContentUtils.ja
va:296)
{code}

> Port tests from parse-html to parse-tika
> 
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1, 1.6
>Reporter: Julien Nioche
> Fix For: 2.4
>
> Attachments: NUTCH-840-2.x.patch, NUTCH-840-trunk.patch, 
> NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[Nutch Wiki] New attachment added to page RunNutchInEclipse

2015-02-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page "RunNutchInEclipse" for change notification. 
An attachment has been added to that page by SebastianNagel. Following detailed 
information is available:

Attachment name: nutch_eclipse_javadoc_loc.png
Attachment size: 70201
Attachment link: 
https://wiki.apache.org/nutch/RunNutchInEclipse?action=AttachFile&do=get&target=nutch_eclipse_javadoc_loc.png
Page link: https://wiki.apache.org/nutch/RunNutchInEclipse

[Nutch Wiki] Update of "RunNutchInEclipse" by SebastianNagel

2015-02-22 Thread Apache Wiki

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "RunNutchInEclipse" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/RunNutchInEclipse?action=diff&rev1=50&rev2=51

Comment:
add section how to make Eclipse display Javadocs of dependent libs, including 
IvyDE 

  OutlinkExtractor : getOutlinks() : line 84
  }}}
  
- === Remote Debugging in Eclipse (NOT VERIFIED) ===
+ === Remote Debugging in Eclipse ===
   1. create a new Debug Configuration as 
[[http://help.eclipse.org/juno/index.jsp?topic=%2Forg.eclipse.jdt.doc.user%2Ftasks%2Ftask-remotejava_launch_config.htm|Remote
 Java Application]] and remember the port (here: 37649)
   1. launch nutch from command-line but add options to use the 
[[http://docs.oracle.com/javase/6/docs/technotes/guides/jpda/architecture.html#jdwp|Java
 Debugger JDWP Agent Library]], e.g. from bash:
  {{{
@@ -173, +173 @@

  
  }}}
  
+ 
+ == Display Javadoc for Dependent Libraries ==
+ 
+ Eclipse is able to show Javadocs immediately, not only for Nutch classes but 
also for dependent libraries. While Eclipse takes the Javadocs of Nutch classes 
directly from the source files, this is not the case for dependent 
[[http://ant.apache.org/ivy/|Ivy]] managed libraries. There are two ways to 
tell Eclipse where to find the Javadocs of dependent libs: (1) adding the 
Javadoc URL to a jar file, or (2) use the IvyDE Eclipse plugin. Note that both 
ways will modify the file {{{.classpath}}}. Because the {{{ant eclipse}}} 
target will overwrite the {{{.classpath}}} file, you should make a backup 
before and merge the changes made via Eclipse back afterwards.
+ 
+ === Connect a Library to the Javadoc URL ===
+ 
+ The simplest way to connect a jar library with its Javadocs is to add the 
Javadoc URL manually in the classpath editor, see screenshot.
+ 
+ {{attachment:nutch_eclipse_javadoc_loc.png}}
+ 
+ === IvyDE ===
+ 
+ The Nutch build system delegates the managment of library dependencies to 
[[http://ant.apache.org/ivy/|Apache Ivy]]. There is an Eclipse plugin 
[[http://ant.apache.org/ivy/ivyde/|IvyDE]] to integrate Ivy's dependency 
managment. It is well-documented, including a description 
[[http://ant.apache.org/ivy/ivyde/history/latest-milestone/cpc/create.html|how 
to add the managed libraries to the Eclipse project]]. The main Ivy file is 
{{{ivy/ivy.xml}}} but note that every plugin has its own {{{ivy.xml}}}. If 
working on a specific plugin, it is a good idea to add also its {{{ivy.xml}}}. 
It is possible to use IvyDE in addition to the libraries placed by {{{ant 
eclipse}}} in {{{.classpath}}}.
+ 
+ The repository hosting a library often also provides packages containing 
javadoc and sources. E.g., the JUnit repository
+ [[https://repo1.maven.org/maven2/junit/junit/4.11/]] provides the following 
files:
+ {{{
+ junit-4.11-javadoc.jar 14-Nov-2012 19:21  
379344
+ junit-4.11-sources.jar 14-Nov-2012 19:21  
151329
+ junit-4.11.jar 14-Nov-2012 19:21  
245039
+ junit-4.11.pom 14-Nov-2012 19:21  
  2344
+ }}}
+ IvyDE is then able to fetch also javadoc and source packages (if provided) 
and show them in Eclipse. Again, there is an excellent description, how this 
can be enabled in the 
[[http://ant.apache.org/ivy/ivyde/history/latest-milestone/preferences.html#mapping|Source/Javadoc
 Mapping]] section of the Ivy preferences. Note that the Ivy cache (usually 
{{{~/.ivy/cache/}}}) must be cleaned before 
[[http://ant.apache.org/ivy/ivyde/history/latest-milestone/cpc/resolve.html|Ivy 
Resolve]] is called from Eclipse. 
+ 
+ 
  == Troubleshooting ==
  
  === eclipse: Cannot create project content in workspace ===

[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-22 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332350#comment-14332350
 ] 

Hudson commented on NUTCH-1925:
---

SUCCESS: Integrated in Nutch-nutchgora #1347 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1347/])
NUTCH-1925 Upgrade Tika to version 1.7 (lewismc: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1661539)
* /nutch/branches/2.x/CHANGES.txt
* 
/nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaConfig.java
* 
/nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* 
/nutch/branches/2.x/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/DOMContentUtilsTest.java


> Upgrade Tika to version 1.7
> ---
>
> Key: NUTCH-1925
> URL: https://issues.apache.org/jira/browse/NUTCH-1925
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Tyler Palsulich
>Assignee: Markus Jelsma
>Priority: Blocker
> Fix For: 1.10, 2.3.1
>
> Attachments: NUTCH-1925-2x.patch, NUTCH-1925.palsulich.p2.patch, 
> NUTCH-1925.palsulich.p2.v2.patch, NUTCH-1925.palsulich.patch, 
> NUTCH-1925.palsulich.v2.patch
>
>
> Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
> API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Jenkins build is back to normal : Nutch-nutchgora #1347

2015-02-22 Thread Apache Jenkins Server

See

[jira] [Created] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)

Lewis John McGibbney created NUTCH-1946:
---

 Summary: Upgrade to Gora 0.6
 Key: NUTCH-1946
 URL: https://issues.apache.org/jira/browse/NUTCH-1946
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.3.1
Reporter: Lewis John McGibbney
 Fix For: 2.3.1


Apache Gora was released recently.
We should upgrade before pushing Nutch 2.3.1 as it will come in very handy for 
the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: linkdb/current/part-00000/data does not exist

2015-02-22 Thread veeresh beeram

Hi,

I was unable to reproduce the linkdb error.

The NSIDC ADE 403 forbidden error occurs because NSIDC seems to be blocking
User-Agent's containing "nutch" in them.

--
Thanks,
Veeresh

On 20 February 2015 at 15:26, Shuo Li  wrote:

> Hi,
>
> I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem *with 
> linkdb/current/part-0/data
> does not exist. *I checked my directory and my files during crawling, and
> it appears this file sometimes exist and sometimes disappear. This is quite
> weird and stranger.
>
> Another problem is when we crawl NSIDC ADE, it will give us a 403
> forbidden error. Does this mean NSIDC ADE is blocking us?
>
> The log of first error is in the bottom of this email. Any help would be
> appreciated.
>
> Regards,
> Shuo Li
>
>
>
>
>
> LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
> LinkDb: java.io.FileNotFoundException: File
> file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-0/data
> does not exist.
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:47)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
>

[jira] [Resolved] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-22 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1925.
-
Resolution: Fixed

Committed @revision 1661539 in 2.3.1 HEAD

> Upgrade Tika to version 1.7
> ---
>
> Key: NUTCH-1925
> URL: https://issues.apache.org/jira/browse/NUTCH-1925
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Tyler Palsulich
>Assignee: Markus Jelsma
>Priority: Blocker
> Fix For: 1.10, 2.3.1
>
> Attachments: NUTCH-1925-2x.patch, NUTCH-1925.palsulich.p2.patch, 
> NUTCH-1925.palsulich.p2.v2.patch, NUTCH-1925.palsulich.patch, 
> NUTCH-1925.palsulich.v2.patch
>
>
> Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
> API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Selenium error

2015-02-22 Thread Puranjay Rajpal

Hi i have installed selenium on my mac but when i try to crawl any website
i get the following lines and then the crawl just stops .

org.openqa.selenium.firefox.NotConnectedException: Unable to connect to
host 127.0.0.1 on port 7055 after 45000 ms.


I am not sure on how to solve this ?

63 matches

Mail list logo