Varying Number of URLS Crawled.

2015-02-12 Thread Nagarjun Pola
Hi Everyone,

I started to use Nutch 1.10 for my homework and I see that every time I
perform a crawl using the same configuration and same seed urls I get a
different number of fetched urls. This occurs even when the old crawl data
is deleted.

This way I would not be able to identify which URLs had a problem being
fetched and if it was resolved later or not.

Any suggestions on how to solve this issue would be of great help.

Thank You.

Best,
Nagarjun Pola
University of Southern California


[jira] [Commented] (NUTCH-1730) Scoring-depth optionally not to increment depth for external hosts

2015-02-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317829#comment-14317829
 ] 

Markus Jelsma commented on NUTCH-1730:
--

Anything to add to this modificiation?

 Scoring-depth optionally not to increment depth for external hosts
 --

 Key: NUTCH-1730
 URL: https://issues.apache.org/jira/browse/NUTCH-1730
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.11

 Attachments: NUTCH-1730-trunk.patch


 Currently, the plugin always increments depth, even when coming or going to 
 external hosts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1939.

Resolution: Fixed

Committed to trunk, v1659227. Thanks, [~leoyey]!

 Fetcher fails to follow redirects
 -

 Key: NUTCH-1939
 URL: https://issues.apache.org/jira/browse/NUTCH-1939
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.9
Reporter: Sebastian Nagel
 Fix For: 1.10

 Attachments: NUTCH-1939.patch


 As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with 
 http.redirect.max  0 Fetcher does not follow redirects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319064#comment-14319064
 ] 

Markus Jelsma commented on NUTCH-1925:
--

Ja, ill check it in tomorrow. Any comments on other minor issues on 1.10 before 
we decide an RC?


 Upgrade Tika to version 1.7
 ---

 Key: NUTCH-1925
 URL: https://issues.apache.org/jira/browse/NUTCH-1925
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Tyler Palsulich
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1925-2x.patch, NUTCH-1925.palsulich.patch, 
 NUTCH-1925.palsulich.v2.patch


 Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
 API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Shuo Li
I think I have possibly finished installing.

What you need to do:
0. git status and checkout what you have modified.
1. patch -p0  YOUR_PATCH_FILE
2. ant clean jar
3. ant runtime

Will try crawling using selenium later on. Hope this helped. _

On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

  Yes I believe you need to install X11 - why don't you try and report
 back what you find thanks.

 Sent from my iPhone

 On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu wrote:

  Hi professor, but can we use Selenium on Mac?

 On Thursday, February 12, 2015, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 You need Selenium Jiaxin, in order to crawl dynamic pages in the
 polar dataset you have been assigned in my CSCI 572 search engines class.

 The instructions for integrating Selenium with Nutch 1.10-trunk
 are here:

 https://issues.apache.org/jira/browse/NUTCH-1933


 Cheers,
 Chris


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Jiaxin Ye jiaxi...@usc.edu
 Reply-To: dev@nutch.apache.org dev@nutch.apache.org
 Date: Thursday, February 12, 2015 at 12:46 AM
 To: dev@nutch.apache.org dev@nutch.apache.org
 Subject: Re: Nutch-Selenium in Nutch 1.10

 Well, good choice. I am thinking changing to ubuntu now. The thing is why
 do we need Selenium anyway? Just easier to perform crawling?
 
 On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li
 sli...@usc.edu wrote:
 
 Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
 using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still
 be installed properly. The issue would be I don't know how to integrate
 Selenium with Nutch 1.10.
 
 On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye
 jiaxi...@usc.edu wrote:
 
 Hi all,
 
 
 Anyone here knows where to find the setup tutorial for Selenium on Mac ??
 I find it difficult to install Xvfb on mac.
 
 
 Best,
 Jiaxin
 
 
 On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh
 sapna...@usc.edu wrote:
 
 Hi Shuo Li,
 
 
 We were facing a similar issue. Prof. Mattman suggested we look into this
 patch for Selenium on Nutch 1.10 :
 https://issues.apache.org/jira/browse/NUTCH-1933.
 
 
 Hope this helps!
 
 
 Thanks,
 Sapna
 
 On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li
 sli...@usc.edu wrote:
 
 Yop,
 
 
 I'm trying to install selenium in Nutch 1.10. However, this error pops
 out:
 
 
 error: package org.apache.nutch.storage does not exist
 
 
 
 I can only find this package in Nutch 2.x. Is there a way to use Selenium
 in 1.10?
 
 
 Any advice would be appreciated.
 
 
 Regards,
 Shuo Li
 
 
 
 
 
 
 
 
 
 
 --
 Graduate Student
 MS in CS (Data Science)
 Viterbi School of Engineering
 University of Southern California
 
 
 Phone:
 +1 650-307-9848 tel:%2B1%20650-307-9848
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 




Re: nutch subscribe

2015-02-12 Thread Tyler Palsulich
Hi,

Please send a message to dev-subscr...@nutch.apache.org to subscribe to the
list.

Tyler
On Feb 12, 2015 6:54 PM, Poojan Jhaveri pjhav...@usc.edu wrote:





Re: org.mortbay.proxy package not found in nutch 1.x, Ref Class - ProxyTestbed

2015-02-12 Thread Preetam Pradeepkumar Shingavi
Cool. Issue resolved now.

Thanks Sebastian !

On Wed, Feb 11, 2015 at 12:21 PM, Sebastian Nagel 
wastl.na...@googlemail.com wrote:

 Hi,

 the jetty-client-6.1.22.jar
 is a dependency needed only for testing.
 Consequently, it's placed in
  build/test/lib/
 but only if you run the tests, resp. call
  % ant resolve-test

 There is also a target
  % ant eclipse
 which writes a complete Eclipse project configuration.
 Sometimes, if dependencies change, you have to run it again.

 Of course, even with this config you have to run
  % ant resolve-default resolve-test
 after a clean to copy all dependencies into build/{lib,test/lib}/

 Best,
 Sebastian

 On 02/11/2015 05:00 AM, Preetam Pradeepkumar Shingavi wrote:
  Hi,
 
  I am trying to configure Nutch 1.X on eclipse, and configured the build
 path to include all jars
  from the build-lib folder.
 
  There is a class ProxyTestbed.java which has a error in importing the
 following package :
  import *org.mortbay.proxy.*AsyncProxyServlet; (proxy package not found)
 
  I tried to figure out that this class file loads from *jetty-6.1.26.jar,
 *but is not actually
  present in this jar.
 
  Am I missing anything here ? Do I download any other jar ?
 
  Thanks in advance !




nutch subscribe

2015-02-12 Thread Poojan Jhaveri



[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-12 Thread Leo Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319277#comment-14319277
 ] 

Leo Ye commented on NUTCH-1939:
---

Good to see we fixed it. Thank you, [~wastl-nagel]

 Fetcher fails to follow redirects
 -

 Key: NUTCH-1939
 URL: https://issues.apache.org/jira/browse/NUTCH-1939
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.9
Reporter: Sebastian Nagel
 Fix For: 1.10

 Attachments: NUTCH-1939.patch


 As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with 
 http.redirect.max  0 Fetcher does not follow redirects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1323) AjaxNormalizer

2015-02-12 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1323:
-
Fix Version/s: (was: 1.11)
   1.10

 AjaxNormalizer
 --

 Key: NUTCH-1323
 URL: https://issues.apache.org/jira/browse/NUTCH-1323
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.10

 Attachments: NUTCH-1323-1.6-1.patch, NUTCH-1323-1.8.patch


 A two-way normalizer for Nutch able to deal with AJAX URL's, converting them 
 to _escaped_fragment_ URL's and back to an AJAX URL.
 https://developers.google.com/webmasters/ajax-crawling/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1323) AjaxNormalizer

2015-02-12 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1323.
--
Resolution: Fixed

Just in time for 1.10, Committed to trunk in revision 1659167.


 AjaxNormalizer
 --

 Key: NUTCH-1323
 URL: https://issues.apache.org/jira/browse/NUTCH-1323
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.10

 Attachments: NUTCH-1323-1.6-1.patch, NUTCH-1323-1.8.patch


 A two-way normalizer for Nutch able to deal with AJAX URL's, converting them 
 to _escaped_fragment_ URL's and back to an AJAX URL.
 https://developers.google.com/webmasters/ajax-crawling/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1921) Optionally parse fetch_not_modified

2015-02-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317826#comment-14317826
 ] 

Markus Jelsma commented on NUTCH-1921:
--

Anything to add to this optional settings?

 Optionally parse fetch_not_modified
 ---

 Key: NUTCH-1921
 URL: https://issues.apache.org/jira/browse/NUTCH-1921
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.11

 Attachments: NUTCH-1921-trunk.patch


 Records with fetch_not_modified are not parsed and are not passed through 
 parse filters, index filters and are not being indexed. This is a huge 
 problem if you modified parser filter, indexing filter or whatever behaviour 
 in the pipe line because changes never show up in the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1684) ParseMeta to be added before fetch schedulers are run

2015-02-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317828#comment-14317828
 ] 

Markus Jelsma commented on NUTCH-1684:
--

Anything to add to this? I think this can go in

 ParseMeta to be added before fetch schedulers are run
 -

 Key: NUTCH-1684
 URL: https://issues.apache.org/jira/browse/NUTCH-1684
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.11

 Attachments: NUTCH-1684-trunk.patch


 FetchSchedulers cannot operate on parseMeta in the CrawlDatum because it is 
 added after the schedulers have run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1913) LinkDB to implement db.ignore.external.links

2015-02-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317874#comment-14317874
 ] 

Hudson commented on NUTCH-1913:
---

SUCCESS: Integrated in Nutch-trunk #2971 (See 
[https://builds.apache.org/job/Nutch-trunk/2971/])
NUTCH-1913 LinkDB to implement db.ignore.external.links (markus: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1659169)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java


 LinkDB to implement db.ignore.external.links
 

 Key: NUTCH-1913
 URL: https://issues.apache.org/jira/browse/NUTCH-1913
 Project: Nutch
  Issue Type: New Feature
  Components: linkdb
Affects Versions: 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-1913-trunk-v2.patch, NUTCH-1913-trunk.patch


 LinkDB needs an option to ignore external links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1323) AjaxNormalizer

2015-02-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317873#comment-14317873
 ] 

Hudson commented on NUTCH-1323:
---

SUCCESS: Integrated in Nutch-trunk #2971 (See 
[https://builds.apache.org/job/Nutch-trunk/2971/])
NUTCH-1323 AjaxNormalizer (markus: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1659167)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/urlnormalizer-ajax
* /nutch/trunk/src/plugin/urlnormalizer-ajax/build.xml
* /nutch/trunk/src/plugin/urlnormalizer-ajax/ivy.xml
* /nutch/trunk/src/plugin/urlnormalizer-ajax/plugin.xml
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net
* 
/nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer
* 
/nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax
* 
/nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax/AjaxURLNormalizer.java
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch
* /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net
* 
/nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer
* 
/nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax
* 
/nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax/TestAjaxURLNormalizer.java


 AjaxNormalizer
 --

 Key: NUTCH-1323
 URL: https://issues.apache.org/jira/browse/NUTCH-1323
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.10

 Attachments: NUTCH-1323-1.6-1.patch, NUTCH-1323-1.8.patch


 A two-way normalizer for Nutch able to deal with AJAX URL's, converting them 
 to _escaped_fragment_ URL's and back to an AJAX URL.
 https://developers.google.com/webmasters/ajax-crawling/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317815#comment-14317815
 ] 

Markus Jelsma commented on NUTCH-1925:
--

Committed  to trunk in revision 1659168.


 Upgrade Tika to version 1.7
 ---

 Key: NUTCH-1925
 URL: https://issues.apache.org/jira/browse/NUTCH-1925
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Tyler Palsulich
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch


 Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
 API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Jiaxin Ye
Hi all,

Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I
find it difficult to install Xvfb on mac.

Best,
Jiaxin

On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote:

 Hi Shuo Li,

 We were facing a similar issue. Prof. Mattman suggested we look into this
 patch for Selenium on Nutch 1.10 :
 https://issues.apache.org/jira/browse/NUTCH-1933.

 Hope this helps!

 Thanks,
 Sapna

 On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote:

 Yop,

 I'm trying to install selenium in Nutch 1.10. However, this error pops
 out:

 *error: package org.apache.nutch.storage does not exist*

 I can only find this package in Nutch 2.x. Is there a way to use Selenium
 in 1.10?

 Any advice would be appreciated.

 Regards,
 Shuo Li




 --
 Graduate Student
 MS in CS (Data Science)
 Viterbi School of Engineering
 University of Southern California

 Phone: +1 650-307-9848



Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Shuo Li
Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be
installed properly. The issue would be I don't know how to integrate
Selenium with Nutch 1.10.

On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu wrote:

 Hi all,

 Anyone here knows where to find the setup tutorial for Selenium on Mac ??
 I find it difficult to install Xvfb on mac.

 Best,
 Jiaxin

 On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu
 wrote:

 Hi Shuo Li,

 We were facing a similar issue. Prof. Mattman suggested we look into this
 patch for Selenium on Nutch 1.10 :
 https://issues.apache.org/jira/browse/NUTCH-1933.

 Hope this helps!

 Thanks,
 Sapna

 On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote:

 Yop,

 I'm trying to install selenium in Nutch 1.10. However, this error pops
 out:

 *error: package org.apache.nutch.storage does not exist*

 I can only find this package in Nutch 2.x. Is there a way to use
 Selenium in 1.10?

 Any advice would be appreciated.

 Regards,
 Shuo Li




 --
 Graduate Student
 MS in CS (Data Science)
 Viterbi School of Engineering
 University of Southern California

 Phone: +1 650-307-9848





[jira] [Commented] (NUTCH-1913) LinkDB to implement db.ignore.external.links

2015-02-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317816#comment-14317816
 ] 

Markus Jelsma commented on NUTCH-1913:
--

Thanks Sebastian, committed to trunk in revision 1659169!

 LinkDB to implement db.ignore.external.links
 

 Key: NUTCH-1913
 URL: https://issues.apache.org/jira/browse/NUTCH-1913
 Project: Nutch
  Issue Type: New Feature
  Components: linkdb
Affects Versions: 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-1913-trunk-v2.patch, NUTCH-1913-trunk.patch


 LinkDB needs an option to ignore external links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1913) LinkDB to implement db.ignore.external.links

2015-02-12 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1913.
--
Resolution: Fixed

 LinkDB to implement db.ignore.external.links
 

 Key: NUTCH-1913
 URL: https://issues.apache.org/jira/browse/NUTCH-1913
 Project: Nutch
  Issue Type: New Feature
  Components: linkdb
Affects Versions: 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-1913-trunk-v2.patch, NUTCH-1913-trunk.patch


 LinkDB needs an option to ignore external links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Jiaxin Ye
Hi Li, Shuo. You are so right. I finished installing and successfully run
the butch with selenium and Firefox. I have a question though, does your
Firefox plug out for always all the urls we crawled?

Hi Prof Mattmann. I think here is the way we install selenium on MAC with
OS higher than 10.6 I think...

1. Download XQuatz, it's a dmp file, install it directly
2. Download Nutch 1.10
3. Download the patch and put it on the Nutch project directory
4. patch -p0  THE PATCH NAME
5. DO NOT update the build.xml and the ivy.xml as the selenium tutorial in
the github told you. The patch basically updated those .xml file for us.
And the patch also installs lib-selenium and protocol selenium for
us (Correct me if I am wrong)
6. Update tika dependency if needed
7. Go to the Nutch project directory and run ant runtime
8. Download Firefox
9. Open a new terminal and type
xvfb -screen scrn 1024x758x34 (I think you can set it smaller if you
want...)
There should be some errors after entering the command (for me at
least). Manually sudo create a /tmp/.X11-unix folder, and then set the mode
to 1777. Rerun the command. xvfb should be working.
10. Go to nutch  runtime  local and run the crawling command

Hope it helps. :)

Best,
Jiaxin



On Thu, Feb 12, 2015 at 1:08 PM, Shuo Li sli...@usc.edu
javascript:_e(%7B%7D,'cvml','sli...@usc.edu'); wrote:

 I think I have possibly finished installing.

 What you need to do:
 0. git status and checkout what you have modified.
 1. patch -p0  YOUR_PATCH_FILE
 2. ant clean jar
 3. ant runtime

 Will try crawling using selenium later on. Hope this helped. _

 On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov
 javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote:

  Yes I believe you need to install X11 - why don't you try and report
 back what you find thanks.

 Sent from my iPhone

 On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu
 javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu'); wrote:

  Hi professor, but can we use Selenium on Mac?

 On Thursday, February 12, 2015, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov
 javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote:

 You need Selenium Jiaxin, in order to crawl dynamic pages in the
 polar dataset you have been assigned in my CSCI 572 search engines class.

 The instructions for integrating Selenium with Nutch 1.10-trunk
 are here:

 https://issues.apache.org/jira/browse/NUTCH-1933


 Cheers,
 Chris


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Jiaxin Ye jiaxi...@usc.edu
 Reply-To: dev@nutch.apache.org dev@nutch.apache.org
 Date: Thursday, February 12, 2015 at 12:46 AM
 To: dev@nutch.apache.org dev@nutch.apache.org
 Subject: Re: Nutch-Selenium in Nutch 1.10

 Well, good choice. I am thinking changing to ubuntu now. The thing is
 why
 do we need Selenium anyway? Just easier to perform crawling?
 
 On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li
 sli...@usc.edu wrote:
 
 Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
 using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still
 be installed properly. The issue would be I don't know how to integrate
 Selenium with Nutch 1.10.
 
 On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye
 jiaxi...@usc.edu wrote:
 
 Hi all,
 
 
 Anyone here knows where to find the setup tutorial for Selenium on Mac
 ??
 I find it difficult to install Xvfb on mac.
 
 
 Best,
 Jiaxin
 
 
 On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh
 sapna...@usc.edu wrote:
 
 Hi Shuo Li,
 
 
 We were facing a similar issue. Prof. Mattman suggested we look into
 this
 patch for Selenium on Nutch 1.10 :
 https://issues.apache.org/jira/browse/NUTCH-1933.
 
 
 Hope this helps!
 
 
 Thanks,
 Sapna
 
 On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li
 sli...@usc.edu wrote:
 
 Yop,
 
 
 I'm trying to install selenium in Nutch 1.10. However, this error pops
 out:
 
 
 error: package org.apache.nutch.storage does not exist
 
 
 
 I can only find this package in Nutch 2.x. Is there a way to use
 Selenium
 in 1.10?
 
 
 Any advice would be appreciated.
 
 
 Regards,
 Shuo Li
 
 
 
 
 
 
 
 
 
 
 --
 Graduate Student
 MS in CS (Data Science)
 Viterbi School of Engineering
 University of Southern California
 
 
 Phone:
 +1 650-307-9848 tel:%2B1%20650-307-9848
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 





Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Mattmann, Chris A (3980)
This is great, Jiaxin, can you please make a wiki page on the Nutch
wiki that has this information?

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Jiaxin Ye jiaxi...@usc.edu
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Thursday, February 12, 2015 at 9:39 PM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Nutch-Selenium in Nutch 1.10

Hi Li, Shuo. You are so right. I finished installing and successfully run
the butch with selenium and Firefox. I have a question though, does your
Firefox plug out for always all the urls we crawled?


Hi Prof Mattmann. I think here is the way we install selenium on MAC with
OS higher than 10.6 I think...


1. Download XQuatz, it's a dmp file, install it directly
2. Download Nutch 1.10
3. Download the patch and put it on the Nutch project directory
4. patch -p0  THE PATCH NAME
5. DO NOT update the build.xml and the ivy.xml as the selenium tutorial
in the github told you. The patch basically updated those .xml file for
us. And the patch also installs lib-selenium and protocol selenium for us
(Correct me if
 I am wrong)
6. Update tika dependency if needed
7. Go to the Nutch project directory and run ant runtime
8. Download Firefox
9. Open a new terminal and type
xvfb -screen scrn 1024x758x34 (I think you can set it smaller if you
want...)
There should be some errors after entering the command (for me at
least). Manually sudo create a /tmp/.X11-unix folder, and then set the
mode to 1777. Rerun the command. xvfb should be working.
10. Go to nutch  runtime  local and run the crawling command


Hope it helps. :)


Best,
Jiaxin





On Thu, Feb 12, 2015 at 1:08 PM, Shuo Li
sli...@usc.edu javascript:_e(%7B%7D,'cvml','sli...@usc.edu'); wrote:

I think I have possibly finished installing.


What you need to do:
0. git status and checkout what you have modified.
1. patch -p0  YOUR_PATCH_FILE
2. ant clean jar
3. ant runtime


Will try crawling using selenium later on. Hope this helped. _


On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov
javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote:

Yes I believe you need to install X11 - why don't you try and report back
what you find thanks.

Sent from my iPhone

On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu
javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu'); wrote:



Hi professor, but can we use Selenium on Mac?

On Thursday, February 12, 2015, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov
javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote:

You need Selenium Jiaxin, in order to crawl dynamic pages in the
polar dataset you have been assigned in my CSCI 572 search engines class.

The instructions for integrating Selenium with Nutch 1.10-trunk
are here:

https://issues.apache.org/jira/browse/NUTCH-1933


Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Jiaxin Ye jiaxi...@usc.edu
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Thursday, February 12, 2015 at 12:46 AM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Re: Nutch-Selenium in Nutch 1.10

Well, good choice. I am thinking changing to ubuntu now. The thing is why
do we need Selenium anyway? Just easier to perform crawling?

On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li
sli...@usc.edu wrote:

Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still
be installed properly. The issue would be I don't know how to integrate
Selenium with Nutch 1.10.

On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye
jiaxi...@usc.edu wrote:

Hi all,


Anyone here knows where to find the setup tutorial for Selenium on Mac ??
I find it difficult to install Xvfb on mac.


Best,
Jiaxin


On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh
sapna...@usc.edu wrote:

Hi Shuo Li,


We 

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Jiaxin Ye
Sure. I will do it once I confirm it works...

On Thursday, February 12, 2015, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 This is great, Jiaxin, can you please make a wiki page on the Nutch
 wiki that has this information?

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov javascript:;
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Jiaxin Ye jiaxi...@usc.edu javascript:;
 Reply-To: dev@nutch.apache.org javascript:; dev@nutch.apache.org
 javascript:;
 Date: Thursday, February 12, 2015 at 9:39 PM
 To: dev@nutch.apache.org javascript:; dev@nutch.apache.org
 javascript:;
 Subject: Nutch-Selenium in Nutch 1.10

 Hi Li, Shuo. You are so right. I finished installing and successfully run
 the butch with selenium and Firefox. I have a question though, does your
 Firefox plug out for always all the urls we crawled?
 
 
 Hi Prof Mattmann. I think here is the way we install selenium on MAC with
 OS higher than 10.6 I think...
 
 
 1. Download XQuatz, it's a dmp file, install it directly
 2. Download Nutch 1.10
 3. Download the patch and put it on the Nutch project directory
 4. patch -p0  THE PATCH NAME
 5. DO NOT update the build.xml and the ivy.xml as the selenium tutorial
 in the github told you. The patch basically updated those .xml file for
 us. And the patch also installs lib-selenium and protocol selenium for us
 (Correct me if
  I am wrong)
 6. Update tika dependency if needed
 7. Go to the Nutch project directory and run ant runtime
 8. Download Firefox
 9. Open a new terminal and type
 xvfb -screen scrn 1024x758x34 (I think you can set it smaller if you
 want...)
 There should be some errors after entering the command (for me at
 least). Manually sudo create a /tmp/.X11-unix folder, and then set the
 mode to 1777. Rerun the command. xvfb should be working.
 10. Go to nutch  runtime  local and run the crawling command
 
 
 Hope it helps. :)
 
 
 Best,
 Jiaxin
 
 
 
 
 
 On Thu, Feb 12, 2015 at 1:08 PM, Shuo Li
 sli...@usc.edu javascript:; javascript:_e(%7B%7D,'cvml','
 sli...@usc.edu javascript:;'); wrote:
 
 I think I have possibly finished installing.
 
 
 What you need to do:
 0. git status and checkout what you have modified.
 1. patch -p0  YOUR_PATCH_FILE
 2. ant clean jar
 3. ant runtime
 
 
 Will try crawling using selenium later on. Hope this helped. _
 
 
 On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov javascript:;
 javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov
 javascript:;'); wrote:
 
 Yes I believe you need to install X11 - why don't you try and report back
 what you find thanks.
 
 Sent from my iPhone
 
 On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu javascript:;
 javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu javascript:;'); wrote:
 
 
 
 Hi professor, but can we use Selenium on Mac?
 
 On Thursday, February 12, 2015, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov javascript:;
 javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov
 javascript:;'); wrote:
 
 You need Selenium Jiaxin, in order to crawl dynamic pages in the
 polar dataset you have been assigned in my CSCI 572 search engines class.
 
 The instructions for integrating Selenium with Nutch 1.10-trunk
 are here:
 
 https://issues.apache.org/jira/browse/NUTCH-1933
 
 
 Cheers,
 Chris
 
 
 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov javascript:;
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 
 
 -Original Message-
 From: Jiaxin Ye jiaxi...@usc.edu javascript:;
 Reply-To: dev@nutch.apache.org javascript:; dev@nutch.apache.org
 javascript:;
 Date: Thursday, February 12, 2015 at 12:46 AM
 To: dev@nutch.apache.org javascript:; dev@nutch.apache.org
 javascript:;
 Subject: Re: Nutch-Selenium in Nutch 1.10
 
 Well, good choice. I am thinking changing to ubuntu now. The thing is why
 do we need Selenium anyway? Just easier to perform crawling?
 
 On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li
 sli...@usc.edu 

[jira] [Created] (NUTCH-1942) Remove TopLevelDomain

2015-02-12 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1942:


 Summary: Remove TopLevelDomain 
 Key: NUTCH-1942
 URL: https://issues.apache.org/jira/browse/NUTCH-1942
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Priority: Minor
 Fix For: 1.11


We should leverage the domain related utilities from crawler-commons instead of 
duplicating them in the `org.apache.nutch.util.domain` package. For instance we 
could deprecate TopLevelDomain and call the corresponding class in CC instead. 
The resources in CC are more up-to-date and it is less code to maintain.

This would be a good task for someone willing to get to know the Nutch codebase 
better and impress us all with the extent of his/her skills.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Mattmann, Chris A (3980)
You need Selenium Jiaxin, in order to crawl dynamic pages in the
polar dataset you have been assigned in my CSCI 572 search engines class.

The instructions for integrating Selenium with Nutch 1.10-trunk
are here: 

https://issues.apache.org/jira/browse/NUTCH-1933


Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Jiaxin Ye jiaxi...@usc.edu
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Thursday, February 12, 2015 at 12:46 AM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Re: Nutch-Selenium in Nutch 1.10

Well, good choice. I am thinking changing to ubuntu now. The thing is why
do we need Selenium anyway? Just easier to perform crawling?

On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li
sli...@usc.edu wrote:

Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still
be installed properly. The issue would be I don't know how to integrate
Selenium with Nutch 1.10.

On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye
jiaxi...@usc.edu wrote:

Hi all,


Anyone here knows where to find the setup tutorial for Selenium on Mac ??
I find it difficult to install Xvfb on mac.


Best,
Jiaxin


On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh
sapna...@usc.edu wrote:

Hi Shuo Li,


We were facing a similar issue. Prof. Mattman suggested we look into this
patch for Selenium on Nutch 1.10 :
https://issues.apache.org/jira/browse/NUTCH-1933.


Hope this helps!


Thanks,
Sapna

On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li
sli...@usc.edu wrote:

Yop,


I'm trying to install selenium in Nutch 1.10. However, this error pops
out:


error: package org.apache.nutch.storage does not exist



I can only find this package in Nutch 2.x. Is there a way to use Selenium
in 1.10? 


Any advice would be appreciated.


Regards,
Shuo Li










-- 
Graduate Student
MS in CS (Data Science)
Viterbi School of Engineering
University of Southern California


Phone: 
+1 650-307-9848 tel:%2B1%20650-307-9848



























[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318099#comment-14318099
 ] 

Hudson commented on NUTCH-1939:
---

SUCCESS: Integrated in Nutch-trunk #2972 (See 
[https://builds.apache.org/job/Nutch-trunk/2972/])
NUTCH-1939 Fetcher fails to follow redirects (snagel: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1659227)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java


 Fetcher fails to follow redirects
 -

 Key: NUTCH-1939
 URL: https://issues.apache.org/jira/browse/NUTCH-1939
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.9
Reporter: Sebastian Nagel
 Fix For: 1.10

 Attachments: NUTCH-1939.patch


 As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with 
 http.redirect.max  0 Fetcher does not follow redirects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-12 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1925:
-
Attachment: NUTCH-1925-2x.patch

Patch for 2.x, it seems to be working. Please confirm.

 Upgrade Tika to version 1.7
 ---

 Key: NUTCH-1925
 URL: https://issues.apache.org/jira/browse/NUTCH-1925
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Tyler Palsulich
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1925-2x.patch, NUTCH-1925.palsulich.patch, 
 NUTCH-1925.palsulich.v2.patch


 Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
 API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Jiaxin Ye
Hi professor, but can we use Selenium on Mac?

On Thursday, February 12, 2015, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 You need Selenium Jiaxin, in order to crawl dynamic pages in the
 polar dataset you have been assigned in my CSCI 572 search engines class.

 The instructions for integrating Selenium with Nutch 1.10-trunk
 are here:

 https://issues.apache.org/jira/browse/NUTCH-1933


 Cheers,
 Chris


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov javascript:;
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Jiaxin Ye jiaxi...@usc.edu javascript:;
 Reply-To: dev@nutch.apache.org javascript:; dev@nutch.apache.org
 javascript:;
 Date: Thursday, February 12, 2015 at 12:46 AM
 To: dev@nutch.apache.org javascript:; dev@nutch.apache.org
 javascript:;
 Subject: Re: Nutch-Selenium in Nutch 1.10

 Well, good choice. I am thinking changing to ubuntu now. The thing is why
 do we need Selenium anyway? Just easier to perform crawling?
 
 On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li
 sli...@usc.edu javascript:; wrote:
 
 Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
 using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still
 be installed properly. The issue would be I don't know how to integrate
 Selenium with Nutch 1.10.
 
 On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye
 jiaxi...@usc.edu javascript:; wrote:
 
 Hi all,
 
 
 Anyone here knows where to find the setup tutorial for Selenium on Mac ??
 I find it difficult to install Xvfb on mac.
 
 
 Best,
 Jiaxin
 
 
 On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh
 sapna...@usc.edu javascript:; wrote:
 
 Hi Shuo Li,
 
 
 We were facing a similar issue. Prof. Mattman suggested we look into this
 patch for Selenium on Nutch 1.10 :
 https://issues.apache.org/jira/browse/NUTCH-1933.
 
 
 Hope this helps!
 
 
 Thanks,
 Sapna
 
 On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li
 sli...@usc.edu javascript:; wrote:
 
 Yop,
 
 
 I'm trying to install selenium in Nutch 1.10. However, this error pops
 out:
 
 
 error: package org.apache.nutch.storage does not exist
 
 
 
 I can only find this package in Nutch 2.x. Is there a way to use Selenium
 in 1.10?
 
 
 Any advice would be appreciated.
 
 
 Regards,
 Shuo Li
 
 
 
 
 
 
 
 
 
 
 --
 Graduate Student
 MS in CS (Data Science)
 Viterbi School of Engineering
 University of Southern California
 
 
 Phone:
 +1 650-307-9848 tel:%2B1%20650-307-9848
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 




[jira] [Commented] (NUTCH-1942) Remove TopLevelDomain

2015-02-12 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318308#comment-14318308
 ] 

Chris A. Mattmann commented on NUTCH-1942:
--

Julien can you tell me more about crawler-commons? You are part of that 
project, right? 

 Remove TopLevelDomain 
 --

 Key: NUTCH-1942
 URL: https://issues.apache.org/jira/browse/NUTCH-1942
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Priority: Minor
  Labels: newbie
 Fix For: 1.11


 We should leverage the domain related utilities from crawler-commons instead 
 of duplicating them in the `org.apache.nutch.util.domain` package. For 
 instance we could deprecate TopLevelDomain and call the corresponding class 
 in CC instead. The resources in CC are more up-to-date and it is less code to 
 maintain.
 This would be a good task for someone willing to get to know the Nutch 
 codebase better and impress us all with the extent of his/her skills.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Mattmann, Chris A (3980)
Yes I believe you need to install X11 - why don't you try and report back what 
you find thanks.

Sent from my iPhone

On Feb 12, 2015, at 8:28 AM, Jiaxin Ye 
jiaxi...@usc.edumailto:jiaxi...@usc.edu wrote:

Hi professor, but can we use Selenium on Mac?

On Thursday, February 12, 2015, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
You need Selenium Jiaxin, in order to crawl dynamic pages in the
polar dataset you have been assigned in my CSCI 572 search engines class.

The instructions for integrating Selenium with Nutch 1.10-trunk
are here:

https://issues.apache.org/jira/browse/NUTCH-1933


Cheers,
Chris


++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.govjavascript:;
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Jiaxin Ye jiaxi...@usc.edujavascript:;
Reply-To: dev@nutch.apache.orgjavascript:; 
dev@nutch.apache.orgjavascript:;
Date: Thursday, February 12, 2015 at 12:46 AM
To: dev@nutch.apache.orgjavascript:; dev@nutch.apache.orgjavascript:;
Subject: Re: Nutch-Selenium in Nutch 1.10

Well, good choice. I am thinking changing to ubuntu now. The thing is why
do we need Selenium anyway? Just easier to perform crawling?

On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li
sli...@usc.edujavascript:; wrote:

Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still
be installed properly. The issue would be I don't know how to integrate
Selenium with Nutch 1.10.

On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye
jiaxi...@usc.edujavascript:; wrote:

Hi all,


Anyone here knows where to find the setup tutorial for Selenium on Mac ??
I find it difficult to install Xvfb on mac.


Best,
Jiaxin


On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh
sapna...@usc.edujavascript:; wrote:

Hi Shuo Li,


We were facing a similar issue. Prof. Mattman suggested we look into this
patch for Selenium on Nutch 1.10 :
https://issues.apache.org/jira/browse/NUTCH-1933.


Hope this helps!


Thanks,
Sapna

On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li
sli...@usc.edujavascript:; wrote:

Yop,


I'm trying to install selenium in Nutch 1.10. However, this error pops
out:


error: package org.apache.nutch.storage does not exist



I can only find this package in Nutch 2.x. Is there a way to use Selenium
in 1.10?


Any advice would be appreciated.


Regards,
Shuo Li










--
Graduate Student
MS in CS (Data Science)
Viterbi School of Engineering
University of Southern California


Phone:
+1 650-307-9848 tel:%2B1%20650-307-9848



























[jira] [Updated] (NUTCH-1724) LinkDBReader to support regex output filtering

2015-02-12 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1724:
-
Attachment: NUTCH-1724-trunk.patch

Modified to adhere to Lewis' changes. Will commit shortly unless objected to.

 LinkDBReader to support regex output filtering
 --

 Key: NUTCH-1724
 URL: https://issues.apache.org/jira/browse/NUTCH-1724
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.11

 Attachments: NUTCH-1724-trunk.patch, NUTCH-1724-trunk.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)