Re: GSoC 2015

2015-02-05 Thread Talat Uyarer
Hi Folks,

Hadoop 2 support is ready for Nutch 2.x. I just wait Gora 0.6. My ideas,

Sitemap, Jsoup (HTML5 parser) , RDF Microformats Supports would be good.

Talat


2015-02-05 13:03 GMT+02:00 Markus Jelsma :
> Well, Hadoop 2.x sounds right indeed!
>
> -Original message-
> From: Julien Nioche
> Sent: Thursday 5th February 2015 1:34
> To: dev@nutch.apache.org
> Subject: Re: GSoC 2015
>
> Moving to Hadoop 2.x ?
>
> On 4 February 2015 at 14:42, Lewis John Mcgibbney  > wrote:
>
> Hi Folks,
>
> Does anyone have any good ideas for GSoC?
>
> Seb mentioned moving Nutch towards Spark so potentially a pluggable runtime 
> execution engine abstraction?
>
> I am currently working on a lot of security and authentication related work 
> so I would possibly be tempted to overhaul and improve that aspect of Nutch.
>
> Any other ideas?
>
> Thanks folks
> Lewis
>
> --
>
> Lewis
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/ 
> http://www.digitalpebble.com 
> 
> http://twitter.com/digitalpebble 
>
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308030#comment-14308030
 ] 

Lewis John McGibbney commented on NUTCH-1928:
-

Fantastic [~jorgelbg]. Does anyone else have comments?


> Indexing filter of documents by the MIME type
> -
>
> Key: NUTCH-1928
> URL: https://issues.apache.org/jira/browse/NUTCH-1928
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>  Labels: filter, mime-type, plugin
> Fix For: 1.10
>
> Attachments: mimetype-patch-v3.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-05 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307974#comment-14307974
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1928:
---

[~lewismc] I've updated the patch:

* Actually I was generating the patch from our internal SVN repository, and we 
keep our plugins separated from the rest of the Nutch distribution, so the 
previous patch couldn't be applied from the $NUTCH_HOME. I've generated the 
patch from the 1.9  $NUTCH_HOME (sources).
* As usual you were right, I was using the deprecated syntax of the JUnit 
tests, sorry for that. 

As usual really useful feedback!

> Indexing filter of documents by the MIME type
> -
>
> Key: NUTCH-1928
> URL: https://issues.apache.org/jira/browse/NUTCH-1928
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>  Labels: filter, mime-type, plugin
> Fix For: 1.10
>
> Attachments: mimetype-patch-v3.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-05 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: mimetype-patch-v3.patch

> Indexing filter of documents by the MIME type
> -
>
> Key: NUTCH-1928
> URL: https://issues.apache.org/jira/browse/NUTCH-1928
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>  Labels: filter, mime-type, plugin
> Fix For: 1.10
>
> Attachments: mimetype-patch-v3.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-05 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: (was: mimetype-patch-v2.patch)

> Indexing filter of documents by the MIME type
> -
>
> Key: NUTCH-1928
> URL: https://issues.apache.org/jira/browse/NUTCH-1928
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>  Labels: filter, mime-type, plugin
> Fix For: 1.10
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Trivial Update of "AdvancedAjaxInteraction" by LewisJohnMcgibbney

2015-02-05 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "AdvancedAjaxInteraction" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/AdvancedAjaxInteraction?action=diff&rev1=1&rev2=2

  
  == Lets Begin with a Scenario ==
  
- xyz
+ So lets say that as a Nutch crawl administrator your client has tasked you 
with the following '"Get me domain specific material a database such as 
NTIS"' (NTIS; the National Technical Information Service, serves as the 
largest central resource for government-funded scientific, technical, 
engineering, and business related information available today.)
+ What this really translates to is the following:
+  * use Nutch to log in to a database which requires 
[[https://wiki.apache.org/nutch/HttpPostAuthentication|HTTP POST 
authentication]]
+  * follow the redirect to the database landing query form
+  * submit a query to the form which will return a ranked list of search 
results for the given query
+  * interpret the JavaScript for each result in the ranked list
+  * use an 
[[http://nutch.apache.org/apidocs/apidocs-1.9/index.html?org/apache/nutch/parse/HtmlParseFilter.html|HtmlParseFilter]]
 to obtain high level article/document content
+  * submit a GET request to invoke JavaScript which will return a PDF of the 
full textual content for this document
+  * return the full document (PDF) content and metadata along with the HTML 
parse filter data
  
  == Related Development Issues ==
  


[INVITATION] Apache Nutch Google Summer of Code 2015

2015-02-05 Thread Lewis John Mcgibbney
Hi Folks,

The Nutch team are currently on the lookout for interested students willing
to engage in this years Google Summer of Code Program [0].
What is GSoC? A global program that offers students stipends to write code
for open source projects. In 2014 the Apache Nutch project participated in
a successful project which resulted in a WebApplication for the Nutch 2.X
REST API.
Right now we are currently soliciting applications ON ANY TOPIC/ISSUE from
students.
Interested students should write to dev@nutch.apache.org with their
questions/proposals where one of the team will aid you with a response.
To give a helping hand, the PMC have already suggested a potential issue
which involves porting Nutch (trunk) to the Hadoop 2.X API [1]. We are
however open to any other topic students may be interested in and willing
to discuss.
Thanks in advance,
Lewis
(On behalf of Nutch PMC)

[0] https://www.google-melange.com/gsoc/homepage/google/gsoc2015
[1]
https://wiki.apache.org/nutch/GoogleSummerOfCode#NUTCH-1936_GSoC_2015_-_Move_Nutch_to_Hadoop_2.X

-- 
*Lewis*


[jira] [Commented] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-02-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307725#comment-14307725
 ] 

Lewis John McGibbney commented on NUTCH-1936:
-

https://wiki.apache.org/nutch/GoogleSummerOfCode#NUTCH-1936_GSoC_2015_-_Move_Nutch_to_Hadoop_2.X

> GSoC 2015 - Move Nutch to Hadoop 2.X
> 
>
> Key: NUTCH-1936
> URL: https://issues.apache.org/jira/browse/NUTCH-1936
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Reporter: Lewis John McGibbney
>  Labels: gsoc2015
> Fix For: 2.4, 1.11
>
>
> The Nutch PMC 
> [discussed|http://www.mail-archive.com/dev%40nutch.apache.org/msg16250.html] 
> ideas for a good 2015 GSoC project. It appears that porting the (trunk) 
> codebase to [Hadoop 2.X|http://hadoop.apache.org/docs/stable/] seems to an 
> attractive option and one which would present an excellent learning 
> experience for a summer student.
> A more comprehensive description of this issue should be included within 
> either a mentor-defined project description or a successful student 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Trivial Update of "GoogleSummerOfCode" by LewisJohnMcgibbney

2015-02-05 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "GoogleSummerOfCode" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/GoogleSummerOfCode?action=diff&rev1=7&rev2=8

  Primarily it should harvest documentation relating to these projects but most 
importantly it should act as a platform for anyone to contribute towards GSoC 
projects.
  
  == Projects ==
+ 
+ === 2015 ===
+  NUTCH-1936 GSoC 2015 - Move Nutch to Hadoop 2.X 
+ = Description =
+ The Nutch PMC discussed ideas for a good 2015 GSoC project. It appears that 
porting the (trunk) codebase to Hadoop 2.X seems to an attractive option and 
one which would present an excellent learning experience for a summer student.
+ 
+ = Student Proposal =
+ TODO
+ 
+ = Reports =
+ TODO
+ 
+ = Documentation =
+ TODO
+ 
+ = Jira Issues =
+ 
+  * [[https://issues.apache.org/jira/browse/NUTCH-1936|NUTCH-1936]] - GSoC 
2015 - Move Nutch to Hadoop 2.X (Parent Issue)
  
  === 2014 ===
   NUTCH-841 Create a Wicket-based Web Application for Nutch 


[jira] [Created] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-02-05 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1936:
---

 Summary: GSoC 2015 - Move Nutch to Hadoop 2.X
 Key: NUTCH-1936
 URL: https://issues.apache.org/jira/browse/NUTCH-1936
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Lewis John McGibbney
 Fix For: 2.4, 1.11


The Nutch PMC 
[discussed|http://www.mail-archive.com/dev%40nutch.apache.org/msg16250.html] 
ideas for a good 2015 GSoC project. It appears that porting the (trunk) 
codebase to [Hadoop 2.X|http://hadoop.apache.org/docs/stable/] seems to an 
attractive option and one which would present an excellent learning experience 
for a summer student.

A more comprehensive description of this issue should be included within either 
a mentor-defined project description or a successful student application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Mo Omer (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307539#comment-14307539
 ] 

Mo Omer commented on NUTCH-1933:


That's really cool to hear; I'll check out that link Lewis. As my employer no 
longer has the client for whom the project (a sort of contextual tagging 
service which derived html content via Nutch) was built, I haven't touched or 
thought of this in a while. A couple months ago, though, I found myself 
wondering if there are any better solutions available. 

Have you all evaluated WebEngine 
(http://docs.oracle.com/javase/8/javafx/api/javafx/scene/web/WebEngine.html)? 
Or setting up some sort of dom inside v8 and calling C funcs from Java?

One small additional note: the nutch-selenium plugin should also allow the 
time-delay (basically the time allowed for the page to render - including ajax 
etc.) to be configured.

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307491#comment-14307491
 ] 

Lewis John McGibbney commented on NUTCH-1933:
-

[~momer], thanks for the feedback
bq. is there a "Beginners guide to helping out with Apache projects on Jira?"
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
It would be *fantastic* if you were able to create a patch for us with your 
[selenium-grid plugin|https://github.com/momer/nutch-selenium-grid-plugin] as 
well. We are currently evaluating selenium as a mechanism for driving JS 
interaction prior to fetching the webpage and returning it to the parser. 
Improving your plugins is where I think we are going.

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Mo Omer (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307481#comment-14307481
 ] 

Mo Omer commented on NUTCH-1933:


Right on - glad you all found it useful enough to integrate. As I mentioned on 
GH, I'd definitely recommend also including the selenium-grid plugin, since 
it's a wa saner approach to integrating with Selenium.

When I cobbled this together, I was under pretty hard deadline pressures, and 
left a lot of cruft in. All references/files belonging to the old html-unit 
should be removed, .idea files/directories which I'd missed in my .gitignore 
should be tossed out; HttpResponse.java should be nearly empty when completed; 
HttpWebClient should allow the tag which Selenium collects innerHtml for to be 
configured (right now it's just 'body' with no config options).

This, and some Hadoop work a couple weeks after putting this together, was 
really the first time I'd used Java (outside of JRuby, which, doesn't 
really count), so I apologize for the wack code smells I left in.

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Mo Omer (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307481#comment-14307481
 ] 

Mo Omer edited comment on NUTCH-1933 at 2/5/15 4:16 PM:


Right on - glad you all found it useful enough to integrate. As I mentioned on 
GH, I'd definitely recommend also including the selenium-grid plugin, since 
it's a wa saner approach to integrating with Selenium. 

When I cobbled this together, I was under pretty hard deadline pressures, and 
left a lot of cruft in. All references/files belonging to the old html-unit 
should be removed, .idea files/directories which I'd missed in my .gitignore 
should be tossed out; HttpResponse.java should be nearly empty when completed; 
HttpWebClient should allow the tag which Selenium collects innerHtml for to be 
configured (right now it's just 'body' with no config options).

This, and some Hadoop work a couple weeks after putting this together, was 
really the first time I'd used Java (outside of JRuby, which, doesn't 
really count), so I apologize for the wack code smells I left in.

Lastly, it would be really dope to have some mention in the code that I'd put 
it together originally; but I'd also appreciate some pointers in how to get 
more involved with Apache projects - is there a "Beginners guide to helping out 
with Apache projects on Jira?"


was (Author: momer):
Right on - glad you all found it useful enough to integrate. As I mentioned on 
GH, I'd definitely recommend also including the selenium-grid plugin, since 
it's a wa saner approach to integrating with Selenium.

When I cobbled this together, I was under pretty hard deadline pressures, and 
left a lot of cruft in. All references/files belonging to the old html-unit 
should be removed, .idea files/directories which I'd missed in my .gitignore 
should be tossed out; HttpResponse.java should be nearly empty when completed; 
HttpWebClient should allow the tag which Selenium collects innerHtml for to be 
configured (right now it's just 'body' with no config options).

This, and some Hadoop work a couple weeks after putting this together, was 
really the first time I'd used Java (outside of JRuby, which, doesn't 
really count), so I apologize for the wack code smells I left in.

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307372#comment-14307372
 ] 

Lewis John McGibbney commented on NUTCH-1933:
-

Hi Folks, additionally we started a [wiki 
document|https://wiki.apache.org/nutch/AdvancedAjaxInteraction] which brings 
some more context to this issue. We will be populating this further as work 
goes on.

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307332#comment-14307332
 ] 

Chris A. Mattmann commented on NUTCH-1933:
--

So we should note that [~momer] I believe was the one who started this plugin. 
I've been talking with Mo about getting this contributed to Apache: 

https://github.com/momer/nutch-selenium/commit/029907b45ff65679c41f334f0f3ff16afb7acc07

So, I asked Mo to come over here and take a look at Lewis's patch. Thanks all.

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1933:
-
Assignee: Lewis John McGibbney

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1933:
-
Reporter: Mo Omer  (was: Lewis John McGibbney)

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307324#comment-14307324
 ] 

Lewis John McGibbney commented on NUTCH-1933:
-

arrgh..., it appears to be utter garbage Markus. Sorry about that.

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307136#comment-14307136
 ] 

Markus Jelsma commented on NUTCH-1933:
--

Hey, what's this? 
src/plugin/protocol-selenium/.idea

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Reporter: Lewis John McGibbney
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: GSoC 2015

2015-02-05 Thread Markus Jelsma
Well, Hadoop 2.x sounds right indeed!

-Original message-
From: Julien Nioche
Sent: Thursday 5th February 2015 1:34
To: dev@nutch.apache.org
Subject: Re: GSoC 2015

Moving to Hadoop 2.x ?

On 4 February 2015 at 14:42, Lewis John Mcgibbney mailto:lewis.mcgibb...@gmail.com>> wrote:

Hi Folks,

Does anyone have any good ideas for GSoC?

Seb mentioned moving Nutch towards Spark so potentially a pluggable runtime 
execution engine abstraction?

I am currently working on a lot of security and authentication related work so 
I would possibly be tempted to overhaul and improve that aspect of Nutch.

Any other ideas?

Thanks folks
Lewis

--

Lewis

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/ 
http://www.digitalpebble.com 

http://twitter.com/digitalpebble 




Re: Blog topic: Maxmind's GeoIP2 API being used in Apache Nutch 1.10

2015-02-05 Thread Lewis John Mcgibbney
Good Evening Susan,
Please see the GDoc below for the proposed blog post
https://docs.google.com/document/d/1MSVQayQwqEtovIl4A1McmEEm8Q-xt8h0AXpWZzlu_7o/edit?usp=sharing
I've given you edit permissions so please either comment on it and I can
fix or alternatively edit yourself.
@dev, feedback would be very welcome.
Thanks folks.
Good night
Lewis



On Mon, Feb 2, 2015 at 4:49 AM, Susan Fendrock 
wrote:

> Hi Lewis!
>
> We've received the green light for you to provide us with your guest blog
> entry.
>
> It will be some weeks before it makes its way into our publishing
> schedule, so you have plenty of time to write it.
>
> We ask that the blog be no more than 500 words.
>
> Also, can you summarize what our GeoIP customers will learn from your
> entry?
>
> We like our blog topics to be actionable for our readers.
>
> I do expect that we will provide some editing assistance from our end.
>
> Look forward to working with you and learning more about your topic,
>
> Susan
>
> On Fri, Jan 30, 2015 at 12:10 PM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
>> Thank you Susan.
>> Have a great weekend.,
>> Lewis
>>
>>
>> On Fri, Jan 30, 2015 at 5:00 AM, Susan Fendrock 
>> wrote:
>>
>>> Hi Lewis,
>>>
>>> Thanks for filling me in more on your project.
>>>
>>> I will review this with others here at MaxMind and get back to you, once
>>> I've determined if we will be able to take you up on your kind offer.
>>>
>>> Susan
>>>
>>> On Thu, Jan 29, 2015 at 6:37 PM, Lewis John Mcgibbney <
>>> lewis.mcgibb...@gmail.com> wrote:
>>>
 Good Afternoon Susan,
 Thanks for your email.
 I am one of a number of developers on the Open Source Apache Nutch
 project [0]. As described on our website, Nutch is a well matured,
 production ready Web Crawler which powers search and discovery for a broad
 spectrum of organizations over a broad spectrum of use cases!
 I recently developed a piece of code e.g. Jira issue NUTCH-1660 [1]
 which leverages the Maxmind GeoIP2-java API [2] for reverse geocoding
 server information from which we fetch webpages. Right now the code is
 configured to use the GeoIP2 insights service. We can do this because we
 are able to obtain an IP address from the socket connection. The IP address
 is then used within the GeoIP2-java client API to locate and provide us
 with a bunch of geocoded data relating to the server.
 My idea here was basically to feature the open source development and
 open source projects which use the Maxmind technology. Something like a
 featured post which promotes both Maxmind product and the Apache Nutch
 project.
 if required I could provide you with some nice vizualizations for
 servers I visit during one of my crawls. The potentially overlay IP
 locations with a static map.
 Please let me know if this sounds interesting to you. I am mostly
 interested in promoting the open source technology we are engaged in
 writing.
 A point to mention here as well is the licensing which enables this
 work which is the Apache Software License v2.0. You will see that the
 Maxmind GeoIP2-java client driver is also licensed under this license [4].
 Thanks for any feedback.
 lewis

 [0] http://nutch.apache.org
 [1] https://issues.apache.org/jira/browse/NUTCH-1660
 [2] https://github.com/maxmind/GeoIP2-java
 [3] http://maxmind.wpengine.com/2013/07/01/introducing-the-geoip2-beta/
 [4] https://github.com/maxmind/GeoIP2-java/blob/master/LICENSE

 On Thu, Jan 29, 2015 at 8:19 AM, Lewis John Mcgibbney <
 lewis.mcgibb...@gmail.com> wrote:

> Hi Susan,
> Just acknowledging this email. I will write this up during my lunch
> hour today.
> Thanks
> lewis
>
> On Thu, Jan 29, 2015 at 6:36 AM, Susan Fendrock  > wrote:
>
>> Hello Lewis!
>>
>> Thanks for getting in touch with us about potentially providing a
>> contribution to our blog.
>>
>> Could you provide a brief summary of the blog post you are
>> envisioning?
>>
>> Look forward to learning more about your project,
>>
>> Susan
>>
>>
>> --
>> Susan Fendrock
>> Product Marketing
>> MaxMind, Inc.
>>
>> 617-500-4493 ext. 820
>>
>
>
>
> --
> *Lewis*
>



 --
 *Lewis*

>>>
>>>
>>>
>>> --
>>> Susan Fendrock
>>> Product Marketing
>>> MaxMind, Inc.
>>>
>>> 617-500-4493 ext. 820
>>>
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> Susan Fendrock
> Product Marketing
> MaxMind, Inc.
>
> 617-500-4493 ext. 820
>



-- 
*Lewis*