Re: Blog topic: Maxmind's GeoIP2 API being used in Apache Nutch 1.10

2015-01-30 Thread Lewis John Mcgibbney
Thank you Susan.
Have a great weekend.,
Lewis


On Fri, Jan 30, 2015 at 5:00 AM, Susan Fendrock sfendr...@maxmind.com
wrote:

 Hi Lewis,

 Thanks for filling me in more on your project.

 I will review this with others here at MaxMind and get back to you, once
 I've determined if we will be able to take you up on your kind offer.

 Susan

 On Thu, Jan 29, 2015 at 6:37 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Good Afternoon Susan,
 Thanks for your email.
 I am one of a number of developers on the Open Source Apache Nutch
 project [0]. As described on our website, Nutch is a well matured,
 production ready Web Crawler which powers search and discovery for a broad
 spectrum of organizations over a broad spectrum of use cases!
 I recently developed a piece of code e.g. Jira issue NUTCH-1660 [1] which
 leverages the Maxmind GeoIP2-java API [2] for reverse geocoding server
 information from which we fetch webpages. Right now the code is configured
 to use the GeoIP2 insights service. We can do this because we are able to
 obtain an IP address from the socket connection. The IP address is then
 used within the GeoIP2-java client API to locate and provide us with a
 bunch of geocoded data relating to the server.
 My idea here was basically to feature the open source development and
 open source projects which use the Maxmind technology. Something like a
 featured post which promotes both Maxmind product and the Apache Nutch
 project.
 if required I could provide you with some nice vizualizations for servers
 I visit during one of my crawls. The potentially overlay IP locations with
 a static map.
 Please let me know if this sounds interesting to you. I am mostly
 interested in promoting the open source technology we are engaged in
 writing.
 A point to mention here as well is the licensing which enables this work
 which is the Apache Software License v2.0. You will see that the Maxmind
 GeoIP2-java client driver is also licensed under this license [4].
 Thanks for any feedback.
 lewis

 [0] http://nutch.apache.org
 [1] https://issues.apache.org/jira/browse/NUTCH-1660
 [2] https://github.com/maxmind/GeoIP2-java
 [3] http://maxmind.wpengine.com/2013/07/01/introducing-the-geoip2-beta/
 [4] https://github.com/maxmind/GeoIP2-java/blob/master/LICENSE

 On Thu, Jan 29, 2015 at 8:19 AM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Hi Susan,
 Just acknowledging this email. I will write this up during my lunch hour
 today.
 Thanks
 lewis

 On Thu, Jan 29, 2015 at 6:36 AM, Susan Fendrock sfendr...@maxmind.com
 wrote:

 Hello Lewis!

 Thanks for getting in touch with us about potentially providing a
 contribution to our blog.

 Could you provide a brief summary of the blog post you are envisioning?

 Look forward to learning more about your project,

 Susan


 --
 Susan Fendrock
 Product Marketing
 MaxMind, Inc.

 617-500-4493 ext. 820




 --
 *Lewis*




 --
 *Lewis*




 --
 Susan Fendrock
 Product Marketing
 MaxMind, Inc.

 617-500-4493 ext. 820




-- 
*Lewis*


[jira] [Created] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-01-30 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-1928:
-

 Summary: Indexing filter of documents by the MIME type
 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
 Fix For: 1.10


This allows to filter the indexed documents by the MIME type property of the 
crawled content. Basically this will allow you to restrict the MIME type of the 
contents that will be stored in Solr/Elasticsearch index without the need to 
restrict the crawling/parsing process, so no need to use URLFilter plugin 
family. Also this address one particular corner case when certain URLs doesn't 
have any format to filter such as some RSS feeds 
(http://www.awesomesite.com/feed) and it will end in your index mixed with all 
your HTML content.

A configuration can file specified on the {{mimetype.filter.file}} property in 
the {{nutch-site.xml}}. This file use the same format as the 
{{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
{{allow all}} policy is used instead, so all your crawled documents will be 
indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-01-30 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: mime-filter.patch

Adding the first version of the code

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: mime-filter.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-01-30 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299533#comment-14299533
 ] 

Lewis John McGibbney edited comment on NUTCH-1928 at 1/31/15 1:31 AM:
--

[~jorgelbg] two immediate obervations.
 * can you put license headers on all files?
 * Can you format the code with the 
[eclipse-codeformatter|http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml]?

Thanks, I'll look forward to trying this out.


was (Author: lewismc):
[~jorgelbg] two immediate obervations.
 * can you put license headers on all files?
 * Can you format the code with the 
[eclipse-codeformatter|http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml]?
Thanks, I'll look forward to trying this out.

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: mime-filter.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-01-30 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299533#comment-14299533
 ] 

Lewis John McGibbney commented on NUTCH-1928:
-

[~jorgelbg] two immediate obervations.
 * can you put license headers on all files?
 * Can you format the code with the 
[eclipse-codeformatter|http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml]?
Thanks, I'll look forward to trying this out.

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: mime-filter.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1929) Consider implementing dependency injection for crawl HTTPS sites that use self signed certificates

2015-01-30 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1929:
---

 Summary: Consider implementing dependency injection for crawl 
HTTPS sites that use self signed certificates
 Key: NUTCH-1929
 URL: https://issues.apache.org/jira/browse/NUTCH-1929
 Project: Nutch
  Issue Type: Improvement
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.4, 1.11


It was mentioned [a while 
ago|http://www.mail-archive.com/user@nutch.apache.org/msg11416.html] that to 
be able to crawl sites with a self signed certificate required a simple code 
modification the protocol-httpclient plugin.
{code}
in org.apache.nutch.protocol.httpclient.Http

Replace:

ProtocolSocketFactory factory = new SSLProtocolSocketFactory();

With:

ProtocolSocketFactory factory = new DummySSLProtocolSocketFactory();
{code}

I can confirm that this patch actually fixes the issue, however the thread 
hangs on a question which was never answered.

Is there dependency injection that can be used?

This issue needs to investigate the required logic which we can implement to 
make the decision at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

2015-01-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298420#comment-14298420
 ] 

Hudson commented on NUTCH-1918:
---

SUCCESS: Integrated in Nutch-trunk #2956 (See 
[https://builds.apache.org/job/Nutch-trunk/2956/])
NUTCH-1918 TikaParser specifies a default namespace when generating DOM 
(jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1655966)
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java
* 
/nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java


 TikaParser specifies a default namespace when generating DOM
 

 Key: NUTCH-1918
 URL: https://issues.apache.org/jira/browse/NUTCH-1918
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
 Fix For: 1.10

 Attachments: NUTCH-1918.patch


 The DOM generated by parse-tika differs from the one done by parse-html. 
 Ideally we should be able to use either parsers with the same XPath 
 expressions.
 This is related to [NUTCH-1592], but this time instead of being a matter of 
 uppercases, the problem comes from the namespace used. 
 This issue has been investigated and fixed in storm-crawler 
 [https://github.com/DigitalPebble/storm-crawler/pull/58].
 Here is what Guillaume explained there :
 bq. When parsing the content, Tika creates a properly formatted XHTML 
 document: all elements are created within the namespace XHTML.
 bq. However in XPath 1.0, there's no concept of default namespace so XPath 
 expressions such as //BODY doesn't match anything. To make this work we 
 should use //ns1:BODY and define a NamespaceContext which associates ns1 with 
 http://www.w3.org/1999/xhtml;
 bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is 
 our SaxHandler used to convert the SAX Events into a DOM tree to ignore a 
 default name space and the ParserBolt initializes it with the XHTML 
 namespace. This way //BODY matches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata

2015-01-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1889.
--
Resolution: Fixed

Committed revision 1655960.


 Store all values from Tika metadata in Nutch metadata
 -

 Key: NUTCH-1889
 URL: https://issues.apache.org/jira/browse/NUTCH-1889
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Julien Nioche
Priority: Trivial
 Fix For: 1.10

 Attachments: NUTCH-1889.patch


 Tika metadata can be multivalued but we currently keep only the first value 
 in the TikaParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)