Re: Blog topic: Maxmind's GeoIP2 API being used in Apache Nutch 1.10
Thank you Susan. Have a great weekend., Lewis On Fri, Jan 30, 2015 at 5:00 AM, Susan Fendrock sfendr...@maxmind.com wrote: Hi Lewis, Thanks for filling me in more on your project. I will review this with others here at MaxMind and get back to you, once I've determined if we will be able to take you up on your kind offer. Susan On Thu, Jan 29, 2015 at 6:37 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Good Afternoon Susan, Thanks for your email. I am one of a number of developers on the Open Source Apache Nutch project [0]. As described on our website, Nutch is a well matured, production ready Web Crawler which powers search and discovery for a broad spectrum of organizations over a broad spectrum of use cases! I recently developed a piece of code e.g. Jira issue NUTCH-1660 [1] which leverages the Maxmind GeoIP2-java API [2] for reverse geocoding server information from which we fetch webpages. Right now the code is configured to use the GeoIP2 insights service. We can do this because we are able to obtain an IP address from the socket connection. The IP address is then used within the GeoIP2-java client API to locate and provide us with a bunch of geocoded data relating to the server. My idea here was basically to feature the open source development and open source projects which use the Maxmind technology. Something like a featured post which promotes both Maxmind product and the Apache Nutch project. if required I could provide you with some nice vizualizations for servers I visit during one of my crawls. The potentially overlay IP locations with a static map. Please let me know if this sounds interesting to you. I am mostly interested in promoting the open source technology we are engaged in writing. A point to mention here as well is the licensing which enables this work which is the Apache Software License v2.0. You will see that the Maxmind GeoIP2-java client driver is also licensed under this license [4]. Thanks for any feedback. lewis [0] http://nutch.apache.org [1] https://issues.apache.org/jira/browse/NUTCH-1660 [2] https://github.com/maxmind/GeoIP2-java [3] http://maxmind.wpengine.com/2013/07/01/introducing-the-geoip2-beta/ [4] https://github.com/maxmind/GeoIP2-java/blob/master/LICENSE On Thu, Jan 29, 2015 at 8:19 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Susan, Just acknowledging this email. I will write this up during my lunch hour today. Thanks lewis On Thu, Jan 29, 2015 at 6:36 AM, Susan Fendrock sfendr...@maxmind.com wrote: Hello Lewis! Thanks for getting in touch with us about potentially providing a contribution to our blog. Could you provide a brief summary of the blog post you are envisioning? Look forward to learning more about your project, Susan -- Susan Fendrock Product Marketing MaxMind, Inc. 617-500-4493 ext. 820 -- *Lewis* -- *Lewis* -- Susan Fendrock Product Marketing MaxMind, Inc. 617-500-4493 ext. 820 -- *Lewis*
[jira] [Created] (NUTCH-1928) Indexing filter of documents by the MIME type
Jorge Luis Betancourt Gonzalez created NUTCH-1928: - Summary: Indexing filter of documents by the MIME type Key: NUTCH-1928 URL: https://issues.apache.org/jira/browse/NUTCH-1928 Project: Nutch Issue Type: Improvement Components: indexer, plugin Reporter: Jorge Luis Betancourt Gonzalez Fix For: 1.10 This allows to filter the indexed documents by the MIME type property of the crawled content. Basically this will allow you to restrict the MIME type of the contents that will be stored in Solr/Elasticsearch index without the need to restrict the crawling/parsing process, so no need to use URLFilter plugin family. Also this address one particular corner case when certain URLs doesn't have any format to filter such as some RSS feeds (http://www.awesomesite.com/feed) and it will end in your index mixed with all your HTML content. A configuration can file specified on the {{mimetype.filter.file}} property in the {{nutch-site.xml}}. This file use the same format as the {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an {{allow all}} policy is used instead, so all your crawled documents will be indexed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type
[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez updated NUTCH-1928: -- Attachment: mime-filter.patch Adding the first version of the code Indexing filter of documents by the MIME type - Key: NUTCH-1928 URL: https://issues.apache.org/jira/browse/NUTCH-1928 Project: Nutch Issue Type: Improvement Components: indexer, plugin Reporter: Jorge Luis Betancourt Gonzalez Labels: filter, mime-type, plugin Fix For: 1.10 Attachments: mime-filter.patch This allows to filter the indexed documents by the MIME type property of the crawled content. Basically this will allow you to restrict the MIME type of the contents that will be stored in Solr/Elasticsearch index without the need to restrict the crawling/parsing process, so no need to use URLFilter plugin family. Also this address one particular corner case when certain URLs doesn't have any format to filter such as some RSS feeds (http://www.awesomesite.com/feed) and it will end in your index mixed with all your HTML content. A configuration can file specified on the {{mimetype.filter.file}} property in the {{nutch-site.xml}}. This file use the same format as the {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an {{allow all}} policy is used instead, so all your crawled documents will be indexed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-1928) Indexing filter of documents by the MIME type
[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299533#comment-14299533 ] Lewis John McGibbney edited comment on NUTCH-1928 at 1/31/15 1:31 AM: -- [~jorgelbg] two immediate obervations. * can you put license headers on all files? * Can you format the code with the [eclipse-codeformatter|http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml]? Thanks, I'll look forward to trying this out. was (Author: lewismc): [~jorgelbg] two immediate obervations. * can you put license headers on all files? * Can you format the code with the [eclipse-codeformatter|http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml]? Thanks, I'll look forward to trying this out. Indexing filter of documents by the MIME type - Key: NUTCH-1928 URL: https://issues.apache.org/jira/browse/NUTCH-1928 Project: Nutch Issue Type: Improvement Components: indexer, plugin Reporter: Jorge Luis Betancourt Gonzalez Labels: filter, mime-type, plugin Fix For: 1.10 Attachments: mime-filter.patch This allows to filter the indexed documents by the MIME type property of the crawled content. Basically this will allow you to restrict the MIME type of the contents that will be stored in Solr/Elasticsearch index without the need to restrict the crawling/parsing process, so no need to use URLFilter plugin family. Also this address one particular corner case when certain URLs doesn't have any format to filter such as some RSS feeds (http://www.awesomesite.com/feed) and it will end in your index mixed with all your HTML content. A configuration can file specified on the {{mimetype.filter.file}} property in the {{nutch-site.xml}}. This file use the same format as the {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an {{allow all}} policy is used instead, so all your crawled documents will be indexed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type
[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299533#comment-14299533 ] Lewis John McGibbney commented on NUTCH-1928: - [~jorgelbg] two immediate obervations. * can you put license headers on all files? * Can you format the code with the [eclipse-codeformatter|http://svn.apache.org/repos/asf/nutch/branches/2.x/eclipse-codeformat.xml]? Thanks, I'll look forward to trying this out. Indexing filter of documents by the MIME type - Key: NUTCH-1928 URL: https://issues.apache.org/jira/browse/NUTCH-1928 Project: Nutch Issue Type: Improvement Components: indexer, plugin Reporter: Jorge Luis Betancourt Gonzalez Labels: filter, mime-type, plugin Fix For: 1.10 Attachments: mime-filter.patch This allows to filter the indexed documents by the MIME type property of the crawled content. Basically this will allow you to restrict the MIME type of the contents that will be stored in Solr/Elasticsearch index without the need to restrict the crawling/parsing process, so no need to use URLFilter plugin family. Also this address one particular corner case when certain URLs doesn't have any format to filter such as some RSS feeds (http://www.awesomesite.com/feed) and it will end in your index mixed with all your HTML content. A configuration can file specified on the {{mimetype.filter.file}} property in the {{nutch-site.xml}}. This file use the same format as the {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an {{allow all}} policy is used instead, so all your crawled documents will be indexed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1929) Consider implementing dependency injection for crawl HTTPS sites that use self signed certificates
Lewis John McGibbney created NUTCH-1929: --- Summary: Consider implementing dependency injection for crawl HTTPS sites that use self signed certificates Key: NUTCH-1929 URL: https://issues.apache.org/jira/browse/NUTCH-1929 Project: Nutch Issue Type: Improvement Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.4, 1.11 It was mentioned [a while ago|http://www.mail-archive.com/user@nutch.apache.org/msg11416.html] that to be able to crawl sites with a self signed certificate required a simple code modification the protocol-httpclient plugin. {code} in org.apache.nutch.protocol.httpclient.Http Replace: ProtocolSocketFactory factory = new SSLProtocolSocketFactory(); With: ProtocolSocketFactory factory = new DummySSLProtocolSocketFactory(); {code} I can confirm that this patch actually fixes the issue, however the thread hangs on a question which was never answered. Is there dependency injection that can be used? This issue needs to investigate the required logic which we can implement to make the decision at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM
[ https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298420#comment-14298420 ] Hudson commented on NUTCH-1918: --- SUCCESS: Integrated in Nutch-trunk #2956 (See [https://builds.apache.org/job/Nutch-trunk/2956/]) NUTCH-1918 TikaParser specifies a default namespace when generating DOM (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1655966) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java * /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java TikaParser specifies a default namespace when generating DOM Key: NUTCH-1918 URL: https://issues.apache.org/jira/browse/NUTCH-1918 Project: Nutch Issue Type: Bug Components: parser Reporter: Julien Nioche Fix For: 1.10 Attachments: NUTCH-1918.patch The DOM generated by parse-tika differs from the one done by parse-html. Ideally we should be able to use either parsers with the same XPath expressions. This is related to [NUTCH-1592], but this time instead of being a matter of uppercases, the problem comes from the namespace used. This issue has been investigated and fixed in storm-crawler [https://github.com/DigitalPebble/storm-crawler/pull/58]. Here is what Guillaume explained there : bq. When parsing the content, Tika creates a properly formatted XHTML document: all elements are created within the namespace XHTML. bq. However in XPath 1.0, there's no concept of default namespace so XPath expressions such as //BODY doesn't match anything. To make this work we should use //ns1:BODY and define a NamespaceContext which associates ns1 with http://www.w3.org/1999/xhtml; bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is our SaxHandler used to convert the SAX Events into a DOM tree to ignore a default name space and the ParserBolt initializes it with the XHTML namespace. This way //BODY matches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1889) Store all values from Tika metadata in Nutch metadata
[ https://issues.apache.org/jira/browse/NUTCH-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1889. -- Resolution: Fixed Committed revision 1655960. Store all values from Tika metadata in Nutch metadata - Key: NUTCH-1889 URL: https://issues.apache.org/jira/browse/NUTCH-1889 Project: Nutch Issue Type: Improvement Components: parser Reporter: Julien Nioche Priority: Trivial Fix For: 1.10 Attachments: NUTCH-1889.patch Tika metadata can be multivalued but we currently keep only the first value in the TikaParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)