[jira] [Assigned] (NUTCH-1959) Improving CommonCrawlFormat implementations
[ https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-1959: Assignee: Chris A. Mattmann > Improving CommonCrawlFormat implementations > --- > > Key: NUTCH-1959 > URL: https://issues.apache.org/jira/browse/NUTCH-1959 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.9 >Reporter: Giuseppe Totaro >Assignee: Chris A. Mattmann >Priority: Minor > Attachments: NUTCH-1959.patch > > > {{CommonCrawlFormat}} is an interface for Java classes that implement methods > for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is > an abstract class that implements {{CommonCrawlFormat}} and provides abstract > methods for "CommonCrawl formatter" classes. > You can find in attachment a PATCH that includes some improvements for > {{CommonCrawlFormat}}-based classes; > * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only > the {{getJsonData()}} method, responsible for getting out JSON data. > * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each > subclass has to implement in order to handle JSON objects. > * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now > provide also escaping of JSON string values. > This PATCH aims at providing a better interface for implementing/extending > {{CommonCrawlFormat}} classes. > I would really appreciate your feedback. > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1960 started by Chris A. Mattmann. > JUnit test for dump method of CommonCrawlDataDumper > --- > > Key: NUTCH-1960 > URL: https://issues.apache.org/jira/browse/NUTCH-1960 > Project: Nutch > Issue Type: Test >Affects Versions: 1.9 >Reporter: Giuseppe Totaro >Assignee: Chris A. Mattmann >Priority: Minor > Attachments: NUTCH-1960.patch, test-segments.tar.gz > > > Hi all, > you can find in attachment the PATCH including an extremely simple JUnit test > for {{dump}} method of {{CommonCrawlDataDumper}} class. > Essentially, it checks if {{dump}} is able to create a given list of files > from Butch segments (in {{testresources}}). > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-1959) Improving CommonCrawlFormat implementations
[ https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1959 started by Chris A. Mattmann. > Improving CommonCrawlFormat implementations > --- > > Key: NUTCH-1959 > URL: https://issues.apache.org/jira/browse/NUTCH-1959 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.9 >Reporter: Giuseppe Totaro >Assignee: Chris A. Mattmann >Priority: Minor > Attachments: NUTCH-1959.patch > > > {{CommonCrawlFormat}} is an interface for Java classes that implement methods > for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is > an abstract class that implements {{CommonCrawlFormat}} and provides abstract > methods for "CommonCrawl formatter" classes. > You can find in attachment a PATCH that includes some improvements for > {{CommonCrawlFormat}}-based classes; > * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only > the {{getJsonData()}} method, responsible for getting out JSON data. > * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each > subclass has to implement in order to handle JSON objects. > * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now > provide also escaping of JSON string values. > This PATCH aims at providing a better interface for implementing/extending > {{CommonCrawlFormat}} classes. > I would really appreciate your feedback. > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-1960: Assignee: Chris A. Mattmann > JUnit test for dump method of CommonCrawlDataDumper > --- > > Key: NUTCH-1960 > URL: https://issues.apache.org/jira/browse/NUTCH-1960 > Project: Nutch > Issue Type: Test >Affects Versions: 1.9 >Reporter: Giuseppe Totaro >Assignee: Chris A. Mattmann >Priority: Minor > Attachments: NUTCH-1960.patch, test-segments.tar.gz > > > Hi all, > you can find in attachment the PATCH including an extremely simple JUnit test > for {{dump}} method of {{CommonCrawlDataDumper}} class. > Essentially, it checks if {{dump}} is able to create a given list of files > from Butch segments (in {{testresources}}). > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default
[ https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357929#comment-14357929 ] Lewis John McGibbney commented on NUTCH-1962: - Can you please upload and commit. Thanks Jorge. On Wednesday, March 11, 2015, Jorge Luis Betancourt Gonzalez (JIRA) < -- *Lewis* > Need to have mimetype-filter.txt file available by default > -- > > Key: NUTCH-1962 > URL: https://issues.apache.org/jira/browse/NUTCH-1962 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Lewis John McGibbney > Fix For: 1.10 > > > By default the mimetype-filter.txt file quoted within nutch-default.xml is > not available. We need to provide this as it is a PITA to constantly have to > add it it new crawler configurations. > https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default
[ https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357883#comment-14357883 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1962: --- +1 actually I have an example file prepared, and I'm ready to commit. > Need to have mimetype-filter.txt file available by default > -- > > Key: NUTCH-1962 > URL: https://issues.apache.org/jira/browse/NUTCH-1962 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Lewis John McGibbney > Fix For: 1.10 > > > By default the mimetype-filter.txt file quoted within nutch-default.xml is > not available. We need to provide this as it is a PITA to constantly have to > add it it new crawler configurations. > https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1962) Need to have mimetype-filter.txt file available by default
Lewis John McGibbney created NUTCH-1962: --- Summary: Need to have mimetype-filter.txt file available by default Key: NUTCH-1962 URL: https://issues.apache.org/jira/browse/NUTCH-1962 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Lewis John McGibbney Fix For: 1.10 By default the mimetype-filter.txt file quoted within nutch-default.xml is not available. We need to provide this as it is a PITA to constantly have to add it it new crawler configurations. https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions
[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357691#comment-14357691 ] Sebastian Nagel commented on NUTCH-1957: Just a few thoughts to finally solve this problem (see also NUTCH-1950): * a URL is a unique name for a resource in the www * md5(url) should be also give a unique identifier ** ok, there may be collisions but if we take a 128-bit MD5 sum we definitely hit a file system limit before, namely the max. number of files (in one directory). A common practice to limit the number of files is to split the MD5 sum into block of 3-4 characters and use the first part(s) as directory hierarchy, e.g., {{d7/a0/9ded039d2833ff602ac9d4cd5a8d_http_en_wikipedia_org_wiki_100}}. ** md5(content) has the disadvantage that the same URL if re-crawled is possibly stored under a new file name * everything else (extension, URL, file name) is only used to make the file name human readable. We can freely skip some parts and/or special characters -- we do not risk any collisions. * "As the FileDumper and the CommonCrawlDataDumper using the same way to store file, we can make this a util." -- of course! > FileDumper output file name collisions > -- > > Key: NUTCH-1957 > URL: https://issues.apache.org/jira/browse/NUTCH-1957 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 1.10 >Reporter: Renxia Wang >Priority: Minor > Labels: dumper, filename, tools > > The FileDumper extracts file base name and extension and use > .(e.g. given the url > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the > . will be project.html) as the file name to dump the > file. > Code from FileDumper.java: > String url = key.toString(); > String baseName = FilenameUtils.getBaseName(url); > String extension = FilenameUtils.getExtension(url); > ... > String filename = baseName + "." + extension; > This introduce file name collision and leads to loss of data when using > bin/nutch dump. > Sample logs: > 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: > http://beringsea.eol.ucar.edu/data/ > 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: > http://catalog.eol.ucar.edu/ > 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Carin%20Ashjian/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Christopher%20Arp/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Juha%20Alatalo/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Kerim%20Aydin/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Knut%20Aagaard/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Mary%20Albert/project.html > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/c
RE: Handling servers with wrong Last Modified HTTP header
Hello Jorge, This is an interesting but very complicated issue. First of all, do not rely on HTTP headers, they are incorrect on any scale larger than very small. This is true for Last-Modified due to dynamic CMS' but for many other headers. You can even expect website descriptions in headers such as Content-Type, madness! The only reliable source of a document's date and optionally time is within the document itself. This introduces two news problems, 1) what format and language, and 2) where exactly can you find it. Let's discuss these two issues. The first is the most straightforward to deal with, it is a two-stage process. First you need to extract anything that resembles a date format that is used on Earth, this includes non-numeric dates such as month names. Then you have to pass all those date candidates through a series of carefully aligned date formats (SimpleDateFormat) and set the appropriate Locale. This stage requires that you have identified the language of the document, or the part of the document you are processing in case of multi-language documents. Luckily, i have uploaded preliminary work as a Nutch parse-plugin a few years ago that does exactly this, check out NUTCH-1414 [1]. You present the extractor with a language and a piece of text, in this case the document's extracted text. It is very basic and has many flaws but it should work nicely if you present it with concise fragments of text. The second part of the solution is more cumbersome to deal with. NUTCH-1414 uses the document's extracted text as source for date extraction, and it has really no clue as to where the date is located in the document's structure. If you use Nutch' basic text extraction (extract all TEXT nodes) you will get bad results for most documents. It can be partially solved by relying on Boilerpipe's text extraction. But using Boilerpipe may in turn prevent you from extracting dates that actually got extracted using no text extraction algorithm at all! Please, check out NUTCH-1414 and see if it works for you. Hopefully, in your case, it will do what you want it to do. I decided a few years ago to get place the improved date extraction tool to a separate project and get rid of Boilerpipe altogether and build a new tool from scratch that can interface with a date extraction tool, and has support for looking up the exact spot of the document's date. It works on 95% of the many hundreds of real web page tests so if you need something that works at scale, you can contact me off list, the stuff has not been open sourced. Have fun! Markus [1]: https://issues.apache.org/jira/browse/NUTCH-1414 -Original message- > From:Jorge Luis Betancourt González > Sent: Tuesday 10th March 2015 4:23 > To: dev@nutch.apache.org > Subject: Handling servers with wrong Last Modified HTTP header > > Recently in the search app we are working on we've encountered a lot of > websites that have a wrong and invalid date in the Last Modified HTTP header, > meaning for instance that an article posted on a news site back in 2010 has a > Las Modified header of just a few days back, this could be for any number of > reasons: > > - A new comment was added to the site > - Some cache invalidation occurring in the source code of the website that > affects the article's page > - Perhaps a new ad showing in the sidebar > - Or just plain wrong header handling in the platform code > > For what I've seen this is handled by several CMS even allowing to "tweak" > the published date, My question is basically if any one on the list has a > suggestion on how to tackle this or has some suggestion on how to address > this situation. For the particular case that we've been working most of the > URLs have the published date in the URL in the form of /mm/dd (or some > similar fashion), so this could be one way of "guessing" the publication date > of the article. I realize that this is no silver bullet but I'd love to get > some feedback on this type of situations. From my experience when people > usually filter by date in our frontend app, they usually are trying to get > news/articles by the publication date instead of the Last Modified date and > they are confused when the returned results have very old publication dates, > they usually don't check if is a new comment for instance. > > I'm living the "how to implement this" a side for now, just interested in > discussing how to deal with this type of situations, as stated in our > particular case we can rely on the URL patterns for a very good portion, but > was hopping to agree on some general approach that could be integrated in > Nutch. > > Regards, > > PS: Should I post this also to the user list? >
[jira] [Updated] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated NUTCH-1960: --- Fix Version/s: (was: 1.9) > JUnit test for dump method of CommonCrawlDataDumper > --- > > Key: NUTCH-1960 > URL: https://issues.apache.org/jira/browse/NUTCH-1960 > Project: Nutch > Issue Type: Test >Affects Versions: 1.9 >Reporter: Giuseppe Totaro >Priority: Minor > Attachments: NUTCH-1960.patch, test-segments.tar.gz > > > Hi all, > you can find in attachment the PATCH including an extremely simple JUnit test > for {{dump}} method of {{CommonCrawlDataDumper}} class. > Essentially, it checks if {{dump}} is able to create a given list of files > from Butch segments (in {{testresources}}). > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated NUTCH-1960: --- Affects Version/s: 1.9 > JUnit test for dump method of CommonCrawlDataDumper > --- > > Key: NUTCH-1960 > URL: https://issues.apache.org/jira/browse/NUTCH-1960 > Project: Nutch > Issue Type: Test >Affects Versions: 1.9 >Reporter: Giuseppe Totaro >Priority: Minor > Attachments: NUTCH-1960.patch, test-segments.tar.gz > > > Hi all, > you can find in attachment the PATCH including an extremely simple JUnit test > for {{dump}} method of {{CommonCrawlDataDumper}} class. > Essentially, it checks if {{dump}} is able to create a given list of files > from Butch segments (in {{testresources}}). > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1961) Provide multipart compression of Common Crawl data
Giuseppe Totaro created NUTCH-1961: -- Summary: Provide multipart compression of Common Crawl data Key: NUTCH-1961 URL: https://issues.apache.org/jira/browse/NUTCH-1961 Project: Nutch Issue Type: Wish Affects Versions: 1.9 Reporter: Giuseppe Totaro Priority: Minor Using {{-gzip}} option in {{CommonCrawlDataDumper}}, users are able to compress data and create a TAR archive (using the [Apache Commons Compress|http://commons.apache.org/proper/commons-compress]. We could provide also the opportunity to make multipart compressed archive using a threshold. I did some tests using a {{CountingOutputStream}} "in the middle" in order to count bytes written, but it requires to flush the output streams at each iteration. Furthermore, _gzip_ does not support multipart compression (we can split the archive in multiple {{.tar.gz}} files but they have to be unzipped individually), whereas _zip_ does (even though this feature is not supported yet in Apache Commons Compress). I would really appreciate your feedback/ideas about this. Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated NUTCH-1960: --- Attachment: test-segments.tar.gz > JUnit test for dump method of CommonCrawlDataDumper > --- > > Key: NUTCH-1960 > URL: https://issues.apache.org/jira/browse/NUTCH-1960 > Project: Nutch > Issue Type: Test >Reporter: Giuseppe Totaro >Priority: Minor > Fix For: 1.9 > > Attachments: NUTCH-1960.patch, test-segments.tar.gz > > > Hi all, > you can find in attachment the PATCH including an extremely simple JUnit test > for {{dump}} method of {{CommonCrawlDataDumper}} class. > Essentially, it checks if {{dump}} is able to create a given list of files > from Butch segments (in {{testresources}}). > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper
Giuseppe Totaro created NUTCH-1960: -- Summary: JUnit test for dump method of CommonCrawlDataDumper Key: NUTCH-1960 URL: https://issues.apache.org/jira/browse/NUTCH-1960 Project: Nutch Issue Type: Test Reporter: Giuseppe Totaro Priority: Minor Fix For: 1.9 Attachments: NUTCH-1960.patch Hi all, you can find in attachment the PATCH including an extremely simple JUnit test for {{dump}} method of {{CommonCrawlDataDumper}} class. Essentially, it checks if {{dump}} is able to create a given list of files from Butch segments (in {{testresources}}). Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated NUTCH-1960: --- Attachment: NUTCH-1960.patch > JUnit test for dump method of CommonCrawlDataDumper > --- > > Key: NUTCH-1960 > URL: https://issues.apache.org/jira/browse/NUTCH-1960 > Project: Nutch > Issue Type: Test >Reporter: Giuseppe Totaro >Priority: Minor > Fix For: 1.9 > > Attachments: NUTCH-1960.patch > > > Hi all, > you can find in attachment the PATCH including an extremely simple JUnit test > for {{dump}} method of {{CommonCrawlDataDumper}} class. > Essentially, it checks if {{dump}} is able to create a given list of files > from Butch segments (in {{testresources}}). > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1959) Improving CommonCrawlFormat implementations
[ https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giuseppe Totaro updated NUTCH-1959: --- Attachment: NUTCH-1959.patch > Improving CommonCrawlFormat implementations > --- > > Key: NUTCH-1959 > URL: https://issues.apache.org/jira/browse/NUTCH-1959 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.9 >Reporter: Giuseppe Totaro >Priority: Minor > Attachments: NUTCH-1959.patch > > > {{CommonCrawlFormat}} is an interface for Java classes that implement methods > for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is > an abstract class that implements {{CommonCrawlFormat}} and provides abstract > methods for "CommonCrawl formatter" classes. > You can find in attachment a PATCH that includes some improvements for > {{CommonCrawlFormat}}-based classes; > * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only > the {{getJsonData()}} method, responsible for getting out JSON data. > * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each > subclass has to implement in order to handle JSON objects. > * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now > provide also escaping of JSON string values. > This PATCH aims at providing a better interface for implementing/extending > {{CommonCrawlFormat}} classes. > I would really appreciate your feedback. > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1959) Improving CommonCrawlFormat implementations
Giuseppe Totaro created NUTCH-1959: -- Summary: Improving CommonCrawlFormat implementations Key: NUTCH-1959 URL: https://issues.apache.org/jira/browse/NUTCH-1959 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Giuseppe Totaro Priority: Minor {{CommonCrawlFormat}} is an interface for Java classes that implement methods for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is an abstract class that implements {{CommonCrawlFormat}} and provides abstract methods for "CommonCrawl formatter" classes. You can find in attachment a PATCH that includes some improvements for {{CommonCrawlFormat}}-based classes; * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only the {{getJsonData()}} method, responsible for getting out JSON data. * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each subclass has to implement in order to handle JSON objects. * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now provide also escaping of JSON string values. This PATCH aims at providing a better interface for implementing/extending {{CommonCrawlFormat}} classes. I would really appreciate your feedback. Thanks a lot, Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions
[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357147#comment-14357147 ] Renxia Wang commented on NUTCH-1957: Hi Giuseppe, About the latter way, is it possible that the url contains special characters that cannot be used as part of path/filename? If not, this way should work, however it may make the downstream processing complicated, as the user should traversal all the paths to get the file. E.g. posting the dump data to Solr. I am thinking to get the MD5 of the file content, append it to the end of file basename, before the extension, like -.. Currently, the FileDumper use the full path to the output file to calculate the MD5, but as the files are storing into the same dir, the MD5 may be the same, which still causing file name collision. We may need to use the MD5 of the file content. As the FileDumper and the CommonCrawlDataDumper using the same way to store file, we can make this a util. Thanks, Renxia > FileDumper output file name collisions > -- > > Key: NUTCH-1957 > URL: https://issues.apache.org/jira/browse/NUTCH-1957 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 1.10 >Reporter: Renxia Wang >Priority: Minor > Labels: dumper, filename, tools > > The FileDumper extracts file base name and extension and use > .(e.g. given the url > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the > . will be project.html) as the file name to dump the > file. > Code from FileDumper.java: > String url = key.toString(); > String baseName = FilenameUtils.getBaseName(url); > String extension = FilenameUtils.getExtension(url); > ... > String filename = baseName + "." + extension; > This introduce file name collision and leads to loss of data when using > bin/nutch dump. > Sample logs: > 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: > http://beringsea.eol.ucar.edu/data/ > 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: > http://catalog.eol.ucar.edu/ > 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Carin%20Ashjian/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Christopher%20Arp/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Juha%20Alatalo/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Kerim%20Aydin/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Knut%20Aagaard/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Mary%20Albert/project.html > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions
[ https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357127#comment-14357127 ] Giuseppe Totaro commented on NUTCH-1957: Hi [~zhique], I agree with your description. Using this "file-naming schema", some collisions may occur. If two or more files have the same basename but different pathname, only the first file will be written because all deserialized files will be included in the same outputDir folder. Currently, CommonCrawlDataDumpoer tool works in the same way. I am working to solve it in CommonCrawlDataDumper tool (but it is the same in FileDumper). We can use either a unique "key" value as filename (but it could be very long) or the same structure/hierarchy as the input. In the latter case, each output file has the same pathname as the original one. Please give your feedback. Thank you, Giuseppe > FileDumper output file name collisions > -- > > Key: NUTCH-1957 > URL: https://issues.apache.org/jira/browse/NUTCH-1957 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 1.10 >Reporter: Renxia Wang >Priority: Minor > Labels: dumper, filename, tools > > The FileDumper extracts file base name and extension and use > .(e.g. given the url > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the > . will be project.html) as the file name to dump the > file. > Code from FileDumper.java: > String url = key.toString(); > String baseName = FilenameUtils.getBaseName(url); > String extension = FilenameUtils.getExtension(url); > ... > String filename = baseName + "." + extension; > This introduce file name collision and leads to loss of data when using > bin/nutch dump. > Sample logs: > 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: > http://beringsea.eol.ucar.edu/data/ > 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: > http://catalog.eol.ucar.edu/ > 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: > [testFileName/.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Carin%20Ashjian/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Christopher%20Arp/project.html > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Juha%20Alatalo/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Kerim%20Aydin/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Knut%20Aagaard/project.html > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Mary%20Albert/project.html > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: > [testFileName/project.html]: file already exists > 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: > https://www.aoncadis.org/contact/Yarrow%20Axford/project.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Google Summer of Code 2015 Mentor Registration
+1. 2015-03-06 22:01 GMT+02:00 Lewis John Mcgibbney : > Nutch PMC, > Please acknowledge my request to become a mentor for Google Summer of Code > 2015 projects for Apache > Nutch. > > My Melange username is lewismc. > > > -- Forwarded message -- > From: Ulrich Stärk > Date: Fri, Mar 6, 2015 at 11:32 AM > Subject: Google Summer of Code 2015 Mentor Registration > To: ment...@community.apache.org > > > Dear PMCs, > > I'm happy to announce that the ASF has made it onto the list of 137 accepted > organizations for > Google Summer of Code 2015! [1,2] > > It is now time for the mentors to sign up, so please pass this email on to > your community and > podlings. If you aren’t already subscribed to ment...@community.apache.org > you should do so now else > you might miss important information. > > Mentor signup requires two steps: mentor signup in Melange and PMC > acknowledgement. > > If you want to mentor a project in this year's SoC you will have to > > 1. Be an Apache committer. > 2. Register with Melange and set up a profile [3]. > 3. Add your username (formerly known as link_id) to [4]. This is NOT your > email address but your > Melange username. You can find it at the top of any page once you are logged > in. > 4. Request an acknowledgement from the PMC for which you want to mentor > projects. Use the below > template and do not forget to copy ment...@community.apache.org. > 5. Once a PMC member acknowledges the request to mentor, and only then, go > to [5] and send a > connection request. > > PMCs, read carefully please. > > We request that each mentor is acknowledged by a PMC member. This is to > ensure the mentor is in good > standing with the community. When you receive a request for acknowledgement, > please ACK it and cc > ment...@community.apache.org > > Lastly, it is not yet too late to record your ideas in Jira (see my previous > emails for details). > Students will now begin to explore ideas so if you haven’t already done so, > record your ideas > immediately! > > Cheers, > > Uli > > mentor request email template: > > to: private@.apache.org > cc: ment...@community.apache.org > subject: GSoC 2015 mentor request for > > PMC, > > please acknowledge my request to become a mentor for Google Summer of Code > 2015 projects for Apache > . > > My Melange username is . > > > > > > [1] http://www.google-melange.com/gsoc/org/list/public/google/gsoc2015 > [2] http://www.google-melange.com/gsoc/org2/google/gsoc2015/apache > [3] http://www.google-melange.com/gsoc/homepage/google/gsoc2015 > [4] https://svn.apache.org/repos/private/committers/GsocLinkId.txt > [5] > http://www.google-melange.com/gsoc/connection/start/user/google/gsoc2015/apache > > > > -- > Lewis -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Google Summer of Code 2015 Mentor Registration
Nutch PMC, Please acknowledge my request to become a mentor for Google Summer of Code 2015 projects for Apache Nutch. My Melange username is talat. -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356711#comment-14356711 ] Markus Jelsma commented on NUTCH-1932: -- Hm yes, i've thought about using a scoring filter too. However, we do need some code in CrawlDbReducer.reduce() because in the end we want to completely remove the record from the CrawlDB. A work-around, maybe elegant but useful, would be to introduce the CrawlDatum to URL filtering and normalizing. We have some other Nutch jobs that would benefit from having method signature like normalize(String url, CrawlDatum datum, String scope), same is true for filter. > Automatically remove orphaned pages > --- > > Key: NUTCH-1932 > URL: https://issues.apache.org/jira/browse/NUTCH-1932 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.11 > > Attachments: NUTCH-1932.patch > > > Nutch should be able to automatically remove orphaned pages such as old > 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink > count of 1 is enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1957) FileDumper output file name collisions
Renxia Wang created NUTCH-1957: -- Summary: FileDumper output file name collisions Key: NUTCH-1957 URL: https://issues.apache.org/jira/browse/NUTCH-1957 Project: Nutch Issue Type: Bug Components: tool Affects Versions: 1.10 Reporter: Renxia Wang Priority: Minor The FileDumper extracts file base name and extension and use .(e.g. given the url https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the . will be project.html) as the file name to dump the file. Code from FileDumper.java: String url = key.toString(); String baseName = FilenameUtils.getBaseName(url); String extension = FilenameUtils.getExtension(url); ... String filename = baseName + "." + extension; This introduce file name collision and leads to loss of data when using bin/nutch dump. Sample logs: 2015-03-10 23:38:01,192 INFO tools.FileDumper - Dumping URL: http://beringsea.eol.ucar.edu/data/ 2015-03-10 23:38:01,193 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists 2015-03-10 23:38:16,717 INFO tools.FileDumper - Dumping URL: http://catalog.eol.ucar.edu/ 2015-03-10 23:38:16,719 INFO tools.FileDumper - Skipping writing: [testFileName/.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Carin%20Ashjian/project.html 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Christopher%20Arp/project.html 2015-03-10 23:38:46,411 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,411 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html 2015-03-10 23:38:46,412 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,412 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Juha%20Alatalo/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Kerim%20Aydin/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Knut%20Aagaard/project.html 2015-03-10 23:38:46,413 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,413 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Mary%20Albert/project.html 2015-03-10 23:38:46,414 INFO tools.FileDumper - Skipping writing: [testFileName/project.html]: file already exists 2015-03-10 23:38:46,414 INFO tools.FileDumper - Dumping URL: https://www.aoncadis.org/contact/Yarrow%20Axford/project.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)