[jira] [Assigned] (NUTCH-1959) Improving CommonCrawlFormat implementations

2015-03-11 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-1959:


Assignee: Chris A. Mattmann

> Improving CommonCrawlFormat implementations
> ---
>
> Key: NUTCH-1959
> URL: https://issues.apache.org/jira/browse/NUTCH-1959
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>Priority: Minor
> Attachments: NUTCH-1959.patch
>
>
> {{CommonCrawlFormat}} is an interface for Java classes that implement methods 
> for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is 
> an abstract class that implements {{CommonCrawlFormat}} and provides abstract 
> methods for "CommonCrawl formatter" classes.
> You can find in attachment a PATCH that includes some improvements for 
> {{CommonCrawlFormat}}-based classes;
> * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only 
> the {{getJsonData()}} method, responsible for getting out JSON data.
> * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each 
> subclass has to implement in order to handle JSON objects.
> * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now 
> provide also escaping of JSON string values.
> This PATCH aims at providing a better interface for implementing/extending 
> {{CommonCrawlFormat}} classes.
> I would really appreciate your feedback.
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

2015-03-11 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1960 started by Chris A. Mattmann.

> JUnit test for dump method of CommonCrawlDataDumper
> ---
>
> Key: NUTCH-1960
> URL: https://issues.apache.org/jira/browse/NUTCH-1960
> Project: Nutch
>  Issue Type: Test
>Affects Versions: 1.9
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>Priority: Minor
> Attachments: NUTCH-1960.patch, test-segments.tar.gz
>
>
> Hi all,
> you can find in attachment the PATCH including an extremely simple JUnit test 
> for {{dump}} method of {{CommonCrawlDataDumper}} class.
> Essentially, it checks if {{dump}} is able to create a given list of files 
> from Butch segments (in {{testresources}}).
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-1959) Improving CommonCrawlFormat implementations

2015-03-11 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1959 started by Chris A. Mattmann.

> Improving CommonCrawlFormat implementations
> ---
>
> Key: NUTCH-1959
> URL: https://issues.apache.org/jira/browse/NUTCH-1959
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>Priority: Minor
> Attachments: NUTCH-1959.patch
>
>
> {{CommonCrawlFormat}} is an interface for Java classes that implement methods 
> for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is 
> an abstract class that implements {{CommonCrawlFormat}} and provides abstract 
> methods for "CommonCrawl formatter" classes.
> You can find in attachment a PATCH that includes some improvements for 
> {{CommonCrawlFormat}}-based classes;
> * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only 
> the {{getJsonData()}} method, responsible for getting out JSON data.
> * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each 
> subclass has to implement in order to handle JSON objects.
> * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now 
> provide also escaping of JSON string values.
> This PATCH aims at providing a better interface for implementing/extending 
> {{CommonCrawlFormat}} classes.
> I would really appreciate your feedback.
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

2015-03-11 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-1960:


Assignee: Chris A. Mattmann

> JUnit test for dump method of CommonCrawlDataDumper
> ---
>
> Key: NUTCH-1960
> URL: https://issues.apache.org/jira/browse/NUTCH-1960
> Project: Nutch
>  Issue Type: Test
>Affects Versions: 1.9
>Reporter: Giuseppe Totaro
>Assignee: Chris A. Mattmann
>Priority: Minor
> Attachments: NUTCH-1960.patch, test-segments.tar.gz
>
>
> Hi all,
> you can find in attachment the PATCH including an extremely simple JUnit test 
> for {{dump}} method of {{CommonCrawlDataDumper}} class.
> Essentially, it checks if {{dump}} is able to create a given list of files 
> from Butch segments (in {{testresources}}).
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-11 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357929#comment-14357929
 ] 

Lewis John McGibbney commented on NUTCH-1962:
-

Can you please upload and commit. Thanks Jorge.

On Wednesday, March 11, 2015, Jorge Luis Betancourt Gonzalez (JIRA) <



-- 
*Lewis*


> Need to have mimetype-filter.txt file available by default
> --
>
> Key: NUTCH-1962
> URL: https://issues.apache.org/jira/browse/NUTCH-1962
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Lewis John McGibbney
> Fix For: 1.10
>
>
> By default the mimetype-filter.txt file quoted within nutch-default.xml is 
> not available. We need to provide this as it is a PITA to constantly have to 
> add it it new crawler configurations.
> https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-11 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357883#comment-14357883
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1962:
---

+1 actually I have an example file prepared, and I'm ready to commit.

> Need to have mimetype-filter.txt file available by default
> --
>
> Key: NUTCH-1962
> URL: https://issues.apache.org/jira/browse/NUTCH-1962
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Lewis John McGibbney
> Fix For: 1.10
>
>
> By default the mimetype-filter.txt file quoted within nutch-default.xml is 
> not available. We need to provide this as it is a PITA to constantly have to 
> add it it new crawler configurations.
> https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-11 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1962:
---

 Summary: Need to have mimetype-filter.txt file available by default
 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10


By default the mimetype-filter.txt file quoted within nutch-default.xml is not 
available. We need to provide this as it is a PITA to constantly have to add it 
it new crawler configurations.

https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions

2015-03-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357691#comment-14357691
 ] 

Sebastian Nagel commented on NUTCH-1957:


Just a few thoughts to finally solve this problem (see also NUTCH-1950):
* a URL is a unique name for a resource in the www
* md5(url) should be also give a unique identifier
** ok, there may be collisions but if we take a 128-bit MD5 sum we definitely 
hit a file system limit before, namely the max. number of files (in one 
directory). A common practice to limit the number of files is to split the MD5 
sum into block of 3-4 characters and use the first part(s) as directory 
hierarchy, e.g., 
{{d7/a0/9ded039d2833ff602ac9d4cd5a8d_http_en_wikipedia_org_wiki_100}}.
** md5(content) has the disadvantage that the same URL if re-crawled is 
possibly stored under a new file name
* everything else (extension, URL, file name) is only used to make the file 
name human readable. We can freely skip some parts and/or special characters -- 
we do not risk any collisions.
* "As the FileDumper and the CommonCrawlDataDumper using the same way to store 
file, we can make this a util." -- of course!


> FileDumper output file name collisions
> --
>
> Key: NUTCH-1957
> URL: https://issues.apache.org/jira/browse/NUTCH-1957
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 1.10
>Reporter: Renxia Wang
>Priority: Minor
>  Labels: dumper, filename, tools
>
> The FileDumper extracts file base name and extension and use 
> .(e.g. given the url 
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
> . will be project.html) as the file name to dump the 
> file. 
> Code from FileDumper.java: 
> String url = key.toString();
> String baseName = FilenameUtils.getBaseName(url);
> String extension = FilenameUtils.getExtension(url);
> ...
> String filename = baseName + "." + extension;
> This introduce file name collision and leads to loss of data when using 
> bin/nutch dump. 
> Sample logs:
> 2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
> http://beringsea.eol.ucar.edu/data/
> 2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
> http://catalog.eol.ucar.edu/
> 2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Christopher%20Arp/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Mary%20Albert/project.html
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/c

RE: Handling servers with wrong Last Modified HTTP header

2015-03-11 Thread Markus Jelsma
Hello Jorge,

This is an interesting but very complicated issue. First of all, do not rely on 
HTTP headers, they are incorrect on any scale larger than very small. This is 
true for Last-Modified due to dynamic CMS' but for many other headers. You can 
even expect website descriptions in headers such as Content-Type, madness!

The only reliable source of a document's date and optionally time is within the 
document itself. This introduces two news problems, 1) what format and 
language, and 2) where exactly can you find it. Let's discuss these two issues.

The first is the most straightforward to deal with, it is a two-stage process. 
First you need to extract anything that resembles a date format that is used on 
Earth, this includes non-numeric dates such as month names. Then you have to 
pass all those date candidates through a series of carefully aligned date 
formats (SimpleDateFormat) and set the appropriate Locale. This stage requires 
that you have identified the language of the document, or the part of the 
document you are processing in case of multi-language documents.

Luckily, i have uploaded preliminary work as a Nutch parse-plugin a few years 
ago that does exactly this, check out NUTCH-1414 [1]. You present the extractor 
with a language and a piece of text, in this case the document's extracted 
text. It is very basic and has many flaws but it should work nicely if you 
present it with concise fragments of text.

The second part of the solution is more cumbersome to deal with. NUTCH-1414 
uses the document's extracted text as source for date extraction, and it has 
really no clue as to where the date is located in the document's structure. If 
you use Nutch' basic text extraction (extract all TEXT nodes) you will get bad 
results for most documents. It can be partially solved by relying on 
Boilerpipe's text extraction. But using Boilerpipe may in turn prevent you from 
extracting dates that actually got extracted using no text extraction algorithm 
at all!

Please, check out NUTCH-1414 and see if it works for you. Hopefully, in your 
case, it will do what you want it to do. I decided a few years ago to get place 
the improved date extraction tool to a separate project and get rid of 
Boilerpipe altogether and build a new tool from scratch that can interface with 
a date extraction tool, and has support for looking up the exact spot of the 
document's date. It works on 95% of the many hundreds of real web page tests so 
if you need something that works at scale, you can contact me off list, the 
stuff has not been open sourced.

Have fun!
Markus

[1]: https://issues.apache.org/jira/browse/NUTCH-1414
 
-Original message-
> From:Jorge Luis Betancourt González 
> Sent: Tuesday 10th March 2015 4:23
> To: dev@nutch.apache.org
> Subject: Handling servers with wrong Last Modified HTTP header
> 
> Recently in the search app we are working on we've encountered a lot of 
> websites that have a wrong and invalid date in the Last Modified HTTP header, 
> meaning for instance that an article posted on a news site back in 2010 has a 
> Las Modified header of just a few days back, this could be for any number of 
> reasons:
> 
> - A new comment was added to the site
> - Some cache invalidation occurring in the source code of the website that 
> affects the article's page
> - Perhaps a new ad showing in the sidebar
> - Or just plain wrong header handling in the platform code
> 
> For what I've seen this is handled by several CMS even allowing to "tweak" 
> the published date, My question is basically if any one on the list has a 
> suggestion on how to tackle this or has some suggestion on how to address 
> this situation. For the particular case that we've been working most of the 
> URLs have the published date in the URL in the form of /mm/dd (or some 
> similar fashion), so this could be one way of "guessing" the publication date 
> of the article. I realize that this is no silver bullet but I'd love to get 
> some feedback on this type of situations. From my experience when people 
> usually filter by date in our frontend app, they usually are trying to get 
> news/articles by the publication date instead of the Last Modified date and 
> they are confused when the returned results have very old publication dates, 
> they usually don't check if is a new comment for instance.
> 
> I'm living the "how to implement this" a side for now, just interested in 
> discussing how to deal with this type of situations, as stated in our 
> particular case we can rely on the URL patterns for a very good portion, but 
> was hopping to agree on some general approach that could be integrated in 
> Nutch.
> 
> Regards,
> 
> PS: Should I post this also to the user list? 
> 


[jira] [Updated] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

2015-03-11 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated NUTCH-1960:
---
Fix Version/s: (was: 1.9)

> JUnit test for dump method of CommonCrawlDataDumper
> ---
>
> Key: NUTCH-1960
> URL: https://issues.apache.org/jira/browse/NUTCH-1960
> Project: Nutch
>  Issue Type: Test
>Affects Versions: 1.9
>Reporter: Giuseppe Totaro
>Priority: Minor
> Attachments: NUTCH-1960.patch, test-segments.tar.gz
>
>
> Hi all,
> you can find in attachment the PATCH including an extremely simple JUnit test 
> for {{dump}} method of {{CommonCrawlDataDumper}} class.
> Essentially, it checks if {{dump}} is able to create a given list of files 
> from Butch segments (in {{testresources}}).
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

2015-03-11 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated NUTCH-1960:
---
Affects Version/s: 1.9

> JUnit test for dump method of CommonCrawlDataDumper
> ---
>
> Key: NUTCH-1960
> URL: https://issues.apache.org/jira/browse/NUTCH-1960
> Project: Nutch
>  Issue Type: Test
>Affects Versions: 1.9
>Reporter: Giuseppe Totaro
>Priority: Minor
> Attachments: NUTCH-1960.patch, test-segments.tar.gz
>
>
> Hi all,
> you can find in attachment the PATCH including an extremely simple JUnit test 
> for {{dump}} method of {{CommonCrawlDataDumper}} class.
> Essentially, it checks if {{dump}} is able to create a given list of files 
> from Butch segments (in {{testresources}}).
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1961) Provide multipart compression of Common Crawl data

2015-03-11 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created NUTCH-1961:
--

 Summary: Provide multipart compression of Common Crawl data
 Key: NUTCH-1961
 URL: https://issues.apache.org/jira/browse/NUTCH-1961
 Project: Nutch
  Issue Type: Wish
Affects Versions: 1.9
Reporter: Giuseppe Totaro
Priority: Minor


Using {{-gzip}} option in {{CommonCrawlDataDumper}}, users are able to compress 
data and create a TAR archive (using the [Apache Commons 
Compress|http://commons.apache.org/proper/commons-compress]. 
We could provide also the opportunity to make multipart compressed archive 
using a threshold. I did some tests using a {{CountingOutputStream}} "in the 
middle" in order to count bytes written, but it requires to flush the output 
streams at each iteration.
Furthermore, _gzip_ does not support multipart compression (we can split the 
archive in multiple {{.tar.gz}} files but they have to be unzipped 
individually), whereas _zip_ does (even though this feature is not supported 
yet in Apache Commons Compress).
I would really appreciate your feedback/ideas about this.
Thanks a lot,
Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

2015-03-11 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated NUTCH-1960:
---
Attachment: test-segments.tar.gz

> JUnit test for dump method of CommonCrawlDataDumper
> ---
>
> Key: NUTCH-1960
> URL: https://issues.apache.org/jira/browse/NUTCH-1960
> Project: Nutch
>  Issue Type: Test
>Reporter: Giuseppe Totaro
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1960.patch, test-segments.tar.gz
>
>
> Hi all,
> you can find in attachment the PATCH including an extremely simple JUnit test 
> for {{dump}} method of {{CommonCrawlDataDumper}} class.
> Essentially, it checks if {{dump}} is able to create a given list of files 
> from Butch segments (in {{testresources}}).
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

2015-03-11 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created NUTCH-1960:
--

 Summary: JUnit test for dump method of CommonCrawlDataDumper
 Key: NUTCH-1960
 URL: https://issues.apache.org/jira/browse/NUTCH-1960
 Project: Nutch
  Issue Type: Test
Reporter: Giuseppe Totaro
Priority: Minor
 Fix For: 1.9
 Attachments: NUTCH-1960.patch

Hi all,
you can find in attachment the PATCH including an extremely simple JUnit test 
for {{dump}} method of {{CommonCrawlDataDumper}} class.
Essentially, it checks if {{dump}} is able to create a given list of files from 
Butch segments (in {{testresources}}).
Thanks a lot,
Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1960) JUnit test for dump method of CommonCrawlDataDumper

2015-03-11 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated NUTCH-1960:
---
Attachment: NUTCH-1960.patch

> JUnit test for dump method of CommonCrawlDataDumper
> ---
>
> Key: NUTCH-1960
> URL: https://issues.apache.org/jira/browse/NUTCH-1960
> Project: Nutch
>  Issue Type: Test
>Reporter: Giuseppe Totaro
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1960.patch
>
>
> Hi all,
> you can find in attachment the PATCH including an extremely simple JUnit test 
> for {{dump}} method of {{CommonCrawlDataDumper}} class.
> Essentially, it checks if {{dump}} is able to create a given list of files 
> from Butch segments (in {{testresources}}).
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1959) Improving CommonCrawlFormat implementations

2015-03-11 Thread Giuseppe Totaro (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated NUTCH-1959:
---
Attachment: NUTCH-1959.patch

> Improving CommonCrawlFormat implementations
> ---
>
> Key: NUTCH-1959
> URL: https://issues.apache.org/jira/browse/NUTCH-1959
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Giuseppe Totaro
>Priority: Minor
> Attachments: NUTCH-1959.patch
>
>
> {{CommonCrawlFormat}} is an interface for Java classes that implement methods 
> for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is 
> an abstract class that implements {{CommonCrawlFormat}} and provides abstract 
> methods for "CommonCrawl formatter" classes.
> You can find in attachment a PATCH that includes some improvements for 
> {{CommonCrawlFormat}}-based classes;
> * {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only 
> the {{getJsonData()}} method, responsible for getting out JSON data.
> * {{AbstractCommonCrawlFormat}} provides also the abstract methods that each 
> subclass has to implement in order to handle JSON objects.
> * {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now 
> provide also escaping of JSON string values.
> This PATCH aims at providing a better interface for implementing/extending 
> {{CommonCrawlFormat}} classes.
> I would really appreciate your feedback.
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1959) Improving CommonCrawlFormat implementations

2015-03-11 Thread Giuseppe Totaro (JIRA)
Giuseppe Totaro created NUTCH-1959:
--

 Summary: Improving CommonCrawlFormat implementations
 Key: NUTCH-1959
 URL: https://issues.apache.org/jira/browse/NUTCH-1959
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Giuseppe Totaro
Priority: Minor


{{CommonCrawlFormat}} is an interface for Java classes that implement methods 
for writing data into Common Crawl format. {{AbstractCommonCrawlFormat}} is an 
abstract class that implements {{CommonCrawlFormat}} and provides abstract 
methods for "CommonCrawl formatter" classes.
You can find in attachment a PATCH that includes some improvements for 
{{CommonCrawlFormat}}-based classes;
* {{CommonCrawlFormat}} and {{AbstractCommonCrawlFormat}} now provide only the 
{{getJsonData()}} method, responsible for getting out JSON data.
* {{AbstractCommonCrawlFormat}} provides also the abstract methods that each 
subclass has to implement in order to handle JSON objects.
* {{CommonCrawlFormatSimple}} is a {{StringBuilder}}-based formatter that now 
provide also escaping of JSON string values.

This PATCH aims at providing a better interface for implementing/extending 
{{CommonCrawlFormat}} classes.

I would really appreciate your feedback.
Thanks a lot,
Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions

2015-03-11 Thread Renxia Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357147#comment-14357147
 ] 

Renxia Wang commented on NUTCH-1957:


Hi Giuseppe,

About the latter way, is it possible that the url contains special characters 
that cannot be used as part of path/filename? If not, this way should work, 
however it may make the downstream processing complicated, as the user should 
traversal all the paths to get the file. E.g. posting the dump data to Solr. 

I am thinking to get the MD5 of the file content, append it to the end of file 
basename, before the extension, like -.. Currently, 
the FileDumper use the full path to the output file to calculate the MD5, but 
as the files are storing into the same dir, the MD5 may be the same, which 
still causing file name collision. We may need to use the MD5 of the file 
content. 

As the FileDumper and the CommonCrawlDataDumper using the same way to store 
file, we can make this a util. 

Thanks,

Renxia

> FileDumper output file name collisions
> --
>
> Key: NUTCH-1957
> URL: https://issues.apache.org/jira/browse/NUTCH-1957
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 1.10
>Reporter: Renxia Wang
>Priority: Minor
>  Labels: dumper, filename, tools
>
> The FileDumper extracts file base name and extension and use 
> .(e.g. given the url 
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
> . will be project.html) as the file name to dump the 
> file. 
> Code from FileDumper.java: 
> String url = key.toString();
> String baseName = FilenameUtils.getBaseName(url);
> String extension = FilenameUtils.getExtension(url);
> ...
> String filename = baseName + "." + extension;
> This introduce file name collision and leads to loss of data when using 
> bin/nutch dump. 
> Sample logs:
> 2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
> http://beringsea.eol.ucar.edu/data/
> 2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
> http://catalog.eol.ucar.edu/
> 2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Christopher%20Arp/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Mary%20Albert/project.html
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1957) FileDumper output file name collisions

2015-03-11 Thread Giuseppe Totaro (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357127#comment-14357127
 ] 

Giuseppe Totaro commented on NUTCH-1957:


Hi [~zhique], I agree with your description. Using this "file-naming schema", 
some collisions may occur. If two or more files have the same basename but 
different pathname, only the first file will be written because all 
deserialized files will be included in the same outputDir folder. Currently, 
CommonCrawlDataDumpoer tool works in the same way.
I am working to solve it in CommonCrawlDataDumper tool (but it is the same in 
FileDumper). We can use either a unique "key" value as filename (but it could 
be very long) or the same structure/hierarchy as the input. In the latter case, 
each output file has the same pathname as the original one.
Please give your feedback.
Thank you,
Giuseppe

> FileDumper output file name collisions
> --
>
> Key: NUTCH-1957
> URL: https://issues.apache.org/jira/browse/NUTCH-1957
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 1.10
>Reporter: Renxia Wang
>Priority: Minor
>  Labels: dumper, filename, tools
>
> The FileDumper extracts file base name and extension and use 
> .(e.g. given the url 
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
> . will be project.html) as the file name to dump the 
> file. 
> Code from FileDumper.java: 
> String url = key.toString();
> String baseName = FilenameUtils.getBaseName(url);
> String extension = FilenameUtils.getExtension(url);
> ...
> String filename = baseName + "." + extension;
> This introduce file name collision and leads to loss of data when using 
> bin/nutch dump. 
> Sample logs:
> 2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
> http://beringsea.eol.ucar.edu/data/
> 2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
> http://catalog.eol.ucar.edu/
> 2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Christopher%20Arp/project.html
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Mary%20Albert/project.html
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
> [testFileName/project.html]: file already exists
> 2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
> https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Google Summer of Code 2015 Mentor Registration

2015-03-11 Thread Talat Uyarer
+1.

2015-03-06 22:01 GMT+02:00 Lewis John Mcgibbney :
> Nutch PMC,
> Please acknowledge my request to become a mentor for Google Summer of Code
> 2015 projects for Apache
> Nutch.
>
> My Melange username is lewismc.
>
>
> -- Forwarded message --
> From: Ulrich Stärk 
> Date: Fri, Mar 6, 2015 at 11:32 AM
> Subject: Google Summer of Code 2015 Mentor Registration
> To: ment...@community.apache.org
>
>
> Dear PMCs,
>
> I'm happy to announce that the ASF has made it onto the list of 137 accepted
> organizations for
> Google Summer of Code 2015! [1,2]
>
> It is now time for the mentors to sign up, so please pass this email on to
> your community and
> podlings. If you aren’t already subscribed to ment...@community.apache.org
> you should do so now else
> you might miss important information.
>
> Mentor signup requires two steps: mentor signup in Melange and PMC
> acknowledgement.
>
> If you want to mentor a project in this year's SoC you will have to
>
> 1. Be an Apache committer.
> 2. Register with Melange and set up a profile [3].
> 3. Add your username (formerly known as link_id) to [4]. This is NOT your
> email address but your
> Melange username. You can find it at the top of any page once you are logged
> in.
> 4. Request an acknowledgement from the PMC for which you want to mentor
> projects. Use the below
> template and do not forget to copy ment...@community.apache.org.
> 5. Once a PMC member acknowledges the request to mentor, and only then, go
> to [5] and send a
> connection request.
>
> PMCs, read carefully please.
>
> We request that each mentor is acknowledged by a PMC member. This is to
> ensure the mentor is in good
> standing with the community. When you receive a request for acknowledgement,
> please ACK it and cc
> ment...@community.apache.org
>
> Lastly, it is not yet too late to record your ideas in Jira (see my previous
> emails for details).
> Students will now begin to explore ideas so if you haven’t already done so,
> record your ideas
> immediately!
>
> Cheers,
>
> Uli
>
> mentor request email template:
> 
> to: private@.apache.org
> cc: ment...@community.apache.org
> subject: GSoC 2015 mentor request for 
>
>  PMC,
>
> please acknowledge my request to become a mentor for Google Summer of Code
> 2015 projects for Apache
> .
>
> My Melange username is .
>
> 
>
> 
>
> [1] http://www.google-melange.com/gsoc/org/list/public/google/gsoc2015
> [2] http://www.google-melange.com/gsoc/org2/google/gsoc2015/apache
> [3] http://www.google-melange.com/gsoc/homepage/google/gsoc2015
> [4] https://svn.apache.org/repos/private/committers/GsocLinkId.txt
> [5]
> http://www.google-melange.com/gsoc/connection/start/user/google/gsoc2015/apache
>
>
>
> --
> Lewis



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


Google Summer of Code 2015 Mentor Registration

2015-03-11 Thread Talat Uyarer
Nutch PMC,

Please acknowledge my request to become a mentor for Google Summer of
Code 2015 projects for Apache
Nutch.

My Melange username is talat.

-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-03-11 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356711#comment-14356711
 ] 

Markus Jelsma commented on NUTCH-1932:
--

Hm yes, i've thought about using a scoring filter too. However, we do need some 
code in CrawlDbReducer.reduce() because in the end we want to completely remove 
the record from the CrawlDB. A work-around, maybe elegant but useful, would be 
to introduce the CrawlDatum to URL filtering and normalizing.
We have some other Nutch jobs that would benefit from having method signature 
like normalize(String url, CrawlDatum datum, String scope), same is true for 
filter.

> Automatically remove orphaned pages
> ---
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old 
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink 
> count of 1 is enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1957) FileDumper output file name collisions

2015-03-11 Thread Renxia Wang (JIRA)
Renxia Wang created NUTCH-1957:
--

 Summary: FileDumper output file name collisions
 Key: NUTCH-1957
 URL: https://issues.apache.org/jira/browse/NUTCH-1957
 Project: Nutch
  Issue Type: Bug
  Components: tool
Affects Versions: 1.10
Reporter: Renxia Wang
Priority: Minor


The FileDumper extracts file base name and extension and use 
.(e.g. given the url 
https://www.aoncadis.org/contact/Yarrow%20Axford/project.html, the 
. will be project.html) as the file name to dump the file. 

Code from FileDumper.java: 

String url = key.toString();
String baseName = FilenameUtils.getBaseName(url);
String extension = FilenameUtils.getExtension(url);
...
String filename = baseName + "." + extension;

This introduce file name collision and leads to loss of data when using 
bin/nutch dump. 

Sample logs:
2015-03-10 23:38:01,192 INFO  tools.FileDumper - Dumping URL: 
http://beringsea.eol.ucar.edu/data/
2015-03-10 23:38:01,193 INFO  tools.FileDumper - Skipping writing: 
[testFileName/.html]: file already exists
2015-03-10 23:38:16,717 INFO  tools.FileDumper - Dumping URL: 
http://catalog.eol.ucar.edu/
2015-03-10 23:38:16,719 INFO  tools.FileDumper - Skipping writing: 
[testFileName/.html]: file already exists

2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Carin%20Ashjian/project.html
2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Christopher%20Arp/project.html
2015-03-10 23:38:46,411 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,411 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Dr.%20Knut%20Aagaard/project.html
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Eric%20C.%20Apel/project.html
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/John%20T.%20Andrews/project.html
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,412 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Juha%20Alatalo/project.html
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Kerim%20Aydin/project.html
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Knut%20Aagaard/project.html
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,413 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Mary%20Albert/project.html
2015-03-10 23:38:46,414 INFO  tools.FileDumper - Skipping writing: 
[testFileName/project.html]: file already exists
2015-03-10 23:38:46,414 INFO  tools.FileDumper - Dumping URL: 
https://www.aoncadis.org/contact/Yarrow%20Axford/project.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)