Re: Writing Nutch data in Parquet format

Sebastian Nagel Wed, 05 May 2021 04:42:23 -0700

Hi Lewis,

> 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet 
format?


Yes, but not directly - it's a multi-step process. The outcome:
  
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

This Parquet index is optimized by sorting the rows by a special form of the 
URL [1] which
- drops the protocol or scheme
- reverses the host name and
- puts it in front of the remaining URL parts (path and query)
- with some additional normalization of path and query (eg. sorting of query 
params)

One example:
  https://example.com/path/search?q=foo&l=en
  com,example)/path/search?l=en&q=foo

The SURT URL is similar to the URL format used by Nutch2
  com.example/https/path/search?q=foo&l=en
to address rows in the WebPage table [2]. This format is inspired by the 
BigTable
paper [3].  The point is that  cf. [4].


Ok, back to the question: both 1) and 2) are trivial if you do not care about
writing an optimal Parquet files: just define a schema following the methods 
implementing
the Writable interface. Parquet is easier to feed into various data processing 
systems
because it integrates the schema. The Sequence file format requires that the
Writable formats are provided - although Spark and other big data tools support
Sequence files this requirement is sometimes a blocker, also because Nutch
does not ship a small "nutch-formats" jar.

Nevertheless, the price for Parquet is slower writing - which is ok for 
write-once-read-many
use cases. But the typical use case for Nutch is "write-once-read-twice":
- segment: read for CrawlDb update and indexing
- CrawlDb: read during update then replace, in some cycles read for 
deduplication, statistics, etc.


Lewis, I'd be really interested what your particular use case is?

Also because at Common Crawl we plan to provide more data in the Parquet 
format: page metadata,
links and text dumps. Storing URLs and wb page metadata efficiently was part of 
the motivation
for Dremel [5] which again inspired Parquet [6].


Best,
Sebastian


[1] https://github.com/internetarchive/surt
[2] 
https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling#Nutch2Crawling-Introduction
[3] 
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
[4] https://cloud.google.com/bigtable/docs/schema-design#domain-names
[5] https://research.google/pubs/pub36632/
[6] 
https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html


On 5/4/21 11:14 PM, Lewis John McGibbney wrote:

Hi user@,
Has anyone experimented/accomplished either
1) writing Nutch data directly as Parquet format, or
2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet 
format?
Thank you
lewismc

Re: Writing Nutch data in Parquet format

Reply via email to