Hi Lewis, > 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format?
Yes, but not directly - it's a multi-step process. The outcome: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/ This Parquet index is optimized by sorting the rows by a special form of the URL [1] which - drops the protocol or scheme - reverses the host name and - puts it in front of the remaining URL parts (path and query) - with some additional normalization of path and query (eg. sorting of query params) One example: https://example.com/path/search?q=foo&l=en com,example)/path/search?l=en&q=foo The SURT URL is similar to the URL format used by Nutch2 com.example/https/path/search?q=foo&l=en to address rows in the WebPage table [2]. This format is inspired by the BigTable paper [3]. The point is that cf. [4]. Ok, back to the question: both 1) and 2) are trivial if you do not care about writing an optimal Parquet files: just define a schema following the methods implementing the Writable interface. Parquet is easier to feed into various data processing systems because it integrates the schema. The Sequence file format requires that the Writable formats are provided - although Spark and other big data tools support Sequence files this requirement is sometimes a blocker, also because Nutch does not ship a small "nutch-formats" jar. Nevertheless, the price for Parquet is slower writing - which is ok for write-once-read-many use cases. But the typical use case for Nutch is "write-once-read-twice": - segment: read for CrawlDb update and indexing - CrawlDb: read during update then replace, in some cycles read for deduplication, statistics, etc. Lewis, I'd be really interested what your particular use case is? Also because at Common Crawl we plan to provide more data in the Parquet format: page metadata, links and text dumps. Storing URLs and wb page metadata efficiently was part of the motivation for Dremel [5] which again inspired Parquet [6]. Best, Sebastian [1] https://github.com/internetarchive/surt [2] https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling#Nutch2Crawling-Introduction [3] https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf [4] https://cloud.google.com/bigtable/docs/schema-design#domain-names [5] https://research.google/pubs/pub36632/ [6] https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html On 5/4/21 11:14 PM, Lewis John McGibbney wrote:
Hi user@, Has anyone experimented/accomplished either 1) writing Nutch data directly as Parquet format, or 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format? Thank you lewismc

