Hi,

does anybody have a recommendation for a free and production-ready Hadoop setup?

- HDFS + YARN
- run Nutch but also other MapReduce and Spark-on-Yarn jobs
- with native library support: libhadoop.so and compression
  libs (bzip2, zstd, snappy)
- must run on AWS EC2 instances and read/write to S3
- including smaller ones (2 vCPUs, 16 GiB RAM)
- ideally,
  - Hadoop 3.3.0
  - Java 11 and
  - support to run on ARM machines

So far, Common Crawl uses Cloudera CDH but with no free updates
anymore we consider either to switch to Amazon EMR, a Cloudera
subscription or to use vanilla Hadoop (esp. since only HDFS and YARN
are required).

A dockerized setup is also an option (at least, for development and
testing). So far, I've looked on [1] - the upgrade to Hadoop 3.3.0
was straight-forward [2]. But native library support is still missing.

Thanks,
Sebastian

[1] https://github.com/big-data-europe/docker-hadoop
[2] 
https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11

Reply via email to