Hi, does anybody have a recommendation for a free and production-ready Hadoop setup?
- HDFS + YARN - run Nutch but also other MapReduce and Spark-on-Yarn jobs - with native library support: libhadoop.so and compression libs (bzip2, zstd, snappy) - must run on AWS EC2 instances and read/write to S3 - including smaller ones (2 vCPUs, 16 GiB RAM) - ideally, - Hadoop 3.3.0 - Java 11 and - support to run on ARM machines So far, Common Crawl uses Cloudera CDH but with no free updates anymore we consider either to switch to Amazon EMR, a Cloudera subscription or to use vanilla Hadoop (esp. since only HDFS and YARN are required). A dockerized setup is also an option (at least, for development and testing). So far, I've looked on [1] - the upgrade to Hadoop 3.3.0 was straight-forward [2]. But native library support is still missing. Thanks, Sebastian [1] https://github.com/big-data-europe/docker-hadoop [2] https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11

