Hello Sebastian,

We have always used vanilla Apache Hadoop on our own physical servers that
are running on the latest Debian, which also runs on ARM. It will run HDFS
and YARN and any other custom job you can think of. It has snappy
compression, which is a massive improvement for large data shuffling jobs,
it runs on Java 11 and if neccessary even on AWS, but i dislike it.

You can easily read/write large files between HDFS en S3 without storing it
on local filesystem so it ticks that box too.

I don't know much about Docker, except that i don't like it either, but
that is personal. I do like vanilla Apache Hadoop.

Regards,
Markus



Op di 1 jun. 2021 om 16:35 schreef Sebastian Nagel
<[email protected]>:

> Hi,
>
> does anybody have a recommendation for a free and production-ready Hadoop
> setup?
>
> - HDFS + YARN
> - run Nutch but also other MapReduce and Spark-on-Yarn jobs
> - with native library support: libhadoop.so and compression
>    libs (bzip2, zstd, snappy)
> - must run on AWS EC2 instances and read/write to S3
> - including smaller ones (2 vCPUs, 16 GiB RAM)
> - ideally,
>    - Hadoop 3.3.0
>    - Java 11 and
>    - support to run on ARM machines
>
> So far, Common Crawl uses Cloudera CDH but with no free updates
> anymore we consider either to switch to Amazon EMR, a Cloudera
> subscription or to use vanilla Hadoop (esp. since only HDFS and YARN
> are required).
>
> A dockerized setup is also an option (at least, for development and
> testing). So far, I've looked on [1] - the upgrade to Hadoop 3.3.0
> was straight-forward [2]. But native library support is still missing.
>
> Thanks,
> Sebastian
>
> [1] https://github.com/big-data-europe/docker-hadoop
> [2]
> https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11
>

Reply via email to