Issue: We are using wholeTextFile() API to read files from S3. But this API
is extremely SLOW due to reasons mentioned below. Question is how to fix this
issue?
Here is our analysis so FAR:
Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile
API works in two step. F
Issue: We are using wholeTextFile() API to read files from S3. But this API
is extremely SLOW due to reasons mentioned below. Question is how to fix this
issue?
Here is our analysis so FAR:
Issue is we are using Spark WholeTextFile API to read s3 files. WholeTextFile
API works in two step. F
We use Jupyter on Hadoop https://jupyterhub-on-hadoop.readthedocs.io/en/latest/
for developing spark jobs directly inside the Cluster it should run.
With that you have direct access to yarn and hdfs (fully secured) without any
migration steps.
You can control the size of your Jupyter yarn
Disclaimer: I'm developer avocado for data engineering at JetBrains, so I'm
definitely biased.
And if someone likes Zeppelin — there is an awesome integration of Zeppelin
into IDEA via Big Data Tools plugin — one can perform any explorations they
want/need and then extract all their work into real