I think TableInputFormat will try to maintain as much locality as possible,
assigning one Spark partition per region and trying to assign that
partition to a YARN container/executor on the same node (assuming you're
using Spark over YARN). So the reason for the uneven distribution could be
that
by the output of the dfsadmin command, so I am still trying to track that
down. The total allocated disk space of 28 TB should still be more than
enough.
Saad
On Sat, Apr 7, 2018 at 2:40 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
> Thanks. I checked and it is using another
nputFormat.html
>
>
> Unfortunately some inputformats need a (local) tmp Directory. Sometimes
> this cannot be avoided.
>
> See also the source:
>
> https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapred/TableSnapshotInputFormat.java
&
Hi,
I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The
input data is HBase files in AWS S3 using EMRFS, but there is no HBase
running on the Spark cluster itself. It is restoring the HBase snapshot
into files on disk in another S3 folder used for temporary storage, then