Well this sounds a lot for “only” 17 billion. However you can limit the
resources of the job so no need that it takes all of them (might be a little
bit longer).
Alternatively did you try to use the hbase tables directly in Hive as external
tables and do a simple ctas? Works better if Hive is on
Jorn,
This is kind of one time load from Historical Data to Analytical Hive
engine. Hive version 1.2.1 and Spark version 2.0.1 with MapR distribution.
Writing every table to parquet and reading it could be very much time
consuming, currently entire job could take ~8 hours on 8 node of 100 Gig
ram
Hi,
Do you have a more detailed log/error message?
Also, can you please provide us details on the tables (no of rows, columns,
size etc).
Is this just a one time thing or something regular?
If it is a one time thing then I would tend more towards putting each table in
HDFS (parquet or ORC) and
Hello Spark Developers,
I have 3 tables that i am reading from HBase and wants to do join
transformation and save to Hive Parquet external table. Currently my join
is failing with container failed error.
1. Read table A from Hbase with ~17 billion records.
2. repartition on primary key of table A