oh, you are using S3. As I remember, S3 has performance issue when
processing large amount of files.
On Wed, Dec 16, 2015 at 7:58 PM, Gourav Sengupta
wrote:
> The HIVE table has very large number of partitions around 365 * 5 * 10 and
> when I say hivemetastore to
Hi Jeff,
sadly that does not resolve the issue. I am sure that the memory mapping to
physical files locations can be saved and recovered in SPARK.
Regards,
Gourav Sengupta
On Wed, Dec 16, 2015 at 12:13 PM, Jeff Zhang wrote:
> oh, you are using S3. As I remember, S3 has
The HIVE table has very large number of partitions around 365 * 5 * 10 and
when I say hivemetastore to start running queries on it (the one with
.count() or .show()) then it takes around 2 hours before the job starts in
SPARK.
On the pyspark screen I can see that it is parsing the S3 locations
Hi,
I have a HIVE table with few thousand partitions (based on date and time).
It takes a long time to run if for the first time and then subsequently it
is fast.
Is there a way to store the cache of partition lookups so that every time I
start a new SPARK instance (cannot keep my personal
>>> Currently it takes around 1.5 hours for me just to cache in the
partition information and after that I can see that the job gets queued in
the SPARK UI.
I guess you mean the stage of getting the split info. I suspect it might be
your cluster issue (or metadata store), unusually it won't take