<https://stackoverflow.com/posts/59977690/timeline>
Hi,
I am trying to do 1000s of update parquet partition operations on different
hive tables parallely from my client application. I am using sparksql with
hive enabled in my application to submit hive query.
spark.sql(" ALTER TABLE mytable PARTITION (a=3, b=3) SET LOCATION
'/newdata/mytable/a=3/b=3/part.parquet")
I can see all the queries are submitted via different threads from my
fork-join pool. i couldn't scale this operation however way i tweak the
thread pool. Then I started observing hive metastore logs and I see that
only thread is making all writes.
2020-01-29T16:27:15,638 INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable1
2020-01-29T16:27:15,638 INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable1
2020-01-29T16:27:15,653 INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,653 INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp ip=10.250.70.14 cmd=source:10.250.70.14 get_database:
mydb
2020-01-29T16:27:15,655 INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable2
2020-01-29T16:27:15,656 INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable2
2020-01-29T16:27:15,670 INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,670 INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp ip=10.250.70.14 cmd=source:10.250.70.14 get_database:
mydb
2020-01-29T16:27:15,672 INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable3
2020-01-29T16:27:15,672 INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable3
ALl actions are performed by only one thread pool-6-thread-163 I have
scanned 100s of lines and it just same thread. I don't see much log in
hiverserver.log file.
I see in hive document following default values:
hive.metastore.server.min.threads Default Value: 200
hive.metastore.server.max.threads Default Value: 100000
which should be good enough but why just one thread doing all the work? Is
it bound to consumer IP ? which would make sense as I am submitting all
jobs from single machine.
Am I missing any configuration or is there any issue with this approach
from my application side?
Thanks,
Nirav
--
<http://www.xactlycorp.com>
<https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>