<https://stackoverflow.com/posts/59977690/timeline>

Hi,

I am trying to do 1000s of update parquet partition operations on different
hive tables parallely from my client application. I am using sparksql with
hive enabled in my application to submit hive query.

spark.sql(" ALTER TABLE mytable PARTITION (a=3, b=3) SET LOCATION
'/newdata/mytable/a=3/b=3/part.parquet")

I can see all the queries are submitted via different threads from my
fork-join pool. i couldn't scale this operation however way i tweak the
thread pool. Then I started observing hive metastore logs and I see that
only thread is making all writes.

    2020-01-29T16:27:15,638  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable1
2020-01-29T16:27:15,638  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable1
2020-01-29T16:27:15,653  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,653  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database:
mydb
2020-01-29T16:27:15,655  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable2
2020-01-29T16:27:15,656  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable2
2020-01-29T16:27:15,670  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,670  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database:
mydb
2020-01-29T16:27:15,672  INFO [pool-6-thread-163]
metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb
tbl=mytable3
2020-01-29T16:27:15,672  INFO [pool-6-thread-163] HiveMetaStore.audit:
ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table :
db=mydb tbl=mytable3

ALl actions are performed by only one thread pool-6-thread-163 I have
scanned 100s of lines and it just same thread. I don't see much log in
hiverserver.log file.

I see in hive document following default values:

hive.metastore.server.min.threads Default Value: 200
hive.metastore.server.max.threads Default Value: 100000

which should be good enough but why just one thread doing all the work? Is
it bound to consumer IP ? which would make sense as I am submitting all
jobs from single machine.


Am I missing any configuration or is there any issue with this approach
from my application side?


Thanks,

Nirav

-- 
 <http://www.xactlycorp.com>


 
<https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>

Reply via email to