Re: Metastore: How is the unique ID of new databases and tables determined?
Hi Eugene, HMS depends on DataNucleus for the identity value generation for the HMS tables. It is generated by DataNucleus when an object is made persistent. DataNucleus value generator will generate values uniquely across different JVMs. As Zoltan said, DataNucleus tracks with the SEQUENCE_TABLE for each model class id allocation. We don't have id generation code directly in metastore code. In recent times, to add the dynamic partitions using direct sql to db, there is a method getDataStoreId(Class modelClass) in https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DirectSqlInsertPart.java. It fetches the next available id to use from DataNucleus directly. It is used for classes using datastore identity type. Not sure about how you are going to replicate into the target cluster. You might have to explore DataNucleus value generation further in case this is what you are looking for. Regards, Venu On Tue, Oct 24, 2023 at 6:00 PM Zoltán Rátkai wrote: > Hi Eugene, > > the TBL_ID in TBLS table is handled by Datanucleus, so AUTO_INCREMENT > won't help, since the TBL_ID is not defined as AUTO_INCREMENT. > > Datanucleus uses SEQUENCE_TABLE to store the actual value for primary > keys. In this table this two rows is what you need to modify: > > org.apache.hadoop.hive.metastore.model.MDatabase > org.apache.hadoop.hive.metastore.model.MTable > > e.g: > update SEQUENCE_TABLE set NEXT_VAL = 1 where > SEQUENCE_NAME='org.apache.hadoop.hive.metastore.model.MTable'; > and do it for org.apache.hadoop.hive.metastore.model.Database as well. > > After that if you create a table the TBL_ID will be used from this value. > Datanucleus uses caching (default 10) so maybe the next tables will still > use the old value. Try to create 10 simple table like this: > > create table test1 (i int); > ... > create table test10 (i int); > and then drop them and check the TBL_ID. > > *Before doing this I recommend to create a backup from the Metastore DB!!* > > Also check this: > > https://community.cloudera.com/t5/Support-Questions/How-to-migrate-Hive-Table-From-one-cluster-to-another/m-p/235145 > > Regards, > > Zoltan Ratkai > > On Sun, Oct 22, 2023 at 5:39 PM Eugene Miretsky wrote: > >> Hey! >> >> Looking for a way to control the ids (DB_ID and TABLE_ID) of newly >> created databases and tables. >> >> We have a somewhat complicated use case where we replicate the metastore >> (and data) from a source Hive cluster to a target cluster. However new >> tables can be added on both source and target. We need a way to avoid >> unique Id collision. One way would be to make sure all databases/tables >> created in the target Hive start from a higher Id. >> >> We have tried to set AUTO_INCREAMENT='1' on a metastore MySQL db, but >> it doesn't work. This makes us think the Id is generated by the Metastore >> code itself, but we cannot find the right place in the code, or if it is >> possible to control the logic. >> >> Any advice would be appreciated. >> >> Cheers, >> >
Re: Metastore: How is the unique ID of new databases and tables determined?
Hi Eugene, the TBL_ID in TBLS table is handled by Datanucleus, so AUTO_INCREMENT won't help, since the TBL_ID is not defined as AUTO_INCREMENT. Datanucleus uses SEQUENCE_TABLE to store the actual value for primary keys. In this table this two rows is what you need to modify: org.apache.hadoop.hive.metastore.model.MDatabase org.apache.hadoop.hive.metastore.model.MTable e.g: update SEQUENCE_TABLE set NEXT_VAL = 1 where SEQUENCE_NAME='org.apache.hadoop.hive.metastore.model.MTable'; and do it for org.apache.hadoop.hive.metastore.model.Database as well. After that if you create a table the TBL_ID will be used from this value. Datanucleus uses caching (default 10) so maybe the next tables will still use the old value. Try to create 10 simple table like this: create table test1 (i int); ... create table test10 (i int); and then drop them and check the TBL_ID. *Before doing this I recommend to create a backup from the Metastore DB!!* Also check this: https://community.cloudera.com/t5/Support-Questions/How-to-migrate-Hive-Table-From-one-cluster-to-another/m-p/235145 Regards, Zoltan Ratkai On Sun, Oct 22, 2023 at 5:39 PM Eugene Miretsky wrote: > Hey! > > Looking for a way to control the ids (DB_ID and TABLE_ID) of newly > created databases and tables. > > We have a somewhat complicated use case where we replicate the metastore > (and data) from a source Hive cluster to a target cluster. However new > tables can be added on both source and target. We need a way to avoid > unique Id collision. One way would be to make sure all databases/tables > created in the target Hive start from a higher Id. > > We have tried to set AUTO_INCREAMENT='1' on a metastore MySQL db, but > it doesn't work. This makes us think the Id is generated by the Metastore > code itself, but we cannot find the right place in the code, or if it is > possible to control the logic. > > Any advice would be appreciated. > > Cheers, >
Re: Announce: Hive-MR3 with Celeborn,
Thanks. I will try. Replied Message | From | Sungwoo Park | | Date | 10/24/2023 20:08 | | To | user@hive.apache.org | | Cc | | | Subject | Announce: Hive-MR3 with Celeborn, | Hi Hive users, Before the impending release of MR3 1.8, we would like to announce the release of Hive-MR3 with Celeborn (Hive 3.1.3 on MR3 1.8 with Celeborn 0.3.1). Apache Celeborn [1] is remote shuffle service, similar to Magnet [2] and Apache Uniffle [3] (which was discussed in this Hive mailing list a while ago). Celeborn officially supports Spark and Flink, and we have implemented an MR3-extension for Celeborn. In addition to all the benefits of using remote shuffle service, Hive-MR3-Celeborn supports direct processing of mapper output on the reducer side, which means that reducers do not store mapper output on local disks (for unordered edges). In this way, Hive-MR3-Celeborn can eliminate over 95% of local disk writes when tested on the 10TB TPC-DS benchmark. This can be particularly useful when running Hive-MR3 on public clouds where fast local disk storage is expensive or not available. We have documented the usage of Hive-MR3-Celeborn in [4]. You can download Hive-MR3-Celeborn in [5]. FYI, MR3 is an execution engine providing native support for Hadoop, Kubernetes, and standalone mode [6]. Hive-MR3, its main application, provides the performance of LLAP yet is very easy to install and operate. If you are using Hive-Tez for running ETL jobs, switching to Hive-MR3 will give you a much higher throughput thanks to its advanced resource sharing model. We have recently opened a Slack channel. If interested, please join the Slack channel and ask any question on MR3: https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg Thank you, --- Sungwoo [1] https://celeborn.apache.org/ [2] https://www.vldb.org/pvldb/vol13/p3382-shen.pdf [3] https://uniffle.apache.org/ [4] https://mr3docs.datamonad.com/docs/mr3/features/celeborn/ [5] https://github.com/mr3project/mr3-release/releases/tag/v1.8 [6] https://mr3docs.datamonad.com/
Announce: Hive-MR3 with Celeborn,
Hi Hive users, Before the impending release of MR3 1.8, we would like to announce the release of Hive-MR3 with Celeborn (Hive 3.1.3 on MR3 1.8 with Celeborn 0.3.1). Apache Celeborn [1] is remote shuffle service, similar to Magnet [2] and Apache Uniffle [3] (which was discussed in this Hive mailing list a while ago). Celeborn officially supports Spark and Flink, and we have implemented an MR3-extension for Celeborn. In addition to all the benefits of using remote shuffle service, Hive-MR3-Celeborn supports direct processing of mapper output on the reducer side, which means that reducers do not store mapper output on local disks (for unordered edges). In this way, Hive-MR3-Celeborn can eliminate over 95% of local disk writes when tested on the 10TB TPC-DS benchmark. This can be particularly useful when running Hive-MR3 on public clouds where fast local disk storage is expensive or not available. We have documented the usage of Hive-MR3-Celeborn in [4]. You can download Hive-MR3-Celeborn in [5]. FYI, MR3 is an execution engine providing native support for Hadoop, Kubernetes, and standalone mode [6]. Hive-MR3, its main application, provides the performance of LLAP yet is very easy to install and operate. If you are using Hive-Tez for running ETL jobs, switching to Hive-MR3 will give you a much higher throughput thanks to its advanced resource sharing model. We have recently opened a Slack channel. If interested, please join the Slack channel and ask any question on MR3: https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg Thank you, --- Sungwoo [1] https://celeborn.apache.org/ [2] https://www.vldb.org/pvldb/vol13/p3382-shen.pdf [3] https://uniffle.apache.org/ [4] https://mr3docs.datamonad.com/docs/mr3/features/celeborn/ [5] https://github.com/mr3project/mr3-release/releases/tag/v1.8 [6] https://mr3docs.datamonad.com/
Re: Re: Hive's performance for querying the Iceberg table is very poor.
HIVE-27734 is in progress, as I see we have a POC attached to the ticket, we should have it in 2-3 week I believe. > Also, after the release of 4.0.0, will we be able to do all TPCDS queries on ICEBERG except for normal HIVE tables? Yep, I believe most of the TPCDS queries would be supported even today on Hive master, but 4.0.0 would have them running for sure. -Ayush On Tue, 24 Oct 2023 at 14:51, lisoda wrote: > Thanks. > I would like to know if hive currently supports push to ICEBERG table > partition under JOIN condition. > Because I see HIVE-27734 is not yet complete, what is its progress so > far? > Also, after the release of 4.0.0, will we be able to do all TPCDS queries > on ICEBERG except for normal HIVE tables? > > > > > > 在 2023-10-24 11:03:07,"Ayush Saxena" 写道: > > Hi Lisoda, > > The iceberg jar for hive 3.1.3 doesn't have a lot of changes, We did a > bunch of improvements on the 4.x line for Hive-Iceberg. You can give > iceberg a try on the 4.0.0-beta-1 release mentioned here [1], we have a > bunch of improvements like vecotrization and stuff like that. If you wanna > give it a quick try on docker, we have docker image published for that here > [2] & Iceberg works out of the box there. > > Rest feel free to create tickets, if you find some specific queries or > scenarios which are problematic, we will be happy to chase them & get them > sorted. > > PS. Not sure about StarRocks, FWIW. That is something we don't develop as > part of Apache Hive nor as part of Apache Software Foundation to best of my > knowledge, so would refrain from or commenting about that on "Apache Hive" > ML > > -Ayush > > > [1] https://hive.apache.org/general/downloads/ > [2] https://hub.docker.com/r/apache/hive/tags > > On Tue, 24 Oct 2023 at 05:28, Albert Wong > wrote: > >> Too bad. Tencent Games used StarRocks with Apache Iceberg to power >> their analytics. >> https://medium.com/starrocks-engineering/tencent-games-inside-scoop-the-road-to-cloud-native-with-starrocks-d7dcb2438e25. >> >> >> On Mon, Oct 23, 2023 at 10:55 AM lisoda wrote: >> >>> We are not going to use starrocks. >>> mpp architecture databases have natural limitations, and starrocks does >>> not necessarily perform better than hive llap. >>> >>> >>> Replied Message >>> From Albert Wong >>> Date 10/24/2023 01:39 >>> To user@hive.apache.org >>> Cc >>> Subject Re: Hive's performance for querying the Iceberg table is very >>> poor. >>> I would try http://starrocks.io. StarRocks is an MPP OLAP database >>> that can query Apache Iceberg and we can cache the data for faster >>> performance. We also have additional features like building materialized >>> views that span across Apache Iceberg, Apache Hudi and Apache Hive. Here >>> is a video of connecting the 2 products through a webinar StarRocks did >>> with Tabular (authors of Apache Iceberg). >>> https://www.youtube.com/watch?v=bAmcTrX7hCI=10s >>> >>> On Mon, Oct 23, 2023 at 7:18 AM lisoda wrote: >>> Hi Team. I recently was testing Hive query Iceberg table , I found that Hive query Iceberg table performance is very very poor . Almost impossible to use in the production environment . And Join conditions can not be pushed down to the Iceberg partition. I'm using the 1.3.1 Hive Runtime Jar from the Iceberg community. Currently I'm using Hive 3.1.3, Iceberg 1.3.1. Now I'm very frustrated because the performance is so bad that I can't deliver to my customers. How can I solve this problem? Details: https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1695050248606629 I would be grateful if someone could guide me. >>>
Re:Re: Hive's performance for querying the Iceberg table is very poor.
Thanks. I would like to know if hive currently supports push to ICEBERG table partition under JOIN condition. Because I see HIVE-27734 is not yet complete, what is its progress so far? Also, after the release of 4.0.0, will we be able to do all TPCDS queries on ICEBERG except for normal HIVE tables? 在 2023-10-24 11:03:07,"Ayush Saxena" 写道: Hi Lisoda, The iceberg jar for hive 3.1.3 doesn't have a lot of changes, We did a bunch of improvements on the 4.x line for Hive-Iceberg. You can give iceberg a try on the 4.0.0-beta-1 release mentioned here [1], we have a bunch of improvements like vecotrization and stuff like that. If you wanna give it a quick try on docker, we have docker image published for that here [2] & Iceberg works out of the box there. Rest feel free to create tickets, if you find some specific queries or scenarios which are problematic, we will be happy to chase them & get them sorted. PS. Not sure about StarRocks, FWIW. That is something we don't develop as part of Apache Hive nor as part of Apache Software Foundation to best of my knowledge, so would refrain from or commenting about that on "Apache Hive" ML -Ayush [1] https://hive.apache.org/general/downloads/ [2] https://hub.docker.com/r/apache/hive/tags On Tue, 24 Oct 2023 at 05:28, Albert Wong wrote: Too bad. Tencent Games used StarRocks with Apache Iceberg to power their analytics. https://medium.com/starrocks-engineering/tencent-games-inside-scoop-the-road-to-cloud-native-with-starrocks-d7dcb2438e25. On Mon, Oct 23, 2023 at 10:55 AM lisoda wrote: We are not going to use starrocks. mpp architecture databases have natural limitations, and starrocks does not necessarily perform better than hive llap. Replied Message | From | Albert Wong | | Date | 10/24/2023 01:39 | | To | user@hive.apache.org | | Cc | | | Subject | Re: Hive's performance for querying the Iceberg table is very poor. | I would try http://starrocks.io. StarRocks is an MPP OLAP database that can query Apache Iceberg and we can cache the data for faster performance. We also have additional features like building materialized views that span across Apache Iceberg, Apache Hudi and Apache Hive. Here is a video of connecting the 2 products through a webinar StarRocks did with Tabular (authors of Apache Iceberg). https://www.youtube.com/watch?v=bAmcTrX7hCI=10s On Mon, Oct 23, 2023 at 7:18 AM lisoda wrote: Hi Team. I recently was testing Hive query Iceberg table , I found that Hive query Iceberg table performance is very very poor . Almost impossible to use in the production environment . And Join conditions can not be pushed down to the Iceberg partition. I'm using the 1.3.1 Hive Runtime Jar from the Iceberg community. Currently I'm using Hive 3.1.3, Iceberg 1.3.1. Now I'm very frustrated because the performance is so bad that I can't deliver to my customers. How can I solve this problem? Details: https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1695050248606629 I would be grateful if someone could guide me.
submitting tasks failed in Spark standalone mode due to missing failureaccess jar file
Hi Team. I use spark 3.5.0 to start Spark cluster with start-master.sh and start-worker.sh, when I use ./bin/spark-shell --master spark://LAPTOP-TC4A0SCV.:7077 and get error logs: ``` 23/10/24 12:00:46 ERROR TaskSchedulerImpl: Lost an executor 1 (already removed): Command exited with code 50 ``` The worker finished executors logs: ``` Spark Executor Command: "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.372.b07-1.el7_9.x86_64/jre/bin/java" "-cp" "/root/spark-3.5.0-bin-hadoop3/conf/:/root/spark-3.5.0-bin-hadoop3/jars/*" "-Xmx1024M" "-Dspark.driver.port=43765" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@172.29.190.147:43765" "--executor-id" "0" "--hostname" "172.29.190.147" "--cores" "6" "--app-id" "app-20231024120037-0001" "--worker-url" "spark://Worker@172.29.190.147:34707" "--resourceProfileId" "0" Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties 23/10/24 12:00:39 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 19535@LAPTOP-TC4A0SCV 23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for TERM 23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for HUP 23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for INT 23/10/24 12:00:39 WARN Utils: Your hostname, LAPTOP-TC4A0SCV resolves to a loopback address: 127.0.1.1; using 172.29.190.147 instead (on interface eth0) 23/10/24 12:00:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 23/10/24 12:00:42 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 23/10/24 12:00:42 INFO Executor: Starting executor ID 0 on host 172.29.190.147 23/10/24 12:00:42 INFO Executor: OS info Linux, 5.15.123.1-microsoft-standard-WSL2, amd64 23/10/24 12:00:42 INFO Executor: Java version 1.8.0_372 23/10/24 12:00:42 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35227. 23/10/24 12:00:42 INFO NettyBlockTransferService: Server created on 172.29.190.147:35227 23/10/24 12:00:42 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 23/10/24 12:00:42 ERROR Inbox: An error happened while processing message in the inbox for Executor java.lang.NoClassDefFoundError: org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:473) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:473) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at