Re: Metastore: How is the unique ID of new databases and tables determined?

2023-10-24 Thread Venu Reddy
Hi Eugene,

HMS depends on DataNucleus for the identity value generation for the HMS
tables. It is generated by DataNucleus when an object is made persistent.
DataNucleus value generator will generate values uniquely across different
JVMs. As Zoltan said, DataNucleus tracks with the SEQUENCE_TABLE for each
model class id allocation. We don't have id generation code directly in
metastore code. In recent times, to add the dynamic partitions using direct
sql to db, there is a method getDataStoreId(Class modelClass) in
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DirectSqlInsertPart.java.
It fetches the next available id to use from DataNucleus directly. It is
used for classes using datastore identity type.
Not sure about how you are going to replicate into the target cluster. You
might have to explore DataNucleus value generation further in case this is
what you are looking for.

Regards,
Venu

On Tue, Oct 24, 2023 at 6:00 PM Zoltán Rátkai  wrote:

> Hi Eugene,
>
> the TBL_ID in TBLS table is handled by Datanucleus, so AUTO_INCREMENT
> won't help, since the TBL_ID is not defined as AUTO_INCREMENT.
>
> Datanucleus uses SEQUENCE_TABLE to store the actual value for primary
> keys. In this table this two rows is what you need to modify:
>
> org.apache.hadoop.hive.metastore.model.MDatabase
> org.apache.hadoop.hive.metastore.model.MTable
>
> e.g:
> update SEQUENCE_TABLE set NEXT_VAL = 1  where
> SEQUENCE_NAME='org.apache.hadoop.hive.metastore.model.MTable';
> and do it for org.apache.hadoop.hive.metastore.model.Database as well.
>
> After that if you create a table the TBL_ID will be used from this value.
> Datanucleus uses caching (default 10) so maybe the next tables will still
> use the old value. Try to create 10 simple table like this:
>
> create table test1 (i int);
> ...
> create table test10 (i int);
> and then drop them and check the TBL_ID.
>
> *Before doing this I recommend to create a backup from the Metastore DB!!*
>
> Also check this:
>
> https://community.cloudera.com/t5/Support-Questions/How-to-migrate-Hive-Table-From-one-cluster-to-another/m-p/235145
>
> Regards,
>
> Zoltan Ratkai
>
> On Sun, Oct 22, 2023 at 5:39 PM Eugene Miretsky  wrote:
>
>> Hey!
>>
>> Looking for a way to control the ids (DB_ID and TABLE_ID) of newly
>> created  databases and tables.
>>
>> We have a somewhat complicated use case where we replicate the metastore
>> (and data) from a source Hive cluster to a target cluster. However new
>> tables can be added on both source and target. We need a way to avoid
>> unique Id collision. One way would be to make sure all databases/tables
>> created in the target Hive start from a higher Id.
>>
>> We have tried to set AUTO_INCREAMENT='1' on a metastore MySQL db, but
>> it doesn't work. This makes us think the Id is generated by the Metastore
>> code itself, but we cannot find the right place in the code, or if it is
>> possible to control the logic.
>>
>> Any advice would be appreciated.
>>
>> Cheers,
>>
>


Re: Metastore: How is the unique ID of new databases and tables determined?

2023-10-24 Thread Zoltán Rátkai
Hi Eugene,

the TBL_ID in TBLS table is handled by Datanucleus, so AUTO_INCREMENT won't
help, since the TBL_ID is not defined as AUTO_INCREMENT.

Datanucleus uses SEQUENCE_TABLE to store the actual value for primary keys.
In this table this two rows is what you need to modify:

org.apache.hadoop.hive.metastore.model.MDatabase
org.apache.hadoop.hive.metastore.model.MTable

e.g:
update SEQUENCE_TABLE set NEXT_VAL = 1  where
SEQUENCE_NAME='org.apache.hadoop.hive.metastore.model.MTable';
and do it for org.apache.hadoop.hive.metastore.model.Database as well.

After that if you create a table the TBL_ID will be used from this value.
Datanucleus uses caching (default 10) so maybe the next tables will still
use the old value. Try to create 10 simple table like this:

create table test1 (i int);
...
create table test10 (i int);
and then drop them and check the TBL_ID.

*Before doing this I recommend to create a backup from the Metastore DB!!*

Also check this:
https://community.cloudera.com/t5/Support-Questions/How-to-migrate-Hive-Table-From-one-cluster-to-another/m-p/235145

Regards,

Zoltan Ratkai

On Sun, Oct 22, 2023 at 5:39 PM Eugene Miretsky  wrote:

> Hey!
>
> Looking for a way to control the ids (DB_ID and TABLE_ID) of newly
> created  databases and tables.
>
> We have a somewhat complicated use case where we replicate the metastore
> (and data) from a source Hive cluster to a target cluster. However new
> tables can be added on both source and target. We need a way to avoid
> unique Id collision. One way would be to make sure all databases/tables
> created in the target Hive start from a higher Id.
>
> We have tried to set AUTO_INCREAMENT='1' on a metastore MySQL db, but
> it doesn't work. This makes us think the Id is generated by the Metastore
> code itself, but we cannot find the right place in the code, or if it is
> possible to control the logic.
>
> Any advice would be appreciated.
>
> Cheers,
>


Re: Announce: Hive-MR3 with Celeborn,

2023-10-24 Thread lisoda
Thanks. I will try.



 Replied Message 
| From | Sungwoo Park |
| Date | 10/24/2023 20:08 |
| To | user@hive.apache.org |
| Cc | |
| Subject | Announce: Hive-MR3 with Celeborn, |
Hi Hive users,


Before the impending release of MR3 1.8, we would like to announce the release 
of Hive-MR3 with Celeborn (Hive 3.1.3 on MR3 1.8 with Celeborn 0.3.1).

Apache Celeborn [1] is remote shuffle service, similar to Magnet [2] and Apache 
Uniffle [3] (which was discussed in this Hive mailing list a while ago). 
Celeborn officially supports Spark and Flink, and we have implemented an 
MR3-extension for Celeborn.

In addition to all the benefits of using remote shuffle service, 
Hive-MR3-Celeborn supports direct processing of mapper output on the reducer 
side, which means that reducers do not store mapper output on local disks (for 
unordered edges). In this way, Hive-MR3-Celeborn can eliminate over 95% of 
local disk writes when tested on the 10TB TPC-DS benchmark. This can be 
particularly useful when running Hive-MR3 on public clouds where fast local 
disk storage is expensive or not available.

We have documented the usage of Hive-MR3-Celeborn in [4]. You can download 
Hive-MR3-Celeborn in [5].

FYI, MR3 is an execution engine providing native support for Hadoop, 
Kubernetes, and standalone mode [6]. Hive-MR3, its main application, provides 
the performance of LLAP yet is very easy to install and operate. If you are 
using Hive-Tez for running ETL jobs, switching to Hive-MR3 will give you a much 
higher throughput thanks to its advanced resource sharing model.

We have recently opened a Slack channel. If interested, please join the Slack 
channel and ask any question on MR3:

https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg

Thank you,

--- Sungwoo

[1] https://celeborn.apache.org/
[2] https://www.vldb.org/pvldb/vol13/p3382-shen.pdf
[3] https://uniffle.apache.org/
[4] https://mr3docs.datamonad.com/docs/mr3/features/celeborn/
[5] https://github.com/mr3project/mr3-release/releases/tag/v1.8
[6] https://mr3docs.datamonad.com/


Announce: Hive-MR3 with Celeborn,

2023-10-24 Thread Sungwoo Park
Hi Hive users,

Before the impending release of MR3 1.8, we would like to announce the
release of Hive-MR3 with Celeborn (Hive 3.1.3 on MR3 1.8 with Celeborn
0.3.1).

Apache Celeborn [1] is remote shuffle service, similar to Magnet [2] and
Apache Uniffle [3] (which was discussed in this Hive mailing list a while
ago). Celeborn officially supports Spark and Flink, and we have implemented
an MR3-extension for Celeborn.

In addition to all the benefits of using remote shuffle service,
Hive-MR3-Celeborn supports direct processing of mapper output on the
reducer side, which means that reducers do not store mapper output on local
disks (for unordered edges). In this way, Hive-MR3-Celeborn can eliminate
over 95% of local disk writes when tested on the 10TB TPC-DS benchmark.
This can be particularly useful when running Hive-MR3 on public clouds
where fast local disk storage is expensive or not available.

We have documented the usage of Hive-MR3-Celeborn in [4]. You can download
Hive-MR3-Celeborn in [5].

FYI, MR3 is an execution engine providing native support for Hadoop,
Kubernetes, and standalone mode [6]. Hive-MR3, its main application,
provides the performance of LLAP yet is very easy to install and operate.
If you are using Hive-Tez for running ETL jobs, switching to Hive-MR3 will
give you a much higher throughput thanks to its advanced resource sharing
model.

We have recently opened a Slack channel. If interested, please join the
Slack channel and ask any question on MR3:

https://join.slack.com/t/mr3-help/shared_invite/zt-1wpqztk35-AN8JRDznTkvxFIjtvhmiNg

Thank you,

--- Sungwoo

[1] https://celeborn.apache.org/
[2] https://www.vldb.org/pvldb/vol13/p3382-shen.pdf
[3] https://uniffle.apache.org/
[4] https://mr3docs.datamonad.com/docs/mr3/features/celeborn/
[5] https://github.com/mr3project/mr3-release/releases/tag/v1.8
[6] https://mr3docs.datamonad.com/


Re: Re: Hive's performance for querying the Iceberg table is very poor.

2023-10-24 Thread Ayush Saxena
HIVE-27734 is in progress, as I see we have a POC attached to the ticket,
we should have it in 2-3 week I believe.

> Also, after the release of 4.0.0, will we be able to do all TPCDS queries
on ICEBERG except for normal HIVE tables?

Yep, I believe most of the TPCDS queries would be supported even today on
Hive master, but 4.0.0 would have them running for sure.

-Ayush

On Tue, 24 Oct 2023 at 14:51, lisoda  wrote:

> Thanks.
> I would like to know if hive currently supports push to ICEBERG table
> partition under JOIN condition.
> Because I see HIVE-27734 is not yet complete, what is its progress so
> far?
> Also, after the release of 4.0.0, will we be able to do all TPCDS queries
> on ICEBERG except for normal HIVE tables?
>
>
>
>
>
> 在 2023-10-24 11:03:07,"Ayush Saxena"  写道:
>
> Hi Lisoda,
>
> The iceberg jar for hive 3.1.3 doesn't have a lot of changes, We did a
> bunch of improvements on the 4.x line for Hive-Iceberg. You can give
> iceberg a try on the 4.0.0-beta-1 release mentioned here [1], we have a
> bunch of improvements like vecotrization and stuff like that. If you wanna
> give it a quick try on docker, we have docker image published for that here
> [2] & Iceberg works out of the box there.
>
> Rest feel free to create tickets, if you find some specific queries or
> scenarios which are problematic, we will be happy to chase them & get them
> sorted.
>
> PS. Not sure about StarRocks, FWIW. That is something we don't develop as
> part of Apache Hive nor as part of Apache Software Foundation to best of my
> knowledge, so would refrain from or commenting about that on "Apache Hive"
> ML
>
> -Ayush
>
>
> [1] https://hive.apache.org/general/downloads/
> [2] https://hub.docker.com/r/apache/hive/tags
>
> On Tue, 24 Oct 2023 at 05:28, Albert Wong 
> wrote:
>
>> Too bad.   Tencent Games used StarRocks with Apache Iceberg to power
>> their analytics.
>> https://medium.com/starrocks-engineering/tencent-games-inside-scoop-the-road-to-cloud-native-with-starrocks-d7dcb2438e25.
>>
>>
>> On Mon, Oct 23, 2023 at 10:55 AM lisoda  wrote:
>>
>>> We are not going to use starrocks.
>>> mpp architecture databases have natural limitations, and starrocks does
>>> not necessarily perform better than hive llap.
>>>
>>>
>>>  Replied Message 
>>> From Albert Wong 
>>> Date 10/24/2023 01:39
>>> To user@hive.apache.org
>>> Cc
>>> Subject Re: Hive's performance for querying the Iceberg table is very
>>> poor.
>>> I would try http://starrocks.io.   StarRocks is an MPP OLAP database
>>> that can query Apache Iceberg and we can cache the data for faster
>>> performance.  We also have additional features like building materialized
>>> views that span across Apache Iceberg, Apache Hudi and Apache Hive.   Here
>>> is a video of connecting the 2 products through a webinar StarRocks did
>>> with Tabular (authors of Apache Iceberg).
>>> https://www.youtube.com/watch?v=bAmcTrX7hCI=10s
>>>
>>> On Mon, Oct 23, 2023 at 7:18 AM lisoda  wrote:
>>>
 Hi Team.
   I recently was testing Hive query Iceberg table , I found that
 Hive query Iceberg table performance is very very poor . Almost impossible
 to use in the production environment . And Join conditions can not be
 pushed down to the Iceberg partition.
   I'm using the 1.3.1 Hive Runtime Jar from the Iceberg community.
   Currently I'm using Hive 3.1.3, Iceberg 1.3.1.
   Now I'm very frustrated because the performance is so bad that I
 can't deliver to my customers. How can I solve this problem?
  Details:
 https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1695050248606629
 I would be grateful if someone could guide me.

>>>


Re:Re: Hive's performance for querying the Iceberg table is very poor.

2023-10-24 Thread lisoda
Thanks.
I would like to know if hive currently supports push to ICEBERG table partition 
under JOIN condition.
Because I see HIVE-27734 is not yet complete, what is its progress so far?
Also, after the release of 4.0.0, will we be able to do all TPCDS queries on 
ICEBERG except for normal HIVE tables?











在 2023-10-24 11:03:07,"Ayush Saxena"  写道:

Hi Lisoda,


The iceberg jar for hive 3.1.3 doesn't have a lot of changes, We did a bunch of 
improvements on the 4.x line for Hive-Iceberg. You can give iceberg a try on 
the 4.0.0-beta-1 release mentioned here [1], we have a bunch of improvements 
like vecotrization and stuff like that. If you wanna give it a quick try on 
docker, we have docker image published for that here [2] & Iceberg works out of 
the box there.


Rest feel free to create tickets, if you find some specific queries or 
scenarios which are problematic, we will be happy to chase them & get them 
sorted.


PS. Not sure about StarRocks, FWIW. That is something we don't develop as part 
of Apache Hive nor as part of Apache Software Foundation to best of my 
knowledge, so would refrain from or commenting about that on "Apache Hive" ML


-Ayush




[1] https://hive.apache.org/general/downloads/
[2] https://hub.docker.com/r/apache/hive/tags


On Tue, 24 Oct 2023 at 05:28, Albert Wong  wrote:

Too bad.   Tencent Games used StarRocks with Apache Iceberg to power their 
analytics.   
https://medium.com/starrocks-engineering/tencent-games-inside-scoop-the-road-to-cloud-native-with-starrocks-d7dcb2438e25.
   


On Mon, Oct 23, 2023 at 10:55 AM lisoda  wrote:

We are not going to use starrocks.
mpp architecture databases have natural limitations, and starrocks does not 
necessarily perform better than hive llap.



 Replied Message 
| From | Albert Wong |
| Date | 10/24/2023 01:39 |
| To | user@hive.apache.org |
| Cc | |
| Subject | Re: Hive's performance for querying the Iceberg table is very poor. 
|
I would try http://starrocks.io.   StarRocks is an MPP OLAP database that can 
query Apache Iceberg and we can cache the data for faster performance.  We also 
have additional features like building materialized views that span across 
Apache Iceberg, Apache Hudi and Apache Hive.   Here is a video of connecting 
the 2 products through a webinar StarRocks did with Tabular (authors of Apache 
Iceberg).  https://www.youtube.com/watch?v=bAmcTrX7hCI=10s


On Mon, Oct 23, 2023 at 7:18 AM lisoda  wrote:

Hi Team.
  I recently was testing Hive query Iceberg table , I found that Hive query 
Iceberg table performance is very very poor . Almost impossible to use in the 
production environment . And Join conditions can not be pushed down to the 
Iceberg partition.
  I'm using the 1.3.1 Hive Runtime Jar from the Iceberg community.
  Currently I'm using Hive 3.1.3, Iceberg 1.3.1. 
  Now I'm very frustrated because the performance is so bad that I can't 
deliver to my customers. How can I solve this problem?
 Details:  
https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1695050248606629
I would be grateful if someone could guide me.

submitting tasks failed in Spark standalone mode due to missing failureaccess jar file

2023-10-24 Thread eab...@163.com
Hi Team.
I use spark 3.5.0 to start Spark cluster with start-master.sh and 
start-worker.sh, when I use  ./bin/spark-shell --master 
spark://LAPTOP-TC4A0SCV.:7077 and get error logs: 
```
23/10/24 12:00:46 ERROR TaskSchedulerImpl: Lost an executor 1 (already 
removed): Command exited with code 50
```
  The worker finished executors  logs:
```
Spark Executor Command: 
"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.372.b07-1.el7_9.x86_64/jre/bin/java" 
"-cp" 
"/root/spark-3.5.0-bin-hadoop3/conf/:/root/spark-3.5.0-bin-hadoop3/jars/*" 
"-Xmx1024M" "-Dspark.driver.port=43765" "-Djava.net.preferIPv6Addresses=false" 
"-XX:+IgnoreUnrecognizedVMOptions" 
"--add-opens=java.base/java.lang=ALL-UNNAMED" 
"--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" 
"--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" 
"--add-opens=java.base/java.io=ALL-UNNAMED" 
"--add-opens=java.base/java.net=ALL-UNNAMED" 
"--add-opens=java.base/java.nio=ALL-UNNAMED" 
"--add-opens=java.base/java.util=ALL-UNNAMED" 
"--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" 
"--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" 
"--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" 
"--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" 
"--add-opens=java.base/sun.security.action=ALL-UNNAMED" 
"--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" 
"--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" 
"-Djdk.reflect.useDirectMethodHandle=false" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"spark://CoarseGrainedScheduler@172.29.190.147:43765" "--executor-id" "0" 
"--hostname" "172.29.190.147" "--cores" "6" "--app-id" 
"app-20231024120037-0001" "--worker-url" "spark://Worker@172.29.190.147:34707" 
"--resourceProfileId" "0"

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
23/10/24 12:00:39 INFO CoarseGrainedExecutorBackend: Started daemon with 
process name: 19535@LAPTOP-TC4A0SCV
23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for TERM
23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for HUP
23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for INT
23/10/24 12:00:39 WARN Utils: Your hostname, LAPTOP-TC4A0SCV resolves to a 
loopback address: 127.0.1.1; using 172.29.190.147 instead (on interface eth0)
23/10/24 12:00:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
23/10/24 12:00:42 INFO CoarseGrainedExecutorBackend: Successfully registered 
with driver
23/10/24 12:00:42 INFO Executor: Starting executor ID 0 on host 172.29.190.147
23/10/24 12:00:42 INFO Executor: OS info Linux, 
5.15.123.1-microsoft-standard-WSL2, amd64
23/10/24 12:00:42 INFO Executor: Java version 1.8.0_372
23/10/24 12:00:42 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 35227.
23/10/24 12:00:42 INFO NettyBlockTransferService: Server created on 
172.29.190.147:35227
23/10/24 12:00:42 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy
23/10/24 12:00:42 ERROR Inbox: An error happened while processing message in 
the inbox for Executor
java.lang.NoClassDefFoundError: 
org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at