[jira] [Created] (HIVE-21651) Move protobuf serde into hive-exec.
Harish Jaiprakash created HIVE-21651: Summary: Move protobuf serde into hive-exec. Key: HIVE-21651 URL: https://issues.apache.org/jira/browse/HIVE-21651 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: Harish Jaiprakash Assignee: Harish Jaiprakash The serde and input format is not accessible without doing an add jar or modifying hive aux libs. Moving it to hive-exec will let us use the serde. Can't move the serde to hive/serde since it depends on ProtobufMessageWriter which is in hive-exec. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: HIVE-21639 and Hive 2.3.5 release
Yuming, and Hjukjin, I've committed HIVE-21639 and HIVE-21536 to branch-2.3. Let me know once you've tested this against Spark 3 and are ready to start the release process. Alan. On Thu, Apr 25, 2019 at 9:47 AM Alan Gates wrote: > Yuming Wang and Hjukjin Kwon have proposed releasing a Hive 2.3.5 that > Spark can use. They need to push a couple of back ports into branch-2.3 > first. See [1] and [2] for details. > > I'm willing to work with them to get this done. > > Alan. > > 1. > https://issues.apache.org/jira/browse/HIVE-21639?focusedCommentId=16822802=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16822802 > 2. > https://issues.apache.org/jira/browse/HIVE-21639?focusedCommentId=16826113=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826113 > >
Re: A proposal for read-only external table for cloud-native Hive deployment
>reuse the transactional_properties and add 'read_only' as a new value. With >read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive >front-end. This is actually a common ask when it comes to OnPrem -> Cloud REPL streams, to avoid diverging. The replicated data having its own updates is very problematic for CDC style ACID replication into the cloud. Ranger authorization works great for this, though it is all-or-nothing right now. At some point in the future, I wish I could lock up specific fields from being updated in ACID. Cheers, Gopal
Re: A proposal for read-only external table for cloud-native Hive deployment
My suggestion does require a change to your ETL process, but it doesn't require you to copy the data into HDFS or to create storage clusters. Hive managed tables can reside in S3 with no problem. Alan. On Thu, Apr 25, 2019 at 2:18 PM Thai Bui wrote: > Your suggested workflow will work and it would require us to re-ETL data > from S3 to all over the place to multiple clusters. This is a cumbersome > approach since most of our data reside on S3 and clusters are somewhat > transient in nature (in the order of a few months for a redeployment & > don't have large HDFS capacity). > > We do scale clusters up and down for compute but not for storage since HDFS > is not easy to be scaled down on demand. It would be much more preferable > in this architecture to have Hive behaves as a pure compute engine that can > be accelerated through query result caching and materialized views. > > I'm not that familiar with Hive 3 implementation to know if this feature > would be simple to make. I was hoping to change only the front-end of Hive > and keep the ACID back-end implementation intact. For example, we could > reuse the transactional_properties and add 'read_only' as a new value. With > read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive > front-end. Thus, it ensures that the ACID properties are guaranteed and the > rest of ACID assumptions on the backend could continue to work. For DDL > operations, since it has to go through the metastore I think it would > automatically work with the current ACID code base and the only thing we > need to do is to enable (where it was disabled) and test it. > > On Wed, Apr 24, 2019 at 6:05 PM Alan Gates wrote: > > > Would a workflow like the following work then: > > 1. Non-Hive tool produces data > > 2. Do a Hive load into a managed table. This effectively takes a > snapshot > > of the data. > > 3. Now you still have the data for Non-Hive tools to operate on, and in > > Hive you get all the Hive 3 goodness. > > > > This would introduce an additional copy of the data. It would be > > interesting to look at adding a copy on write semantic to a partition to > > avoid this copy, but you don't need that to get going. > > > > I'm not opposed to what you're suggesting, I'm just wondering if there > are > > other ways that will save you work and that will keep Hive more simple. > > > > Alan. > > > > On Wed, Apr 24, 2019 at 2:07 PM Thai Bui wrote: > > > > > As I understand, read-only ACID tables only work if your table is a > > managed > > > table (so you'll have to create your table with CREATE TABLE > > > .. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive > > will > > > control the data layout. > > > > > > Unfortunately, in my case, I'm concerned with external tables where > data > > is > > > written by other tools such as Spark, PySpark, Sqoop or older Hive > > clusters > > > and Hadoop-based systems to cloud storage such as S3. My wish is to > have > > > materialized views and query result caching work directly on those data > > if > > > and only if the table is registered as an external, read-only table in > > Hive > > > 3 via the same ACID mechanism. > > > > > > On Wed, Apr 24, 2019 at 3:35 PM Alan Gates > wrote: > > > > > > > Have you looked at the insert only ACID tables in Hive 3 ( > > > > https://issues.apache.org/jira/browse/HIVE-14535 )? These were > > designed > > > > specifically with the cloud in mind, since the way Hive traditionally > > > adds > > > > new data doesn't work well in the cloud. And they do not require > ORC, > > > they > > > > work with any file format. > > > > > > > > Alan. > > > > > > > > On Wed, Apr 24, 2019 at 12:04 PM Thai Bui > wrote: > > > > > > > > > Hello all, > > > > > > > > > > Hive 3 has brought significant changes to the community with the > > > support > > > > > for ACID tables as default managed tables. With ACID tables, we can > > use > > > > > features such as materialized views, query result caching for BI > > tools > > > > and > > > > > more. But without ACID tables such as external tables, Hive doesn't > > > > support > > > > > any of these advanced features which makes a majority of > cloud-native > > > > users > > > > > like me sad :(. > > > > > > > > > > I propose we should support a more limited version of read-only > > > external > > > > > tables such that materialized views and query result caching would > > > work. > > > > > For example: > > > > > > > > > > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC > > > > > LOCATION 's3://some-bucket/some-dir' > > > > > TBLPROPERTIES ('read-only': "true"); > > > > > > > > > > In such tables, any data modification operations such as INSERT and > > > > UPDATE > > > > > would fail and DDL operations that "add" or "remove" partitions to > > the > > > > > table would succeed such as "ALTER TABLE ... ADD PARTITION". This > > would > > > > > make it possible for Hive to invalidate the cache and materialized > > > views > > > > > even when the
Re: A proposal for read-only external table for cloud-native Hive deployment
Your suggested workflow will work and it would require us to re-ETL data from S3 to all over the place to multiple clusters. This is a cumbersome approach since most of our data reside on S3 and clusters are somewhat transient in nature (in the order of a few months for a redeployment & don't have large HDFS capacity). We do scale clusters up and down for compute but not for storage since HDFS is not easy to be scaled down on demand. It would be much more preferable in this architecture to have Hive behaves as a pure compute engine that can be accelerated through query result caching and materialized views. I'm not that familiar with Hive 3 implementation to know if this feature would be simple to make. I was hoping to change only the front-end of Hive and keep the ACID back-end implementation intact. For example, we could reuse the transactional_properties and add 'read_only' as a new value. With read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive front-end. Thus, it ensures that the ACID properties are guaranteed and the rest of ACID assumptions on the backend could continue to work. For DDL operations, since it has to go through the metastore I think it would automatically work with the current ACID code base and the only thing we need to do is to enable (where it was disabled) and test it. On Wed, Apr 24, 2019 at 6:05 PM Alan Gates wrote: > Would a workflow like the following work then: > 1. Non-Hive tool produces data > 2. Do a Hive load into a managed table. This effectively takes a snapshot > of the data. > 3. Now you still have the data for Non-Hive tools to operate on, and in > Hive you get all the Hive 3 goodness. > > This would introduce an additional copy of the data. It would be > interesting to look at adding a copy on write semantic to a partition to > avoid this copy, but you don't need that to get going. > > I'm not opposed to what you're suggesting, I'm just wondering if there are > other ways that will save you work and that will keep Hive more simple. > > Alan. > > On Wed, Apr 24, 2019 at 2:07 PM Thai Bui wrote: > > > As I understand, read-only ACID tables only work if your table is a > managed > > table (so you'll have to create your table with CREATE TABLE > > .. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive > will > > control the data layout. > > > > Unfortunately, in my case, I'm concerned with external tables where data > is > > written by other tools such as Spark, PySpark, Sqoop or older Hive > clusters > > and Hadoop-based systems to cloud storage such as S3. My wish is to have > > materialized views and query result caching work directly on those data > if > > and only if the table is registered as an external, read-only table in > Hive > > 3 via the same ACID mechanism. > > > > On Wed, Apr 24, 2019 at 3:35 PM Alan Gates wrote: > > > > > Have you looked at the insert only ACID tables in Hive 3 ( > > > https://issues.apache.org/jira/browse/HIVE-14535 )? These were > designed > > > specifically with the cloud in mind, since the way Hive traditionally > > adds > > > new data doesn't work well in the cloud. And they do not require ORC, > > they > > > work with any file format. > > > > > > Alan. > > > > > > On Wed, Apr 24, 2019 at 12:04 PM Thai Bui wrote: > > > > > > > Hello all, > > > > > > > > Hive 3 has brought significant changes to the community with the > > support > > > > for ACID tables as default managed tables. With ACID tables, we can > use > > > > features such as materialized views, query result caching for BI > tools > > > and > > > > more. But without ACID tables such as external tables, Hive doesn't > > > support > > > > any of these advanced features which makes a majority of cloud-native > > > users > > > > like me sad :(. > > > > > > > > I propose we should support a more limited version of read-only > > external > > > > tables such that materialized views and query result caching would > > work. > > > > For example: > > > > > > > > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC > > > > LOCATION 's3://some-bucket/some-dir' > > > > TBLPROPERTIES ('read-only': "true"); > > > > > > > > In such tables, any data modification operations such as INSERT and > > > UPDATE > > > > would fail and DDL operations that "add" or "remove" partitions to > the > > > > table would succeed such as "ALTER TABLE ... ADD PARTITION". This > would > > > > make it possible for Hive to invalidate the cache and materialized > > views > > > > even when the table is an external table. > > > > > > > > Let me know what do you guys think and maybe I can start writing a > wiki > > > > document describing the approach in greater details. > > > > > > > > Thanks, > > > > Thai > > > > > > > > > > > > > -- > > Thai > > > -- Thai
HIVE-21639 and Hive 2.3.5 release
Yuming Wang and Hjukjin Kwon have proposed releasing a Hive 2.3.5 that Spark can use. They need to push a couple of back ports into branch-2.3 first. See [1] and [2] for details. I'm willing to work with them to get this done. Alan. 1. https://issues.apache.org/jira/browse/HIVE-21639?focusedCommentId=16822802=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16822802 2. https://issues.apache.org/jira/browse/HIVE-21639?focusedCommentId=16826113=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826113
Re: Hive Pulsar Integration
Hey sorry your image is not showing? Not sure why. On Wed, Apr 24, 2019 at 6:53 AM PengHui Li wrote: > Sorry for so long to reply, > > I drew a simple picture, hope can help for the question. > The main point is to reduce the read of messages from unnecessary topics > while read data from partitioned table of hive. > [image: image.png] > > Slim Bouguerra 于2019年4月20日周六 上午12:16写道: > >> Hi am not sure am getting the question 100% Can you share a design doc or >> outline the big picture in your mind? FYI am not very familiar with Pulsar >> thus please account for that :D >> But let me point out that Hive does not have the notion of partitions for >> tables backed by storage handlers, that is because by definition the table >> is not stored by Hive therefore can not control the layout. >> >> Will be happy to look at any POC. >> looking forward to hear from you. >> >> On Wed, Apr 17, 2019 at 7:25 PM PengHui Li >> wrote: >> >> > @Slim >> > >> > I want to use different pulsar topic to store data for different hive >> > partition. Is there a way to do this, or does this idea make sense? >> > >> > Can you give me some advice? >> > >> > >> > 李鹏辉gmail 于2019年4月15日周一 下午6:22写道: >> > >> > > I already have a simple implementation that can write data and query >> > data. >> > > I read the design document and implementation of kafka. >> > > There are some differences of table partition with what I think. >> > > >> > > I want hive table partition locations work with pulsar topics. >> Different >> > > table partitions correspond to different topics. >> > > But i can’t get the partition where the data will be written. >> > > >> > > I know that the drawback of doing this is that it will lose the order >> of >> > > the stream data itself. >> > > But can reduce unnecessary data reading when querying. >> > > >> > > Best Regards >> > > >> > > Penghui >> > > Beijing,China >> > > >> > > >> > > >> > > > 在 2019年4月13日,21:43,Jörn Franke 写道: >> > > > >> > > > I think you need to develop a custom hiveserde + custom >> > > Hadoopinputformat + custom Hiveoutputformat >> > > > >> > > >> Am 12.04.2019 um 17:35 schrieb 李鹏辉gmail : >> > > >> >> > > >> Hi guys, >> > > >> >> > > >> I’m working on integration of hive and pulsar recently. But now i >> have >> > > encountered some problems and hope to get help here. >> > > >> >> > > >> First of all, i simply describe the motivation. >> > > >> >> > > >> Pulsar can be used as infinite streams for keeping both historic >> data >> > > and streaming data, So we want to use pulsar as a storage extension >> for >> > > hive. >> > > >> In this way, hive can read the data in pulsar naturally, and can >> also >> > > write data into pulsar. >> > > >> We will benefit from the same data that provides both interactive >> > query >> > > and streaming capabilities. >> > > >> >> > > >> As an improvement, support data partitioning can make the query >> more >> > > efficient(e.g. partition by date or any other field). >> > > >> >> > > >> But >> > > >> >> > > >> - how to get hive table partition definition? >> > > >> - While user inert data to hive table, how to get partition the >> data >> > > should be store? >> > > >> - While use select data from hive table, how to determine data is >> in >> > > that partition? >> > > >> >> > > >> If hive already expose some mechanism to support, please show me >> how >> > to >> > > use it. >> > > >> >> > > >> Best regards >> > > >> >> > > >> Penghui >> > > >> Beijing, China >> > > >> >> > > >> >> > > >> >> > > >> > > >> > >> > -- B-Slim ___/\/\/\___/\/\/\___/\/\/\___/\/\/\___/\/\/\___
[jira] [Created] (HIVE-21650) QOutProcessor should provide configurable partial masks for qtests
Aditya Shah created HIVE-21650: -- Summary: QOutProcessor should provide configurable partial masks for qtests Key: HIVE-21650 URL: https://issues.apache.org/jira/browse/HIVE-21650 Project: Hive Issue Type: Improvement Components: Test, Testing Infrastructure Reporter: Aditya Shah Assignee: Aditya Shah Fix For: 4.0.0 QOutProcessor would mask a whole bunch of outputs in q.out files if it sees any of the target mask patterns. This restricts us from testing a whole bunch of tests like for example testing directories being formed for an acid table. Thus, internal configurations where we can provide additional partial masks for us to cover such similar case would help us make our tests better. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21649) Passing an non-existent jar in HIVE_AUX_JARS_PATH produces incorrect error message
Todd Lipcon created HIVE-21649: -- Summary: Passing an non-existent jar in HIVE_AUX_JARS_PATH produces incorrect error message Key: HIVE-21649 URL: https://issues.apache.org/jira/browse/HIVE-21649 Project: Hive Issue Type: Bug Components: Tez Reporter: Todd Lipcon I had configured HS2 with HIVE_AUX_JARS_PATH pointing to a non-existent postgres jar. This resulted in queries failing with the following error: {code} tez.DagUtils: Localizing resource because it does not exist: file:/data/1/todd/impala/fe/target/dependency/postgresql-42.2.5.jar to dest: hdfs://localhost:20500/tmp/hive/todd/_tez_session_dir/9de357d5-59bf-4faa-8973-5212a08bc41a-resources/postgresql-42.2.5.jar tez.DagUtils: Looks like another thread or process is writing the same file tez.DagUtils: Waiting for the file hdfs://localhost:20500/tmp/hive/todd/_tez_session_dir/9de357d5-59bf-4faa-8973-5212a08bc41a-resources/postgresql-42.2.5.jar (5 attempts, with 5000ms interval) tez.DagUtils: Could not find the jar that was being uploaded {code} This incorrect logging sent me on a wild goose chase looking for concurrency issues, HDFS issues, etc, before I realized that the jar in fact didn't exist on the local FS. This is due to some sketchy code which presumes that any IOException is due to a writing conflict. -- This message was sent by Atlassian JIRA (v7.6.3#76005)