[jira] [Created] (HIVE-21651) Move protobuf serde into hive-exec.

2019-04-25 Thread Harish Jaiprakash (JIRA)
Harish Jaiprakash created HIVE-21651:


 Summary: Move protobuf serde into hive-exec.
 Key: HIVE-21651
 URL: https://issues.apache.org/jira/browse/HIVE-21651
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Harish Jaiprakash
Assignee: Harish Jaiprakash


The serde and input format is not accessible without doing an add jar or 
modifying hive aux libs. Moving it to hive-exec will let us use the serde.

 

Can't move the serde to hive/serde since it depends on ProtobufMessageWriter 
which is in hive-exec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: HIVE-21639 and Hive 2.3.5 release

2019-04-25 Thread Alan Gates
Yuming, and Hjukjin, I've committed HIVE-21639 and HIVE-21536 to
branch-2.3.  Let me know once you've tested this against Spark 3 and are
ready to start the release process.

Alan.

On Thu, Apr 25, 2019 at 9:47 AM Alan Gates  wrote:

> Yuming Wang and Hjukjin Kwon have proposed releasing a Hive 2.3.5 that
> Spark can use.  They need to push a couple of back ports into branch-2.3
> first.  See [1] and [2] for details.
>
> I'm willing to work with them to get this done.
>
> Alan.
>
> 1.
> https://issues.apache.org/jira/browse/HIVE-21639?focusedCommentId=16822802=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16822802
> 2.
> https://issues.apache.org/jira/browse/HIVE-21639?focusedCommentId=16826113=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826113
>
>


Re: A proposal for read-only external table for cloud-native Hive deployment

2019-04-25 Thread Gopal Vijayaraghavan
>reuse the transactional_properties and add 'read_only' as a new value. With
>read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive
>front-end. 

This is actually a common ask when it comes to OnPrem -> Cloud REPL streams, to 
avoid diverging.

The replicated data having its own updates is very problematic for CDC style 
ACID replication into the cloud.

Ranger authorization works great for this, though it is all-or-nothing right 
now.

At some point in the future, I wish I could lock up specific fields from being 
updated in ACID.

Cheers,
Gopal




Re: A proposal for read-only external table for cloud-native Hive deployment

2019-04-25 Thread Alan Gates
My suggestion does require a change to your ETL process, but it doesn't
require you to copy the data into HDFS or to create storage clusters.  Hive
managed tables can reside in S3 with no problem.

Alan.

On Thu, Apr 25, 2019 at 2:18 PM Thai Bui  wrote:

> Your suggested workflow will work and it would require us to re-ETL data
> from S3 to all over the place to multiple clusters. This is a cumbersome
> approach since most of our data reside on S3 and clusters are somewhat
> transient in nature (in the order of a few months for a redeployment &
> don't have large HDFS capacity).
>
> We do scale clusters up and down for compute but not for storage since HDFS
> is not easy to be scaled down on demand. It would be much more preferable
> in this architecture to have Hive behaves as a pure compute engine that can
> be accelerated through query result caching and materialized views.
>
> I'm not that familiar with Hive 3 implementation to know if this feature
> would be simple to make. I was hoping to change only the front-end of Hive
> and keep the ACID back-end implementation intact. For example, we could
> reuse the transactional_properties and add 'read_only' as a new value. With
> read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive
> front-end. Thus, it ensures that the ACID properties are guaranteed and the
> rest of ACID assumptions on the backend could continue to work. For DDL
> operations, since it has to go through the metastore I think it would
> automatically work with the current ACID code base and the only thing we
> need to do is to enable (where it was disabled) and test it.
>
> On Wed, Apr 24, 2019 at 6:05 PM Alan Gates  wrote:
>
> > Would a workflow like the following work then:
> > 1. Non-Hive tool produces data
> > 2. Do a Hive load into a managed table.  This effectively takes a
> snapshot
> > of the data.
> > 3. Now you still have the data for Non-Hive tools to operate on, and in
> > Hive you get all the Hive 3 goodness.
> >
> > This would introduce an additional copy of the data.  It would be
> > interesting to look at adding a copy on write semantic to a partition to
> > avoid this copy, but you don't need that to get going.
> >
> > I'm not opposed to what you're suggesting, I'm just wondering if there
> are
> > other ways that will save you work and that will keep Hive more simple.
> >
> > Alan.
> >
> > On Wed, Apr 24, 2019 at 2:07 PM Thai Bui  wrote:
> >
> > > As I understand, read-only ACID tables only work if your table is a
> > managed
> > > table (so you'll have to create your table with CREATE TABLE
> > > .. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive
> > will
> > > control the data layout.
> > >
> > > Unfortunately, in my case, I'm concerned with external tables where
> data
> > is
> > > written by other tools such as Spark, PySpark, Sqoop or older Hive
> > clusters
> > > and Hadoop-based systems to cloud storage such as S3. My wish is to
> have
> > > materialized views and query result caching work directly on those data
> > if
> > > and only if the table is registered as an external, read-only table in
> > Hive
> > > 3 via the same ACID mechanism.
> > >
> > > On Wed, Apr 24, 2019 at 3:35 PM Alan Gates 
> wrote:
> > >
> > > > Have you looked at the insert only ACID tables in Hive 3 (
> > > > https://issues.apache.org/jira/browse/HIVE-14535 )?  These were
> > designed
> > > > specifically with the cloud in mind, since the way Hive traditionally
> > > adds
> > > > new data doesn't work well in the cloud.  And they do not require
> ORC,
> > > they
> > > > work with any file format.
> > > >
> > > > Alan.
> > > >
> > > > On Wed, Apr 24, 2019 at 12:04 PM Thai Bui 
> wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > Hive 3 has brought significant changes to the community with the
> > > support
> > > > > for ACID tables as default managed tables. With ACID tables, we can
> > use
> > > > > features such as materialized views, query result caching for BI
> > tools
> > > > and
> > > > > more. But without ACID tables such as external tables, Hive doesn't
> > > > support
> > > > > any of these advanced features which makes a majority of
> cloud-native
> > > > users
> > > > > like me sad :(.
> > > > >
> > > > > I propose we should support a more limited version of read-only
> > > external
> > > > > tables such that materialized views and query result caching would
> > > work.
> > > > > For example:
> > > > >
> > > > > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
> > > > > LOCATION 's3://some-bucket/some-dir'
> > > > > TBLPROPERTIES ('read-only': "true");
> > > > >
> > > > > In such tables, any data modification operations such as INSERT and
> > > > UPDATE
> > > > > would fail and DDL operations that "add" or "remove" partitions to
> > the
> > > > > table would succeed such as "ALTER TABLE ... ADD PARTITION". This
> > would
> > > > > make it possible for Hive to invalidate the cache and materialized
> > > views
> > > > > even when the 

Re: A proposal for read-only external table for cloud-native Hive deployment

2019-04-25 Thread Thai Bui
Your suggested workflow will work and it would require us to re-ETL data
from S3 to all over the place to multiple clusters. This is a cumbersome
approach since most of our data reside on S3 and clusters are somewhat
transient in nature (in the order of a few months for a redeployment &
don't have large HDFS capacity).

We do scale clusters up and down for compute but not for storage since HDFS
is not easy to be scaled down on demand. It would be much more preferable
in this architecture to have Hive behaves as a pure compute engine that can
be accelerated through query result caching and materialized views.

I'm not that familiar with Hive 3 implementation to know if this feature
would be simple to make. I was hoping to change only the front-end of Hive
and keep the ACID back-end implementation intact. For example, we could
reuse the transactional_properties and add 'read_only' as a new value. With
read-only tables, all INSERT, UPDATE, DELETE statements will fail at Hive
front-end. Thus, it ensures that the ACID properties are guaranteed and the
rest of ACID assumptions on the backend could continue to work. For DDL
operations, since it has to go through the metastore I think it would
automatically work with the current ACID code base and the only thing we
need to do is to enable (where it was disabled) and test it.

On Wed, Apr 24, 2019 at 6:05 PM Alan Gates  wrote:

> Would a workflow like the following work then:
> 1. Non-Hive tool produces data
> 2. Do a Hive load into a managed table.  This effectively takes a snapshot
> of the data.
> 3. Now you still have the data for Non-Hive tools to operate on, and in
> Hive you get all the Hive 3 goodness.
>
> This would introduce an additional copy of the data.  It would be
> interesting to look at adding a copy on write semantic to a partition to
> avoid this copy, but you don't need that to get going.
>
> I'm not opposed to what you're suggesting, I'm just wondering if there are
> other ways that will save you work and that will keep Hive more simple.
>
> Alan.
>
> On Wed, Apr 24, 2019 at 2:07 PM Thai Bui  wrote:
>
> > As I understand, read-only ACID tables only work if your table is a
> managed
> > table (so you'll have to create your table with CREATE TABLE
> > .. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive
> will
> > control the data layout.
> >
> > Unfortunately, in my case, I'm concerned with external tables where data
> is
> > written by other tools such as Spark, PySpark, Sqoop or older Hive
> clusters
> > and Hadoop-based systems to cloud storage such as S3. My wish is to have
> > materialized views and query result caching work directly on those data
> if
> > and only if the table is registered as an external, read-only table in
> Hive
> > 3 via the same ACID mechanism.
> >
> > On Wed, Apr 24, 2019 at 3:35 PM Alan Gates  wrote:
> >
> > > Have you looked at the insert only ACID tables in Hive 3 (
> > > https://issues.apache.org/jira/browse/HIVE-14535 )?  These were
> designed
> > > specifically with the cloud in mind, since the way Hive traditionally
> > adds
> > > new data doesn't work well in the cloud.  And they do not require ORC,
> > they
> > > work with any file format.
> > >
> > > Alan.
> > >
> > > On Wed, Apr 24, 2019 at 12:04 PM Thai Bui  wrote:
> > >
> > > > Hello all,
> > > >
> > > > Hive 3 has brought significant changes to the community with the
> > support
> > > > for ACID tables as default managed tables. With ACID tables, we can
> use
> > > > features such as materialized views, query result caching for BI
> tools
> > > and
> > > > more. But without ACID tables such as external tables, Hive doesn't
> > > support
> > > > any of these advanced features which makes a majority of cloud-native
> > > users
> > > > like me sad :(.
> > > >
> > > > I propose we should support a more limited version of read-only
> > external
> > > > tables such that materialized views and query result caching would
> > work.
> > > > For example:
> > > >
> > > > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
> > > > LOCATION 's3://some-bucket/some-dir'
> > > > TBLPROPERTIES ('read-only': "true");
> > > >
> > > > In such tables, any data modification operations such as INSERT and
> > > UPDATE
> > > > would fail and DDL operations that "add" or "remove" partitions to
> the
> > > > table would succeed such as "ALTER TABLE ... ADD PARTITION". This
> would
> > > > make it possible for Hive to invalidate the cache and materialized
> > views
> > > > even when the table is an external table.
> > > >
> > > > Let me know what do you guys think and maybe I can start writing a
> wiki
> > > > document describing the approach in greater details.
> > > >
> > > > Thanks,
> > > > Thai
> > > >
> > >
> >
> >
> > --
> > Thai
> >
>


-- 
Thai


HIVE-21639 and Hive 2.3.5 release

2019-04-25 Thread Alan Gates
Yuming Wang and Hjukjin Kwon have proposed releasing a Hive 2.3.5 that
Spark can use.  They need to push a couple of back ports into branch-2.3
first.  See [1] and [2] for details.

I'm willing to work with them to get this done.

Alan.

1.
https://issues.apache.org/jira/browse/HIVE-21639?focusedCommentId=16822802=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16822802
2.
https://issues.apache.org/jira/browse/HIVE-21639?focusedCommentId=16826113=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16826113


Re: Hive Pulsar Integration

2019-04-25 Thread Slim Bouguerra
Hey sorry your image is not showing? Not sure why.

On Wed, Apr 24, 2019 at 6:53 AM PengHui Li  wrote:

> Sorry for so long to reply,
>
> I drew a simple picture, hope can help for the question.
> The main point is to reduce the read of messages from unnecessary topics
> while read data from partitioned table of hive.
> [image: image.png]
>
> Slim Bouguerra  于2019年4月20日周六 上午12:16写道:
>
>> Hi am not sure am getting the question 100% Can you share a design doc or
>> outline the big picture in your mind? FYI am not very familiar with Pulsar
>> thus please account for that :D
>> But let me point out that Hive does not have the notion of partitions for
>> tables backed by storage handlers, that is because by definition the table
>> is not stored by Hive therefore can not control the layout.
>>
>> Will be happy to look at any POC.
>> looking forward to hear from you.
>>
>> On Wed, Apr 17, 2019 at 7:25 PM PengHui Li 
>> wrote:
>>
>> > @Slim
>> >
>> > I want to use different pulsar topic to store data for different hive
>> > partition. Is there a way to do this, or does this idea make sense?
>> >
>> > Can you give me some advice?
>> >
>> >
>> > 李鹏辉gmail  于2019年4月15日周一 下午6:22写道:
>> >
>> > > I already have a simple implementation that can write data and query
>> > data.
>> > > I read the design document and implementation of kafka.
>> > > There are some differences of table partition with what I think.
>> > >
>> > > I want hive table partition locations work with pulsar topics.
>> Different
>> > > table partitions correspond to different topics.
>> > > But i can’t get the partition where the data will be written.
>> > >
>> > > I know that the drawback of doing this is that it will lose the order
>> of
>> > > the stream data itself.
>> > > But can reduce unnecessary data reading when querying.
>> > >
>> > > Best Regards
>> > >
>> > > Penghui
>> > > Beijing,China
>> > >
>> > >
>> > >
>> > > > 在 2019年4月13日,21:43,Jörn Franke  写道:
>> > > >
>> > > > I think you need to develop a custom hiveserde + custom
>> > > Hadoopinputformat + custom Hiveoutputformat
>> > > >
>> > > >> Am 12.04.2019 um 17:35 schrieb 李鹏辉gmail :
>> > > >>
>> > > >> Hi guys,
>> > > >>
>> > > >> I’m working on integration of hive and pulsar recently. But now i
>> have
>> > > encountered some problems and hope to get help here.
>> > > >>
>> > > >> First of all, i simply describe the motivation.
>> > > >>
>> > > >> Pulsar can be used as infinite streams for keeping both historic
>> data
>> > > and streaming data, So we want to use pulsar as a storage extension
>> for
>> > > hive.
>> > > >> In this way, hive can read the data in pulsar naturally, and can
>> also
>> > > write data into pulsar.
>> > > >> We will benefit from the same data that provides both interactive
>> > query
>> > > and streaming capabilities.
>> > > >>
>> > > >> As an improvement, support data partitioning can make the query
>> more
>> > > efficient(e.g. partition by date or any other field).
>> > > >>
>> > > >> But
>> > > >>
>> > > >> - how to get hive table partition definition?
>> > > >> - While user inert data to hive table, how to get partition the
>> data
>> > > should be store?
>> > > >> - While use select data from hive table, how to determine data is
>> in
>> > > that partition?
>> > > >>
>> > > >> If hive already expose some mechanism to support, please show me
>> how
>> > to
>> > > use it.
>> > > >>
>> > > >> Best regards
>> > > >>
>> > > >> Penghui
>> > > >> Beijing, China
>> > > >>
>> > > >>
>> > > >>
>> > >
>> > >
>> >
>>
> --

B-Slim
___/\/\/\___/\/\/\___/\/\/\___/\/\/\___/\/\/\___


[jira] [Created] (HIVE-21650) QOutProcessor should provide configurable partial masks for qtests

2019-04-25 Thread Aditya Shah (JIRA)
Aditya Shah created HIVE-21650:
--

 Summary: QOutProcessor should provide configurable partial masks 
for qtests
 Key: HIVE-21650
 URL: https://issues.apache.org/jira/browse/HIVE-21650
 Project: Hive
  Issue Type: Improvement
  Components: Test, Testing Infrastructure
Reporter: Aditya Shah
Assignee: Aditya Shah
 Fix For: 4.0.0


QOutProcessor would mask a whole bunch of outputs in q.out files if it sees any 
of the target mask patterns. This restricts us from testing a whole bunch of 
tests like for example testing directories being formed for an acid table. 
Thus, internal configurations where we can provide additional partial masks for 
us to cover such similar case would help us make our tests better.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21649) Passing an non-existent jar in HIVE_AUX_JARS_PATH produces incorrect error message

2019-04-25 Thread Todd Lipcon (JIRA)
Todd Lipcon created HIVE-21649:
--

 Summary: Passing an non-existent jar in HIVE_AUX_JARS_PATH 
produces incorrect error message
 Key: HIVE-21649
 URL: https://issues.apache.org/jira/browse/HIVE-21649
 Project: Hive
  Issue Type: Bug
  Components: Tez
Reporter: Todd Lipcon


I had configured HS2 with HIVE_AUX_JARS_PATH pointing to a non-existent 
postgres jar. This resulted in queries failing with the following error:

{code}
 tez.DagUtils: Localizing resource because it does not exist: 
file:/data/1/todd/impala/fe/target/dependency/postgresql-42.2.5.jar to dest:
   
hdfs://localhost:20500/tmp/hive/todd/_tez_session_dir/9de357d5-59bf-4faa-8973-5212a08bc41a-resources/postgresql-42.2.5.jar
 tez.DagUtils: Looks like another thread or process is writing the same file
 tez.DagUtils: Waiting for the file 
hdfs://localhost:20500/tmp/hive/todd/_tez_session_dir/9de357d5-59bf-4faa-8973-5212a08bc41a-resources/postgresql-42.2.5.jar
 (5 attempts, with 5000ms interval)
 tez.DagUtils: Could not find the jar that was being uploaded
{code}

This incorrect logging sent me on a wild goose chase looking for concurrency 
issues, HDFS issues, etc, before I realized that the jar in fact didn't exist 
on the local FS. This is due to some sketchy code which presumes that any 
IOException is due to a writing conflict.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)