Updates/deletes with OrcRecordUpdater

2015-03-20 Thread Elliot West
Hi, I'm trying to use the insert, update and delete methods on OrcRecordUpdater to programmatically mutate an ORC based Hive table (1.0.0). I've got inserts working correctly but I'm hitting into a problem with deletes and updates. I get an NPE which I have traced back to what seems like a missing

Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West
Hi, I'd like to ascertain if it might be possible to add 'update' and 'delete' operations to the hive-hcatalog-streaming API. I've been looking at the API with interest for the last week as it appears to have the potential to help with some general data processing patterns that are prevalent where

Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West
g > merge like functionality, where you would upload all changes to a temp > table and then in one scan/transaction apply those changes. This is a > common way to handle these situations for data warehouses, and is much > easier than adding a primary key concept to Hive. > > Alan. &g

Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West
bulk (along with > operation markers to indicate insert/update/delete) and write those in a > delta file in one pass. > > Alan. > > Elliot West > March 26, 2015 at 15:10 > Hi, thanks for your quick reply. > > I see your point, but in my case would I not have the requir

Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West
Hi Mich, Yes, we have a timestamp on each record. Our processes effectively group by a key and order by time stamp. Cheers - Elliot.

Interpretation of transactional table base file format

2015-03-30 Thread Elliot West
I've been looking at the structure of the ORCFiles that back transaction tables in Hive. After a compaction I was surprised to find that the base file structure is identical to the delta structure: struct< operation:int, originalTransaction:bigint, bucket:int, rowId:bigint, c

Re: Interpretation of transactional table base file format

2015-03-30 Thread Elliot West
Ok, so both the source and Javadoc for org.apache.hadoop.hive.ql.io.orc.OrcInputFormat answer most of these questions. Apologies for the spam. Thanks - Elliot. On 30 March 2015 at 11:52, Elliot West wrote: > I've been looking at the structure of the ORCFiles that back transaction >

Re: Adding update/delete to the hive-hcatalog-streaming API

2015-04-01 Thread Elliot West
Hi Alan, Regarding the streaming changes, I've raised an issue and submitted patches here: https://issues.apache.org/jira/browse/HIVE-10165 Thanks - Elliot. On 26 March 2015 at 23:20, Alan Gates wrote: > > > Elliot West > March 26, 2015 at 15:58 > Hi Alan, > >

Transactional table read lifecycle

2015-04-17 Thread Elliot West
Hi, I'm working on a Cascading Tap that reads the data that backs a transactional Hive table. I've successfully utilised the in-built OrcInputFormat functionality to read and merge the deltas with the base and optionally pull in the RecordIdentifiers. However, I'm now considering what other steps I

ACID ORC file reader issue with uncompacted data

2015-04-29 Thread Elliot West
Hi, I'm implementing a tap to read Hive ORC ACID date into Cascading jobs and I've hit a couple of issues for a particular scenario. The case I have is when data has been written into a transactional table and a compaction has not yet occurred. This can be recreated like so: CREATE TABLE test_tab

Re: ACID ORC file reader issue with uncompacted data

2015-04-29 Thread Elliot West
ore 1st compaction is definitely a valid use case. > > From: Elliot West > Reply-To: "user@hive.apache.org" > Date: Wednesday, April 29, 2015 at 9:40 AM > To: "user@hive.apache.org" > Subject: ACID ORC file reader issue with uncompacted data > >

Re: ACID ORC file reader issue with uncompacted data

2015-05-01 Thread Elliot West
ent=Asia/country=India Partition keys derived as: 'continent=Asia' (INCORRECT) Cheers - Elliot. On 30 April 2015 at 17:40, Alan Gates wrote: > Are you using OrcInputFormat.getReader to get a reader? If so, it should > take care of these anomalies for you and mask your need to worr

Re: ACID ORC file reader issue with uncompacted data

2015-05-18 Thread Elliot West
eturns > you the partition keys. That way you're not left putting ORC specific code > in Cascading. > > Alan. > > Elliot West > May 1, 2015 at 3:04 > Yes and no :-) We're initially using OrcFile.createReader to create a > Reader so that we can obt

Re: delta file compact take no effect

2015-06-11 Thread Elliot West
What do you see if you issue: SHOW COMPACTIONS; On Thursday, 11 June 2015, r7raul1...@163.com wrote: > > I use hive 1.1.0 on hadoop 2.5.0 > After I do some update operation on table u_data_txn. > My table create many delta file like: > drwxr-xr-x - hdfs hive 0 2015-02-06 22:52 > /user/hive/ware

Re: Hive Concurrency support

2015-08-21 Thread Elliot West
I presume you mean "into different partitions of a table at the same time"? This should be possible. It is certainly supported by the streaming API, which is probably where you want to look if you need to insert large volumes of data to multiple partitions concurrently. I can't see why it would not

Hive Concurrency support

2015-08-23 Thread Elliot West
operation. > > Please correct me if my understanding about the same is wrong. > (I am using hql inserts only for these operations) > > Thanks, > Suyog > On Aug 21, 2015 7:28 PM, "Elliot West" wrote: > >> I presume you mean "into different partitions of

Re: Hive Concurrency support

2015-08-23 Thread Elliot West
,it shows the shared > lock.Which does not allow the writes on the table. > > So I wanted to understand that , > > Does hive execute these two insert operations sequentially or it executes > it in parallel . > > Thanks, > Suyog > On Aug 23, 2015 4:23 PM, "El

Hive Macros roadmap

2015-09-11 Thread Elliot West
Hi, I noticed some time ago the Hive Macro feature. To me at least this seemed like an excellent addition to HQL, allowing the user to encapsulate complex column logic as an independent HQL, reusable macro while avoiding the complexities of Java UDFs. However, few people seem to be aware of them o

Re: Organising Hive Scripts

2015-09-14 Thread Elliot West
Hi Charles, You can also split out column level logic using Hive macros. These also allow re-use of said logic: hive> create temporary macro MYSIGMOID(x DOUBLE) > 2.0 / (1.0 + exp(-x)); OK hive> select MYSIGMOID(1.0) from dual; OK 1.4621171572600098 Cheers - Elliot. On 11 September 2015

Decomposing nested Hive statements with views

2015-09-14 Thread Elliot West
Hello, We have many HQL scripts that select from nested sub-selects. In many cases the nesting can be a few levels deep: SELECT ... FROM ( SELECT ... FROM ( SELECT ... FROM ( SELECT ... FROM a WHERE ... ) A LEFT JOIN ( SELECT ... FROM b ) B ON (

Re: Decomposing nested Hive statements with views

2015-09-15 Thread Elliot West
On 15 September 2015 at 00:09, Gopal Vijayaraghavan wrote: > CTE Thank you for the in depth reply Gopal. I've just had a quick try out of CTEs but can't see how they address my original problem of decomposing a query into separate independent units. It seems that the CTE definition (' with' cla

Re: Better way to do UDF's for Hive

2015-10-01 Thread Elliot West
Perhaps a macro? CREATE TEMPORARY MACRO state_from_city (city string) " + /* HQL column logic */ ...; On 1 October 2015 at 14:11, Daniel Lopes wrote: > Hi, > > I'd like to know the good way to do a a UDF for a single field, like > > SELECT > tbl.id AS id, > tbl.name AS name, > tbl.city A

Strange HiveConf behaviour

2015-10-08 Thread Elliot West
I've been writing some unit tests around some extensions to Hive recently and came across an issue that left me perplexed. If I execute: System.err.println(new HiveConf().get(FileSystem.FS_DEFAULT_NAME_KEY, FileSystem.DEFAULT_FS)); this prints: core-site.xml I suspect I need to provide a bett

Access to wiki (documenting locking requirements).

2015-10-22 Thread Elliot West
Hi, May I have access to edit the wiki? My confluence user name is 'teabot'. I've been looking briefly at ALTER TABLE CONCATENATE and noticed that the operation isn't listed on the Hive/Locking wiki page even though it acquires an exclusi

Locking when using the Metastore/HCatalog APIs.

2015-10-22 Thread Elliot West
I notice from the Hive locking wiki page that locks may be acquired for a range of HQL DDL operations. I wanted to know how the locking scheme mapped mapped/employed by equivalent operations in the Metastore and HCatalog APIs. Consider the

Re: the column names removed after insert select

2015-10-23 Thread Elliot West
I was seeing something similar in the initial ORC delta file when inserting rows into a newly created ACID table. Subsequent deltas had the correct columns names. On 23 October 2015 at 08:25, patcharee wrote: > Hi > > I inserted a table from select (insert into table newtable select date, > hh,

Re: the column names removed after insert select

2015-10-23 Thread Elliot West
Excellent news. Thanks. On 23 October 2015 at 15:50, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: > Hi > > This has been fixed recently > https://issues.apache.org/jira/browse/HIVE-4243 > > This used to be a problem with the way hive writes rows out. The > ObjectInspectors sent o

Re: Locking when using the Metastore/HCatalog APIs.

2015-10-28 Thread Elliot West
s like a simpler user model. > > From: Alan Gates > > Reply-To: "user@hive.apache.org > " < > user@hive.apache.org > > > Date: Tuesday, October 27, 2015 at 11:34 AM > To: "user@hive.apache.org > " < > user@hive.apache.org > > >

Re: Locking when using the Metastore/HCatalog APIs.

2015-10-28 Thread Elliot West
Captured in HIVE-12285. Thanks - Elliot. On 28 October 2015 at 08:54, Elliot West wrote: > Perhaps, I'd expect one might wish to make multiple API calls when holding > a lock, but a suitably implemented client may be able to manage this > seamlessly. Then again a user may als

Re: query orc file by hive

2015-11-09 Thread Elliot West
Hi, You can create a table and point the location property to the folder containing your ORC file: CREATE EXTERNAL TABLE orc_table ( ) STORED AS ORC LOCATION '/hdfs/folder/containing/orc/file' ; https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Re: hive transaction strange behaviour

2015-11-13 Thread Elliot West
It is the compaction process that creates the base files. Check your configuration to ensure that compaction should be running. I believe the compactor should run periodically. You can also request a compaction using the appropriate ALTER TABLE HQL DDL command. Elliot. On Friday, 13 November 2015

Re: Bulk load in Hive transactions backed table

2015-11-18 Thread Elliot West
Are you loading new data (inserts) or mutating existing data (update/delete) or both? And by 'transactions' are you referring to Hive ACID transactional tables? If so: For new data, I think you should be able to use: INSERT INTO transactional_table ... FROM table_over_file_to_be_loaded For upda

Re: Hotels.com

2015-11-30 Thread Elliot West
This looks like a phishing attempt. Please do no open it. On 30 November 2015 at 10:32, @Sanjiv Singh wrote: > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Mon, Nov 30, 2015 at 1:16 PM, Amrit Jangid > wrote: > >> ?? >> >> On Mon, Nov 30, 2015 at 11:56 AM, Roshini Johri >>

Beeline user default configuration

2015-12-07 Thread Elliot West
Hi, Does ~/.beeline/beeline.properties serve the same purpose as ~/.hiverc? I'd like to be able to apply a user specific configuration when the beeline CLI is started. Additionally, is it possible to provide a default connect string in either a configuration file or environment variable? Thanks -

Synchronizing Hive metastores across clusters

2015-12-17 Thread Elliot West
Hello, I'm thinking about the steps required to repeatedly push Hive datasets out from a traditional Hadoop cluster into a parallel cloud based cluster. This is not a one off, it needs to be a constantly running sync process. As new tables and partitions are added in one cluster, they need to be s

Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Elliot West
server. > > > > HTH, > > > > Mich > > > > *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk] > *Sent:* 17 December 2015 16:47 > *To:* user@hive.apache.org > *Subject:* RE: Synchronizing Hive metastores across clusters > > > > Are both clusters

Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Elliot West
> > *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk] > *Sent:* 17 December 2015 16:47 > *To:* user@hive.apache.org > *Subject:* RE: Synchronizing Hive metastores across clusters > > > > Are both clusters in active/active mode or the cloud based cluster is > standby?

Re: Synchronizing Hive metastores across clusters

2015-12-18 Thread Elliot West
ation) > > > > > > Thanks, > > > > -Sushanth > > > > On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman > > wrote: > >> Metastore supports MetaStoreEventListener and MetaStorePreEventListener > >> which may be useful here > >> > >>

Re: Attempt to do update or delete using transaction manager that does not support these operations. (state=42000,code=10294)

2015-12-22 Thread Elliot West
Hi, The input/output formats do not appear to be ORC, have you tried 'stored as orc'? Additionally you'll need to set the property 'transactional=true' on the table. Do you have the original create table statement? Cheers - Elliot. On Tuesday, 22 December 2015, Mich Talebzadeh wrote: > Hi, > >

Re: Attempt to do update or delete using transaction manager that does not support these operations. (state=42000,code=10294)

2015-12-22 Thread Elliot West
| > > | 'transactional'='true', | > > | 'transient_lastDdlTime'='1449831076') | > > +-+--+ > > 49 rows selected

Hive ExIm from on-premise HDP to Amazon EMR

2016-01-07 Thread Elliot West
Hello, Following on from my earlier post concerning syncing Hive data from an on premise cluster to the cloud, I've been experimenting with the IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to Amazon EMR. I started out with some simple Exports/Imports as these can be the

Re: Hive ExIm from on-premise HDP to Amazon EMR

2016-01-07 Thread Elliot West
. On 7 January 2016 at 16:53, Elliot West wrote: > More information: This works if I move the export into EMR's HDFS and then > import from there to a new location in HDFS. It does not work across > FileSystems: > >- Import from S3 → EMR HDFS (fails in a similar m

Re: Writing hive column headers in 'Insert overwrite query'

2016-01-13 Thread Elliot West
Unfortunately there appears to be no nice way of doing this. I've seen others achieve a work around by UNIONing with a table of the same schema, containing a single row of the header names, and then finally sorting by a synthesised rank column (see: http://stackoverflow.com/a/25214480/74772). I be

Re: Writing hive column headers in 'Insert overwrite query'

2016-01-13 Thread Elliot West
I created an issue in the Hive Jira related to this. You may wish to vote on it or watch it if you believe it to be relevant. https://issues.apache.org/jira/browse/HIVE-12860 On 13 January 2016 at 09:43, Elliot West wrote: > Unfortunately there appears to be no nice way of doing this. I

Re: Synchronizing Hive metastores across clusters

2016-01-21 Thread Elliot West
On 18 December 2015 at 14:31, Elliot West wrote: > Eugene/Susanth, > > Thank you for pointing me in the direction of these features. I'll > investigate them further to see if I can put them to good use. > > Cheers - Elliot. > > On 17 December 2015 at 20:03, Sushant

Re: Using s3 as warehouse on emr

2016-01-22 Thread Elliot West
Related to this, might it be better to use the s3a protocol instead of s3n? https://wiki.apache.org/hadoop/AmazonS3 Additionally, can anyone advise when EMRFS is required when storing Hive tables in S3? http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-overview-arch.html#emr-a

Re: Hive ExIm from on-premise HDP to Amazon EMR

2016-01-25 Thread Elliot West
might I still encounter the same bug I am seeing now? Thanks for your response. On Sunday, 24 January 2016, Artem Ervits wrote: > Have you looked at Apache Falcon? > On Jan 8, 2016 2:41 AM, "Elliot West" > wrote: > >> Further investigation appears to show this goin

Re: Hive table over S3 bucket with s3a

2016-02-02 Thread Elliot West
When I last looked at this it was recommended to simply regenerate the key as you suggest. On 2 February 2016 at 15:52, Terry Siu wrote: > Hi, > > I’m wondering if anyone has found a workaround for defining a Hive table > over a S3 bucket when the secret access key has ‘/‘ characters in it. I’m

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Elliot West
Related to this and for the benefit of anyone who is using Hive: The issues around testing and some possible approaches are summarised here: https://cwiki.apache.org/confluence/display/Hive/Unit+testing+HQL Ultimately there are no elegant solutions to the limitations correctly described by Koert

Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-16 Thread Elliot West
I'd say that so long as you can achieve a similar quality of engineering as is possible with other software development domains, then 'yes, it is ok'. Specifically, our Hive projects are packaged as RPMs, built and released with Maven, covered by suites of unit tests developed with HiveRunner, and

Re: How to find hive version using hive editor in hue ?

2016-02-18 Thread Elliot West
See set command usage here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveInteractiveShellCommands On Thursday, 18 February 2016, Abhishek Dubey wrote: > Thanks Bennie, It worked ! > > Would you like to tell us what else this command means, like what e

Retrying metastore clients

2016-05-16 Thread Elliot West
Hello, We have a fair amount of code that uses IMetaStoreClient implementation to talk over a network to Hive metastore instances. I was under the impression that the recommended way to create said clients was to use: org.apache.hive.hcatalog.common.HCatUtil.getHiveClient(HiveConf) However, I'm

Re: Copying all Hive tables from Prod to UAT

2016-05-25 Thread Elliot West
Hello, I've been looking at this recently for moving Hive tables from on-premise clusters to the cloud, but the principle should be the same for your use-case. If you wish to do this in an automated way, some tools worth considering are: - Hive's built in replication framework: https://cwik

Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Elliot West
Related to this, there exists an API in Hive to simplify the integrations of other frameworks with Hive's ACID feature: See: https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API It contains code for maintaining heartbeats, handling locks and transactions, and submittin

Hive Metastore on Amazon Aurora

2016-07-11 Thread Elliot West
Hello, Is anyone running the Hive metastore database on Amazon Aurora?: https://aws.amazon.com/rds/aurora/details/. My expectation is that it should work nicely as it is derived from MySQL but I'd be keen to hear of user's experiences with this setup. Many thanks, Elliot.

Re: Hive Metastore on Amazon Aurora

2016-07-11 Thread Elliot West
operty which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 11 July 2016 at 13:58, Elliot West wrote: > >&g

Hive authentication and authorisation in AWS.

2016-07-13 Thread Elliot West
Hello, I am attempting to setup a long running, shared Hive metastore in AWS. The intention is to have this serve as the core repository of metadata for shared datasets across multiple AWS accounts. Users will be able to spin up their own short-lived EMR clusters, Spark jobs, etc. and then locate

Iterating over partitions using the metastore API

2016-08-04 Thread Elliot West
Hello, I have a process that needs to iterate over all of the partitions in a table using the metastore API.The process should not need to know about the structure or meaning of the partition key values (i.e. whether they are dates, numbers, country names etc), or be required to know the existing

Re: Iterating over partitions using the metastore API

2016-08-04 Thread Elliot West
ay, you could ask the metastore directly via JDBC for all the > partitions, and get java.sql.ResultSet that can be iterated over. > > Regards, > > Furcy > > > On Thu, Aug 4, 2016 at 1:29 PM, Elliot West wrote: > >> Hello, >> >> I have a process that needs

Re: Unit testing macros

2016-09-30 Thread Elliot West
Hi, You can achieve this by storing the macro definition in a separate HQL file and 'import' this as needed. Unfortunately such imports are interpreted by your Hive client and the relevant command varies between client implementations: '!run' in Beeline and 'SOURCE' in Hive CLI. I raised a proposa

Re: s3a and hive

2016-11-15 Thread Elliot West
My gut feeling is that this is not something you should do (except for fun!) I'm fairly confident that somewhere in Hive, MR, or Tez, you'll hit some code that requires consistent, atomic move/copy/list/overwrite semantics from the warehouse filesystem. This is not something that the vanilla S3AFil

Interrogating a uniontype

2016-11-23 Thread Elliot West
Can anyone recommend a good approach for interrogating uniontype values in HQL? I note that the documentation states that the support for such types is limited to 'look-at-only' which I assume to mean that I may only dump out the value in its entirety, and extract sub-elements. Using the example be

Re: Interrogating a uniontype

2016-11-23 Thread Elliot West
Ah, I see that this can't be done with an array as there is no type common to all union indexes. Perhaps a struct with one field per indexed type? On Wed, 23 Nov 2016 at 17:29, Elliot West wrote: > Can anyone recommend a good approach for interrogating uniontype values in > HQL? I no

Re: Interrogating a uniontype

2016-11-24 Thread Elliot West
d them as json objects. > > regards > /Pelle > > On Wed, Nov 23, 2016 at 6:40 PM, Elliot West wrote: > > Ah, I see that this can't be done with an array as there is no type common > to all union indexes. Perhaps a struct with one field per indexed type? > > On Wed,

Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West
Some options are covered here, although there is no definitive guidance as far as I know: https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL#UnitTestingHiveSQL-Modularisation On 15 December 2016 at 17:08, Saumitra Shahapure < saumitra.offic...@gmail.com> wrote: > Hello, > > W

Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West
I notice that HPL/SQL is not mentioned on the page I referenced, however I expect that is another approach that you could use to modularise: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156 http://www.hplsql.org/doc On 15 December 2016 at 17:17, Elliot West wrote

Re: Column names in ORC file

2016-12-15 Thread Elliot West
Possibly related to HIVE-4243 which was fixed in Hive 2.0.0: https://issues.apache.org/jira/browse/HIVE-4243 On Thu, 15 Dec 2016 at 18:06, Daniel Haviv wrote: > Hi, > When I'm generating ORC files using spark the column names are written > into the ORC file but when generated using Hive I get t

Re: how to load ORC file into hive orc table

2016-12-17 Thread Elliot West
It looks as though your table is partitioned yet perhaps you haven't accounted for this when adding the data? Firstly it is good practice (and sometimes essential) to put the data into a partition folder of the form "timestamp=''". You may then need to add the partition depending on how you are cre

Re: hive2.1.0 one partition has two locations

2016-12-22 Thread Elliot West
I believe there is an issue with non-string type partition values. On some code path point they are incorrectly compared as strings when a numeric comparison should be used instead. Consequently, as '04' ≠ '4' you get two different partitions. To work around this you should ensure that only one num

Re: tez + union stmt

2016-12-25 Thread Elliot West
I believe that tez will generate subfolders for unioned data. As far as I know, this is the expected behaviour and there is no alternative. Presumably this is to prevent multiple tasks from attempting to write the same file? We've experienced issues when switching from mr to tez; downstream jobs w

Re: tez + union stmt

2017-01-10 Thread Elliot West
tings in that > stackoverflow link look to me to be exactly what i need to set for MR jobs > to pick that data up that Tez created. > > Cheers, > Stephen. > > On Sun, Dec 25, 2016 at 2:45 AM, Elliot West wrote: > > I believe that tez will generate subfolders for unioned

Re: tez + union stmt

2017-01-11 Thread Elliot West
Thank you. On Wed, 11 Jan 2017 at 07:21, Chris Drome wrote: > Elliot, > > Mithun already created the following ticket to track the issue: > > https://issues.apache.org/jira/browse/HIVE-15575 > > chris > > > On Tuesday, January 10, 2017 11:05 PM, Elliot West

Re: VARCHAR or STRING fields in Hive

2017-01-16 Thread Elliot West
Internally it looks as though Hive simply represents CHAR/VARCHAR values using a Java String and so I would not expect a significant change in execution performance. The Hive JIRA suggests that these types were added to 'support for more SQL-compliant behavior, such as SQL string comparison semanti

Client agnostic metastore authorization in AWS

2017-04-24 Thread Elliot West
We’re operating long lived Hive metastore instances in AWS to provide a metadata source of truth for our data processing pipelines. These pipelines are not restricted to Hive SQL, but must use frameworks that can integrate with the metastore (such as Spark). We’re storing data in S3. As these are c

Metastore integration testing with S3

2017-06-26 Thread Elliot West
I'm trying to put together a metastore instance for the purposes of creating an integration test environment. The system in question reads and writes data into S3 and consequently manages Hive tables whose raw data lives in S3. I've been successfully using Minio (https://www.minio.io) to decouple o

Re: Metastore integration testing with S3

2017-06-26 Thread Elliot West
s in remote mode can you make sure > that these keys are present in the hive-site.xml of HMS? > > fs.s3a.access.key= > fs.s3a.secret.key= > > On Mon, Jun 26, 2017 at 11:12 AM, Elliot West wrote: > >> I'm trying to put together a metastore instance for the purposes of

Hive federation service

2017-07-27 Thread Elliot West
Hello, We've recently contributed our Hive federation service to the open source community: https://github.com/HotelsDotCom/waggle-dance Waggle Dance is a request routing Hive metastore proxy that allows tables to be concurrently accessed across multiple Hive deployments. It was created to tack

Re: Hive federation service

2017-07-27 Thread Elliot West
ons > 1. Can Waggle Dance deal with multiple kerberized Hadoop clusters? > 2. Do you support 3 layers in the hierarchy (i.e. cluster.database.table) > or 2 layers, with a requirement to avoid any possible name collisions in > the mapping layer. > 3. Is it compatible with JDBC? It

Re: Hive federation service

2017-07-27 Thread Elliot West
hierarchy (i.e. cluster.database.table) > or 2 layers, with a requirement to avoid any possible name collisions in > the mapping layer. > > 3. Is it compatible with JDBC? It wasn't clear to me since the diagrams > all mention thrift. > > > > Thanks! > > &

Re: Hive Table with Avro Union

2017-07-31 Thread Elliot West
Hi Nishanth, While what you suggest is indeed feasible, it is not something that I'd recommend for the following reasons: 1. Consumers of the data will need to write conditional code in their HQL which will likely be difficult to write and maintain (although this might be unavoidable reg

Parameterized views

2017-10-02 Thread Elliot West
Hello, Does any version of Hive support parameterized views. I'm thinking of something like the following contrived example: CREATE VIEW x AS SELECT a FROM y WHERE date = ${mydate} I've not been able to get this to work in Hive 1.2.1 or 2.1.0 and wonder if this is the intended behavior, o

Re: Hive and Schema Registry

2017-10-13 Thread Elliot West
We also use this feature of AvroSerDe and find it very useful. In our case we copy the schema from our schema registry into S3 and reference it from there. In effect, we listen to the internal topic used to store schemas by our registry, and push to S3 whenever there is a new record. As well as bei

Re: ACID update operation not working as expected

2017-10-18 Thread Elliot West
Did you manage to get any further with this? On Fri, 6 Oct 2017 at 05:47, Manju A wrote: > > > Hi Team, > > > > > > > > Using flume interceptor , I am reading messages from kafka with key and > value pair. The key is represented by an integer variable called pk in > below code and the value of m

Re: Options for connecting to Apache Hive

2017-11-10 Thread Elliot West
Hi Jakob, Assuming that your Hive deployment is running HiveServer2, you could issue queries and obtain result sets via its Thrift API. Thrift has a broad set of language implementations, including C IIRC. I believe this is also the API used by Hive's JDBC connector, so it should be capable from a

Re: Cannot create external table on S3; class S3AFileSystem not found

2017-12-09 Thread Elliot West
Which distribution are you using? Do you have hadoop-aws on the class path? Is ‘/path/to/hadoop/install’ a literal value or a placeholder that you’ using for the actual location? Cheers, Elliot. On Sat, 9 Dec 2017 at 00:08, Scott Halgrim wrote: > Hi, > > I’ve been struggling with this for a fe

Re: Proposal: File based metastore

2018-01-30 Thread Elliot West
Hi Ryan, Is Hive support on the iceberg roadmap? Presumably its MetastoreClientFactory and storage API provide an integration point? Or is there perhaps some architectural detail that makes this impractical? I’m thinking not just of the ability to support Hive, but also the range of tooling that

HQL parser internals

2018-02-16 Thread Elliot West
Hello, We need to be able to parse and manipulate an HQL query. To date we’ve been intercepting and transforming the parse tree in the SemanticAnalyzerHook. However, ideally we wish to manipulate the original query as delivered by the user (or as close to it as possible), and we’re finding that th

Re: HQL parser internals

2018-02-19 Thread Elliot West
Thank you all for your rapid responses; some really useful information and pointers in there. We'll keep the list updated with our progress. On 18 February 2018 at 19:00, Dharmesh Kakadia wrote: > +1 for using ParseDriver for this. I also have used it to intercept and > augment query AST. > > A

Re: HQL parser internals

2018-03-19 Thread Elliot West
ns handled prior to parsing or within the parser itself? If in a pre-procesing stage, is there any code or utility classes within Hive that we can use as a reference, or to provide this functionality? Cheers, Elliot. On 19 February 2018 at 11:10, Elliot West wrote: > Thank you all for your ra

Proposal: Apply SQL based authorization functions in the metastore.

2018-04-20 Thread Elliot West
deployments that use HS2 exclusively, the proposed metastore resident SQL based auth could either be disabled or used harmlessly in conjunction with the HS2 implementation. Thanks, Elliot. Elliot West Senior Engineer Data Platform Team Hotels.com

org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory

2018-04-23 Thread Elliot West
Hello, I'm looking for an abstraction to use for integrating with different (non-Thrift) metadata catalog implementations. I know that AWS Glue manages this and so have explored in EMR (Hive 2.3.2) a little. I see that it uses the "org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory" int

Hive remote databases/tables proposal

2018-04-26 Thread Elliot West
Hello, At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle Dance , a tool for federating multiple Hive clusters and providing the illusion of a unified data catalog from disparate instances. We’ve been running Waggle Dance in produ

Re: Hive remote databases/tables proposal

2018-04-27 Thread Elliot West
icalities. Thank you for your helpful reply. Elliot. On 26 April 2018 at 17:28, Johannes Alberti wrote: > Did you guys look at https://github.com/qubole/Hive-JDBC-Storage-Handler > and discussed the pros/cons/similarities of the qubole approach > > On Thu, Apr 26, 2018 at 4:01 AM

Re: May 2018 Hive User Group Meeting

2018-05-02 Thread Elliot West
+1 for streaming or a recording. Content looks excellent. On 2 May 2018 at 15:51, dan young wrote: > Looks like great talks, will this be streamed anywhere? > > On Wed, May 2, 2018, 8:48 AM Sahil Takiar wrote: > >> Hey Everyone, >> >> The agenda for the meetup has been set and I'm excited to sa

Re: Hive remote databases/tables proposal

2018-05-10 Thread Elliot West
and does not fall into the scope of the SerDe interface. The org.apache.hadoop.hive.ql.metadata.MetastoreClientFactory integration point looks promising for both of these cases, but I can only find an operational implementation of this in EMR. Cheers, Elliot. On 27 April 2018 at 17:32, Elliot West

Re: What does the ORC SERDE do

2018-05-13 Thread Elliot West
Hi Jörn, I’m curious to know how the SerDe framework provides the means to deal with partitions, table properties, and statistics? I was under the impression that these were in the domain of the metastore and I’ve not found anything in the SerDe interface related to these. I would appreciate if yo

org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory

2018-05-14 Thread Elliot West
Hello, I've been looking at Amazon's integration of their Glue service with Hive in EMR and notice that they achieve this with: - An AWS Glue specific implementation of org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory (com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogH

Re: Hive External Table on particular set of files.

2018-06-03 Thread Elliot West
On which type of file system are you storing the data? S3? HDFS? Other? On Sun, 3 Jun 2018 at 08:26, Mahender Sarangam wrote: > We are copying files from our upstream system which are in JSON GZ format. > They are following a pattern for very daily slice say MMDDHH > (2018053100) they are ma

Re: Hive storm streaming with s3 file system

2018-06-12 Thread Elliot West
I don't not believe that S3 is currently a supported filesystem for transactional tables. I believe there are plans to make this so. On 12 June 2018 at 17:50, Abhishek Raj wrote: > Hi. I'm using HiveBolt from Apache Storm to stream >

Optimal approach for changing file format of a partitioned table

2018-08-04 Thread Elliot West
Hi, I’m trying to simply change the format of a very large partitioned table from Json to ORC. I’m finding that it is unexpectedly resource intensive, primarily due to a shuffle phase with the partition key. I end up running out of disk space in what looks like a spill to disk in the reducers. How

  1   2   >