Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Mich Talebzadeh
Hi Gopal, I am using Hive 2 on Spark 1.3.1 engine. OK, This is only a test table. What would be the best way to create this table in Hive as ORC format? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: De-identification_in Hive

2016-03-19 Thread Ajay Chander
Jorne, I have around hundred big csv files in my local machine. Each file has some number of columns which has sensitive information in it. I don't want to drop the columns manually. Now I have to bring those files into hive external tables, but I want to make sure that the columns which has sensi

Re: hbase-1.1.1 & hive-1.0.1

2016-03-19 Thread Adam Hunt
Version information Hive 1.x will remain compatible with HBase 0.98.x and lower versions. Hive 2.x will be compatible with HBase 1.x and higher. (See HIVE-10990 for details.) Consumers wanting to work with HBase 1.x using Hive 1.x will need to com

Implementing PIVOT in Hive

2016-03-19 Thread Mahender Sarangam
Hi Team, We are looking for Pivoting of some of columns has rows. Is there any support for Pivoting in HIVE

Re: De-identification_in Hive

2016-03-19 Thread Jörn Franke
What are your requirements? Do you need to omit a column? Transform it? Make the anonymized version joinable etc. there is not simply one function. > On 17 Mar 2016, at 14:58, Ajay Chander wrote: > > Hi Everyone, > > I have a csv.file which has some sensitive data in a particular column in it.

Leveraging Hive calcite module

2016-03-19 Thread Srinivasan Hariharan02
Hi, We want to leverage Hive calcite module to find the query cost without executing the complete query. I have gone through the code of the calcite module https://github.com/apache/hive/tree/branch-2.0/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite Can anyone help how we can get the

hbase-1.1.1 & hive-1.0.1

2016-03-19 Thread songj songj
hi all: I use hbase-1.1.1 & hive-1.0.1 ,but I can not access hbase from hive this two apps version does not compatible? Diagnostic Messages for this Task: Error: java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setDurability(Lorg/apache/hadoop/hbase/clie

Re: Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Joseph
Hi professor Gopal, > Most of your ~300s looks to be the fixed overheads of setting up each task. Maybe you are right. Perhaps the orc indexes work normally in hive, Just because the fixed time overhead is too long, so I think the performance improement is not obvious, I will check this later.

Re: Hiding staging directory data on HDFS

2016-03-19 Thread chandra Reddy Bogala
Right mechanism is implement Kerberos authentication. Think like UNIX how you protect data. Same like that users and groups can be created and file permissions can be given. If you don't protect that way user can read any data any data including ORC (by ORC command line/tool). On Thursday, March 1

Re: De-identification_in Hive

2016-03-19 Thread Ajay Chander
Mich thbaks for looking into this. I have a 'csvfile.txt ' on hdfs. I have created an external table 'xyz' to load that data into it. One of the columns data 'ssn' needs to be masked. Is there any built in function is give that I could use? On Thursday, March 17, 2016, Mich Talebzadeh wrote: > A

Hiding staging directory data on HDFS

2016-03-19 Thread Mich Talebzadeh
Hi, What are the best mechanisms of hiding data destined for Hive tables. Let us assume that we are loading tons of CSV files into Hive. The way I do it is: --1 Move .CSV data into HDFS staging area --2 Create an external table. --3 Create the ORC table if needed --4 Insert or append the data f

Re: Tez reducer parallelism ..

2016-03-19 Thread Gopal Vijayaraghavan
> So you'r saying, since these windows are part of a single SELECT >projection they need to be serial? Yes, with a full shuffle of the result so far for each new OVER(). > row_number() OVER( PARTITION BY app, user, type ORDER BY ts >) as a_number, > row_number() OVER( P

Re: hive need access the hdfs of hbase?

2016-03-19 Thread Divya Gehlot
Hi, Please check your zookeeper.znode.parent property where is it pointing to ? On 17 March 2016 at 15:21, songj songj wrote: > hi all: > I have 2 cluster,one is hive cluster(2.0.0),another is hbase > cluster(1.1.1), > this two clusters have dependent hdfs: > > hive cluster: > >fs.defaultFS

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Mich Talebzadeh
I did some tests on Hive running on MR to get rid of Spark effects. In an ORC table that has been partitioned, partition elimination with predicate push down works and the query is narrowed to the partition itself. I can see that from the number of rows within that partition. For example below sa

Re: De-identification_in Hive

2016-03-19 Thread Ajay Chander
Mich, I am okay with replacing the columns data with some characters like asterisk. Thanks On Thursday, March 17, 2016, Mich Talebzadeh wrote: > Hi Ajay, > > Do you want to be able to unmask it (at any time) or just have it totally > scrambled (for example replace the column with random characte

Re: hive need access the hdfs of hbase?

2016-03-19 Thread Divya Gehlot
Do you have hbase-site.xml in classpath ? On 17 March 2016 at 17:08, songj songj wrote: > >zookeeper.znode.parent >/hbase > > > and I found it that ,bind any ip which the hive can access to > 'hbase-cluster' ,they are all ok! > > > > 2016-03-17 16:46 GMT+08:00 Divya Gehlot : > >> Hi,

Re: De-identification_in Hive

2016-03-19 Thread Mich Talebzadeh
Are you loading your CSV file from an External table into Hive table.? Basically you want to scramble that column before putting into Hive table? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: De-identification_in Hive

2016-03-19 Thread Marcin Tustin
This is a classic transform-load problem. You'll want to anonymise it once before making it available for analysis. On Thursday, March 17, 2016, Ajay Chander wrote: > Hi Everyone, > > I have a csv.file which has some sensitive data in a particular column > in it. Now I have to create a table in

Re: De-identification_in Hive

2016-03-19 Thread Mich Talebzadeh
Then probably the easiest option would be in INSERT/SELECT from external table to target table and make that column NULL Check the VAT column here that I made it NULL DROP TABLE IF EXISTS stg_t2; CREATE EXTERNAL TABLE stg_t2 ( INVOICENUMBER string ,PAYMENTDATE string ,NET string ,VAT string ,TOT

hive 2.0.0 schematool explicit?

2016-03-19 Thread songj songj
hive 2.0.0 should exec explicit: schematool -dbType derby -initSchema mysql can I not exec this command before I use hive? or there is another config in hive-site.xml to configure it as default?

How to work around non-executive /tmp with Hive in Parquet+Snappy compression?

2016-03-19 Thread Rex X
The local /tmp is non-executive configured by admin. When we do a "select ...limit 10" query on Hive, it copied some file to /tmp, and tried to execute it. But since the /tmp is non-executive, I always bumped out of the Hive shell with some binding error. What is the setting to change this /tmp

How to append one column to an existing array column in Hive?

2016-03-19 Thread Rex X
For example, to append columnA to an existing array-type column B select string_column_A, array_column_B, *append(array_column_B, string_column_A) as AB* from onetable; I did not find any append function as above in Hive. To be more accurate, I should say "set" instead of "ar

Re: Hiding staging directory data on HDFS

2016-03-19 Thread Gopal Vijayaraghavan
> --1 Move .CSV data into HDFS staging area Per-user staging areas with Kerberos auth is standard practice. As long as you're not running a vanilla Apache install (i.e Ranger KMS + SSL certificates for KMS needed), you can encrypt users away from each other[1] or from threat of physical hardware

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Gopal Vijayaraghavan
> I love to see these ORC table optimization help but it is not obvious to >me under what circumstances they bare fruit. Are you using Tez or LLAP? Your explain plans are clearly missing the optimizations I've added as part of Stinger.next. https://github.com/apache/hive/blob/master/ql/src/test/

Re: De-identification_in Hive

2016-03-19 Thread Ajay Chander
Tustin, Is there anyway I can deidentify it in hive ? On Thursday, March 17, 2016, Marcin Tustin wrote: > This is a classic transform-load problem. You'll want to anonymise it once > before making it available for analysis. > > On Thursday, March 17, 2016, Ajay Chander > wrote: > >> Hi Everyone

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size? > On 16 Mar 2016, at 11:23, Joseph wrote: > > Hi all, > > I have known that ORC provides three level of indexes within each file, file > level, stripe level, and row level. > The fi

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke
Not sure it should work. How many rows are affected? The data is sorted? Have you tried with Tez? Tez has some summary statistics that tells you if you use push down. Maybe you need to use HiveContext. Perhaps a bloom filter could make sense for you as well. > On 16 Mar 2016, at 12:45, Joseph wr

De-identification_in Hive

2016-03-19 Thread Ajay Chander
Hi Everyone, I have a csv.file which has some sensitive data in a particular column in it. Now I have to create a table in hive and load the data into it. But when loading the data I have to make sure that the data is masked. Is there any built in function is used ch supports this or do I have to