P.s.

There are some risky data manipulations going there.
I’m not sure this is a desired result… ☺

hive> select CAST(REGEXP_REPLACE('And the Lord spake, saying, "First shalt thou 
take out the Holy Pin * Then shalt thou count to 3, no more, no less * 3 shall 
be the number thou shalt count, and the number of the counting shall be 3 * 4 
shalt thou not count, neither count thou 2, excepting that thou then proceed to 
3 * 5 is right out * Once the number 3, being the third number, be reached, 
then lobbest thou thy Holy Hand Grenade of Antioch towards thy foe, who, being 
naughty in My sight, shall snuff it *','[^\\d\\.]','') AS DECIMAL(20,2));
OK
33342353

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Tuesday, June 07, 2016 2:23 AM
To: user <user@hive.apache.org>
Subject: Re: Why does the user need write permission on the location of 
external hive table?

Hi Igor,

Hive can read from zipped files. If you are getting a lot of external files it 
makes sense to zip them and store on staging hdfs directory

1) download say these csv files into your local file system and use bzip2 to 
zip them as part of ETL

 ls -l
total 68
-rw-r--r-- 1 hduser hadoop 7334 Apr 25 11:29 nw_2011.csv.bz2
-rw-r--r-- 1 hduser hadoop 6235 Apr 25 11:29 nw_2012.csv.bz2
-rw-r--r-- 1 hduser hadoop 5476 Apr 25 11:29 nw_2013.csv.bz2
-rw-r--r-- 1 hduser hadoop 2725 Apr 25 11:29 nw_2014.csv.bz2
-rw-r--r-- 1 hduser hadoop 1868 Apr 25 11:29 nw_2015.csv.bz2
-rw-r--r-- 1 hduser hadoop  693 Apr 25 11:29 nw_2016.csv.bz2

Then put these files in a staging directory on hdfs usinh a shell script


for FILE in `ls *.*|grep -v .ksh`
do
  echo "Bzipping ${FILE}"
  /usr/bin/bzip2 ${FILE}
   hdfs dfs -copyFromLocal ${FILE}.bz2 ${TARGETDIR}
done

OK now the files are saved in ${TARGETDIR}

Now create the external table looking at this staging directory. No need to 
tell hive that these files are compressed. It knows how to handle it. They are 
stored as textfiles


DROP TABLE IF EXISTS stg_t2;
CREATE EXTERNAL TABLE stg_t2 (
 INVOICENUMBER string
,PAYMENTDATE string
,NET string
,VAT string
,TOTAL string
)
COMMENT 'from csv file from excel sheet nw_10124772'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/data/stg/accounts/nw/10124772'
TBLPROPERTIES ("skip.header.line.count"="1")

Now create the Hive table internally. Note that I want this data to be 
compressed. You will tell it to compress the table with ZLIB or SNAPPY


DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (
 INVOICENUMBER          INT
,PAYMENTDATE            date
,NET                    DECIMAL(20,2)
,VAT                    DECIMAL(20,2)
,TOTAL                  DECIMAL(20,2)
)
COMMENT 'from csv file from excel sheet nw_10124772'
CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )

Put data in target table. do the conversion and ignore empty rows

INSERT INTO TABLE t2
SELECT
          INVOICENUMBER
        , 
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd')) 
AS paymentdate
        , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
        , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
        , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
FROM
stg_t2
WHERE
--        INVOICENUMBER > 0 AND
        CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0 -- 
Exclude empty rows
;

So pretty straight forward.

Now to your question

"it will affect performance, correct?"


Compression is a well established algorithm. It has been around in databases. 
Almost all RDBMS (Oracle, Sybase etc) do compress the data at database and 
backups through an option. Compression is more CPU intensive than without it. 
However, the database will handle the conversion of data from compressed to 
none when you read it or whatever. So yes there is a performance price to pay 
albeit small using more CPU to uncompress the data and present it. However, 
that is a small price to pay to reduce the storage cost for data.

HTH













Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 6 June 2016 at 23:18, Igor Kravzov 
<igork.ine...@gmail.com<mailto:igork.ine...@gmail.com>> wrote:
Mich, will Hive automatically detect and unzip zipped files? Ir there is 
special option in table configuration?
it will affect performance, correct?

On Mon, Jun 6, 2016 at 4:14 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
Hi Sandeep.

I tend to use Hive external tables as staffing tables but still I will require 
access writes to hdfs.

Zip files work OK as well. For example our CSV files are zipped using bzip2 to 
save space

However, you may request a temporary solution by disabling permission in 
$HADOOP_HOME/etc/Hadoop/hdfs-site.xml

<property>
    <name>dfs.permissions</name>
    <value>false</value>
</property>

There are other ways as well.

Check this

http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs

HTH







Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 6 June 2016 at 21:00, Igor Kravzov 
<igork.ine...@gmail.com<mailto:igork.ine...@gmail.com>> wrote:
I see file are with extension .gz. Are these zipped?
Did you try with unzipped files?
Maybe in order to read the data hive needs to unzip files but does not have 
write permission?
Just a wild guess...

On Tue, May 31, 2016 at 4:20 AM, Sandeep Giri 
<sand...@cloudxlab.com<mailto:sand...@cloudxlab.com>> wrote:
Yes, when I run hadoop fs it gives results correctly.

hadoop fs -ls /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/
Found 30 items
-rw-r--r--   3 hdfs hdfs       6148 2015-12-04 15:19 
/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store
-rw-r--r--   3 hdfs hdfs     803323 2015-12-04 15:19 
/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz
-rw-r--r--   3 hdfs hdfs     284355 2015-12-04 15:19 
/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz
....



On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
is this location correct and valid?

LOCATION '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 31 May 2016 at 08:50, Sandeep Giri 
<sand...@cloudxlab.com<mailto:sand...@cloudxlab.com>> wrote:
Hi Hive Team,

As per my understanding, in Hive, you can create two kinds of tables: Managed 
and External.

In case of managed table, you own the data and hence when you drop the table 
the data is deleted.

In case of external table, you don't have ownership of the data and hence when 
you delete such a table, the underlying data is not deleted. Only metadata is 
deleted.

Now, recently i have observed that you can not create an external table over a 
location on which you don't have write (modification) permissions in HDFS. I 
completely fail to understand this.

Use case: It is quite common that the data you are churning is huge and 
read-only. So, to churn such data via Hive, will you have to copy this huge 
data to a location on which you have write permissions?

Please help.

My data is located in a hdfs folder 
(/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I only 
have readonly permission. And I am trying to execute the following command

CREATE EXTERNAL TABLE tweets_raw (
        id BIGINT,
        created_at STRING,
        source STRING,
        favorited BOOLEAN,
        retweet_count INT,
        retweeted_status STRUCT<
        text:STRING,
        users:STRUCT<screen_name:STRING,name:STRING>>,
        entities STRUCT<
        urls:ARRAY<STRUCT<expanded_url:STRING>>,
        user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
        hashtags:ARRAY<STRUCT<text:STRING>>>,
        text STRING,
        user1 STRUCT<
        screen_name:STRING,
        name:STRING,
        friends_count:INT,
        followers_count:INT,
        statuses_count:INT,
        verified:BOOLEAN,
        utc_offset:STRING, -- was INT but nulls are strings
        time_zone:STRING>,
        in_reply_to_screen_name STRING,
        year int,
        month int,
        day int,
        hour int
        )
        ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
        WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
        LOCATION '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'
        ;

It throws the following error:

FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. 
MetaException(message:java.security.AccessControlException: Permission denied: 
user=sandeep, access=WRITE, 
inode="/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw":hdfs:hdfs:drwxr-xr-x
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
        at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1729)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8348)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1978)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.ja
va:1443)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProto
s.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)



--
Regards,
Sandeep Giri,
+1-(347) 781-4573<tel:%2B1-%28347%29%20781-4573> (US)
+91-953-899-8962 (IN)
www.CloudxLab.com<http://www.CloudxLab.com>  (A Hadoop cluster for practicing)





--
Regards,
Sandeep Giri,
+1-(347) 781-4573<tel:%2B1-%28347%29%20781-4573> (US)
+91-953-899-8962 (IN)
www.CloudxLab.com<http://www.CloudxLab.com>




Reply via email to