Re: Hive s3 external table with sub directories

2015-10-22 Thread Sergey Shelukhin
I don’t think Hive picks up partitions automatically in this scenario. Maybe a 
ticket could be added to add partitions based on some additional syntax, as 
this seems to be an occasionally used scenario. I’ve seen msck used as a hack 
to “restore” partitions into metastore (it will find the directories and create 
the partitions if all goes well), note that new partitions also won’t be picked 
up.
Make sure to try it first on test directory so you could see if it works for 
you.

From: Hafiz Mujadid >
Reply-To: "user@hive.apache.org" 
>
Date: Thursday, October 22, 2015 at 03:57
To: "user@hive.apache.org" 
>
Subject: Hive s3 external table with sub directories


I have following s3 directory structure.

Data/
   Year=2015/
 Month=01/
Day=01/
files
Day=02/
files
 Month=02/
Day=01/
files
Day=02/
files
 .
 .
 .

   Year=2014/
 Month=01/
Day=01/
files
Day=02/
files
 Month=02/
Day=01/
files
Day=02/
files
So i am creating hive external table as follow

CREATE external TABLE trips
(
 trip_id  STRING,probe_id STRING,provider_id STRING,
 is_moving TINYINT,is_completed BOOLEAN,start_time STRING,
 start_lat  DOUBLE,start_lon DOUBLE,start_lat_adj DOUBLE)
  PARTITIONED BY (year INT,month INT,day INT)
  STORED AS TEXTFILE
  LOCATION 's3n://accesskey:secretkey@bucket/data/';

When i run query on this table no data is returned without any exception. If i 
place same files in one directory only and without partitioning, then it runs 
fine. I also tried bey setting

set mapred.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;

Any idea where i am wrong?


RE: Double quotes in csv data

2015-10-22 Thread michael.england
Hi,

Is there an easier way to do this with a Serde? The data volumes could easily 
reach multi-terabyte level so it would be nice if Hive could handle this.

Thanks,
Michael

From: Vikas Parashar [mailto:para.vi...@gmail.com]
Sent: 21 October 2015 16:29
To: user@hive.apache.org
Subject: Re: Double quotes in csv data

Hi Micheal,

You can write some python/perl script and can transform the csv.

On Wed, Oct 21, 2015 at 8:57 PM, 
> wrote:
Hi,

I have some CSV data which is encompassing each field in double quotes e.g. 
“hello”, “”, “test”, “”.

I noticed that there is a CSV Serde now available in Hive 0.14. Is it possible 
to use this to strip the quotes when querying an external Hive table?

https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

e.g. select * from table limit 5 should return:

hello test   

This e-mail (including any attachments) is private and confidential, may 
contain proprietary or privileged information and is intended for the named 
recipient(s) only. Unintended recipients are strictly prohibited from taking 
action on the basis of information in this e-mail and must contact the sender 
immediately, delete this e-mail (and all attachments) and destroy any hard 
copies. Nomura will not accept responsibility or liability for the accuracy or 
completeness of, or the presence of any virus or disabling code in, this 
e-mail. If verification is sought please request a hard copy. Any reference to 
the terms of executed transactions should be treated as preliminary only and 
subject to formal written confirmation by Nomura. Nomura reserves the right to 
retain, monitor and intercept e-mail communications through its networks 
(subject to and in accordance with applicable laws). No confidentiality or 
privilege is waived or lost by Nomura by any mistransmission of this e-mail. 
Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, 
Inc. group. Please read our Electronic Communications Legal Notice which forms 
part of this e-mail: http://www.Nomura.com/email_disclaimer.htm



This e-mail (including any attachments) is private and confidential, may 
contain proprietary or privileged information and is intended for the named 
recipient(s) only. Unintended recipients are strictly prohibited from taking 
action on the basis of information in this e-mail and must contact the sender 
immediately, delete this e-mail (and all attachments) and destroy any hard 
copies. Nomura will not accept responsibility or liability for the accuracy or 
completeness of, or the presence of any virus or disabling code in, this 
e-mail. If verification is sought please request a hard copy. Any reference to 
the terms of executed transactions should be treated as preliminary only and 
subject to formal written confirmation by Nomura. Nomura reserves the right to 
retain, monitor and intercept e-mail communications through its networks 
(subject to and in accordance with applicable laws). No confidentiality or 
privilege is waived or lost by Nomura by any mistransmission of this e-mail. 
Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, 
Inc. group. Please read our Electronic Communications Legal Notice which forms 
part of this e-mail: http://www.Nomura.com/email_disclaimer.htm



the number of files after merging

2015-10-22 Thread patcharee

Hi,

I am using alter command below to merge partitioned orc file on one 
partition:


alter table X partition(zone=1,z=1,year=2009,month=1) CONCATENATE;

- How can I control the number of files after merging? I would like to 
get only one file per partition.

- Is it possible to concatenate the whole table, not one-by-one partition?

Thanks,
Patcharee



Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Gopal Vijayaraghavan

> so do you think if we want the same result from Hive and Spark or the
>other freamwork, how could we try this one ?

There's a special backwards compat slow codepath that gets triggered if
you do

set mapred.reduce.tasks=199; (or any number)

This will produce the exact same hash-code as the java hashcode for
Strings & Integers.

The bucket-id is determined by

(hashCode & Integer.MAX_VALUE) % numberOfBuckets

but this also triggers a non-stable sort on an entirely empty key, which
will shuffle the data so the output file's order bears no resemblance to
the input file's order.


Even with that setting, the only consistent layout produced by Hive is the
CLUSTER BY, which will sort on the same key used for distribution & uses
the java hashCode if the auto-parallelism is turned off by setting a fixed
reducer count.

Cheers,
Gopal




Hive function to convert numeric IP address to "dot" format?

2015-10-22 Thread Mark Sunderlin
Does hive have a built in function to return a dotted-quad representation
of an IP address  given a network address as an integer as input?

If not, does anyone have SQL they would be willing to share that does this?

What I am looking for: My data is in the below "raw" format, I  want it in
the "dot" format.

raw_formatdot_format166136010699.6.87.234162645867796.241.202.53

--
Mark E. Sunderlin
Data Architect // Data Solutions
P: 703-265-6935 // C: 540-327-6222 // 22000 AOL Way,  Dulles, VA  20166
AIM: MESunderlin // Slack: @Mark.Sunderlin


Hive s3 external table with sub directories

2015-10-22 Thread Hafiz Mujadid
I have following s3 directory structure.

Data/
   Year=2015/
 Month=01/
Day=01/
files
Day=02/
files
 Month=02/
Day=01/
files
Day=02/
files
 .
 .
 .

   Year=2014/
 Month=01/
Day=01/
files
Day=02/
files
 Month=02/
Day=01/
files
Day=02/
files
So i am creating hive external table as follow

*CREATE external TABLE trips*
*(*
* trip_id  STRING,probe_id STRING,provider_id STRING,*
* is_moving TINYINT,is_completed BOOLEAN,start_time STRING,*
* start_lat  DOUBLE,start_lon DOUBLE,start_lat_adj DOUBLE) *
*  PARTITIONED BY (year INT,month INT,day INT)*
*  STORED AS TEXTFILE*
*  LOCATION 's3n://accesskey:secretkey@bucket/data/';*

When i run query on this table no data is returned without any exception.
If i place same files in one directory only and without partitioning, then
it runs fine. I also tried bey setting

*set mapred.input.dir.recursive=true;*
*set hive.mapred.supports.subdirectories=true;*

Any idea where i am wrong?


Need suggestions on processing JSON junk (e.g., invalid double quotes) data using HIVE

2015-10-22 Thread Sam Joe
Hi,

After streaming twitter data to HDFS using Flume, I'm trying to analyze it
using some HIVE queries. The data is in JSON format and not clean having
double quotes (") in wrong places causing the HIVE queries to fail. I am
getting the following error:

Failed with exception
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was
expecting closing '"' for name

The script used for creating the external table:

ADD JAR 
/usr/local/hive/apache-hive-1.2.1-bin/lib/hive-serdes-1.0-SNAPSHOT.jar;set
hive.support.sql11.reserved.keywords = false;
CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT>,
entities STRUCT<
urls:ARRAY,
user_mentions:ARRAY>,
hashtags:ARRAY>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/usr/local/hadoop/bin/tweets';

Since I would not know for which row the extra double quotes is present, I
can't put an escape character. How can I escape the junk characters and
process the data successfully?

Appreciate any help.

Thanks,

Joel


Re: Need suggestions on processing JSON junk (e.g., invalid double quotes) data using HIVE

2015-10-22 Thread Sam Joe
Hi,

Please see the logs are given below:

hive> SELECT t.retweeted_screen_name,
>Sum(retweets) AS total_retweets,
>Count(*)  AS tweet_count
> FROM   (SELECT retweeted_status.user.screen_name AS retweeted_screen_name,
>retweeted_status.text,
>Max(retweet_count)AS retweets
> FROM   tweets
> GROUP  BY retweeted_status.user.screen_name,
>   retweeted_status.text) t
> GROUP  BY t.retweeted_screen_name
> ORDER  BY total_retweets DESC
> LIMIT  1;
Query ID = joe_20151022143018_f680c6fd-5d6d-4d5e-8d20-df25396a84d5
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1445537142761_0002, Tracking URL =
http://localhost:8088/proxy/application_1445537142761_0002/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1445537142761_0002
Hadoop job information for Stage-1: number of mappers: 2; number of
reducers: 22015-10-22 14:30:51,478 Stage-1 map = 0%,  reduce =
0%2015-10-22 14:39:14,950 Stage-1 map = 69%,  reduce = 17%, Cumulative
CPU 62.09 sec2015-10-22 14:39:17,556 Stage-1 map = 70%,  reduce = 17%,
Cumulative CPU 63.9 sec2015-10-22 14:39:20,209 Stage-1 map = 71%,
reduce = 17%, Cumulative CPU 65.86 sec2015-10-22 14:39:25,098 Stage-1
map = 72%,  reduce = 17%, Cumulative CPU 67.68 sec2015-10-22
14:39:26,126 Stage-1 map = 74%,  reduce = 17%, Cumulative CPU 69.33
sec2015-10-22 14:39:29,943 Stage-1 map = 75%,  reduce = 17%,
Cumulative CPU 71.09 sec2015-10-22 14:39:34,993 Stage-1 map = 77%,
reduce = 17%, Cumulative CPU 74.86 sec2015-10-22 14:39:43,505 Stage-1
map = 100%,  reduce = 100%, Cumulative CPU 25.47 sec
MapReduce Total cumulative CPU time: 25 seconds 470 msec
Ended Job = job_1445537142761_0002 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1445537142761_0002_m_01 (and more) from
job job_1445537142761_0002

Task with the most failures(1):
-
Task ID:
  task_1445537142761_0002_m_00
URL:
  
http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1445537142761_0002=task_1445537142761_0002_m_00
-
Diagnostic Messages for this Task:Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error
while processing writable
{"filter_level":"low","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":true,"truncated":false,"lang":"it","in_reply_to_status_id_str":null,"id":654395624406675456,"in_reply_to_user_id_str":null,"timestamp_ms":"1444855049598","in_reply_to_status_id":null,"created_at":"Wed
Oct 14 20:37:29 +
2015","favorite_count":0,"place":null,"coordinates":null,"text":"Samaritani:
\"E-fattura strumento chiave per la business intelligence nella PA\"
https://t.co/CIckEdB9EG","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[{"expanded_url":"https://lnkd.in/eDRb_sv","indices":[79,102],"display_url":"lnkd.in/eDRb_sv","url":"https://t.co/CIckEdB9EG"}],"hashtags":[],"user_mentions":[]},"is_quote_status":false,"source":;http://www.linkedin.com/\;
rel=\"nofollow\">LinkedIn<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"654395624406675456","user":{"location":"pisa","default_profile":true,"profile_background_tile":false,"statuses_count":2924,"lang":"it","profile_link_color":"0084B4","profile_banner_url":"https://pbs.twimg.com/profile_banners/145360070/1422279238","id":145360070,"following":null,"protected":false,"favourites_count":660,"profile_text_color":"33","verified":false,"description":null,"contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name":"marco
andreozzi","profile_background_color":"C0DEED","created_at":"Tue May
18 19:49:58 +
2010","default_profile_image":false,"followers_count":178,"profile_image_url_https":"https://pbs.twimg.com/profile_images/643826897231724544/odFpg1zd_normal.jpg","geo_enabled":true,"profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","follow_request_sent":null,"url":null,"utc_off
   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
 at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)at

Access to wiki (documenting locking requirements).

2015-10-22 Thread Elliot West
Hi,

May I have access to edit the wiki? My confluence user name is 'teabot'.

I've been looking briefly at ALTER TABLE CONCATENATE and noticed that the
operation isn't listed on the Hive/Locking wiki page
 even though it
acquires an exclusive lock.

Thanks - Elliot.


Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Philip Lee
Hello, I am working on Flink and Spark majoring in Computer Science in
Berlin.

I have the important question.
Well, this question is from what I do these days, which is translations
Hive Query to Flink.

When applying [Distribute By] on Hive to the framework, the function should
be partitionByHash on Flink. This is to spread out all the rows distributed
by a hash key from Object Class in Java. That's why after running the query
many times, the results could be different. But you know Hash function is
to spread out data fairly and evenly.

Here is a question.
What about [Distribute By] on Hive? How does this function spread out data
fairly? This functions is actually same as this case, which is the results
could be different after running query many times?

Thanks,
Philip




-- 

==

*Hae Joon Lee*


Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin


In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST


Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)


Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==


Locking when using the Metastore/HCatalog APIs.

2015-10-22 Thread Elliot West
I notice from the Hive locking wiki page
 that locks may
be acquired for a range of HQL DDL operations. I wanted to know how the
locking scheme mapped mapped/employed by equivalent operations in the
Metastore and HCatalog APIs. Consider the operation:

alter table T1 drop partition P1


This apparently requires a shared lock on T1 and an exclusive lock on P1.
However, when following the invocation chain on the equivalent APIs I can
see no acquisition of such locks:

HCatClient.dropPartitions(...)
IMetaStoreClient.dropPartition(...)


I notice that IMetaStoreClient exposes methods to lock and unlock resources
and I have used these when working with the ACID APIs. However I can find
nothing similar on the HCatalog API. As a user of these APIs:

   1. Do I need to acquire/release locks as declared on the wiki page when
   using equivalent operations in these APIs?
   2. If not, why not?
   3. Ideally, should the HCatClient also expose lock/unlock methods like
   the Metastore API?
   4. If not, why not?

Thank you - Elliot.


Hive on Spark

2015-10-22 Thread Jone Zhang
1.How can i set Storage Level when i use Hive on Spark?
2.Do Spark have any intention of  dynamically determined Hive on MapReduce
or Hive on Spark, base on SQL features.

Thanks in advance
Best regards


Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Philip Lee
Thanks for your help.

so do you think if we want the same result from Hive and Spark or the other
freamwork, how could we try this one ?
could you tell me in detail.

Regards,
Philip

On Thu, Oct 22, 2015 at 6:25 PM, Gopal Vijayaraghavan  wrote:

>
> > When applying [Distribute By] on Hive to the framework, the function
> >should be partitionByHash on Flink. This is to spread out all the rows
> >distributed by a hash key from Object Class in Java.
>
> Hive does not use the Object hashCode - the identityHashCode is
> inconsistent, so Object.hashCode() .
>
> ObjectInspectorUtils::hashCode() is the hashcode used by the DBY in hive
> (SORT BY uses a Random number generator).
>
> Cheers,
> Gopal
>
>


-- 

==

*Hae Joon Lee*


Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin


In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST


Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)


Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==


Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Gopal Vijayaraghavan

> When applying [Distribute By] on Hive to the framework, the function
>should be partitionByHash on Flink. This is to spread out all the rows
>distributed by a hash key from Object Class in Java.

Hive does not use the Object hashCode - the identityHashCode is
inconsistent, so Object.hashCode() .

ObjectInspectorUtils::hashCode() is the hashcode used by the DBY in hive
(SORT BY uses a Random number generator).

Cheers,
Gopal