Re: Chocolatey package for Windows

2013-08-27 Thread Ruslan Al-Fakikh
I've heard that there are some problems in running Pig/Hive and Hadoop
itself on WIndows...


On Mon, Aug 26, 2013 at 10:21 PM, Andrew Pennebaker
wrote:

> I love how you can "apt-get install hadoop-hive" in Ubuntu, and "brew
> install hive" in Mac. Could we submit a 
> Chocolateypackage for Windows, so that they can 
> easily install Hive with "chocolatey
> install hive"?
>


Re: Elastic MapReduce Hive & Avro SerDe

2013-07-04 Thread Ruslan Al-Fakikh
Hi.

My guess is that you can try to look it up in their docs or mailing lists
(Amazon EMR). IIRC, CDH had the patch for Avro+Hive before it was included
in Hive itself, so Amazon EMR can have similar patches...

Ruslan


On Thu, Jul 4, 2013 at 12:20 PM, Dan Filimon wrote:

> Hi!
>
> I'm working on a few Avro MapReduce jobs whose output will end up on S3 to
> be processed by Hive.
> Amazon's latest Hive version [1] is 0.8.1 but Avro support was added in
> 0.9.1.
>
> I can only find the haivvreo project [2] that supports 0.7.
> Is this my only option?
>
> Thanks!
>
> [1] http://aws.amazon.com/elasticmapreduce/faqs/#hive-19
> [2] https://github.com/jghoman/haivvreo
>


Re: Locking in HIVE : How to use locking/unlocking features using hive java API ?

2012-12-10 Thread Ruslan Al-Fakikh
Hi Manish!

Why do you need metadata backup? Can't you just store all the table create
statements in an init file? If you care about Partitions that have been
created dynamically then you can restore them from data by RECOVER
PARTITIONS (if using Amazon EMR) or an analog check command for a regular
distro of Hadoop (I don't remember what the name is).

Ruslan


On Mon, Dec 10, 2012 at 12:48 PM, Manish Malhotra <
manish.hadoop.w...@gmail.com> wrote:

> Sending again, as got no response.
>
> Can somebody from Hive dev group please review my approach and reply?
>
> Cheers,
> Manish
>
>
> On Thu, Dec 6, 2012 at 11:17 PM, Manish Malhotra <
> manish.hadoop.w...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm building / designing a back-up and restore tool for hive data for
>> Disaster Recovery scenarios.
>>
>> I'm trying to understand the locking behavior of HIVE that is currently
>> supporting ZooKeeper for locking.
>>
>> My thought process if like this ( early design.)
>>
>> 1. Backing up the meta-data of hive.
>> 2. Backing up the data for hive tables on s3 or hdfs or NFS
>> 3. Restoring table(s):
>> a. Only Data
>> b. Schema and data
>>
>> So, to achieve 1st task, this is the flow I'm thinking.
>>
>> a. Check whether there is any exclusive lock on the Table, whose
>> meta-data needs to be backed up.
>>  if YES then don't do any thing, wait and retry for configured
>> no/frequency
>>  if NO: Then get the meta-data of the table and create the DDL
>> statement for HIVE including table / partition etc.
>>
>> For 2nd task:
>>
>> a. Check whether the table has any exclusive lock,
>> if NOT take shared lock and start copy, once done release the
>> shared lock.
>> if YES then then wait and retry.
>>
>> For 3rd: Restoring:
>>
>> a. Only Data: Check if there is any lock on the table.
>>  if NO, then take the exclusive lock, insert the data
>> into table, release the lock.
>>  if YES then wait and retry.
>>
>> b. Schema and Data:
>>
>> Check if there is any lock on table/partition.
>>   if NO then Drop and create table/partitions.
>>   if YES then wait and retry.
>>  Once schema is created:
>>   take the exclusive lock, insert data, release lock.
>>
>>
>> Now I'm going to run this kind of job from my scheduler / WF engine.
>> I need input on following questions:
>>
>> a. Is this overall approach looks good?
>> b. How can I take and release different locks explicitly using HIVE API.
>> ref: https://cwiki.apache.org/confluence/display/Hive/Locking
>>
>> If I understood correctly, As per this still HIVE doesn't support locking
>> explicitly at API level.
>> Is there any plan or patch to get this done.
>>
>> I saw some classes like *ZooKeeperHiveLock *etc.but need to dig further
>> to see, if can use these classes for locking features.
>>
>> Thanks for your time and effort.
>>
>> Regards,
>> Manish
>>
>>
>>
>


Re: PK violation during Hive add partition

2012-12-10 Thread Ruslan Al-Fakikh
Hi!

Have you enabled Hive concurrency? Hive should not be accessed concurrently
if the appropriate property is not enabled.

Ruslan

On Sat, Dec 8, 2012 at 6:01 AM, Karlen Lie  wrote:

> nal table, and the query below is run concurrently by multiple oo


Re: "Subject" etiquette

2012-11-22 Thread Ruslan Al-Fakikh
+1


On Thu, Nov 22, 2012 at 6:27 PM, Mohammad Tariq  wrote:

> +1
>
> Regards,
> Mohammad Tariq
>
>
>
> On Thu, Nov 22, 2012 at 7:47 PM, Dean Wampler <
> dean.wamp...@thinkbiganalytics.com> wrote:
>
>> As a service to everyone on this list, please fill in the "Subject" field
>> when you post to the list, especially if you want the rest of us to read
>> your message. Thank you.
>>
>> dean
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>>
>


Re: how to transform the date format in hive?

2012-11-22 Thread Ruslan Al-Fakikh
Hi, also take a look at Hive date functions:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions


On Thu, Nov 22, 2012 at 3:59 AM, Tom Hubina  wrote:

> Could convert them to unix time which will give you two bigints that you
> can subtract to get seconds, and divide result by 60 to get minutes, etc.
>
> Tom
>
>
> On Wed, Nov 21, 2012 at 3:15 PM, qiaoresearcher 
> wrote:
>
>> Hi All,
>>
>> Assume we have two time stamp like:9/15/2002  8:05   and9/15/2002
>>   9:05
>>
>> if we need to evaluate the difference of these two time stamp in terms of
>> hours ( or minutes ), how to do that?
>>
>> Hive seems only support the format  '-MM-dd HH:mm:ss' , how can we
>> transform the two time stamps into the format that can be processed by
>> hive?
>>
>> thanks for any suggestion in advance.
>>
>>
>>
>


Re: hive 0.7.1 Error: Non-Partition column appears in the partition specification

2012-11-16 Thread Ruslan Al-Fakikh
Hey Pavel,

Also note that the dynamic partition values are selected by ordering, not
name, and taken as the last columns from the select clause.
So you have to have a column for the author partition in your most outer
'select'.

So your error messages are normal. In the first one you do not have a
column for the author partition in the outer select, in the second - you
have a wrong partition declaration as Nitin mentioned.

Ruslan


On Thu, Nov 15, 2012 at 2:56 PM, Nitin Pawar wrote:

> you are little complicating your query with as and tmp tables
>
> you can just write simple query for same
> INSERT OVERWRITE TABLE table2 PARTITION (author)
> SELECT text, author FROM table1
>
> Tolstory is not any column in table1 so even that fail for query parsing
>
> if you want to all the data where author is tolstoy then you use where
> clause in query and solve that but then when you use then your partition
> (author='tolstoy')
>
>
> Thanks,
> Nitin
>
>
>
> On Thu, Nov 15, 2012 at 4:13 PM, Павел Мезенцев wrote:
>
>> Thank you for right idea.
>> It is very strange, but normally executed query looks like:
>>
>> INSERT OVERWRITE TABLE table2 PARTITION (author)
>> SELECT text*, author* FROM (SELECT text, 'Tolstoy' AS author FROM
>> table1) tmp;
>>
>> Best regards
>> Mezentsev Pavel
>>
>>
>> 2012/11/15 Nitin Pawar 
>>
>>> when you add data to a partitioned table the partition column name in
>>> insert statement should match the table definition
>>>
>>> so try changing your insert query to "INSERT OVERWRITE TABLE table2
>>> PARTITION (author)"
>>> where author is the column in your table definition
>>>
>>> Thanks,
>>> Nitin
>>>
>>>
>>> On Thu, Nov 15, 2012 at 1:44 PM, Павел Мезенцев wrote:
>>>
 Hello all!

 I have a problem with dynamic partitions in hive 0.7.1.

 For example I have 2 tables:

 CREATE TABLE table1 (text STRING);
 CREATE TABLE table2 (text STRING) PARTITIONED BY (author STRING);

 And make insert into dynamic partition from table1 to table2
 SET hive.exec.dynamic.partition = true;
 SET hive.exec.dynamic.partition.mode = nonstrict;

 Query
 INSERT OVERWRITE TABLE table2 PARTITION (author)
 SELECT text FROM (SELECT text, 'Tolstoy' AS author FROM table1) tmp;

 failes with error:
 FAILED: Error in semantic analysis: Line 1:23 Cannot insert into target
 table because column number/types are different author: Table insclause-0
 has 2 columns, but query has 1 columns.


 Query:
 INSERT OVERWRITE TABLE table2 PARTITION (new_author)
 SELECT text FROM (SELECT text, 'Tolstoy' AS new_author FROM table1) tmp;

 failes with error:
 FAILED: Error in semantic analysis: Non-Partition column appears in the
 partition specification:  new_author


 What is happen? Is there any workaround for this problem?

 I know that I can use static partition author = 'Tolsoy', but my real
 query is more complex and dynamic partition calculates from several input
 fields.

 Best regards
 Mezentsev Pavel
 Moscow.


>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>>
>>
>
>
> --
> Nitin Pawar
>
>


Re: How to conver into date in hive

2012-10-16 Thread Ruslan Al-Fakikh
Probably you could hardcode those in a ternary operator or a
switch/case or if/else expression.

Ruslan

On Tue, Oct 16, 2012 at 10:16 AM,   wrote:
> Hi all,
>
> Is there any way to convert months in names.
> like.
>
> 1   to   Jan
> 2   to   Feb
> 3   to   Mar
>
> and so on  (here 1, 2, 3 are having string data type)
>
>
> Please suggest.
>
> Thanks & Regards
> Yogesh Kumar
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient should
> check this email and any attachments for the presence of viruses. The
> company accepts no liability for any damage caused by any virus transmitted
> by this email.
>
> www.wipro.com


Re: Multiple Hive Connection Issues

2012-10-10 Thread Ruslan Al-Fakikh
Just a note:

There is also hiveserver2 which fixes connection issues, and it is
included in cdh 4

On Tue, Oct 9, 2012 at 4:24 PM, nagarjuna kanamarlapudi
 wrote:
> Hi,
>
>
> I have a requirement of using multiple hive connections simultaneously to
> run multiple queries in parallel. I use JDBC Client to get the connection.
> (hive-0.7.1)
>
> Just going through one of the bugs of hive server at the link
> https://cwiki.apache.org/Hive/hiveserver.html
>
>
> The link says , "HiveServer can not handle concurrent requests from more
> than one client." Are we talking about multiple clients like JDBC, ODBC,
> Thrift JAVA client, CLI, HWI. If yes, should the hive server handle multiple
> connections from the same slient simultaneously ??
>
>
>
> Regards,
> Nagarjuna


Re: On hive date functions

2012-10-09 Thread Ruslan Al-Fakikh
Hi

Try the hour() function from here:
https://cwiki.apache.org/Hive/languagemanual-udf.html#LanguageManualUDF-DateFunctions

Ruslan

On Tue, Oct 9, 2012 at 7:54 AM, Matthieu Labour  wrote:
> Hi
> Is it possible with Hive to truncate date to a specified precision?
> For example in Postgresql date_trunc('hour',timestamp '2001-02-16 20:38:40')
> will return 2001-02-16 20:00:00
> There is the to_date function in hive
> I am trying to achieve the following
> select distinct date_trunc('hour', timestamp) as hour, count(*) from table
> group by hour;
> Thank you for your help
> Matthieu
>


Re: Error on hive web interface

2012-09-28 Thread Ruslan Al-Fakikh
Hey,

Are you using Cloudera's distribution? If yes - as far as I know they
don't support the Hive Web Interface, recommending Hue.

Best Regards

On Thu, Sep 27, 2012 at 10:38 PM, Germain Tanguy
 wrote:
> Hi
>
> I am a new user of Hive, I am on version 0.9.0. I try to use hive web 
> interface and  I have this error :
>
>
> 12/09/27 11:05:02 INFO hwi.HWIServer: HWI is starting up
> 12/09/27 11:05:02 INFO mortbay.log: Logging to 
> org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via 
> org.mortbay.log.Slf4jLog
> 12/09/27 11:05:02 INFO mortbay.log: jetty-6.1.14
> 12/09/27 11:05:02 INFO mortbay.log: Extract 
> jar:file:/Applications/hive-0.9.0/lib/hive-hwi-0.9.0.war!/ to 
> /var/folders/th/1f2knntx5kv5t866bnwlrr10gn/T/Jetty_0_0_0_0__hive.hwi.0.9.0.war__hwi__2l99ri/webapp
> 12/09/27 11:05:02 WARN mortbay.log: Failed startup of context 
> org.mortbay.jetty.webapp.WebAppContext@68de462{/hwi,jar:file:/Applications/hive-0.9.0/lib/hive-hwi-0.9.0.war!/}
> java.util.zip.ZipException: error in opening zip file
> at java.util.zip.ZipFile.open(Native Method)
> at java.util.zip.ZipFile.(ZipFile.java:127)
> at java.util.jar.JarFile.(JarFile.java:135)
> at java.util.jar.JarFile.(JarFile.java:99)
> at 
> org.mortbay.jetty.webapp.TagLibConfiguration.configureWebApp(TagLibConfiguration.java:168)
> at 
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1231)
> at 
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517)
> at 
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:460)
> at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
> at 
> org.mortbay.jetty.handler.RequestLogHandler.doStart(RequestLogHandler.java:115)
> at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
> at org.mortbay.jetty.Server.doStart(Server.java:222)
> at 
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> at org.apache.hadoop.hive.hwi.HWIServer.start(HWIServer.java:102)
> at org.apache.hadoop.hive.hwi.HWIServer.main(HWIServer.java:132)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> 12/09/27 11:05:02 INFO mortbay.log: Started SocketConnector@0.0.0.0:
>
> Someone have an idea where does it come from? and how to fix it?
>
> Thanks for your help,
>
> Regards,
>
> Germain Tanguy.


Re: Hive File Sizes, Merging, and Splits

2012-09-26 Thread Ruslan Al-Fakikh
Hi,

Can you look up the file names of each mapper? You can do so by
looking at a running task UI in the status column. Also what split
property do you mean? Can you give your job's console output?
Also, the best recommended way is to use a splittable format like
Avro, Seq files, indexed LZO, etc. This way you won't bother about
file sizes so you can just make them big.

Best Regards

On Wed, Sep 26, 2012 at 4:12 AM, Connell, Chuck
 wrote:
> But remember that you are running on parallel machines. Depending on the
> hardware configuration, more map tasks is BETTER.
>
>
> 
> From: John Omernik [j...@omernik.com]
> Sent: Tuesday, September 25, 2012 7:11 PM
> To: user@hive.apache.org
> Subject: Re: Hive File Sizes, Merging, and Splits
>
> Isn't there an overhead associated with each map task?  Based on that, my
> hypothesis is if I pay attention to may data, merge up small files after
> load, and ensure split sizes are close to files sizes, I can keep the number
> of map tasks to an absolute minimum.
>
>
> On Tue, Sep 25, 2012 at 2:35 PM, Connell, Chuck 
> wrote:
>>
>> Why do you think the current generated code is inefficient?
>>
>>
>>
>>
>>
>>
>>
>> From: John Omernik [mailto:j...@omernik.com]
>> Sent: Tuesday, September 25, 2012 2:57 PM
>> To: user@hive.apache.org
>> Subject: Hive File Sizes, Merging, and Splits
>>
>>
>>
>> I am really struggling trying to make hears or tails out of how to
>> optimize the data in my tables for best query times.  I have a partition
>> that is compressed (Gzip) RCFile data in two files
>>
>>
>>
>> total 421877
>>
>> 263715 -rwxr-xr-x 1 darkness darkness 270044140 2012-09-25 13:32 00_0
>>
>> 158162 -rwxr-xr-x 1 darkness darkness 161956948 2012-09-25 13:32 01_0
>>
>>
>>
>>
>>
>>
>>
>> No matter what I set my split settings to prior to the job, I always get
>> three mappers.  My block size is 268435456 but the setting doesn't seem to
>> change anything. I can set split size huge or small with no apparent affect
>> on the data.
>>
>>
>>
>>
>>
>> I know there are many esoteric items here, but is there any good
>> documentation on setting these things to make my queries on this data more
>> efficient. I am not sure what it needs three map tasks on this data, it
>> should really just grab two mappers. Not to mention, I thought gzip wasn't
>> splitable anyhow.  So, from that standpoint, how does it even send data to
>> three mappers.  If you know of some secret cache of documentation for hive,
>> I'd love to read it.
>>
>>
>>
>> Thanks
>>
>>
>
>


Incomplete example on page cwiki.apache.org/Hive/compressedstorage.html

2012-09-17 Thread Ruslan Al-Fakikh
Hey guys,

I spent a lot of time to figure out that when I use the example from the page
cwiki.apache.org/Hive/compressedstorage.html:
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;

the setting
SET io.seqfile.compression.type=BLOCK
actually doesn't work and I had to do
set mapred.output.compression.type=BLOCK;
because the default for mapred.output.compression.type was RECORD;

Please update the page

Ruslan


Re: How to overwrite a file inside hive table folder

2012-09-17 Thread Ruslan Al-Fakikh
Hi,

I think you could try to drop partition and create a new partition if
you table is partitioned

Ruslan

On Mon, Sep 17, 2012 at 8:12 AM, MiaoMiao  wrote:
> What do you mean by "a file"? If your HDFS dir contains several file
> and you want to overwrite one, then no way you can do it with HiveQL.
> You can check out this link and see if it suits you.
>
> https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Writingdataintofilesystemfromqueries
>
> My way of exporting hiveql result as a single is this:
> echo "some,column,name">some.file
> hive -S -e "select some, column, name from some_table"|sed -e
> "s/\t/,/" >> some.file
> On Sun, Sep 16, 2012 at 12:17 AM, Ramasubramanian
>  wrote:
>> Hi,
>>
>> We could not overwrite a file inside hive table folder. What is the command 
>> to over write it.
>>
>> Regards,
>> Rams


Re: How to use the IF conditional function in Hive scripts

2012-09-13 Thread Ruslan Al-Fakikh
Hi,

I guess this goes beyond Hive scripts. You can use some kind of an
external automation tool like Oozie or a wrapper sh script

Ruslan

On Thu, Sep 13, 2012 at 3:49 PM, Amila Maha Arachchi
 wrote:
> Hi,
>
> I am trying to write a hive script which is doing some summarization. There
> are two summarizations hourly and daily. I want to run the hourly
> summarization first, then check whether the current timestamp matches some
> value of my favour (lets say 2012-09-13 23:00:00) and run the daily
> summarization.
>
> My question is, can I use the IF conditional function which is mentioned in
> [1]. I could not fined any examples on this.
>
>
> [1]
> https://cwiki.apache.org/Hive/languagemanual-udf.html#LanguageManualUDF-ConditionalFunctions
>
> Thanks,
> Amila.


Re: Hive UDF intialization

2012-09-04 Thread Ruslan Al-Fakikh
Ravi,

It looks like you are missing the
ADD JAR ...
command

Ruslan

On Tue, Sep 4, 2012 at 6:45 PM, Edward Capriolo  wrote:
> You could start with this:
>
> https://github.com/edwardcapriolo/hive-geoip
>
> On Tue, Sep 4, 2012 at 10:42 AM, Ravi Shetye  wrote:
>> Hi
>> I am trying to register a java udf which looks like
>>
>> public final class IP_2_GEO extends UDF {
>> String geo_file;
>> String geo_type;
>> public IP_2_GEO(String geo_file, String geo_type) {
>> this.geo_file = geo_file;
>> this.geo_type = geo_type;
>> }
>>
>> public Text evaluate(final Text ip) {
>> //IP to geo conversion based on the level passed in 'geo_type' and by look
>> up in 'geo_file'
>>
>> return new Text(geoString);
>> }
>> }
>>
>> I had similar udf in pig and could register it by using
>> define ip_2_city com.udfs.common.IP_2_GEO('$geo_file', 'city');
>> define ip_2_country com.udfs.common.IP_2_GEO('$geo_file', 'country');
>>
>> but in hive I am not able to register and  initialize the udf
>>
>> hive> create temporary function ip_2_city as
>> "udfs.common.IP_2_GEO('/mnt/ravi/GeoLiteCity.dat','city')";
>> FAILED: Class udfs.common.IP_2_GEO('/mnt/ravi/GeoLiteCity.dat','city') not
>> found
>> FAILED: Execution Error, return code 1 from
>> org.apache.hadoop.hive.ql.exec.FunctionTask
>> hive> create temporary function ip_2_city as "udfs.common.IP_2_GEO()";
>> FAILED: Class udfs.common.IP_2_GEO() not found
>> FAILED: Execution Error, return code 1 from
>> org.apache.hadoop.hive.ql.exec.FunctionTask
>> hive> create temporary function ip_2_city as "udfs.common.IP_2_GEO";
>> OK
>> Time taken: 0.0080 seconds
>>
>>
>> 1) Is there an alternate way to achieve what I expect from the command
>>
>> hive> create temporary function ip_2_city as
>> "udfs.common.IP_2_GEO('/mnt/ravi/GeoLiteCity.dat','city')"; ?
>>
>> 2) What is the common practice of converting ip to City and Country in hive?
>> --
>> RAVI SHETYE



-- 
Best Regards,
Ruslan Al-Fakikh


Re: Hive sort by using a single reducer

2012-09-04 Thread Ruslan Al-Fakikh
Hi

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy#LanguageManualSortBy-DifferencebetweenSortByandOrderBy
Sort By will give you only partially sorted results if you have more
than one reducer

Ruslan

On Mon, Sep 3, 2012 at 1:38 AM, Binesh Gummadi  wrote:
> Thanks for your quick reply. Rank is a column which has integer data. I am
> writing to dynamoDB database tho. Not sure why only a single reducer is used
> tho. I will check sql with explain command again and will report my
> findings. I will check your implementation too.
>
> 
> Binesh Gummadi
>
>
>
>
> On Sun, Sep 2, 2012 at 4:01 PM, Edward Capriolo 
> wrote:
>>
>>
>> Sort by does not have the single reduce restriction. Not sure which rank
>> you are using but any one should allow you to sort and rank if the query is
>> written correctly. Our implementation on my github.com/edwardcapriolo allows
>> this.
>>
>> On Sunday, September 2, 2012, Binesh Gummadi 
>> wrote:
>> > I am trying to insert data into a table after selecting and sorting by a
>> > column. What I really want is order by a column and select the top million
>> > rows. I am using Amazon EMR hive cloud to process data.
>> > Here is my query
>> > INSERT INTO TABLE ddb_table SELECT * FROM data_dump sort by rank desc
>> > LIMIT 100;
>> > It creates two jobs. First job run rather quickly and second job reducer
>> > is running forever as it is running with a single reducer. Here is my
>> > question on
>> > stackoverflow(http://stackoverflow.com/questions/12233343/why-is-sort-by-always-using-single-reducer).
>> > According to docs "order by" clause has a limitation of 1 reducer. Does
>> > sort by has same limitation? Are there any other ways of solving the above
>> > requirement?
>> > 
>> > Binesh Gummadi
>> >
>> >
>
>



-- 
Best Regards,
Ruslan Al-Fakikh


RE: Continuous log analysis requires 'dynamic' partitions, is that possible?

2012-07-25 Thread Ruslan Al-fakikh
Bertrand,

 

Sorry, I don't have a link to msck documentation. I haven't tried it myself,
I just heard of it.

 

Thanks

 

From: Bertrand Dechoux [mailto:decho...@gmail.com] 
Sent: Wednesday, July 25, 2012 1:23 PM
To: user@hive.apache.org
Subject: Re: Continuous log analysis requires 'dynamic' partitions, is that
possible?

 

usage of msck :
msck table 
msck repair table 

BUT that won't help me.

I am using an external table with 'external' partitions (which do not follow
hive conventions).
So I first create an external table without local and then I specify every
partition with an absolute location.
I don't think there is another way given my constraints. But if there is, I
will gladly read it.

So, with the current implementation and with regards to the parameters that
can be used with the current hive commands :
1) hive has no way to list table directory
2) hive has no way to understand which variable should be used for each
partitioning level

Conclusion : the only solution at the moment is to declare partitions for
hive in advance (thanks Edward). But that means that I do have to handle the
'synchronisation' of two 'pseudo' file tree : hdfs and hive partitions.

Bertrand

On Wed, Jul 25, 2012 at 10:51 AM, Bertrand Dechoux 
wrote:

@Puneet Khatod : I found that out. And that's why I am asking here. I guess
non AWS users might have the same problems and a way to solve it.

@Ruslan Al-fakikh : It seems great. Is there any documentation for msck? I
will find out with the diff file but is there a wiki page or a blog post
about it? It would be best. I could not find any.

@Edward Capriolo : I now feel silly. This is clearly a better approach that
my proposed hacks. The performance impact should be negligible, even more
when ensuring partition pruning. I am using hive to 'piggy back' on an
external way of writing data. So in my case, I could indeed tell in advance
to hive where the data will be written. (Same as you say but the logic is
reverse.) I guess I skipped over alter table touch. But it would not help
me. The partitions are external. And if I add partitions, I will do it with
cron and a shell file.

Bertrand





On Tue, Jul 24, 2012 at 7:24 PM, Edward Capriolo 
wrote:

Alter table touch will create partitions even if they have no data,
You can also just create partitions ahead of time and have your code
"know" where to write data.



On Tue, Jul 24, 2012 at 12:35 PM, Ruslan Al-fakikh
 wrote:
> If you are not using Amazon take a look at this:
>
> https://issues.apache.org/jira/browse/HIVE-874
>
>
>
> Ruslan
>
>
>
> From: Puneet Khatod [mailto:puneet.kha...@tavant.com]
> Sent: Tuesday, July 24, 2012 8:32 PM
> To: user@hive.apache.org
> Subject: RE: Continuous log analysis requires 'dynamic' partitions, is
that
> possible?
>
>
>
> If you are using Amazon (AWS), you can use 'recover partitions' to enable
> all top level partitions.
>
> This will add required dynamicity.
>
>
>
> Regards,
>
> Puneet Khatod
>
>
>
> From: Bertrand Dechoux [mailto:decho...@gmail.com]
> Sent: 24 July 2012 21:15
> To: user@hive.apache.org
> Subject: Continuous log analysis requires 'dynamic' partitions, is that
> possible?
>
>
>
> Hi,
>
> Let's say logs are stored inside hdfs using the following file tree
> ///.
> So for apache, that would be :
> /apache/01/01
> /apache/01/02
> ...
> /apache/02/01
> ...
>
> I would like to know how to define a table for this information. I found
out
> that the table should be external and should be using partitions.
> However, I did not found any way to dynamically create the partitions. Is
> there no automatic way to define them?
> In that case, the partition 'template' would be / with the
root
> being apache.
>
> I know how to 'hack a fix' : create a script which would generate all the
> "add partition statement" and run the resulting statements without caring
> about the results because partitions may not exist or may already have
been
> added. Better, I could parse the result of 'show partition' for the table
> and run only the relevant statement but it still feels like a hack.
>
> Is there any clean way to do it?
>
> Regards,
>
> Bertrand Dechoux
>
> Any comments or statements made in this email are not necessarily those of
> Tavant Technologies.
> The information transmitted is intended only for the person or entity to
> which it is addressed and may
> contain confidential and/or privileged material. If you have received this
> in error, please contact the
> sender and delete the material from any computer. All e-mails sent from or
> to Tavant Technologies
> may be subject to our monitoring procedures.





-- 
Bertrand Dechoux




-- 
Bertrand Dechoux



RE: Continuous log analysis requires 'dynamic' partitions, is that possible?

2012-07-24 Thread Ruslan Al-fakikh
If you are not using Amazon take a look at this:

https://issues.apache.org/jira/browse/HIVE-874

 

Ruslan

 

From: Puneet Khatod [mailto:puneet.kha...@tavant.com] 
Sent: Tuesday, July 24, 2012 8:32 PM
To: user@hive.apache.org
Subject: RE: Continuous log analysis requires 'dynamic' partitions, is that
possible?

 

If you are using Amazon (AWS), you can use 'recover partitions' to enable
all top level partitions.

This will add required dynamicity.

 

Regards,

Puneet Khatod

 

From: Bertrand Dechoux [mailto:decho...@gmail.com] 
Sent: 24 July 2012 21:15
To: user@hive.apache.org
Subject: Continuous log analysis requires 'dynamic' partitions, is that
possible?

 

Hi,

Let's say logs are stored inside hdfs using the following file tree
///.
So for apache, that would be :
/apache/01/01
/apache/01/02
...
/apache/02/01
...

I would like to know how to define a table for this information. I found out
that the table should be external and should be using partitions.
However, I did not found any way to dynamically create the partitions. Is
there no automatic way to define them?
In that case, the partition 'template' would be / with the root
being apache.

I know how to 'hack a fix' : create a script which would generate all the
"add partition statement" and run the resulting statements without caring
about the results because partitions may not exist or may already have been
added. Better, I could parse the result of 'show partition' for the table
and run only the relevant statement but it still feels like a hack.

Is there any clean way to do it?

Regards,

Bertrand Dechoux

Any comments or statements made in this email are not necessarily those of
Tavant Technologies.
The information transmitted is intended only for the person or entity to
which it is addressed and may 
contain confidential and/or privileged material. If you have received this
in error, please contact the 
sender and delete the material from any computer. All e-mails sent from or
to Tavant Technologies 
may be subject to our monitoring procedures.



Re: Hive upload

2012-07-04 Thread Ruslan Al-Fakikh
>
> Hi Yogesh
>
> The first issue (sqoop one).
> 1) Is the table newhive coming when you list tables using 'show table'?
> 2) Are you seeing a directory 'newhive' in your hive warte house dir(usually
> /usr/hive/warehouse)?
>
> If not sqoop is failing to create hive tables /load data into them. Only
> sqoop import to hdfs is getting successful the hive part is failing.
>
> If hive in stand alone mode works as desired you need to check the sqoop
> configurations.
>
> Regarding the second issue, can you check the storage location of NewTable
> and check whether there are files within. If so then do a 'cat' of those
> files and see whether it has the correct data format.
>
> You can get the location of your table from the following command
> describe formatted NewTable;
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
> 
> From: yogesh dhari 
> Date: Wed, 4 Jul 2012 11:09:02 +0530
> To: hive request
> ReplyTo: user@hive.apache.org
> Subject: Hive upload
>
> Hi all,
>
> I am trying to upload the tables from RDBMS to hive through sqoop, hive
> imports successfully. but i didn't find any table in hive that imported
> table gets uploaded into hdfs idr /user/hive/warehouse
> I want it to be present into hive, I used this command
>
> sqoop import --connect jdbc:mysql://localhost:3306/Demo --username sqoop1
> --password SQOOP1 -table newone --hive-table newhive --create-hive-table
> --hive-import --target-dir /user/hive/warehouse/new
>
>
> And another thing is,
> If I upload any file or table from HDFS or from Local then its uploads but
> data doesn't show in Hive table,
>
> If I run command
> Select * from NewTable;
> it reflects
>
> Null Null NullNull
>
>
> although the real data is
>
> Yogesh4Bangalore   1234
>
>
> Please Suggest and help
>
> Regards
> Yogesh Kumar
>
>
>
>



-- 
Best Regards,
Ruslan Al-Fakikh


Re: Quering RDBMS table in a Hive query

2012-06-18 Thread Ruslan Al-Fakikh
Bejoy,

Again, I do understand those two steps, and I do understand that I
have a lot of options of making them run in sequence, but from the
very beginning my point was to avoid having two steps. I want to have
a dataset in the hive warehouse that I could query at any time with
just a hive query without any preliminary imports/queries. So
implementing a custom UDF/InputFormat looks best for now except for
having too many rdbms connections (one connection per mapper as far as
I understand).

Thanks

On Sat, Jun 16, 2012 at 6:04 AM, Bejoy KS  wrote:
> Hi Ruslan
>
> The solution Esteban pointed out was
> 1. Import look up data from RDBMS to hdfs/hive (you can fire any adhoc query 
> here). If the data is just a few mbs one or two maps/connections are enough.
>
> 2. A look up on this smaller data can be achieved in terms of joining that 
> with larger table
>
> Now since the look up table is small, enable map joins so that the look up 
> table is in the distributed cache and that data is used by map tasks for join.
>
> The two sequential steps mentioned above can be scheduled using a workflow 
> manager as oozie.
>
> In simple terms you can place these steps in order in a shell script and just 
> execute the script.
>
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> -Original Message-
> From: Ruslan Al-Fakikh 
> Date: Sat, 16 Jun 2012 04:40:36
> To: 
> Reply-To: user@hive.apache.org
> Subject: Re: Quering RDBMS table in a Hive query
>
> Hi Esteban,
>
> Your solution is what I am trying to avoid, having to keep the hdfs
> data up-to-date. I know I can easily schedule a dependency between the
> Sqoop import job and the hive query job and currently we have a
> scheduling tool (opswise) for such things. But what if I just want to
> run an ad hoc query and forget to re-import the lookup data, etc?
> Maybe there is a way to put the Sqoop import as a hook for a
> particular hive table making it run before every query?
> But I understand the problem of having too many connections. I would
> like to have it only once and distribute it over all the mappers in a
> distributed cache or something like it. Isn't there a way for it?
>
> Ruslan
>
> On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez  
> wrote:
>> Hi Ruslan,
>>
>> Jan's approach sounds like a good workaround only if you can use the output
>> in a mapjoin, but I don't think it will scale nicely if you have a very
>> large number of  tasks since that will translate as  DB connections to
>> MySQL. I think a more scalable and reliable way is just to schedule an Oozie
>> workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the
>> Hive query once the transfer was done.
>>
>> cheers!
>> esteban.
>>
>> --
>> Cloudera, Inc.
>>
>>
>>
>>
>> On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh 
>> wrote:
>>>
>>> Thanks Jan
>>>
>>> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár  wrote:
>>> > On 6/15/12, Ruslan Al-Fakikh  wrote:
>>> >> I didn't know InputFormat and LineReader could help, though I didn't
>>> >> look at them closely. I was thinking about implementing a
>>> >> Table-Generating Function (UDTF) if there is no an already implemented
>>> >> solution.
>>> >
>>> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
>>> > you need. I actually use both - in Input format I load lists of
>>> > allowed values to check the data and in UDF I query some other
>>> > database for values necessary only in some queries. Generally, I'd use
>>> >  InputFormat for situations where all jobs over given table would
>>> > require the additional data from RDBMS. Oppositely, in situations
>>> > where only few jobs out of many requires the RDBMS connection, I would
>>> > use UDF.
>>> >
>>> > I think that the difference in performance between the two is rather
>>> > small, if any. Also UDF is easier to write, so it might be the "weapon
>>> > of choice", at least if you don't already use custom InputFormat.
>>> >
>>> > Jan
>>
>>


Re: Quering RDBMS table in a Hive query

2012-06-15 Thread Ruslan Al-Fakikh
Hi Esteban,

Your solution is what I am trying to avoid, having to keep the hdfs
data up-to-date. I know I can easily schedule a dependency between the
Sqoop import job and the hive query job and currently we have a
scheduling tool (opswise) for such things. But what if I just want to
run an ad hoc query and forget to re-import the lookup data, etc?
Maybe there is a way to put the Sqoop import as a hook for a
particular hive table making it run before every query?
But I understand the problem of having too many connections. I would
like to have it only once and distribute it over all the mappers in a
distributed cache or something like it. Isn't there a way for it?

Ruslan

On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez  wrote:
> Hi Ruslan,
>
> Jan's approach sounds like a good workaround only if you can use the output
> in a mapjoin, but I don't think it will scale nicely if you have a very
> large number of  tasks since that will translate as  DB connections to
> MySQL. I think a more scalable and reliable way is just to schedule an Oozie
> workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the
> Hive query once the transfer was done.
>
> cheers!
> esteban.
>
> --
> Cloudera, Inc.
>
>
>
>
> On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh 
> wrote:
>>
>> Thanks Jan
>>
>> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár  wrote:
>> > On 6/15/12, Ruslan Al-Fakikh  wrote:
>> >> I didn't know InputFormat and LineReader could help, though I didn't
>> >> look at them closely. I was thinking about implementing a
>> >> Table-Generating Function (UDTF) if there is no an already implemented
>> >> solution.
>> >
>> > Both is possible, InputFormat and/or UD(T)F. It all depends on what
>> > you need. I actually use both - in Input format I load lists of
>> > allowed values to check the data and in UDF I query some other
>> > database for values necessary only in some queries. Generally, I'd use
>> >  InputFormat for situations where all jobs over given table would
>> > require the additional data from RDBMS. Oppositely, in situations
>> > where only few jobs out of many requires the RDBMS connection, I would
>> > use UDF.
>> >
>> > I think that the difference in performance between the two is rather
>> > small, if any. Also UDF is easier to write, so it might be the "weapon
>> > of choice", at least if you don't already use custom InputFormat.
>> >
>> > Jan
>
>


Re: Quering RDBMS table in a Hive query

2012-06-15 Thread Ruslan Al-Fakikh
Thanks Jan

On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár  wrote:
> On 6/15/12, Ruslan Al-Fakikh  wrote:
>> I didn't know InputFormat and LineReader could help, though I didn't
>> look at them closely. I was thinking about implementing a
>> Table-Generating Function (UDTF) if there is no an already implemented
>> solution.
>
> Both is possible, InputFormat and/or UD(T)F. It all depends on what
> you need. I actually use both - in Input format I load lists of
> allowed values to check the data and in UDF I query some other
> database for values necessary only in some queries. Generally, I'd use
>  InputFormat for situations where all jobs over given table would
> require the additional data from RDBMS. Oppositely, in situations
> where only few jobs out of many requires the RDBMS connection, I would
> use UDF.
>
> I think that the difference in performance between the two is rather
> small, if any. Also UDF is easier to write, so it might be the "weapon
> of choice", at least if you don't already use custom InputFormat.
>
> Jan


Re: Quering RDBMS table in a Hive query

2012-06-15 Thread Ruslan Al-Fakikh
Thanks Jan,

I didn't know InputFormat and LineReader could help, though I didn't
look at them closely. I was thinking about implementing a
Table-Generating Function (UDTF) if there is no an already implemented
solution.

Ruslan

On Thu, Jun 14, 2012 at 10:03 AM, Jan Dolinár  wrote:
> Hi Ruslan,
>
> I've been in similar situation and solved it by writing a custom
> InputFormat and LineReader that loads the data from MySQL in
> constructor. In my case I use it just to check value ranges and
> similar stuff. If you want to join the data with whats in your hdfs
> files, you can do that as well, InputFormat allows you to add the
> columns easily. I'm not sure how well this solution would behave for a
> bigger data, but for small data (I load about 5 tables, ~100 lines
> each) it works just fine.
>
> Best Regards,
> Jan
>
>
>
> On 6/13/12, Ruslan Al-Fakikh  wrote:
>> Hello to everyone,
>>
>> I need to join hdfs data with little data taken from RDBMS. A possible
>> solution is to import RDBMS data to a regular hive table using Sqoop,
>> but this way I'll have to keep that imported hive table up-to-date
>> which means that I will have to update it every time before joining in
>> a query.
>> Is there a way to load RDBMS data on the fly? Maybe a UDF which would
>> take RDBMS connection properties and load the data?
>>
>> Thanks in advance,
>> Ruslan Al-Fakikh
>>



-- 
Best Regards,
Ruslan Al-Fakikh


Quering RDBMS table in a Hive query

2012-06-13 Thread Ruslan Al-Fakikh
Hello to everyone,

I need to join hdfs data with little data taken from RDBMS. A possible
solution is to import RDBMS data to a regular hive table using Sqoop,
but this way I'll have to keep that imported hive table up-to-date
which means that I will have to update it every time before joining in
a query.
Is there a way to load RDBMS data on the fly? Maybe a UDF which would
take RDBMS connection properties and load the data?

Thanks in advance,
Ruslan Al-Fakikh


Hadoop Russia user group

2012-05-31 Thread Ruslan Al-Fakikh
Hi everyone,

I've created a group on Linkedin for Russian-speaking folks. It is
about Hadoop and its components (Hive, Pig, etc).

http://www.linkedin.com/groups/Hadoop-Russia-4468740?gid=4468740

Thanks,
Ruslan Al-Fakikh