Re: Possible to specify reducers for each stage?

2013-07-02 Thread Navis류승우
Currently it's not. https://issues.apache.org/jira/browse/HIVE-3946

2013/7/3 Felix.徐 :
> Hi all,
>
> Is it possible to specify reducer number for each stage ? how?
>
> thanks!


Re: Unsubscribe

2013-07-02 Thread tofunmibabatunde
Unsubscribe
Sent from my BlackBerry® wireless handheld from Glo Mobile.

-Original Message-
From: Sundar 
Date: Wed, 3 Jul 2013 09:32:22 
To: 
Reply-To: user@hive.apache.org
Subject: Unsubscribe

Unsubscribe



Possible to specify reducers for each stage?

2013-07-02 Thread Felix . 徐
Hi all,

Is it possible to specify reducer number for each stage ? how?

thanks!


Unsubscribe

2013-07-02 Thread Sundar
Unsubscribe


Re: Hive Compile Error: WrappedRuntimeException

2013-07-02 Thread Xun Tang
To close the loop, I think the problem comes from running 32 bit ubuntu VM
on vmware inside windows 7 host.
After replacing the 32 bit vm with a 64 bit one, problem is gone...


On Sun, Jun 30, 2013 at 9:18 AM, Xun Tang  wrote:

> Hi,
>
> ** **
>
> I am trying to compile hive 0.10.
>
> Build fails with error ‘WrappedRuntimeException’, related to this jira.***
> *
>
> https://issues.apache.org/jira/browse/HIVE-4231
>
> ** **
>
> The jira says ‘easy workaround: unset _JAVA_OPTIONS’.
>
> However, I didn’t set any java options, nor did the command ‘unset
> _JAVA_OPTIONS’ change anything. Same error.
>
> ** **
>
> Here is more info.
>
> OS: Ubuntu 12.10
>
> Java: sun 1.6.0_45
>
> ** **
>
> Any suggestion welcome.
>
> ** **
>
> Thanks,
>
> Alice
>


Re: Override COUNT() function

2013-07-02 Thread Navis류승우
As you expected, there is no documentation on it (like other optimizers)

Javadoc of the class might be helpful but seemed not in detail enough.

2013/7/2 Peter Marron :
> Thanks Navis,
>
> This is a very interesting class which I feel pretty sure that I would never 
> have found.
> Are  there any descriptions, motivations, documentation or examples anywhere?
> I suspect that there's nothing other than the source itself, but I had to ask.
>
> Regards,
>
> Z
> -Original Message-
> From: Navis류승우 [mailto:navis@nexr.com]
> Sent: 02 July 2013 08:50
> To: user@hive.apache.org
> Subject: Re: Override COUNT() function
>
> MetadataOnlyOptimizer changes GBY on partition columns to simple TableScan 
> with one line dummy.
>
> I think similar things can be done with stats.
>
> 2013/6/28 Peter Marron :
>> Hi,
>>
>>
>>
>> I feel sure that someone has asked for this before, but here goes…
>>
>>
>>
>> In the case where I have the query
>>
>>
>>
>> SELECT COUNT(*) FROM table;
>>
>>
>>
>> There are many cases where I can determine the count immediately.
>>
>> (For example if I have run something like:
>>
>>
>>
>> ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2],
>> ...)] COMPUTE STATISTICS [noscan];
>>
>>
>>
>> then there seems to be a table property “numRows” which holds a count
>> of the number of rows.
>>
>> Now I know that the COUNT function can’t always be determined easily.
>>
>> If the query is more complicated, like
>>
>>
>>
>> SELECT COUNT(*) FROM table GROUP BY column;
>>
>>
>>
>> then obviously a simple scalar count is of no real use. But is there
>> some way
>>
>> to intercept the simple case and avoid running a table scan?
>>
>>
>>
>> One problem that I see is that the COUNT function is a UDAF and I am
>>
>> assuming that the presence of any aggregate function like this is
>> enough
>>
>> to force the query planner to require a Map/Reduce. Is there anyway
>>
>> to make the function look like a simple UDF for some queries? Or
>>
>> just for some tables? I guess that I’d be prepared to sacrifice the
>> full
>>
>> generality of the normal COUNT function for one which
>>
>> only functions correctly for the simple query on my tables.
>>
>>
>>
>> So is it possible to have a different COUNT function only on certain tables?
>>
>>
>>
>> Regards,
>>
>>
>>
>> Z
>>
>>
>


Re: Experience of Hive local mode execution style

2013-07-02 Thread Edward Capriolo
Local mode is fast. In particular older version pf hadoop take a lot of
time scheduling tasks and a delay betwen map and reduce phase.

Local mode really helps with those little delays.

On Monday, July 1, 2013, Guillaume Allain  wrote:
> Hi all,
>
> Would anybody have any comments or feedback about the hive local mode
execution? It is advertised as providing a boost to performance for small
data sets. It seem to fit nicely when running unit/integration tests on
single node or virtual machine.
>
> My exact questions are the following :
>
> - How significantly diverge the local mode execution of queries compared
to distributed mode? Do the results may be different in some way?
>
> - I have had encountered error when running complex queries (with several
joins/distinct/groupbys) that seem to relate to configuration (see below).
I got no exact answers from the ML and I am kind of ready to dive into the
source code.
>
> Any idea where I should aim in order to solve that particular problem?
>
> Thanks in advance,
>
> Guillaume
>
> 
> From: Guillaume Allain
> Sent: 18 June 2013 12:14
> To: user@hive.apache.org
> Subject: FileNotFoundException when using hive local mode execution style
>
> Hi all,
>
> I plan to use  hive local in order to speed-up unit testing on (very)
small data sets. (Data is still on hdfs). I switch the local mode by
setting the following variables :
>
> SET hive.exec.mode.local.auto=true;
> SET mapred.local.dir=/user;
> SET mapred.tmp.dir=file:///tmp;
> (plus creating needed directories and permissions)
>
> Simple GROUP BY, INNER and OUTER JOIN queries work just fine (with up to
3 jobs) with nice performance improvements.
>
> Unfortunately I ran into a
FileNotFoundException:/tmp/vagrant/hive_2013-06-17_16-10-05_614_7672774118904458113/-mr-1/1/emptyFile)
on some more complex query (4 jobs, distinct on top of several joins, see
below logs if needed).
>
> Any idea about that error? What other option I am missing to have a fully
fonctional local mode?
>
> Thanks in advance, Guillaume
>
> $ tail -50
/tmp/vagrant/vagrant_20130617171313_82baad8b-1961-4055-a52e-d8865b2cd4f8.lo
>
> 2013-06-17 16:10:05,669 INFO  exec.ExecDriver
(ExecDriver.java:execute(320)) - Using
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> 2013-06-17 16:10:05,688 INFO  exec.ExecDriver
(ExecDriver.java:execute(342)) - adding libjars:
file:///opt/events-warehouse/build/jars/joda-time.jar,file:///opt/events-warehouse/build/jars/we7-hive-udfs.jar,file:///usr/lib/hive/lib/hive-json-serde-0.2.jar,file:///usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jar,file:///opt/events-warehouse/build/jars/guava.jar
> 2013-06-17 16:10:05,688 INFO  exec.ExecDriver
(ExecDriver.java:addInputPaths(840)) - Processing alias dc
> 2013-06-17 16:10:05,688 INFO  exec.ExecDriver
(ExecDriver.java:addInputPaths(858)) - Adding input file
hdfs://localhost/user/hive/warehouse/events_super_mart_test.db/dim_cohorts
> 2013-06-17 16:10:05,689 INFO  exec.Utilities
(Utilities.java:isEmptyPath(1807)) - Content Summary not cached for
hdfs://localhost/user/hive/warehouse/events_super_mart_test.db/dim_cohorts
> 2013-06-17 16:10:06,185 INFO  exec.ExecDriver
(ExecDriver.java:addInputPath(789)) - Changed input file to
file:/tmp/vagrant/hive_2013-06-17_16-10-05_614_7672774118904458113/-mr-1/1
> 2013-06-17 16:10:06,226 INFO  exec.ExecDriver
(ExecDriver.java:addInputPaths(840)) - Processing alias $INTNAME
> 2013-06-17 16:10:06,226 INFO  exec.ExecDriver
(ExecDriver.java:addInputPaths(858)) - Adding input file
hdfs://localhost/tmp/hive-vagrant/hive_2013-06-17_16-09-42_560_407729448242367/-mr-10004
> 2013-06-17 16:10:06,226 INFO  exec.Utilities
(Utilities.java:isEmptyPath(1807)) - Content Summary not cached for
hdfs://localhost/tmp/hive-vagrant/hive_2013-06-17_16-09-42_560_407729448242367/-mr-10004
> 2013-06-17 16:10:06,681 WARN  conf.Configuration
(Configuration.java:warnOnceIfDeprecated(808)) - session.id is deprecated.
Instead, use dfs.metrics.session-id
> 2013-06-17 16:10:06,682 INFO  jvm.JvmMetrics (JvmMetrics.java:init(76)) -
Initializing JVM Metrics with processName=JobTracker, sessionId=
> 2013-06-17 16:10:06,688 INFO  exec.ExecDriver
(ExecDriver.java:createTmpDirs(215)) - Making Temp Directory:
hdfs://localhost/tmp/hive-vagrant/hive_2013-06-17_16-09-42_560_407729448242367/-mr-10002
> 2013-06-17 16:10:06,706 WARN  mapred.JobClient
(JobClient.java:copyAndConfigureFiles(704)) - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
> 2013-06-17 16:10:06,942 INFO  io.CombineHiveInputFormat
(CombineHiveInputFormat.java:getSplits(370)) - CombineHiveInputSplit
creating pool for
file:/tmp/vagrant/hive_2013-06-17_16-10-05_614_7672774118904458113/-mr-1/1;
using filter path
file:/tmp/vagrant/hive_2013-06-17_16-10-05_614_7672774118904458113/-mr-1/1
> 2013-06-17 16:10:06,943 INFO  io.CombineHiveInputFormat
(CombineHiveInputFormat.java:getSplits(370)) - CombineHiveInputSplit
c

Unsubscribe

2013-07-02 Thread Ariel Leiva



Re: Dealing with differents date format

2013-07-02 Thread Stephen Sprague
not sure i fully understand your dilemma.have you investigated any of
the date functions listed here?

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

seems to me you could pull the year and month from a date.  or if you have
an int then do some arithmetic to get the year and month.  eg. year =
floor( /1) and month = cast(  % 100 as int)  [% ==
modulus operator]

or am i not even answering your question?



On Tue, Jul 2, 2013 at 2:42 AM, Jérôme Verdier
wrote:

> Hi,
>
> i trying to translate some PL/SQL script in HiveQL, and dealing with
> unusual date format.
>
> i added a variable in my hive script : '${hiveconf:in_co_an_mois}' which
> is a year/month date format, like this : 201307 (INT format).
>
> I would like to transform this in date format, because i have to increment
> this (add one month/one year).
>
> Is there a way to do this in hive ?
>
> Thanks.
>
>
>
> --
> *Jérôme*
>
>


Re: Help in ROW_NUMBER() OVER (PARTITION BY) in Hive

2013-07-02 Thread Ashutosh Chauhan
Windowing functions (over clause) were introduced in 0.11 You need to
upgrade to 0.11 to make use of them.

Ashutosh


On Tue, Jul 2, 2013 at 7:19 AM, dyuti a  wrote:

> Hi Experts,
> I'm working with Teradata query conversion to hive environment (Hive
> version 0.10.0).The challenge that am facing here is in converting the
> below line in query.
>
>
> In SELECT clause:
>
> ROW_NUMBER() OVER (PARTITION BY CLMST_KEY2
> ORDER BY COUNTER) AS CLMST_ORDR_NBR
>
> When searched found like instead of ROW_NUMBER()  I can go with
> ROW_SEQUENCE using UDF. what to do with OVER clause and the rest.
>
> Do I have to upgrade the version to 0.11.0 or any other way?
>
>
> Any of your help is really appreciated.
>
> Thanks in advance!
>
>
>


Re: Partition performance

2013-07-02 Thread Owen O'Malley
On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron <
peter.mar...@trilliumsoftware.com> wrote:

>  Hi Owen,
>
> ** **
>
> I’m curious about this advice about partitioning. Is there some
> fundamental reason why Hive
>
> is slow when the number of partitions is 10,000 rather than 1,000?
>

The precise numbers don't matter. I wanted to give people a ballpark range
that they should be looking at. Most tables at 1000 partitions won't cause
big slow downs, but the cost scales with the number of partitions. By the
time you are at 10,000 the cost is noticeable. I have one customer who has
a table with 1.2 million partitions. That causes a lot of slow downs.


> And the improvements
>
> that you mention are they going to be in version 12? Is there a JIRA
> raised so that I can track them?
>
> (It’s not currently a problem for me but I can see that I am going to need
> to be able to explain the situation.)
>

I think this is the one they will use:
https://issues.apache.org/jira/browse/HIVE-4051

-- Owen


Help in ROW_NUMBER() OVER (PARTITION BY) in Hive

2013-07-02 Thread dyuti a
Hi Experts,
I'm working with Teradata query conversion to hive environment (Hive
version 0.10.0).The challenge that am facing here is in converting the
below line in query.


In SELECT clause:

ROW_NUMBER() OVER (PARTITION BY CLMST_KEY2
ORDER BY COUNTER) AS CLMST_ORDR_NBR

When searched found like instead of ROW_NUMBER()  I can go with
ROW_SEQUENCE using UDF. what to do with OVER clause and the rest.

Do I have to upgrade the version to 0.11.0 or any other way?


Any of your help is really appreciated.

Thanks in advance!


Convenience methods .execute(String query) and fetchX(x) on TCLIService.thrift

2013-07-02 Thread Christian Schneider
Hi,
I compared the HiveServer 1 and HiverServer 2 Thrift implementation.

I'm asking myself why the  straight forward methodes like .execute(String
Query) and the three fetchX(x) methods defined in
org.apache.hadoop.hive.service.ThriftHive.Client doesn't in the
TCLIService.thrift?

Now we have to huzzle with things like TOpenSessionReq and TOpenSessionResp
etc.

Best Regards,
Christian.


Re: Need urgent help in hive query

2013-07-02 Thread dyuti a
sure will check to that Robin


-Thanks


On Sat, Jun 29, 2013 at 1:11 AM, Robin Morris  wrote:

>  It probably won't make a difference to performance — the subqueries will
> be executed in separate map-reduce jobs, and the output put on disk anyway.
>
>
>  First, get it to work; second, optimize.
>
>  Robin
>
>   From: dyuti a 
> Reply-To: "user@hive.apache.org" 
> Date: Friday, June 28, 2013 12:05 PM
> To: "user@hive.apache.org" 
> Subject: Re: Need urgent help in hive query
>
>   Hi Robin,
> Thanks for your reply. Hope this will help us but still looking into
> performance wise tooif no other option then have to go by this.
>
> Dti
>
>
> On Sat, Jun 29, 2013 at 12:03 AM, Robin Morris  wrote:
>
>>  Split up the query – put results of the subquery into a table, run the
>> final query on that, then drop the temporary table.
>>
>>  Robin
>>
>>   From: dyuti a 
>> Reply-To: "user@hive.apache.org" 
>> Date: Friday, June 28, 2013 10:54 AM
>> To: "user@hive.apache.org" , Michael Malak <
>> michaelma...@yahoo.com>
>> Subject: Re: Fwd: Need urgent help in hive query
>>
>>   Hi Michael,
>> Thanks for your help, is there any other possible options apart from this.
>>
>>
>> On Fri, Jun 28, 2013 at 10:33 PM, Michael Malak 
>> wrote:
>>
>>>  Just copy and paste the whole long expressions to their second
>>> occurrences.
>>>
>>>   --
>>> *From:* dyuti a 
>>> *To:* user@hive.apache.org
>>> *Sent:* Friday, June 28, 2013 10:58 AM
>>> *Subject:* Fwd: Need urgent help in hive query
>>>
>>>   Hi Experts,
>>>  I'm trying with the below SQL query in Hive, which does not support
>>> column alias access in subsequent columns as shown below in the query. Is
>>> there any other way to rewrite the same without using alias? any of your
>>> help are really appreciated.
>>>
>>> INSERT INTO CAUD
>>>(
>>>  pst_cc
>>>   pst_no
>>>   pst_ber
>>>   pst_tkn
>>>   pst_dtm
>>>   pst_nbr
>>>   pst_cde
>>>   pst_dte
>>> )
>>> SELECT der.cc
>>>   der.no
>>>   der.ber
>>>   der.tkn
>>>   der.dtm
>>>   der.nbr
>>>   der.cde
>>>   der.dte
>>>
>>> FROM (SELECT udp.cc
>>>udp.no
>>>udp.ber
>>>udp.tkn
>>>,CASE WHEN udp.SYSTEM_DTE>1600 AND udp.SYSTEM_DTE<1 THEN
>>> udp.SYSTEM_DTE
>>> WHEN udp.DTE_OUT>1600 AND udp.DTE_OUT<1 THEN
>>> udp.DTE_OUT
>>> WHEN udp.DTE_IN>1600 AND udp.DTE_IN<1 THEN
>>> udp.DTE_IN
>>> ELSE '1231'
>>> END  AS DTE_OUT
>>>,CASE WHEN udp.TME_OUT <> 0 THEN udp.TME_OUT
>>> WHEN udp.TME_IN <> 0 THEN udp.TME_IN
>>> ELSE 0
>>> END AS TME_OUT
>>> ,TRIM(CAST(TME_OUT  AS CHAR(6))) AS TME_OUT1
>>> ,CAST(CAST(SUBSTR(TRIM(DTE_OUT),1,8)  AS CHAR(8)) AS DATE FORMAT
>>> 'mmdd')  AS DTE_OUT_O
>>> ,CASE WHEN TME_OUT>9 THEN CAST(TME_OUT1 AS CHAR(6))
>>> WHEN TME_OUT> AND TME_OUT<=9 THEN
>>> CAST('0'||TME_OUT1  AS CHAR(6))
>>> WHEN TME_OUT>999 AND TME_OUT<= THEN CAST('00'||TME_OUT1
>>> AS CHAR(6))
>>> WHEN TME_OUT>99 AND TME_OUT<=999 THEN CAST('000'||TME_OUT1
>>> AS CHAR(6))
>>> WHEN TME_OUT>9 AND TME_OUT<=99 THEN CAST(''||TME_OUT1 AS
>>> CHAR(6))
>>> WHEN TME_OUT>0 AND TME_OUT<=9 THEN CAST('0'||TME_OUT1 AS
>>> CHAR(6))
>>> WHEN TME_OUT=0 THEN '00'
>>> END AS TME_OUT2
>>>
>>> ,SUBSTR(TRIM(TME_OUT2),1,2)||':'||SUBSTR(TRIM(TME_OUT2),3,2)||':'||SUBSTR(TRIM(TME_OUT2),5,2)
>>> AS
>>> TME_OUT_O
>>> , CAST( DTE_OUT_O||' '||TME_OUT_O AS TIMESTAMP FORMAT
>>> 'MMDD:HH: MI:SS') AS DTM
>>> ,udp.nbr  AS  nbr
>>>
>>>FROM   STS_GNCAUDP   udp
>>>
>>> INNER JOIN LOAD_LOG LZ_LL  ON udp.LOG_KEY=LZ_LL.LOG_KEY
>>> INNER JOIN ESA_LOAD_LOG ESA_LL  ON
>>> LZ_LL.ESA_LOAD_LOG_KEY=ESA_LL.LOG_KEY
>>> AND ESA_LL.PBLSH_IND='$PBLSH_IND'
>>> AND ESA_LL.LOAD_END_DTM ='$HIGH_DATE_TIME'
>>> AND ESA_LL.SOR_CD= '$CLM_SOR_CD'
>>> AND ESA_LL.SUBJ_AREA_NM= '$SUBJ_AREA_NM'
>>> AND ESA_LL.WORK_FLOW_NM= '$WORK_FLOW_NM'
>>> QUALIFY ROW_NUMBER()  OVER  (PARTITION BY udp.cc,udp.pst_no,
>>> udp.cde,udp.nbr,udp.dte,udp.LOG_KEY
>>> ORDER BY DTM DESC)=1) AS der
>>> ;
>>>
>>>
>>>
>>> Thanks in advance!
>>> Dti
>>>
>>>
>>>
>>>
>>>
>>
>


Help in ROW_NUMBER() OVER (PARTITION BY) in Hive

2013-07-02 Thread dyuti a
Hi Experts,
I'm working with Teradata query conversion to hive environment (Hive
version 0.10.0).The challenge that am facing here is in converting the
below line in query.


In SELECT clause:

ROW_NUMBER() OVER (PARTITION BY CLMST_KEY2
ORDER BY COUNTER) AS CLMST_ORDR_NBR

When searched found like instead of ROW_NUMBER()  I can go with
ROW_SEQUENCE using UDF. what to do with OVER clause and the rest.

Do I have to upgrade the version to 0.11.0 or any other way?


Any of your help are really appreciated.

Thanks in advance!


Dealing with differents date format

2013-07-02 Thread Jérôme Verdier
Hi,

i trying to translate some PL/SQL script in HiveQL, and dealing with
unusual date format.

i added a variable in my hive script : '${hiveconf:in_co_an_mois}' which is
a year/month date format, like this : 201307 (INT format).

I would like to transform this in date format, because i have to increment
this (add one month/one year).

Is there a way to do this in hive ?

Thanks.



-- 
*Jérôme*


RE: Partition performance

2013-07-02 Thread Peter Marron
Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental 
reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000? And the 
improvements
that you mention are they going to be in version 12? Is there a JIRA raised so 
that I can track them?
(It’s not currently a problem for me but I can see that I am going to need to 
be able to explain the situation.)

Warm regards,

Z

From: Owen O'Malley [mailto:omal...@apache.org]
Sent: 05 April 2013 00:26
To: user@hive.apache.org
Subject: Re: Partition performance

See slide #9 from my Optimizing Hive Queries talk 
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will 
improve it, but for now you are much better off with 1,000 partitions than 
10,000.

-- Owen

On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle 
mailto:ramki.pa...@gmail.com>> wrote:
Is it possible for you to send the explain plan of these two queries?
Regards,
Ramki.

On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
 wrote:
The slow down is most possibly due to large number of partitions.
I believe the Hive book authors tell us to be cautious with large number of 
partitions :-)  and I abide by that.

Users
Please add your points of view and experiences

Thanks
sanjay

From: Ian mailto:liu...@yahoo.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>, Ian 
mailto:liu...@yahoo.com>>
Date: Thursday, April 4, 2013 4:01 PM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Partition performance

Hi,

I created 3 years of hourly log files (totally 26280 files), and use External 
Table with partition to query. I tried two partition methods.

1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per 
hour). Use date and hour as partition keys. Add 3 years of directories to the 
table partitions. So there are 26280 partitions.
CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, 
hr int);
ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION 
'/test1/2013/04/02/16';

2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 
24 files in each directory). Use date as partition key. Add 3 years of 
directories to the table partitions. So there are 1095 partitions.
CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string);
ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION 
'/test2/2013/04/02';

When doing a simple query like
SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <= '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

I'm wondering why there is a big performance difference between these two? 
These two approaches have the same number of files, only the directory 
structure is different. So Hive is going to load the same amount of files. Why 
does the number of partitions have such big impact? Does that mean #2 is a 
better partition strategy?

Thanks.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.




RE: Override COUNT() function

2013-07-02 Thread Peter Marron
Thanks Navis,

This is a very interesting class which I feel pretty sure that I would never 
have found.
Are  there any descriptions, motivations, documentation or examples anywhere?
I suspect that there's nothing other than the source itself, but I had to ask.

Regards,

Z
-Original Message-
From: Navis류승우 [mailto:navis@nexr.com] 
Sent: 02 July 2013 08:50
To: user@hive.apache.org
Subject: Re: Override COUNT() function

MetadataOnlyOptimizer changes GBY on partition columns to simple TableScan with 
one line dummy.

I think similar things can be done with stats.

2013/6/28 Peter Marron :
> Hi,
>
>
>
> I feel sure that someone has asked for this before, but here goes…
>
>
>
> In the case where I have the query
>
>
>
> SELECT COUNT(*) FROM table;
>
>
>
> There are many cases where I can determine the count immediately.
>
> (For example if I have run something like:
>
>
>
> ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], 
> ...)] COMPUTE STATISTICS [noscan];
>
>
>
> then there seems to be a table property “numRows” which holds a count 
> of the number of rows.
>
> Now I know that the COUNT function can’t always be determined easily.
>
> If the query is more complicated, like
>
>
>
> SELECT COUNT(*) FROM table GROUP BY column;
>
>
>
> then obviously a simple scalar count is of no real use. But is there 
> some way
>
> to intercept the simple case and avoid running a table scan?
>
>
>
> One problem that I see is that the COUNT function is a UDAF and I am
>
> assuming that the presence of any aggregate function like this is 
> enough
>
> to force the query planner to require a Map/Reduce. Is there anyway
>
> to make the function look like a simple UDF for some queries? Or
>
> just for some tables? I guess that I’d be prepared to sacrifice the 
> full
>
> generality of the normal COUNT function for one which
>
> only functions correctly for the simple query on my tables.
>
>
>
> So is it possible to have a different COUNT function only on certain tables?
>
>
>
> Regards,
>
>
>
> Z
>
>



RE:

2013-07-02 Thread Siddharth Tiwari
Please stop spamming the list.

**

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Tue, 2 Jul 2013 01:18:17 -0700
> From: m_muthuku...@yahoo.com
> To: user@hive.apache.org; tecalu...@yahoogroups.com; saroj...@yahoo.com; 
> user-subscr...@hive.apache.org; suni...@yahoo.com; ashok_srin...@yahoo.com; 
> p...@fastcrypt.com; pgsql-j...@postgresql.org; mcignar...@tsgsportsgroup.com; 
> bi...@gkrellm.net; st...@mobilevibes.net
> Subject: Re: 
> 
> 
> http://hmydesign.ca/wel.come.phpcjvode30331761
> 
  

Re:

2013-07-02 Thread m_muthukumar

http://hmydesign.ca/wel.come.phpcjvode30331761



Re: Override COUNT() function

2013-07-02 Thread Navis류승우
MetadataOnlyOptimizer changes GBY on partition columns to simple
TableScan with one line dummy.

I think similar things can be done with stats.

2013/6/28 Peter Marron :
> Hi,
>
>
>
> I feel sure that someone has asked for this before, but here goes…
>
>
>
> In the case where I have the query
>
>
>
> SELECT COUNT(*) FROM table;
>
>
>
> There are many cases where I can determine the count immediately.
>
> (For example if I have run something like:
>
>
>
> ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
> COMPUTE STATISTICS [noscan];
>
>
>
> then there seems to be a table property “numRows” which holds a count of the
> number of rows.
>
> Now I know that the COUNT function can’t always be determined easily.
>
> If the query is more complicated, like
>
>
>
> SELECT COUNT(*) FROM table GROUP BY column;
>
>
>
> then obviously a simple scalar count is of no real use. But is there some
> way
>
> to intercept the simple case and avoid running a table scan?
>
>
>
> One problem that I see is that the COUNT function is a UDAF and I am
>
> assuming that the presence of any aggregate function like this is enough
>
> to force the query planner to require a Map/Reduce. Is there anyway
>
> to make the function look like a simple UDF for some queries? Or
>
> just for some tables? I guess that I’d be prepared to sacrifice the full
>
> generality of the normal COUNT function for one which
>
> only functions correctly for the simple query on my tables.
>
>
>
> So is it possible to have a different COUNT function only on certain tables?
>
>
>
> Regards,
>
>
>
> Z
>
>


Re: Correct way of using regexserde

2013-07-02 Thread Mohammad Tariq
No luck.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Tue, Jul 2, 2013 at 1:03 PM, Matouk IFTISSEN
wrote:

> Yes it is to create an external table that points your data with the
> regexp passed with the SERDE.
> good day
>
>
> 2013/7/2 Mohammad Tariq 
>
>> Hello there,
>>
>>  Thanks a lot for the response. Do you mean creating an external
>> table?I have also tried deleting "ouputregex", but same result. And the jar
>> is also added.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Mon, Jul 1, 2013 at 1:39 PM, Matouk IFTISSEN <
>> matouk.iftis...@ysance.com> wrote:
>>
>>> Hello,
>>> try to delete  "output.regex" = "%1$s" and store your data in file then
>>> pricise where the file is located
>>> STORED AS TEXTFILE
>>> LOCATION '/..';
>>> and assure that you add the jar "hive-contrib-0.10.0.jar"  in the
>>> session or you have it in all Hadoop tasktrackers
>>> like this:
>>> add jar path_where_is_the_jar_in_hive_lib\hive-contrib-0.9.0.jar ;
>>>
>>> Good luck
>>>
>>>
>>> 2013/7/1 Mohammad Tariq 
>>>
 Hello list,

  I would really appreciate if someone could show me the correct
 way of using regexserde as i'm having some hard time using it. I have
 verified my regex through
 http://www.regexplanet.com/advanced/java/index.html and it's working
 fine there. But when i'm using the same pattern with regexserde i'm getting
 NULL.

 My input looks like this :
 

 and I want to extract the characters enclosed between the angle
 brackets.

 This is the command i'm using :
 hive> CREATE TABLE s(f1 STRING) ROW FORMAT SERDE
 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
 > WITH SERDEPROPERTIES ( "input.regex" = "(\\w*)", "output.regex" =
 "%1$s") STORED AS TEXTFILE;

 LOAD works fine, but SELECT * gives me NULL.

 I am on hadoop-1.0.4 and hive-0.10.0

 Thank you so much for you time.


 Warm Regards,
 Tariq
 cloudfront.blogspot.com

>>>
>>>
>>
>


Re: Correct way of using regexserde

2013-07-02 Thread Matouk IFTISSEN
Yes it is to create an external table that points your data with the regexp
passed with the SERDE.
good day


2013/7/2 Mohammad Tariq 

> Hello there,
>
>  Thanks a lot for the response. Do you mean creating an external
> table?I have also tried deleting "ouputregex", but same result. And the jar
> is also added.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Mon, Jul 1, 2013 at 1:39 PM, Matouk IFTISSEN <
> matouk.iftis...@ysance.com> wrote:
>
>> Hello,
>> try to delete  "output.regex" = "%1$s" and store your data in file then
>> pricise where the file is located
>> STORED AS TEXTFILE
>> LOCATION '/..';
>> and assure that you add the jar "hive-contrib-0.10.0.jar"  in the
>> session or you have it in all Hadoop tasktrackers
>> like this:
>> add jar path_where_is_the_jar_in_hive_lib\hive-contrib-0.9.0.jar ;
>>
>> Good luck
>>
>>
>> 2013/7/1 Mohammad Tariq 
>>
>>> Hello list,
>>>
>>>  I would really appreciate if someone could show me the correct
>>> way of using regexserde as i'm having some hard time using it. I have
>>> verified my regex through
>>> http://www.regexplanet.com/advanced/java/index.html and it's working
>>> fine there. But when i'm using the same pattern with regexserde i'm getting
>>> NULL.
>>>
>>> My input looks like this :
>>> 
>>>
>>> and I want to extract the characters enclosed between the angle brackets.
>>>
>>> This is the command i'm using :
>>> hive> CREATE TABLE s(f1 STRING) ROW FORMAT SERDE
>>> 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>>> > WITH SERDEPROPERTIES ( "input.regex" = "(\\w*)", "output.regex" =
>>> "%1$s") STORED AS TEXTFILE;
>>>
>>> LOAD works fine, but SELECT * gives me NULL.
>>>
>>> I am on hadoop-1.0.4 and hive-0.10.0
>>>
>>> Thank you so much for you time.
>>>
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>
>>
>


RE: NoClassDefFoundError when creating a custom hive UDF

2013-07-02 Thread Rupinder Singh
So I found the getRequiredFiles() and getRequiredJars() methods that were added 
in HIVE-2561, and implemented getRequiredJars() to provide the full local paths 
to the jars. Still the same error.
Is there a built-in UDF that uses these functions or is there an example from 
someone who has implemented it successfully?

Thanks
Rupinder

From: Rupinder Singh [mailto:rsi...@care.com]
Sent: Tuesday, July 02, 2013 10:40 AM
To: user@hive.apache.org
Subject: NoClassDefFoundError when creating a custom hive UDF

Hi,

I have created a custom Hive UDF that has external JAR dependencies. I have 
added those jars to the Hive session using 'add jar' but when I try to create 
my function, I get a NoClassDefFoundError on the dependency class.
I am on Hive 0.81 running in Amazon EMR.

This is what happens when I try to create my function:
hive> add jar /home/hadoop/care/lib/wurfl-1.4.4.3.jar;
Added /home/hadoop/care/lib/wurfl-1.4.4.3.jar to class path
Added resource: /home/hadoop/care/lib/wurfl-1.4.4.3.jar
hive> add jar /home/hadoop/care/my_udf.jar;
Added /home/hadoop/care/my_udf.jar to class path
Added resource: /home/hadoop/care/my_udf.jar
hive> create temporary function is_bot as 'my.package.BotUDF';
java.lang.NoClassDefFoundError: 
net/sourceforge/wurfl/core/resource/WURFLResource
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.hadoop.hive.ql.exec.FunctionTask.getUdfClass(FunctionTask.java:105)
at 
org.apache.hadoop.hive.ql.exec.FunctionTask.createFunction(FunctionTask.java:75)
at 
org.apache.hadoop.hive.ql.exec.FunctionTask.execute(FunctionTask.java:63)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1127)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:307)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:228)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:457)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:732)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Caused by: java.lang.ClassNotFoundException: 
net.sourceforge.wurfl.core.resource.WURFLResource
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 20 more
FAILED: Execution Error, return code -101 from 
org.apache.hadoop.hive.ql.exec.FunctionTask

Then I validated that hive can actually find the class that it is failing on:
hive> create temporary function is_bot as 
'net.sourceforge.wurfl.core.resource.WURFLResource';
FAILED: Class net.sourceforge.wurfl.core.resource.WURFLResource does not 
implement UDF, GenericUDF, or UDAF
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask

So it can see the class but doesn't see it as a UDF, which is right.

Just to verify that Hive will crib with a different error if I specify a 
non-existent class.
hive> create temporary function is_bot as 'non.existent.Class';
FAILED: Class non.existent.Class not found
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.FunctionTask

I have not found any documentation that indicates that dependencies need to be 
specified using a different mechanism. 
https://issues.apache.org/jira/browse/HIVE-2561 mentions an alternate 
mechanism, but I can't find it on wiki or anywhere else.

Any help or pointers are appreciated.

Thanks
Rupinder



This email is intended for the person(s) to whom it is addressed and may 
contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, 
distribution, copying, or disclosure by any person other than the addressee(s) 
is strictly prohibited. If you have received this email in error, please notify 
the sender immediately by return email and delete the message and any 
attachments from your system.