Re: Hive on Windows w/Hadoop 2.2.0

2014-02-04 Thread Lefty Leverenz
That's from the first sentence of Getting Started, which needs to be
rewritten:

DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using
> Java 1.6 for now - although it may very well work on other similar
> platforms. It does not work on Cygwin.


What should it say instead?

-- Lefty


On Tue, Feb 4, 2014 at 9:24 AM, Eric Hanson (BIG DATA) <
eric.n.han...@microsoft.com> wrote:

>  HDP runs on Windows:
>
>
>
> http://hortonworks.com/products/hdp-windows/#install
>
>
>
> I don't know if it uses Cygwin or not but everything needed to run Hive on
> Windows should be in the install package.
>
>
>
> Eric
>
>
>
> *From:* Ian Jackson [mailto:ian_jack...@trilliumsoftware.com]
> *Sent:* Tuesday, February 4, 2014 8:55 AM
> *To:* user@hive.apache.org
> *Subject:* Hive on Windows w/Hadoop 2.2.0
>
>
>
> Reading the Wiki (cwiki.apache.org/confluence/display/Hive/GettingStarted),
> it states "It does not work on Cygwin." Do further details exist or hints
> to make work on Windows?
>


Hive queries for disk usage analysis

2014-02-04 Thread Mungre,Surbhi
Hello All,

We are doing some analysis for which we need to determine things like size of 
the largest row or size of the largest column. By size, I am referring to disk 
space usage. Does HIVE provide any functions to run such queries?

Thanks,
Surbhi Mungre
Software Engineer
www.cerner.com

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: GenericUDF Testing in Hive

2014-02-04 Thread Raj Hadoop

I want to do a simple test like this - but not working -

select ComplexUDFExample(List("a", "b", "c"), "b") from table1 limit 10;


FAILED: SemanticException [Error 10011]: Line 1:25 Invalid function 'List'






On Tuesday, February 4, 2014 2:34 PM, Raj Hadoop  wrote:
 
How to test a Hive GenericUDF which accepts two parameters List, T 

List -> Can it be the output of a collect set. Please advise.

I have a generic udf which takes List, T. I want to test it how it works 
through Hive. 





On Monday, January 20, 2014 5:19 PM, Raj Hadoop  wrote:
 
 
The following is a an example for a GenericUDF. I wanted to test this through a 
Hive query. Basically want to pass parameters some thing like "select 
ComplexUDFExample('a','b','c') from employees limit 10".


 
 
https://github.com/rathboma/hive-extension-examples/blob/master/src/main/java/com/matthewrathbone/example/ComplexUDFExample.java
 
 
 
class ComplexUDFExample extends GenericUDF {
  ListObjectInspector listOI;
  StringObjectInspector elementOI;
  @Override
  public String getDisplayString(String[] arg0) {
    return "arrayContainsExample()"; // this should probably be better
  }
  @Override
  public ObjectInspector initialize(ObjectInspector[] arguments) throws 
UDFArgumentException {
    if (arguments.length != 2) {
  throw new UDFArgumentLengthException("arrayContainsExample only takes 2 
arguments: List, T");
    }
    // 1. Check we received the right object types.
    ObjectInspector a = arguments[0];
    ObjectInspector b = arguments[1];
    if (!(a instanceof ListObjectInspector) || !(b instanceof 
StringObjectInspector)) {
  throw new UDFArgumentException("first argument must be a list / array, 
second argument must be a
 string");
    }
    this.listOI = (ListObjectInspector) a;
    this.elementOI = (StringObjectInspector) b;
    
    // 2. Check that the list contains strings
    if(!(listOI.getListElementObjectInspector() instanceof 
StringObjectInspector)) {
  throw new UDFArgumentException("first argument must be a list of 
strings");
    }
    
    // the return type of our function is a boolean, so we provide the correct 
object inspector
    return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
  }
  
  @Override
  public
 Object evaluate(DeferredObject[] arguments) throws HiveException {
    
    // get the list and string from the deferred objects using the object
 inspectors
    List list = (List) this.listOI.getList(arguments[0].get());
    String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());
    
    // check for nulls
    if (list == null || arg == null) {
  return null;
    }
    
    // see if our list contains the value we need
    for(String s: list) {
  if (arg.equals(s)) return new Boolean(true);
    }
    return new Boolean(false);
  }
  
}
 
 
hive> select ComplexUDFExample('a','b','c') from email_list_1 limit 10;
FAILED: SemanticException [Error 10015]: Line 1:7 Arguments length mismatch 
''c'': arrayContainsExample only takes 2 arguments: List, T
 
--
 
How to test this example in Hive query. I know I am invoking it wrong. But how 
can I invoke it correctly.
 
My requirement is to pass a String of arrays as first argument and another 
string as second argument in Hive like below.
 
 
Select col1, ComplexUDFExample( collectset(col2) , 'xyz')
from 
Employees
Group By col1;
 
How do i do that?
 
Thanks in advance.
 
Regards,
Raj

Re: GenericUDF Testing in Hive

2014-02-04 Thread Raj Hadoop
How to test a Hive GenericUDF which accepts two parameters List, T 

List -> Can it be the output of a collect set. Please advise.

I have a generic udf which takes List, T. I want to test it how it works 
through Hive. 





On Monday, January 20, 2014 5:19 PM, Raj Hadoop  wrote:
 
 
The following is a an example for a GenericUDF. I wanted to test this through a 
Hive query. Basically want to pass parameters some thing like "select 
ComplexUDFExample('a','b','c') from employees limit 10".


 
 
https://github.com/rathboma/hive-extension-examples/blob/master/src/main/java/com/matthewrathbone/example/ComplexUDFExample.java
 
 
 
class ComplexUDFExample extends GenericUDF {
  ListObjectInspector listOI;
  StringObjectInspector elementOI;
  @Override
  public String getDisplayString(String[] arg0) {
    return "arrayContainsExample()"; // this should probably be better
  }
  @Override
  public ObjectInspector initialize(ObjectInspector[] arguments) throws 
UDFArgumentException {
    if (arguments.length != 2) {
  throw new UDFArgumentLengthException("arrayContainsExample only takes 2 
arguments: List, T");
    }
    // 1. Check we received the right object types.
    ObjectInspector a = arguments[0];
    ObjectInspector b = arguments[1];
    if (!(a instanceof ListObjectInspector) || !(b instanceof 
StringObjectInspector)) {
  throw new UDFArgumentException("first argument must be a list / array, 
second argument must be a
 string");
    }
    this.listOI = (ListObjectInspector) a;
    this.elementOI = (StringObjectInspector) b;
    
    // 2. Check that the list contains strings
    if(!(listOI.getListElementObjectInspector() instanceof 
StringObjectInspector)) {
  throw new UDFArgumentException("first argument must be a list of 
strings");
    }
    
    // the return type of our function is a boolean, so we provide the correct 
object inspector
    return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
  }
  
  @Override
  public Object evaluate(DeferredObject[] arguments) throws HiveException {
    
    // get the list and string from the deferred objects using the object
 inspectors
    List list = (List) this.listOI.getList(arguments[0].get());
    String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());
    
    // check for nulls
    if (list == null || arg == null) {
  return null;
    }
    
    // see if our list contains the value we need
    for(String s: list) {
  if (arg.equals(s)) return new Boolean(true);
    }
    return new Boolean(false);
  }
  
}
 
 
hive> select ComplexUDFExample('a','b','c') from email_list_1 limit 10;
FAILED: SemanticException [Error 10015]: Line 1:7 Arguments length mismatch 
''c'': arrayContainsExample only takes 2 arguments: List, T
 
--
 
How to test this example in Hive query. I know I am invoking it wrong. But how 
can I invoke it correctly.
 
My requirement is to pass a String of arrays as first argument and another 
string as second argument in Hive like below.
 
 
Select col1, ComplexUDFExample( collectset(col2) , 'xyz')
from 
Employees
Group By col1;
 
How do i do that?
 
Thanks in advance.
 
Regards,
Raj

RE: Hive on Windows w/Hadoop 2.2.0

2014-02-04 Thread Eric Hanson (BIG DATA)
HDP runs on Windows:

http://hortonworks.com/products/hdp-windows/#install

I don't know if it uses Cygwin or not but everything needed to run Hive on 
Windows should be in the install package.

Eric

From: Ian Jackson [mailto:ian_jack...@trilliumsoftware.com]
Sent: Tuesday, February 4, 2014 8:55 AM
To: user@hive.apache.org
Subject: Hive on Windows w/Hadoop 2.2.0

Reading the Wiki (cwiki.apache.org/confluence/display/Hive/GettingStarted), it 
states "It does not work on Cygwin." Do further details exist or hints to make 
work on Windows?


Hive on Windows w/Hadoop 2.2.0

2014-02-04 Thread Ian Jackson
Reading the Wiki (cwiki.apache.org/confluence/display/Hive/GettingStarted), it 
states "It does not work on Cygwin." Do further details exist or hints to make 
work on Windows?


Re: Index not getting used for the queries

2014-02-04 Thread Thilina Gunarathne
Thanks Peter. It helped. That property combined with setting the property
'hive.optimize.index.filter' to 'true' got the index working.

thanks,
Thilina


On Mon, Feb 3, 2014 at 6:12 PM, Peter Marron <
peter.mar...@trilliumsoftware.com> wrote:

>  Hi,
>
>
>
> Not sure if it is relevant to your problem but I'm just checking
>
> that you know about
>
> hive.optimize.index.filter.compact.minsize
>
> it's set to 5Gbytes by default and if the estimated query size is
>
> less than this then the index won't be used.
>
> HTH.
>
>
>
> Regards
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.mar...@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>
>
>  
>
>  
>
>
>
> *www.trilliumsoftware.com *
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* Thilina Gunarathne [mailto:cset...@gmail.com]
> *Sent:* 03 February 2014 16:08
> *To:* user
> *Subject:* Index not getting used for the queries
>
>
>
> Dear all,
>
> I created a compact index for a table with several hundred million records
> as follows. The table is partitioned by the month. The index on A and B was
> created successfully, but I can't see it getting used in the queries. It
> would be great if one of you experts can shed some light on what  am I
> missing. I'm using hive 0.9.
>
> set hive.exec.parallel=false;
> CREATE INDEX idx_
> ON TABLE (a,b)
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
> WITH DEFERRED REBUILD
> COMMENT 'Index for  table. Indexing on A and B';
> ALTER INDEX idx_ on  REBUILD;
>
>
> hive> describe ;
> OK
> abigint
> ...
> bbigint
> 
>
> month int
>
>
>
> hive> show index on ;
> OK
> idx_    a, b
> default___p_idx___compact Index for tm top50
> table. Indexing on A and B
>
>
> hive> explain select a,b from tm_top50_p where a=113231 and
> month=201308;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ))) (TOK_INSERT
> (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
> (TOK_TABLE_OR_COL a)) (TOK_SELEXPR (TOK_TABLE_OR_COL b))) (TOK_WHERE (and
> (= (TOK_TABLE_OR_COL a) 113231) (= (TOK_TABLE_OR_COL month) 201308)
>
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
>
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> 
>   TableScan
> alias: 
> Filter Operator
>   predicate:
>   expr: (a = 113231)
>   type: boolean
>   Select Operator
> expressions:
>   expr: a
>   type: bigint
>   expr: b
>   type: bigint
> outputColumnNames: _col0, _col1
> File Output Operator
>   compressed: false
>   GlobalTableId: 0
>   table:
>   input format:
> org.apache.hadoop.mapred.TextInputFormat
>   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>
>   thanks a lot,
> Thilina
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
>
> http://thilina.gunarathne.org
>



-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org
<><><><>

collect_list on two columns of the same row

2014-02-04 Thread ZORAIDA HIDALGO SANCHEZ
Dear all,

I need to call to one of my UDFs that receives two arrays as a parameters. Each 
of these arrays represents the result of collect_list(col) after doing group 
by. Does the first position of the first array correspond with the first 
position to the second array?

Por instace, having this:

col11 col21 col31
col12 col22 col32
col13 col23 col33

i would expect :
select col3,
result
from (
 select col3, collect_list(col1) as col1_list,collect_list(col2) as 
col2_list
 from my_table
 group by col3 ) tmp
lateral view my_udf([col11,col12,col13],[col21,col22,col23])tmp as result

is that correct?

Thanks.

Zoraida.-



Este mensaje se dirige exclusivamente a su destinatario. Puede consultar 
nuestra política de envío y recepción de correo electrónico en el enlace 
situado más abajo.
This message is intended exclusively for its addressee. We only send and 
receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx


Re: Optimising mappers for number of nodes

2014-02-04 Thread Lefty Leverenz
Actually that's mapred.max.split.size.  Hive doesn't have a configuration
parameter named "hive.max.split.size".

-- Lefty


On Mon, Feb 3, 2014 at 10:59 AM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Hi
>
> hive.max.split.size can be tuned to decrease the number of mappers.
> Reference: http://www.slideshare.net/ye.mikez/hive-tuning (slide number
> 38)
>
> Also using CombineHiveInputFormat (default input format) will combine
> multiple small files to form a large split and hence less number of mappers.
>
> Thanks
> Prasanth Jayachandran
>
> On Feb 3, 2014, at 10:20 AM, KingDavies  wrote:
>
> Our platform has a 40GB raw data file that was compressed lzo (12GB
> compressed) to reduce network IO between S3.
> Without indexing the file is unsplittable resulting in 1 map task and poor
> cluster utilisation.
> After indexing the file to be splitable the hive query produces 120 map
> tasks.
> However, with the 120 tasks distributed over a small 4 node cluster it
> takes longer to process the data than when it wasn't splitable and
> processing done by a single node (1h20mins vs 17mins). This was with a
> fairly simple select from where query, without distinct, group by or order.
> I'd like to utilise all nodes in the cluster, to reduce query time. Whats
> the best way to have the data crunched in parallel but with fewer mappers?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.


Issues in retrieval of hive data-reg

2014-02-04 Thread Selvi. rceg
I am doing my project on big data. I had installed *HADOOP 1.2.1* and *HIVE
- 0.11.0* in *UBUNTU 11.10* . I have created tables in hive and structured
the contents in the table using hive. Now I need to retrieve the structured
data. Can you please help me to retrieve the structured data? On trying to
find the location of the data, I found out it to be
*hdfs://localhost:54310/user/hive/warehouse/table_name.
*But I don't know where to find this location.

HADOOP- /usr/local/hadoop
HIVE - /usr/local/hadoop/hive 0.11.0