Issues in retrieval of hive data-reg
I am doing my project on big data. I had installed *HADOOP 1.2.1* and *HIVE - 0.11.0* in *UBUNTU 11.10* . I have created tables in hive and structured the contents in the table using hive. Now I need to retrieve the structured data. Can you please help me to retrieve the structured data? On trying to find the location of the data, I found out it to be *hdfs://localhost:54310/user/hive/warehouse/table_name. *But I don't know where to find this location. HADOOP- /usr/local/hadoop HIVE - /usr/local/hadoop/hive 0.11.0
Re: Optimising mappers for number of nodes
Actually that's mapred.max.split.size. Hive doesn't have a configuration parameter named hive.max.split.size. -- Lefty On Mon, Feb 3, 2014 at 10:59 AM, Prasanth Jayachandran pjayachand...@hortonworks.com wrote: Hi hive.max.split.size can be tuned to decrease the number of mappers. Reference: http://www.slideshare.net/ye.mikez/hive-tuning (slide number 38) Also using CombineHiveInputFormat (default input format) will combine multiple small files to form a large split and hence less number of mappers. Thanks Prasanth Jayachandran On Feb 3, 2014, at 10:20 AM, KingDavies kingdav...@gmail.com wrote: Our platform has a 40GB raw data file that was compressed lzo (12GB compressed) to reduce network IO between S3. Without indexing the file is unsplittable resulting in 1 map task and poor cluster utilisation. After indexing the file to be splitable the hive query produces 120 map tasks. However, with the 120 tasks distributed over a small 4 node cluster it takes longer to process the data than when it wasn't splitable and processing done by a single node (1h20mins vs 17mins). This was with a fairly simple select from where query, without distinct, group by or order. I'd like to utilise all nodes in the cluster, to reduce query time. Whats the best way to have the data crunched in parallel but with fewer mappers? CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
collect_list on two columns of the same row
Dear all, I need to call to one of my UDFs that receives two arrays as a parameters. Each of these arrays represents the result of collect_list(col) after doing group by. Does the first position of the first array correspond with the first position to the second array? Por instace, having this: col11 col21 col31 col12 col22 col32 col13 col23 col33 i would expect : select col3, result from ( select col3, collect_list(col1) as col1_list,collect_list(col2) as col2_list from my_table group by col3 ) tmp lateral view my_udf([col11,col12,col13],[col21,col22,col23])tmp as result is that correct? Thanks. Zoraida.- Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at: http://www.tid.es/ES/PAGINAS/disclaimer.aspx
Re: Index not getting used for the queries
Thanks Peter. It helped. That property combined with setting the property 'hive.optimize.index.filter' to 'true' got the index working. thanks, Thilina On Mon, Feb 3, 2014 at 6:12 PM, Peter Marron peter.mar...@trilliumsoftware.com wrote: Hi, Not sure if it is relevant to your problem but I'm just checking that you know about hive.optimize.index.filter.compact.minsize it's set to 5Gbytes by default and if the estimated query size is less than this then the index won't be used. HTH. Regards *Peter Marron* Senior Developer, Research Development Office: +44 *(0) 118-940-7609* peter.mar...@trilliumsoftware.com Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK https://www.facebook.com/pages/Trillium-Software/109184815778307 https://twitter.com/TrilliumSW http://www.linkedin.com/company/17710 *www.trilliumsoftware.com http://www.trilliumsoftware.com/* Be Certain About Your Data. Be Trillium Certain. *From:* Thilina Gunarathne [mailto:cset...@gmail.com] *Sent:* 03 February 2014 16:08 *To:* user *Subject:* Index not getting used for the queries Dear all, I created a compact index for a table with several hundred million records as follows. The table is partitioned by the month. The index on A and B was created successfully, but I can't see it getting used in the queries. It would be great if one of you experts can shed some light on what am I missing. I'm using hive 0.9. set hive.exec.parallel=false; CREATE INDEX idx_ ON TABLE (a,b) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD COMMENT 'Index for table. Indexing on A and B'; ALTER INDEX idx_ on REBUILD; hive describe ; OK abigint ... bbigint month int hive show index on ; OK idx_ a, b default___p_idx___compact Index for tm top50 table. Indexing on A and B hive explain select a,b from tm_top50_p where a=113231 and month=201308; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME ))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a)) (TOK_SELEXPR (TOK_TABLE_OR_COL b))) (TOK_WHERE (and (= (TOK_TABLE_OR_COL a) 113231) (= (TOK_TABLE_OR_COL month) 201308) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: TableScan alias: Filter Operator predicate: expr: (a = 113231) type: boolean Select Operator expressions: expr: a type: bigint expr: b type: bigint outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 thanks a lot, Thilina -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org image001.pngimage002.pngimage004.pngimage003.png
Hive on Windows w/Hadoop 2.2.0
Reading the Wiki (cwiki.apache.org/confluence/display/Hive/GettingStarted), it states It does not work on Cygwin. Do further details exist or hints to make work on Windows?
RE: Hive on Windows w/Hadoop 2.2.0
HDP runs on Windows: http://hortonworks.com/products/hdp-windows/#install I don't know if it uses Cygwin or not but everything needed to run Hive on Windows should be in the install package. Eric From: Ian Jackson [mailto:ian_jack...@trilliumsoftware.com] Sent: Tuesday, February 4, 2014 8:55 AM To: user@hive.apache.org Subject: Hive on Windows w/Hadoop 2.2.0 Reading the Wiki (cwiki.apache.org/confluence/display/Hive/GettingStarted), it states It does not work on Cygwin. Do further details exist or hints to make work on Windows?
Re: GenericUDF Testing in Hive
How to test a Hive GenericUDF which accepts two parameters ListT, T ListT - Can it be the output of a collect set. Please advise. I have a generic udf which takes ListT, T. I want to test it how it works through Hive. On Monday, January 20, 2014 5:19 PM, Raj Hadoop hadoop...@yahoo.com wrote: The following is a an example for a GenericUDF. I wanted to test this through a Hive query. Basically want to pass parameters some thing like select ComplexUDFExample('a','b','c') from employees limit 10. https://github.com/rathboma/hive-extension-examples/blob/master/src/main/java/com/matthewrathbone/example/ComplexUDFExample.java class ComplexUDFExample extends GenericUDF { ListObjectInspector listOI; StringObjectInspector elementOI; @Override public String getDisplayString(String[] arg0) { return arrayContainsExample(); // this should probably be better } @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { if (arguments.length != 2) { throw new UDFArgumentLengthException(arrayContainsExample only takes 2 arguments: ListT, T); } // 1. Check we received the right object types. ObjectInspector a = arguments[0]; ObjectInspector b = arguments[1]; if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) { throw new UDFArgumentException(first argument must be a list / array, second argument must be a string); } this.listOI = (ListObjectInspector) a; this.elementOI = (StringObjectInspector) b; // 2. Check that the list contains strings if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) { throw new UDFArgumentException(first argument must be a list of strings); } // the return type of our function is a boolean, so we provide the correct object inspector return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector; } @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { // get the list and string from the deferred objects using the object inspectors ListString list = (ListString) this.listOI.getList(arguments[0].get()); String arg = elementOI.getPrimitiveJavaObject(arguments[1].get()); // check for nulls if (list == null || arg == null) { return null; } // see if our list contains the value we need for(String s: list) { if (arg.equals(s)) return new Boolean(true); } return new Boolean(false); } } hive select ComplexUDFExample('a','b','c') from email_list_1 limit 10; FAILED: SemanticException [Error 10015]: Line 1:7 Arguments length mismatch ''c'': arrayContainsExample only takes 2 arguments: ListT, T -- How to test this example in Hive query. I know I am invoking it wrong. But how can I invoke it correctly. My requirement is to pass a String of arrays as first argument and another string as second argument in Hive like below. Select col1, ComplexUDFExample( collectset(col2) , 'xyz') from Employees Group By col1; How do i do that? Thanks in advance. Regards, Raj
Re: GenericUDF Testing in Hive
I want to do a simple test like this - but not working - select ComplexUDFExample(List(a, b, c), b) from table1 limit 10; FAILED: SemanticException [Error 10011]: Line 1:25 Invalid function 'List' On Tuesday, February 4, 2014 2:34 PM, Raj Hadoop hadoop...@yahoo.com wrote: How to test a Hive GenericUDF which accepts two parameters ListT, T ListT - Can it be the output of a collect set. Please advise. I have a generic udf which takes ListT, T. I want to test it how it works through Hive. On Monday, January 20, 2014 5:19 PM, Raj Hadoop hadoop...@yahoo.com wrote: The following is a an example for a GenericUDF. I wanted to test this through a Hive query. Basically want to pass parameters some thing like select ComplexUDFExample('a','b','c') from employees limit 10. https://github.com/rathboma/hive-extension-examples/blob/master/src/main/java/com/matthewrathbone/example/ComplexUDFExample.java class ComplexUDFExample extends GenericUDF { ListObjectInspector listOI; StringObjectInspector elementOI; @Override public String getDisplayString(String[] arg0) { return arrayContainsExample(); // this should probably be better } @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { if (arguments.length != 2) { throw new UDFArgumentLengthException(arrayContainsExample only takes 2 arguments: ListT, T); } // 1. Check we received the right object types. ObjectInspector a = arguments[0]; ObjectInspector b = arguments[1]; if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) { throw new UDFArgumentException(first argument must be a list / array, second argument must be a string); } this.listOI = (ListObjectInspector) a; this.elementOI = (StringObjectInspector) b; // 2. Check that the list contains strings if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) { throw new UDFArgumentException(first argument must be a list of strings); } // the return type of our function is a boolean, so we provide the correct object inspector return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector; } @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { // get the list and string from the deferred objects using the object inspectors ListString list = (ListString) this.listOI.getList(arguments[0].get()); String arg = elementOI.getPrimitiveJavaObject(arguments[1].get()); // check for nulls if (list == null || arg == null) { return null; } // see if our list contains the value we need for(String s: list) { if (arg.equals(s)) return new Boolean(true); } return new Boolean(false); } } hive select ComplexUDFExample('a','b','c') from email_list_1 limit 10; FAILED: SemanticException [Error 10015]: Line 1:7 Arguments length mismatch ''c'': arrayContainsExample only takes 2 arguments: ListT, T -- How to test this example in Hive query. I know I am invoking it wrong. But how can I invoke it correctly. My requirement is to pass a String of arrays as first argument and another string as second argument in Hive like below. Select col1, ComplexUDFExample( collectset(col2) , 'xyz') from Employees Group By col1; How do i do that? Thanks in advance. Regards, Raj
Hive queries for disk usage analysis
Hello All, We are doing some analysis for which we need to determine things like size of the largest row or size of the largest column. By size, I am referring to disk space usage. Does HIVE provide any functions to run such queries? Thanks, Surbhi Mungre Software Engineer www.cerner.comhttp://www.cerner.com/ CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.