ORC file question

2014-02-10 Thread Avrilia Floratou
Hi all, I'm running a query that scans a file stored in ORC format and extracts some columns. My file is about 92 GB, uncompressed. I kept the default stripe size. The MapReduce job generates 363 map tasks. I have noticed that the first 180 map tasks finish in 3 secs (each) and after they

get the Hive job status

2014-02-10 Thread kun yan
By jdbc run hive, how to get the Hive job status Now I'm currently using the following method Configuration conf = new Configuration(); JobConf job = new JobConf(conf); JobClient jc = new JobClient(job); // 获取集群状态 // ClusterStatus cs = jc.getClusterStatus(); JobStatus[] jobStatus =

Re: Add few record(s) to a Hive table or a HDFS file on a daily basis

2014-02-10 Thread pandeesh
Why not INSERT INTO for appending the new data? a)Load the new data into staging table b)INSERT INTO final table. Sent from Windows Mail From: Raj Hadoop Sent: ‎Monday‎, ‎10‎ ‎February‎ ‎2014 ‎08‎:‎15 To: user, User Hi, My requirement is a typical Datawarehouse and

FUNCTION HIVE to DAYS OF WEEK

2014-02-10 Thread Eduardo Parra Valdes | BEEVA
Hello! all I wish to find a function that returns hive day of the week (Monday, Tuesday .. etc) to enter a parameter (timestamp). Anyone have an idea of how to do it? -- [image: BEEVA] *Eduardo Parra Valdés* eduardo.pa...@beeva.com BEE OUR CLIENT WWW.BEEVA.COM http://www.beeva.com/ Clara del

Re: ORC file question

2014-02-10 Thread Prasanth Jayachandran
Hi Avrilia Is it a partitioned table? If so approximately how many partitions are there and how many files are there? What is the value for hive.input.format? My suspicion is that there are ~180 files and each file is ~515MB in size. Since, you had mentioned you are using default stripe size

Hbase + Hive scan performance

2014-02-10 Thread java8964
Hi, I know this has been asked before. I did google around this topic and tried to understand as much as possible, but I kind of got difference answers based on different places. So I like to ask what I have faced and if someone can help me again on this topic. I created one table with one

Re: ORC file question

2014-02-10 Thread Avrilia Floratou
Hi Prasanth, No it's not a partitioned table. The table consists of only one file of (91.7 GB). When I created the table I loaded data from a text table to the orc table and used only 1 map task so that only one large file is created and not many small files. This is why I'm getting confused with

Can data be passed to the final mode init call in a UDAF?

2014-02-10 Thread John Meagher
I'm working on a UDAF that takes in a constant string that defines what the final output of the UDAF will be. In the mode=PARTIAL1 call to the init function all the parameters are available and the constant can be read so the output ObjectInspector can be built. I haven't found a way to pass

Re: FUNCTION HIVE to DAYS OF WEEK

2014-02-10 Thread Stephen Sprague
oddly enough i don't see one here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions however, you're not the only one finding something like this useful. cf. https://issues.apache.org/jira/browse/HIVE-6046 in the meantime it appears as though

Re: ORC file question

2014-02-10 Thread Prasanth Jayachandran
Hi Avrilia I have few more questions 1) Have you enabled ORC predicate pushdown by setting hive.optimize.index.filter? 2) What is the value for hive.input.format? 3) Which hive version are you using? 4) What query are you using? Thanks Prasanth Jayachandran On Feb 10, 2014, at 1:26 PM,

Re: FUNCTION HIVE to DAYS OF WEEK

2014-02-10 Thread John Meagher
Here's one implementation of it: https://github.com/livingsocial/HiveSwarm#dayofweekdate. The code for it is pretty straight forward: https://github.com/livingsocial/HiveSwarm/blob/master/src/main/java/com/livingsocial/hive/udf/DayOfWeek.java On Mon, Feb 10, 2014 at 4:38 PM, Stephen Sprague

Re: ORC file question

2014-02-10 Thread Avrilia Floratou
Hi Prasanth, Here are the answers to your questions: 1) Yes I have set both set hive.optimize.ppd=true; set hive.optimize.index.filter=true; 2) From describe extended: inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 3) Hive 0.12 4) Select max (I1) from table; Thanks, Avrilia On

Re: ORC file question

2014-02-10 Thread Prasanth Jayachandran
2) From describe extended: inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OrcInputFormat can be bypassed if hive.input.format is set to CombineHiveInputFormat. There are two different split computation code path both of which may generate different number of splits and hence

Re: ORC file question

2014-02-10 Thread Avrilia Floratou
Hi Prasanth, It seems that I was actually using the hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and that was generating 363 map tasks. I tried the org.apache. hadoop.hive.ql.io.HiveInputFormat and I as actually able to get 182 map tasks and get rid of the short map

Re: ORC file question

2014-02-10 Thread Prasanth Jayachandran
Great to hear! Thanks Prasanth Jayachandran On Feb 10, 2014, at 2:50 PM, Avrilia Floratou avrilia.flora...@gmail.com wrote: Hi Prasanth, It seems that I was actually using the hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and that was generating 363 map tasks. I

Which side (map/reduce) does a UDF/UDTF/UDAF execute

2014-02-10 Thread Vaibhav Jain
Hi, I need to know what are conditions based on which Hive decides to execute a UDF, UDTF and UDAF on the map or reduce side. So far I have understood that UDFs are mostly executed on map side and UDAFs on reduce side. But are there any conditions in which a UDF can be executed on reduce side

Re: Hbase + Hive scan performance

2014-02-10 Thread Navis류승우
HBase storage handler uses it's own InputFormat. So, hbase.client.scanner.caching (which is used in hbase.TableInputFormat) does not work. It might be configurable via HIVE-2906, something like select empno, ename from hbase_emp ('hbase.scan.cache'='1000'). But I've not tried. bq. Is there any