Re: [BUG] Hive 3.1.2 ALTER TABLE statement

2021-04-20 Thread Devopam Mittra
Hi,
Could you please try the same statement with CASCADE clause.

As a general rule, please avoid doing it in production directly unless that
is the last option.

Regards
Dev

On Mon, 19 Apr, 2021, 8:03 pm Vinícius Matheus Olivieri, <
olivierivi...@gmail.com> wrote:

> Hey guys!
>
> I am using HIVE 3.1.2 in a recently updated cluster of EMR 6.0.0 hosted in
> AWS and I am experiencing some problems when trying to execute a simple
> query in hive.
>
> The query is in question is the following:
>
> ALTER TABLE schema.table ADD COLUMNS (new_column STRING);
>
> The table that we are executing the query has approximately 600k
> partitions.
>
> The version of Hive was updated recently to 3.1.2 as the whole package
> included in EMR 6.0.0
> .
>
> The curious thing is that when we were using Hive 2.3.6, the query worked
> with no worries or any hard work. So I searched if the version update
> changed something on the execution of an ALTER TABLE but I didn’t find
> anything relevant that could be the root cause of the problem.
>
> Could you guys help me see the light at the end of the tunnel?
>
>
> The log that is showed in the server side is the following:
>
> 2021-04-14T14:34:45,176 ERROR [HiveServer2-Background-Pool:
> Thread-221871([])]: exec.DDLTask (:()) - Failed
>
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table.
> java.net.SocketTimeoutException: Read timed out
>
> at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:721)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:698)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3966)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:455)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:205)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2664)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2335)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:2011)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1709)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1703)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:224)
> ~[hive-service-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
> ~[hive-service-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:316)
> ~[hive-service-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at java.security.AccessController.doPrivileged(Native Method)
> ~[?:1.8.0_242]
>
> at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_242]
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> ~[hadoop-common-3.2.1-amzn-0.jar:?]
>
> at
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:330)
> ~[hive-service-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_242]
>
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_242]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_242]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_242]
>
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
>
> Caused by: org.apache.thrift.transport.TTransportException:
> java.net.SocketTimeoutException: Read timed out
>
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> ~[hive-exec-3.1.2-amzn-0.jar:3.1.2-amzn-0]
>
> at org.apache.thrift.TServic

Re: Subquery failing - is there an alternative?

2019-12-08 Thread Devopam Mittra
Please try with subquery alias .
Regards


On Mon, Dec 9, 2019, 6:06 AM Dan Horne  wrote:

> Hi All
>
> I'm trying to run the following subquery but it returns an error that says
> "cannot recognize input near 'select' 'max' '(' in expression specification"
>
> select id,
>
> first_name,
>
> last_name,
>
> change_seq
>
> from person_source
>
> where change_seq >
>
> (select max(change_seq) from person_target) A;
>
>
> If I replace the sub query with the actual maximum change_seq it works.
>
>
> Is there another construct that should work?
>
>
> Regards
>
>
> Dan
>


Re: Converting Hive Column from Varchar to String

2019-07-18 Thread Devopam Mittra
The table has data in it perhaps that is beyond ASCII.
Easier way is to go for additional column , update with data and the drop
the older one after validation of records in String type col.

Regards
Dev

On Thu, Jul 18, 2019, 4:44 AM William Shen 
wrote:

> Hi all,
>
> I assumed that it should be compatible to convert column type varchar to
> string, however, after running ALTER TABLE table CHANGE col col STRING, I
> encounter the following error when querying the column from hive:
>
> Failed with exception
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.ClassCastException:
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to
> org.apache.hadoop.io.Text
>
> Anyone encountered this before, or know how to work around this?
>
> Thank you!
>
> - Will
>


Re: out of memory using Union operator and array column type

2019-03-11 Thread Devopam Mittra
hi Patrick,
Usually a distinct is preferred on Primary key columns instead of the
entire table - something typically addressed to as SKEWNESS in traditional
rdbms world.
Doing it on an array will further add to the woes typically.

A typical workaround for this done by me in past is to fall back to Unix ,
to do a distinct by uniq or something like that suitably as it can easily
handle up to 2 GB filesize (yes I split bigger files and then merge them
suitably).

But that should not stop us doing something which is a released feature.
May be a Jira can be filed to suitably get it fixed.

regards
Dev

On Tue, Mar 12, 2019 at 12:43 AM Patrick Duin  wrote:

>
> Venkatesh:
> Increasing the memory: I've tried even bigger setttings, that made the
> error appear after twice much more time.
>
> Dev:
> So I know which table is giving the issue, following your previous
> suggestion I did a SELECT DISTINCT * FROM DELTA, which cause the same issue
> so I think the DISTINCT is the likely cause. This would make sense since my
> original query worked for a UNION ALL.
>
> I'll try the simplest query I can reduce it to  with loads of memory and
> see if that gets anywhere. Other pointers are much appreciated.
>
> Thanks for the help!
>
>
>
>
>
> Op ma 11 mrt. 2019 om 19:35 schreef Devopam Mittra :
>
>> hi Patrick,
>> If it sounds worth trying please do the same:
>>
>> 1. Create physical table from table 1. (with filter clause)
>> 2. Create physical table from table 2. (with filter clause)
>> 3. Create interim table 2_1 with the DISTINCT clause.
>> 4. Create interim table 2_2 with the UNION clause.
>> 5. Do an INSERT OVERWRITE into target table from 2_2 above.
>>
>> This should help with isolating the error step.
>> Yes, increasing # mappers or memory helps - but I am usually averse to
>> this till I have exhausted all other options to be sure.
>>
>> regards
>> Dev
>>
>>
>> On Mon, Mar 11, 2019 at 9:21 PM Patrick Duin  wrote:
>>
>>> Very good question, Yes that does give the same problem.
>>>
>>> Op ma 11 mrt. 2019 om 16:28 schreef Devopam Mittra :
>>>
>>>> Can you please try doing SELECT DISTINCT * FROM DELTA into a physical
>>>> table first ?
>>>> regards
>>>> Dev
>>>>
>>>>
>>>> On Mon, Mar 11, 2019 at 7:59 PM Patrick Duin  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm running into oom issue trying to do a Union all on a bunch of AVRO
>>>>> files.
>>>>>
>>>>> The query is something like this:
>>>>>
>>>>> with gold  as ( select * from table1 where local_date=2019-01-01),
>>>>>  delta ss ( select * from table2 where local_date=2019-01-01)
>>>>> insert overwrite table3 PARTITION ('local_date')
>>>>> select * from gold
>>>>> union distinct
>>>>> select * from delta;
>>>>>
>>>>> UNION ALL works. The data size is in the low gigabytes and I'm running
>>>>> on 6 16 GB Nodes (I've tried larger and set memory settings higher but 
>>>>> that
>>>>> just postpones the error).
>>>>>
>>>>> Mappers fail with erros (stacktraces not all the same)
>>>>>
>>>>> 2019-03-11 13:37:22,381 ERROR [main] org.apache.hadoop.mapred.YarnChild: 
>>>>> Error running child : java.lang.OutOfMemoryError: GC overhead limit 
>>>>> exceeded
>>>>>   at org.apache.hadoop.io.Text.setCapacity(Text.java:268)
>>>>>   at org.apache.hadoop.io.Text.set(Text.java:224)
>>>>>   at org.apache.hadoop.io.Text.set(Text.java:214)
>>>>>   at org.apache.hadoop.io.Text.(Text.java:93)
>>>>>   at 
>>>>> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.copyObject(WritableStringObjectInspector.java:36)
>>>>>   at 
>>>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:418)
>>>>>   at 
>>>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:442)
>>>>>   at 
>>>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:428)
>>>>>   at 
>>>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.jav

Re: out of memory using Union operator and array column type

2019-03-11 Thread Devopam Mittra
hi Patrick,
If it sounds worth trying please do the same:

1. Create physical table from table 1. (with filter clause)
2. Create physical table from table 2. (with filter clause)
3. Create interim table 2_1 with the DISTINCT clause.
4. Create interim table 2_2 with the UNION clause.
5. Do an INSERT OVERWRITE into target table from 2_2 above.

This should help with isolating the error step.
Yes, increasing # mappers or memory helps - but I am usually averse to this
till I have exhausted all other options to be sure.

regards
Dev


On Mon, Mar 11, 2019 at 9:21 PM Patrick Duin  wrote:

> Very good question, Yes that does give the same problem.
>
> Op ma 11 mrt. 2019 om 16:28 schreef Devopam Mittra :
>
>> Can you please try doing SELECT DISTINCT * FROM DELTA into a physical
>> table first ?
>> regards
>> Dev
>>
>>
>> On Mon, Mar 11, 2019 at 7:59 PM Patrick Duin  wrote:
>>
>>> Hi,
>>>
>>> I'm running into oom issue trying to do a Union all on a bunch of AVRO
>>> files.
>>>
>>> The query is something like this:
>>>
>>> with gold  as ( select * from table1 where local_date=2019-01-01),
>>>  delta ss ( select * from table2 where local_date=2019-01-01)
>>> insert overwrite table3 PARTITION ('local_date')
>>> select * from gold
>>> union distinct
>>> select * from delta;
>>>
>>> UNION ALL works. The data size is in the low gigabytes and I'm running
>>> on 6 16 GB Nodes (I've tried larger and set memory settings higher but that
>>> just postpones the error).
>>>
>>> Mappers fail with erros (stacktraces not all the same)
>>>
>>> 2019-03-11 13:37:22,381 ERROR [main] org.apache.hadoop.mapred.YarnChild: 
>>> Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> at org.apache.hadoop.io.Text.setCapacity(Text.java:268)
>>> at org.apache.hadoop.io.Text.set(Text.java:224)
>>> at org.apache.hadoop.io.Text.set(Text.java:214)
>>> at org.apache.hadoop.io.Text.(Text.java:93)
>>> at 
>>> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.copyObject(WritableStringObjectInspector.java:36)
>>> at 
>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:418)
>>> at 
>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:442)
>>> at 
>>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:428)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:152)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:144)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.copyKey(KeyWrapperFactory.java:121)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:805)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:719)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:787)
>>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.UnionOperator.process(UnionOperator.java:148)
>>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
>>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
>>> at 
>>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:547)
>>> at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160)
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
>>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.

Re: out of memory using Union operator and array column type

2019-03-11 Thread Devopam Mittra
Can you please try doing SELECT DISTINCT * FROM DELTA into a physical table
first ?
regards
Dev


On Mon, Mar 11, 2019 at 7:59 PM Patrick Duin  wrote:

> Hi,
>
> I'm running into oom issue trying to do a Union all on a bunch of AVRO
> files.
>
> The query is something like this:
>
> with gold  as ( select * from table1 where local_date=2019-01-01),
>  delta ss ( select * from table2 where local_date=2019-01-01)
> insert overwrite table3 PARTITION ('local_date')
> select * from gold
> union distinct
> select * from delta;
>
> UNION ALL works. The data size is in the low gigabytes and I'm running on
> 6 16 GB Nodes (I've tried larger and set memory settings higher but that
> just postpones the error).
>
> Mappers fail with erros (stacktraces not all the same)
>
> 2019-03-11 13:37:22,381 ERROR [main] org.apache.hadoop.mapred.YarnChild: 
> Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at org.apache.hadoop.io.Text.setCapacity(Text.java:268)
>   at org.apache.hadoop.io.Text.set(Text.java:224)
>   at org.apache.hadoop.io.Text.set(Text.java:214)
>   at org.apache.hadoop.io.Text.(Text.java:93)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.copyObject(WritableStringObjectInspector.java:36)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:418)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:442)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.copyToStandardObject(ObjectInspectorUtils.java:428)
>   at 
> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:152)
>   at 
> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.deepCopyElements(KeyWrapperFactory.java:144)
>   at 
> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.copyKey(KeyWrapperFactory.java:121)
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:805)
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:719)
>   at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:787)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>   at 
> org.apache.hadoop.hive.ql.exec.UnionOperator.process(UnionOperator.java:148)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:547)
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:160)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
>
> I've tried Hive 2.3.2 and Hive 2.3.4, both tez and mr engines.
>
> I've tried running with more and less mappers, always hitting oom.
>
> I'm running similar query on different (much larger) data without issues so 
> suspect something with the actual data.
>
> The table schema is this:
> c1string  
> c2bigint  
> c3array>   
> local_date  string
>
>
> I've narrowed it down and (not surprisingly) the 3rd column seems to be the 
> cause of the issue, If I remove that the union works again just fine.
>
> Anyone has similar experiences? Perhaps any pointers on how to tackle this?
>
> Kind regards,
>
>  Patrick
>
>
>

-- 
Devopam Mittra
Life and Relations are not binary


Re: Programmatically determine version of Hive running on server

2018-08-13 Thread Devopam Mittra
Can you achieve that with
hive --version ? using a simple script.

regards
Dev


On Mon, Aug 13, 2018 at 2:49 PM Bohdan Kazydub 
wrote:

> Hi all,
>
> is it possible to determine a version of Hive running on server using Hive
> (HiveMetaStoreClient etc.) 2.3.3?
>
> Best regards,
> Bohdan
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Difference between join and inner join

2017-02-11 Thread Devopam Mittra
+1 to this question. I will appreciate if someone can shed light on behind
the scenes for this one. Might give us ideas for best practice.

Regards
Devopam

On 11 Feb 2017 8:02 pm, "Divya Gehlot"  wrote:

> Hi ,
> What's the difference between" join " and "inner join" in hive ?
>
> Thanks ,
> Divya
>
>


Re: VARCHAR or STRING fields in Hive

2017-01-16 Thread Devopam Mittra
Few things that might have an effect:

1. Compression (better if you are in VARCHAR with finite length, instead of
a STRING)
2. Multicharset support (like NVARCHAR)
3. LOBs from RDBMS world are more suitable to be typecast to STRING for
pure text data (not images e.g.)

regards
Devopam

On Mon, Jan 16, 2017 at 9:07 PM, Mich Talebzadeh 
wrote:

>
> Coming from DBMS background I tend to treat the columns in Hive similar to
> an RDBMS table. For example if a table created in Hive as Parquet I will
> use VARCHAR(30) for column that has been defined as VARCHAR(30) as source.
> If a column is defined as TEXT in RDBMS table I use STRING in Hive with a
> max size of 2GB I believe.
>
> My view is that it is more efficient storage wise to have Hive table
> created as VARCHA as opposed to STRING.
>
> I have not really seen any performance difference if one uses VARCHAR or
> STRING. However, I believe there is a reason why one has VARCH in Hive as
> opposed to STRRING.
>
> What is the thread view on this?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: DateFunction

2017-01-16 Thread Devopam Mittra
hi Mahender,

I don't know your version of Hive .
Please try :
date_format(curren_date,'M')

regards
Dev


On Mon, Jan 16, 2017 at 6:56 PM, Jitendra Yadav 
wrote:

> Ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#
> LanguageManualUDF-DateFunctions.
>
> int
>
> month(string date)
>
> Returns the month part of a date or a timestamp string: month("1970-11-01
> 00:00:00") = 11, month("1970-11-01") = 11.
>
> Does it fit in your requirement?.
>
> Thanks
>
> On Mon, Jan 16, 2017 at 12:21 PM, Mahender Sarangam <
> mahender.bigd...@outlook.com> wrote:
>
>> Hi,
>>
>> Is there any Date Function which returns Full Month Name for given time
>> stamp.
>>
>>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Happy Diwali to those forum members who celebrate this great festival

2016-10-30 Thread Devopam Mittra
+1
Thanks and regards
Devopam

On 30 Oct 2016 9:37 pm, "Mich Talebzadeh"  wrote:

> Enjoy the festive season.
>
> Regards,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Need help with query

2016-09-12 Thread Devopam Mittra
Kindly learn dynamic partition from cwiki. That will be the perfect
solution to your requirement in my opinion.
Regards
Dev

On 13 Sep 2016 12:49 am, "Igor Kravzov"  wrote:

> Hi,
>
> I have a query like this one
>
> alter table my_table
>   add if not exists partition (mmdd=20160912) location
> '/mylocation/20160912';
>
> Is it possible to make so I don't have to change date every day?
> Something with  CURRENT_DATE;?
>
> Thanks in advance.
>


Re: What is the best way to store IPv6 address in Hive?

2016-06-28 Thread Devopam Mittra
My best bet will be string data type itself with partitioning to aid
partial search. Please do consider the fact that ipv6 address is more
complicated than ipv4 in terms of searching .

Regards
Dev
On 28 Jun 2016 9:35 pm, "Igor Kuzmenko"  wrote:

> Currently I'm using ORC transactional tables, and i need to store a lot of
> data containing IP addresses.
> With IPv4 it can be a Integer (4 bytes exacty), but what about IPv6?
> Obiously it should be space efficient and easy to search for exact match.
> As extra feature it would be good to do fast search with mask (10.120.*.*)
>


Re: count(*) not allowed in order by

2016-03-07 Thread Devopam Mittra
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

This should help you , try rank/ dense rank as appropriate and mold it to
best use for yourself
Regards
Dev
On Mar 7, 2016 10:35 PM, "Awhan Patnaik"  wrote:

> I have to take the first 25 IDs ranked by count(*). But the following is
> not allowed in Hive
>
> select id from T order by count(*) desc limit 25;
>
> Which yields a "NOt yet supported place for UDAF count". The way around it
> is the following
>
> select id, count(*) as cnt from T group by id order by cnt desc limit 25;
>
> However I need to put this query in a subquery like so
>
> select id, XXX from T t join  where t.id in (select id, count(*) as
> cnt from T group by id order by cnt desc limit 25) group by ... ;
>
> which does NOT work because a subquery is allowed to return only one
> thing. Here XXX are complex constructs like distance calculation and
> binning. These are time consuming and complex operations.
>
> The only way I see this would work is if I use the following sub-sub query
>
> select id, XXX from T t join ... where t.id in (select sub.id from
> (select id, count(*) as cnt from T group by id order by cnt desc limit
> 25)sub) group by ... ;
>
> The reason I don't want to put the limit in the outermost query is because
> those XXX queries are expensive and I don't want them to be performed on
> the entire result only to retain the top 25 and throw away the rest of the
> results. The count(*) operation of the rest of the IDs is not very
> expensive or time consuming.
>
> Is there some other elegant way of handling this without using a
> sub-sub-query approach?
>


Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-16 Thread Devopam Mittra
+1 for all suggestions provided already.

I have personally use Talend Big Data Studio in conjunction with Hive +
Cron/Autosys to build and manage small DW.
Found it easy to rapidly build and deploy. Helps with email integration etc
which was my custom requirement (spool few reports and share via email at
routine intervals).

regards
Dev

On Tue, Feb 16, 2016 at 4:10 PM, Elliot West  wrote:

> I'd say that so long as you can achieve a similar quality of engineering
> as is possible with other software development domains, then 'yes, it is
> ok'.
>
> Specifically, our Hive projects are packaged as RPMs, built and released
> with Maven, covered by suites of unit tests developed with HiveRunner, and
> part of the same Jenkins CI process as other Java based projects.
> Decomposing large processes into sensible units is not as easy as with
> other frameworks so this may require more thought and care.
>
> More information here:
> https://cwiki.apache.org/confluence/display/Hive/Unit+testing+HQL
>
> You have many potential alternatives depending on which languages you are
> comfortable using: Pig, Flink, Cascading, Spark, Crunch, Scrunch, Scalding,
> etc.
>
> Elliot.
>
>
> On Tuesday, 16 February 2016, Ramasubramanian <
> ramasubramanian.naraya...@gmail.com> wrote:
>
>> Hi,
>>
>> Is it ok to build an entire ETL/ELT data flow using HIVE queries?
>>
>> Data is stored in HIVE. We have transactional and reference data. We need
>> to build a small warehouse.
>>
>> Need suggestion on alternatives too.
>>
>> Regards,
>> Rams
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Fastest way to get the row count

2016-01-13 Thread Devopam Mittra
hello Mahender,
I use beeline CLI mostly for such operations. My best bet is to parse the
output for "rows selected"...line and use it for logging the row count.
INFO - lines are an overhead but I can happily live with that to achieve my
objective.

Suggestions welcome, in case there is a cleaner way to achieve this.

regards
Dev

On Wed, Jan 13, 2016 at 1:26 AM, mahender bigdata <
mahender.bigd...@outlook.com> wrote:

> Hi Team,
>
> Is there any built-in command in hive to get ROWCOUNT of previous
> operation just like in SQL Server @@RowCount. For example: I have table
> with almost 2 GB of data per partition, We run select query on table with
> condition, which filters the records, we will be dumping those filtered
> records in to another table apart from inserting, we need insert ROWCOUNT
> inserted into another hive table. Rather than running Select count(1) or 
> Explain
> commands, is there better way to get row count of affected row. (previous
> step row count)
>
> Mahender
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Hive Support for Unicode languages

2015-12-09 Thread Devopam Mittra
Presuming you are using some utility like Cygwin/Putty etc. to access Hive
CLI , you would need to configure the settings of this application for
supporting extended charset display.

Easy/Quick option to verify the support will be by using some data browser
application (e.g. Talend ) to sample data from an existing table where you
have unicode sample records.

Hope I understood your requirement right...

regards
Dev

On Wed, Dec 9, 2015 at 11:33 PM, mahender bigdata <
mahender.bigd...@outlook.com> wrote:

> Any update ?
>
>
> On 12/5/2015 2:10 PM, Mahender Sarangam wrote:
>
> Its Windows Server 2012 OS.
>
> > From: jornfra...@gmail.com
> > Subject: Re: Hive Support for Unicode languages
> > Date: Fri, 4 Dec 2015 13:19:00 +0100
> > To: user@hive.apache.org
> >
> > What operating system are you using?
> >
> > > On 04 Dec 2015, at 01:25, mahender bigdata
>   wrote:
> > >
> > > Hi Team,
> > >
> > > Does hive supports Hive Unicode like UTF-8,UTF-16 and UTF-32. I would
> like to see different language supported in hive table. Is there any serde
> which can show exactly japanese, chineses character rather than showing
> symbols on Hive console.
> > >
> > > -Mahender
>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Storing the Hive Query Results into Variable

2015-12-06 Thread Devopam Mittra
There are multiple ways of doing it. It will depend on your method of
implementation as well. i.e. what is the platform you are using to fire the
hive queries.

Pig can do that for you (as already confirmed by another user)
Shell scripts (unix based implementation) also supports it.
A very crude yet effective way : hive -e 'select *from table' >
/home/user/output$$.txt

You may want to check out the Hive CLI features available and then pick
what suits your need best:

1.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution
2. primer:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

Hope this helps you to move forward

regards
Dev

On Fri, Dec 4, 2015 at 5:48 AM, mahender bigdata <
mahender.bigd...@outlook.com> wrote:

> Hi,
> Is there option available to store hive results into variable like
>
> select @i= count(*) from HiveTable.
>
> or
>
> Storing Table Results into variable and make use of it later stage of
> Query. I tired using HQL CTE but the scope of CTE is limited to next select
> only, Is there a way to intermediate results like join data into variable
> and make use of it latter stage.
>
> Thanks,
> Mahender
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: [ANNOUNCE] New PMC Member : John Pullokkaran

2015-11-24 Thread Devopam Mittra
+1
Congratulations !

regards
Dev


On Wed, Nov 25, 2015 at 4:29 AM, Ashutosh Chauhan 
wrote:

> On behalf of the Hive PMC I am delighted to announce John Pullokkaran is
> joining Hive PMC.
> John is a long time contributor in Hive and is focusing on compiler and
> optimizer areas these days.
> Please give John a warm welcome to the project!
>
> Ashutosh
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Hive alternatives?

2015-11-05 Thread Devopam Mittra
I agree with the suggestions presented already.
You may want to check presto as an alternative as well.
But please remember , Presto is an added layer on top of Hive and not an
independent alternative.
It simplifies your semantic layer and querying while being faster than Hive.
For OLAP , I will always recommend pre-calculated aggregate layer and avoid
ad-hoc analytics as much feasible.

Regards
Dev
On Nov 6, 2015 1:29 AM, "Alex Kamil"  wrote:

> 1) Apache Phoenix  + Mondrian
> 
> 2) Apache Spark 
>
> On Thu, Nov 5, 2015 at 2:49 PM, Jörn Franke  wrote:
>
>> First it depends on what you want to do exactly. Second, Hive > 1.2, Tez
>> as an Execution Engine (I recommend >= 0.8) and Orc as storage format can
>> be pretty quick depending on your use case. Additionally you may want to
>> employ compression which is a performance boost once you understand how
>> storage indexes and bloom filter work. Additionally , you need to think
>> about how you sort the data. Cf. also
>>
>> https://snippetessay.wordpress.com/2015/07/25/hive-optimizations-with-indexes-bloom-filters-and-statistics/
>>
>> However, you have to rethink how you define your technical data model. A
>> lot of prejoinend data in a big flat table can be more performant when
>> using storage indexes and bloom filters than using standard indexes and
>> dimensional modeling.
>>
>> Besides besides tez you can also use other execution engine in your
>> session (eg Spark) if this makes sense.
>>
>> Finally you have to review how yarn manages resources including
>> preemption, fair vs capacity scheduler etc.
>>
>> Btw the same holds also for relational database appliances, such as
>> Exadata. The standard approach dimensional modeling + standard indexes
>> there is often not anymore the most performant.
>>
>>
>>
>> > On 05 Nov 2015, at 20:04, Andrés Ivaldi  wrote:
>> >
>> > Hello,
>> > I was looking for Hive as OLAP alternative, but I've read that is quite
>> slow for that, does anybody have experiences about? or a Hive altenative
>> for OLAP? Killin is not an option becouse we need dynamic OLAP like ROLAP
>> >
>> > Regards,
>> >
>> > --
>> > Ing. Ivaldi Andres
>>
>
>


request for editing cwiki, username: devopam

2015-10-28 Thread Devopam Mittra
regards
Dev


Re: need help to keep the Hive logo on wikipedia

2015-10-18 Thread Devopam Mittra
hi Gopal,Lefty,
Thanks for the relevant pointers. Got this one fixed and attributed image
to ASF. FYI only.

regards
Dev


On Mon, Oct 19, 2015 at 12:57 AM, Gopal Vijayaraghavan 
wrote:

>
> > Rights to use the Hive logo are controlled by the Apache Software
> >Foundation, not by the Hive project.  You might find what you need here:
> >http://www.apache.org/foundation/marks/contact.
> >  (I found that link here:  May I use Apache product logos on the
> >cover/title of published books, articles, or other similar publications?
> ><http://www.apache.org/foundation/marks/faq/#titlelogo>)
> ...
> >> please refer to :
> >>https://commons.wikimedia.org/wiki/File:Apache_Hive_logo.jpg
>
> First up, this is not legal advice - but a recollection from a previous
> incident.
>
> The ASF link is about Trademark restrictions, while the permissions
> demanded by Wikipedia is about Copyright.
>
> Copyright license is covered by the APL v2, since the source image is
> checked into hive git repo under docs/images/.
>
> The reason Wikipedia has flagged this image is probably because it is
> marked as "Own work" by Devopam & not attributed correctly.
>
> Cheers,
> Gopal
>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


need help to keep the Hive logo on wikipedia

2015-10-17 Thread Devopam Mittra
hello there,
please refer to :
https://commons.wikimedia.org/wiki/File:Apache_Hive_logo.jpg

I had uploaded this as-is after downloading from main site, and stated the
usage terms from http://www.apache.org/licenses/LICENSE-2.0.html

I have been asked now to provide explicit permission to use the logo on the
respective wikipedia page(s) else the logo image will be deleted... e.g.
https://it.wikipedia.org/wiki/Apache_Hive ,
https://en.wikipedia.org/wiki/Apache_Hive

Can someone help me with this ?

regards
-- 
Devopam Mittra
Life and Relations are not binary


Re: [ANNOUNCE] New Hive PMC Chair - Ashutosh Chauhan

2015-10-11 Thread Devopam Mittra
+1 Congratulations !

regards
Dev

On Thu, Sep 17, 2015 at 1:09 AM, Carl Steinbach  wrote:

> I am very happy to announce that Ashutosh Chauhan is taking over as the
> new VP of the Apache Hive project. Ashutosh has been a longtime contributor
> to Hive and has played a pivotal role in many of the major advances that
> have been made over the past couple of years. Please join me in
> congratulating Ashutosh on his new role!
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: EXPORTing multiple partitions

2015-06-26 Thread Devopam Mittra
+1

regards
Dev

On Fri, Jun 26, 2015 at 12:50 PM, @Sanjiv Singh 
wrote:

> What about this :
>
> create view foo_copy_partition AS  select * from foo where id in (1,2,3);
> export table foo_copy_partition  to ‘path’;
> drop view foo_copy_partition ;
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Fri, Jun 26, 2015 at 7:40 AM, Xuefu Zhang  wrote:
>
>> Hi Brian,
>>
>> If you think that is useful, please feel free to create a JIRA requesting
>> for it.
>>
>> Thanks,
>> Xuefu
>>
>> On Thu, Jun 25, 2015 at 10:36 AM, Brian Jeltema <
>> brian.jelt...@digitalenvoy.net> wrote:
>>
>>> Answering my own question:
>>>
>>>   create table foo_copy like foo;
>>>   insert into foo_copy partition (id) select * from foo where id in
>>> (1,2,3);
>>>   export table foo_copy to ‘path’;
>>>   drop table foo_copy;
>>>
>>> It would be nice if export could do this automatically, though.
>>>
>>> Brian
>>>
>>> On Jun 25, 2015, at 11:34 AM, Brian Jeltema <
>>> brian.jelt...@digitalenvoy.net> wrote:
>>>
>>> > Using Hive .13, I would like to export multiple partitions of a table,
>>> something conceptually like:
>>> >
>>> >   EXPORT TABLE foo PARTITION (id=1,2,3) to ‘path’
>>> >
>>> > Is there any way to accomplish this?
>>> >
>>> > Brian
>>>
>>>
>>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Date Functions in Hive

2015-06-23 Thread Devopam Mittra
Please go through UDFs in detail :
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

I presume that your column is STRING datatype.

You may want to do the following:

SELECT CAST (to_date(from_unixtime(unix_timestamp(, "MMM dd,
"))) AS DATE)
FROM ;

Once you have your DATE datatype conversion you may manipulate the output
to anything like MM-DD- as desired above (but I suggest you rethink why
it wouldn't be simply the default -MM-DD instead in native DATE
datatype itself)

If you have full month name like "January" , then you will have to use
M instead of MMM

If you are trying to do simply STRING manipulation to change the column
appearance I would rather suggest Unix / OS level file based manipulation
itself.

Apologies I haven't tested the code as I don't have access to my cluster
now.

Hope it helps

regards
Dev




On Tue, Jun 23, 2015 at 9:47 PM, saurabh  wrote:

> Hi,
>
> Need some suggestions on Date function in Hive.
> The data appearing in the source file is in the format of "May 31, 2015".
> I want to convert the same in MM-DD- format.
>
> Please suggest if this can be done with existing Hive 14.0 functionality
> or it will require custom UDF.
>
> Please let me know if any more information is required on the same.
>
> Thanks,
> Saurabh
>
>
>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Updating hive metadata

2015-06-18 Thread Devopam Mittra
hi Prasanth,
I would not suggest tweaking hive metastore info unless you have full
knowledge of the entire tables that will get impacted due to such a change.
And such things break a lot with upgrades since this is quite unmanageable
manually.

Why don't you create my_managed_table_2 as type EXTERNAL and link it to the
copied data in hdfs layer ..

regards
Dev

On Thu, Jun 18, 2015 at 11:40 PM, Chagarlamudi, Prasanth <
prasanth.chagarlam...@epsilon.com> wrote:

>  Hello,
>
> Is there a way to update metadata in hive?
>
>
>
> Created database *mydb*;
>
> Created *my_point_table*;
>
> Created *my_managed_table*;
>
> Insert into *my_managed_table* from *my_point_table*;
>
>
>
> Now,
>
> Create *my_point_table_2*;
>
> //Copy data from hive* managed_table *to* managed_table_2*’s location
>
> hdfs dfs –cp /user/hive/warehouse/mydb.db/*my_managed_table*
> /user/hive/warehouse/mydb.db/*my_managed_table_2*
>
> At this point, I am expecting the following query
>
> Select * from *my_managed_table_2*; to give me all the data I
> just copied from *my_managed_table*;
>
>
>
> How do I update the hive metastore to consider the data that I copied to
> *my_managed_table_2*? Is that even possible?
>
>
>
> Thanks in advance
>
> Prasanth Chagarlamudi
>
>
>
>
>
> --
>
> This e-mail and files transmitted with it are confidential, and are
> intended solely for the use of the individual or entity to whom this e-mail
> is addressed. If you are not the intended recipient, or the employee or
> agent responsible to deliver it to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you are not one of the named
> recipient(s) or otherwise have reason to believe that you received this
> message in error, please immediately notify sender by e-mail, and destroy
> the original message. Thank You.
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: creating a table based on where column=x

2015-06-09 Thread Devopam Mittra
Please evaluate using unix file operations to simply select what you want
from the csv and use that extract in your external table.
A simple grep command should do the trick I suppose, but you will need to
test it out to suit your needs.

Hope it helps.
regards
Devopam


On Tue, Jun 9, 2015 at 9:50 PM, Abe Weinograd  wrote:

> Does a view not work for you?
>
> You create your external table as you described and a view which looks
> like the 2nd table.
>
> Abe
>
> On Tue, Jun 9, 2015 at 9:44 AM, Gary Clark  wrote:
>
>>  Hello,
>>
>>
>>
>> I would like to create a table based on a selection in hive. Currently I
>> am creating a table based on a location which contains the source csv files
>> and then deriving other tables from that table.
>>
>>
>>
>> I.e:
>>
>>
>>
>> create EXTERNAL TABLE initialtable (deployment_id tinyint,
>>
>>  A  bigint,
>>
>>  sample_date string,
>>
>>  charge smallint,
>>
>>  discharge smallint)
>>
>>  ROW FORMAT DELIMITED
>>
>>  FIELDS TERMINATED BY ','
>> LOCATION '/user/hue/data';
>>
>>
>>
>> CREATE TABLE derivedtable AS SELECT * FROM initialtable WHERE
>> deployment_id='22’;
>>
>>
>>
>> I would just like to create a table where the deployment_id is 22 and not
>> have to create the initial table.
>>
>>
>>
>> If someone can shed some light on how to do this I would appreciate it.
>> The only example I have seen seem to follow the above?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Gary C
>>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: partition and bucket

2015-04-14 Thread Devopam Mittra
able creation
> level in Hive, the number of partitions/files will be fixed. In contrast,
> with partitioning you do not have this limitation.
>
>
>
> HTH
>
>
>
> Mich
>
>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>
>
> *From:* Ashok Kumar [mailto:ashok34...@yahoo.com]
> *Sent:* 10 April 2015 17:46
> *To:* user@hive.apache.org
> *Subject:* partition and bucket
>
>
>
> Greeting all,
>
> Glad to join the user group. I am from DBA background Oracle/Sybase/MSSQL.
>
> I would like to understand partition and bucketing in Hive and the
> difference between.
>
> Shall be grateful if someone explains where shall I use partition or
> bucket for best practices.
>
> thanks
>
>
>



-- 
Devopam Mittra
Life and Relations are not binary


need input for heavy duty string manipulation

2015-03-15 Thread Devopam Mittra
hi there,
Need help / suggestion to formulate a hive-only best feasible/optimized
strategy for:


   - I have a table with list of reserved words - one word per row
   - I have a string wherein I want to search the presence of each reserve
   word and take a count of it .e.g: If there are 5 matches from my reserved
   keyword list , then I expect the query output to be 5.
   - I need to repeat this exercise recursively for all the 'sentences' in
   another table, that has one sentence per row.


How should I do it to get the best (read quickest) performance without
getting too much into recursion , using Hive.

Advise/ suggestions requested.

Environment details: Hive 0.14 / HDP 2.2 , CentOS 6.5 , 2 NN + 6 DN

regards
Devopam


Re: how to add partitioned by column

2015-03-03 Thread Devopam Mittra
1. Unless the table has huge size you may please try moving the data to a
temp table and then reloading it into a new table with the desired
partitioning.
2. If it's huge, then create a separate table with desired partition and
then load data from existing table for chunks of partitions from source
table.


Hope this helps

regards
Dev

On Tue, Mar 3, 2015 at 1:23 PM, smartzjp  wrote:

> I have created a table with " create table  partitioned by (year int
> ,mont int)"
> now I want to add a day partition. how can i do it ?   TKS
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Re: How to query data by page in Hive?

2015-02-24 Thread Devopam Mittra
Apologies for the extended delay in providing response. I totally forgot
about this one and got stuck in other things.

Please find below a sample I created based on the inputs received.

CREATE TABLE u_data(
userid int,
movieid int,
rating int,
unixtime string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' ;

file contents of 'temp_load'
11020810223232
1104571232131
1123481231231
1124441212113
2102091231213
212441123121121
310206123211
51020412312321
6102741231231

LOAD DATA LOCAL INPATH '/home/exilant/Desktop/temp_load' OVERWRITE INTO
TABLE u_data;

WITH movietable AS (SELECT a.movieid,
   a.rating,
   ROW_NUMBER() OVER (ORDER BY movieid) AS rownum
FROM ( SELECT movieid,
  AVG(rating) AS rating
   FROM u_data
   GROUP BY movieid) A)
SELECT movieid,rating
FROM movietable
*WHERE rownum BETWEEN 3 AND 5*;

So, now I can achieve pagination by suitably choosing the values for rownum
that I can calculate based on which page I am and how many records per page
are there.

e.g. 12 records per page will mean 3rd page will start from
12*3+1 and end at 12*4 ( BETWEEN pagesize*pagenum+1 AND
pagesize*(pagenum+1))

Hope it helps

regards
Devopam


On Fri, Feb 6, 2015 at 6:18 AM, r7raul1...@163.com 
wrote:

> *Table structure :*
>  CREATE TABLE `u_data`(
> `userid` int,
> `movieid` int,
> `rating` int,
> `unixtime` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
> 'hdfs://localhost:8020/user/hive/warehouse/u_data'
> TBLPROPERTIES (
> 'COLUMN_STATS_ACCURATE'='true',
> 'numFiles'='1',
> 'numRows'='0',
> 'rawDataSize'='0',
> 'totalSize'='1979173',
> 'transient_lastDdlTime'='1421076916')
>
> *columns :*
>movieid
>
> --
> r7raul1...@163.com
>
>
> *From:* Devopam Mittra 
> *Date:* 2015-02-05 18:48
> *To:* user@hive.apache.org
> *Subject:* Re: Re: How to query data by page in Hive?
> Please provide a valid table structure and the columns you wish to pick
> and I shall email you the query directly
>
>
> regards
> Devopam
>
> On Thu, Feb 5, 2015 at 3:20 PM, r7raul1...@163.com 
> wrote:
>
>> Thank you Devopam! Could you show me a  example?
>>
>> --
>> r7raul1...@163.com
>>
>>
>> *From:* Devopam Mittra 
>> *Date:* 2015-02-05 18:05
>> *To:* user@hive.apache.org
>> *Subject:* Re: How to query data by page in Hive?
>> You may want to use a ROW_NUMBER OR RANK / DENSE RANK in the inner query
>> and then select only a subset of it in the outer query to control
>> pagination. Based on your need, you may want to order the records as well ..
>>
>> Alternatively you may want to use CTE(
>> https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression)
>> for selecting the data in one go and then use row number to select as in
>> previous case.
>>
>> regards
>> Devopam
>>
>> On Thu, Feb 5, 2015 at 1:31 PM, r7raul1...@163.com 
>> wrote:
>>
>>> Hello,
>>>  How to query data by page in Hive?
>>>
>>> hive> select * from u_data a limit 1,2;
>>> FAILED: ParseException line 1:31 missing EOF at ',' near '1'
>>>
>>> --
>>> r7raul1...@163.com
>>>
>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Remove duplicated rows

2015-02-18 Thread Devopam Mittra
hi Philippe,
Performance improvement has two factors : 1. availability (read abundance)
of resources 2.need for speed
All "advise" usually is to address mainly these two factors , as I have
usually seen till this far.

Since you are doing a full scan each day for identify dedups I suggested
keeping data uncompressed (in fact you may do the same only with a separate
table with key columns itself)

Apologies that the post didn't come out clear in terms of scalability and
you are definitely the best judge of the fit-gap analysis , being closest
to the issues. Dynamic partitions allows selective target for the table and
not a FTS so I presumed that scaling up should never be an issue ...

Couldn't be of much help here I guess :(
regards
Dev


On Wed, Feb 18, 2015 at 3:02 PM, Philippe Kernévez 
wrote:

> Hi Dev,
>
> > "Only suggestion is that please don't compress the files that are
> linked in the external table" .
> Why do you suggest that ? Presently, I divide my import time by 2 by using
> 'gz' files instead of the uncompressed 'csv'.
>
> > "You may want to leverage dynamic partition"
> I read you post, and I wonder if it can scale. It seems that you have
> hourly and daily aggregation. More or less 24*365=8'800 raw per year.
> We have about 2 billions (2E9) raws per year.
> Do you know if you approach can scale on a such factor ?
> I also wonder about the impact of the partition columns on the map reduce
> job, they will probably produce a big number of files to produce those
> partitions.
>
>
> Regards,
> Philippe
>
>
>
> On Wed, Feb 18, 2015 at 2:49 AM, Devopam Mittra  wrote:
>
>> dear Philippe,
>> I asked specifically the version so as to understand what built-in
>> features you can leverage to your benefit.
>>
>> There are two ways that I can think of - to help you in your case :
>> 1. Use RANK as you have already noted in your original note - this will
>> be faster and more appropriate . Only suggestion is that please don't
>> compress the files that are linked in the external table.
>>
>> 2. You may want to leverage dynamic partition feature to your aid for
>> managing the duplicate records . I like it and use it extensively now a
>> days after 0.13 onward. Brief concept : dynamically partition the table on
>> one low skew column + 'key' column . Then simply INSERT OVERWRITE the
>> 'delta' and it will seamlessly overwrite only the affected rows and do the
>> redistribution of data in the table internally without you having to bother
>> about the code or the burden to do it. You may refer to this blog I wrote
>> quite some time back : http://linkd.in/1Fq3wdb
>> This technique will cost you a little overhead time with the MR job
>> getting kicked off and all , but is painless and seamless , so I like it
>> this way.
>>
>> regards
>> Dev
>>
>> On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez 
>> wrote:
>>
>>> Hi Dev,
>>>
>>> I'm using hive 0.14.
>>>
>>> Regards,
>>>
>>> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra 
>>> wrote:
>>>
>>>> Can you please state the hive version
>>>>
>>>> regards
>>>> Dev
>>>> +91 958 305 9899
>>>>
>>>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez 
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
>>>> Each day I receive a new file  with about 5 millions rows. About 10% of
>>>> those rows are duplicated (duplication occur inside a daily file but also
>>>> between files).
>>>> There are about 30 fields in the files.
>>>>
>>>> As for now I deduplicate all the data every day with the following
>>>> request :
>>>>
>>>>   INSERT OVERWRITE TABLE DEDUPLICATED
>>>> SELECT cl.*
>>>> FROM (
>>>> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>>>> FROM DAILY d  -- DAILY is an external table that contains all
>>>> the daily files
>>>> ) cl
>>>> WHERE cl.pos = 1
>>>>
>>>> On the mailing list I saw another approach base on a "group by KEY"
>>>> request and use a 'select MAX(xxx)' for all non-key fields.
>>>>
>>>> My first question is : which of the both seems to be better ?
>>>> (the second one is quite harder to maintain as all the f

Re: Remove duplicated rows

2015-02-17 Thread Devopam Mittra
dear Philippe,
I asked specifically the version so as to understand what built-in features
you can leverage to your benefit.

There are two ways that I can think of - to help you in your case :
1. Use RANK as you have already noted in your original note - this will be
faster and more appropriate . Only suggestion is that please don't compress
the files that are linked in the external table.

2. You may want to leverage dynamic partition feature to your aid for
managing the duplicate records . I like it and use it extensively now a
days after 0.13 onward. Brief concept : dynamically partition the table on
one low skew column + 'key' column . Then simply INSERT OVERWRITE the
'delta' and it will seamlessly overwrite only the affected rows and do the
redistribution of data in the table internally without you having to bother
about the code or the burden to do it. You may refer to this blog I wrote
quite some time back : http://linkd.in/1Fq3wdb
This technique will cost you a little overhead time with the MR job getting
kicked off and all , but is painless and seamless , so I like it this way.

regards
Dev

On Wed, Feb 18, 2015 at 3:45 AM, Philippe Kernévez 
wrote:

> Hi Dev,
>
> I'm using hive 0.14.
>
> Regards,
>
> On Tue, Feb 17, 2015 at 8:12 PM, Devopam Mittra  wrote:
>
>> Can you please state the hive version
>>
>> regards
>> Dev
>> +91 958 305 9899
>>
>> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez 
>> wrote:
>>
>> Hi,
>>
>> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
>> Each day I receive a new file  with about 5 millions rows. About 10% of
>> those rows are duplicated (duplication occur inside a daily file but also
>> between files).
>> There are about 30 fields in the files.
>>
>> As for now I deduplicate all the data every day with the following
>> request :
>>
>>   INSERT OVERWRITE TABLE DEDUPLICATED
>> SELECT cl.*
>> FROM (
>> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>> FROM DAILY d  -- DAILY is an external table that contains all the
>> daily files
>> ) cl
>> WHERE cl.pos = 1
>>
>> On the mailing list I saw another approach base on a "group by KEY"
>> request and use a 'select MAX(xxx)' for all non-key fields.
>>
>> My first question is : which of the both seems to be better ?
>> (the second one is quite harder to maintain as all the fields should be
>> explicitly written in the request).
>>
>>
>>
>> The second question is : what is the best way to do the deduplication and
>> import on a incremental approach ?
>> Something like that ?
>>   INSERT TABLE DEDUPLICATED
>> SELECT cl.*
>> FROM (
>> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos
>> FROM LAST_DAILY_FILE d -- ONLY the last file
>> ) cl
>> WHERE cl.pos = 1-- REQUIRED to remove all the duplication inside
>> the last file
>>AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication
>> between the last file and all the existing files
>>
>> And the last question : for the last request, does an index on KEY help
>> with hive as it can help on a classical relational database ?
>>
>> Regards,
>> Philippe
>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> pkerne...@octo.com
>> +41 79 888 33 32
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.com
>>
>>
>
>
> --
> Philippe Kernévez
>
>
>
> Directeur technique (Suisse),
> pkerne...@octo.com
> +41 79 888 33 32
>
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Remove duplicated rows

2015-02-17 Thread Devopam Mittra
Can you please state the hive version 

regards
Dev
+91 958 305 9899

> On Feb 17, 2015, at 11:01 PM, Philippe Kernévez  wrote:
> 
> Hi,
> 
> I have a table (named DEDUPLICATED) that contains about 1 billions rows.
> Each day I receive a new file  with about 5 millions rows. About 10% of those 
> rows are duplicated (duplication occur inside a daily file but also between 
> files).
> There are about 30 fields in the files.
> 
> As for now I deduplicate all the data every day with the following request :
> 
>   INSERT OVERWRITE TABLE DEDUPLICATED
> SELECT cl.* 
> FROM (
> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos 
> FROM DAILY d  -- DAILY is an external table that contains all the 
> daily files
> ) cl
> WHERE cl.pos = 1
>  
> On the mailing list I saw another approach base on a "group by KEY" request 
> and use a 'select MAX(xxx)' for all non-key fields.
> 
> My first question is : which of the both seems to be better ?
> (the second one is quite harder to maintain as all the fields should be 
> explicitly written in the request).
> 
> 
> 
> The second question is : what is the best way to do the deduplication and 
> import on a incremental approach ?
> Something like that ?
>   INSERT TABLE DEDUPLICATED
> SELECT cl.* 
> FROM (
> SELECT d.*, ROW_NUMBER() OVER (PARTITION BY d.KEY) AS pos 
> FROM LAST_DAILY_FILE d -- ONLY the last file
> ) cl
> WHERE cl.pos = 1-- REQUIRED to remove all the duplication inside the 
> last file
>AND cl.KEY NOT IN SELECT KEY FROM DEDUPLICATED  -- remove duplication 
> between the last file and all the existing files
> 
> And the last question : for the last request, does an index on KEY help with 
> hive as it can help on a classical relational database ?
> 
> Regards,
> Philippe
> 
> 
> 
> -- 
> Philippe Kernévez
> 
> 
> 
> Directeur technique (Suisse), 
> pkerne...@octo.com
> +41 79 888 33 32
> 
> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
> OCTO Technology http://www.octo.com


Re: [ANNOUNCE] Apache Hive 1.0.0 Released

2015-02-05 Thread Devopam Mittra
+1

Updated https://en.wikipedia.org/wiki/Apache_Hive with the latest version
info

regards
Devopam

On Thu, Feb 5, 2015 at 11:46 PM, Thejas Nair  wrote:

> Congrats to all the users and contributors in the Apache Hive community!
> It is great that we finally move away from the 0.x versioning scheme
> to the 1.x versioning scheme for new releases. This is a great way of
> honoring the work from hive community that has made hive the defacto
> standard for SQL on Hadoop!
>
> Thanks for the hard work of driving the release Vikram!
>
>
> On Wed, Feb 4, 2015 at 3:07 PM, Vikram Dixit K 
> wrote:
> > The Apache Hive team is proud to announce the the release of Apache
> > Hive version 1.0.0.
> >
> > The Apache Hive (TM) data warehouse software facilitates querying and
> > managing large datasets residing in distributed storage. Built on top
> > of Apache Hadoop (TM), it provides:
> >
> > * Tools to enable easy data extract/transform/load (ETL)
> >
> > * A mechanism to impose structure on a variety of data formats
> >
> > * Access to files stored either directly in Apache HDFS (TM) or in other
> >   data storage systems such as Apache HBase (TM)
> >
> > * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
> >
> > For Hive release details and downloads, please
> > visit:https://hive.apache.org/downloads.html
> >
> > Hive 1.0.0 Release Notes are available here:
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12329278&styleName=Text&projectId=12310843
> >
> >
> > We would like to thank the many contributors who made this release
> > possible.
> >
> > Regards,
> >
> > The Apache Hive Team
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Re: How to query data by page in Hive?

2015-02-05 Thread Devopam Mittra
Please provide a valid table structure and the columns you wish to pick and
I shall email you the query directly


regards
Devopam

On Thu, Feb 5, 2015 at 3:20 PM, r7raul1...@163.com 
wrote:

> Thank you Devopam! Could you show me a  example?
>
> --
> r7raul1...@163.com
>
>
> *From:* Devopam Mittra 
> *Date:* 2015-02-05 18:05
> *To:* user@hive.apache.org
> *Subject:* Re: How to query data by page in Hive?
> You may want to use a ROW_NUMBER OR RANK / DENSE RANK in the inner query
> and then select only a subset of it in the outer query to control
> pagination. Based on your need, you may want to order the records as well ..
>
> Alternatively you may want to use CTE(
> https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression)
> for selecting the data in one go and then use row number to select as in
> previous case.
>
> regards
> Devopam
>
> On Thu, Feb 5, 2015 at 1:31 PM, r7raul1...@163.com 
> wrote:
>
>> Hello,
>>  How to query data by page in Hive?
>>
>> hive> select * from u_data a limit 1,2;
>> FAILED: ParseException line 1:31 missing EOF at ',' near '1'
>>
>> --
>> r7raul1...@163.com
>>
>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: How to query data by page in Hive?

2015-02-05 Thread Devopam Mittra
You may want to use a ROW_NUMBER OR RANK / DENSE RANK in the inner query
and then select only a subset of it in the outer query to control
pagination. Based on your need, you may want to order the records as well ..

Alternatively you may want to use CTE(
https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression)
for selecting the data in one go and then use row number to select as in
previous case.

regards
Devopam

On Thu, Feb 5, 2015 at 1:31 PM, r7raul1...@163.com 
wrote:

> Hello,
>  How to query data by page in Hive?
>
> hive> select * from u_data a limit 1,2;
> FAILED: ParseException line 1:31 missing EOF at ',' near '1'
>
> ------
> r7raul1...@163.com
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Hive - regexp_replace function for multiple strings

2015-02-03 Thread Devopam Mittra
hi Viral,
Unless you are strictly bound to change the text to achieve your
objectives... you may actually wish to explore ngrams and context_ngram
combinations to uniquely identify the patterns you want to explore and move
them to a new table for further processinng

Better do it at file level on Unix for faster and cleaner results , if it
has to be done the replacing way only.

regards
Devopam


On Wed, Feb 4, 2015 at 3:25 AM, Pradeep Gollakota 
wrote:

> I don't think this is doable using the out of the box regexp_replace()
> UDF. That way I would do it, is using a file to create a mapping between a
> regexp and it's replacement and write a custom UDF that loads this file and
> applies all regular expressions on the input.
>
> Hope this helps.
>
> On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh 
> wrote:
>
>> Hi Everyone,
>>
>> I am using hive 0.13! I want to find multiple tokens like "hip hop" and
>> "rock music" in my data and replace them with "hiphop" and "rockmusic" -
>> basically replace them without white space. I have used the regexp_replace
>> function in hive. Below is my query and it works great for above 2 examples.
>>
>> drop table vp_hiphop;
>> create table vp_hiphop asselect userid, ntext,
>>regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock 
>> music', 'rockmusic') as ntext1from  vp_nlp_protext_males;
>>
>> But I have 100 such bigrams/ngrams and want to be able to do replace
>> efficiently where I just remove the whitespace. I can pattern match the
>> phrase - hip hop and rock music but in the replace I want to simply trim
>> the white spaces. Below is what I tried. I also tried using trim with
>> regexp_replace but it wants the third argument in the regexp_replace
>> function.
>>
>> drop table vp_hiphop;
>> create table vp_hiphop asselect  userid, ntext,
>> regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1from  
>> vp_nlp_protext_males;
>>
>>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-03 Thread Devopam Mittra
hi Samuel,
Apologies for the delay in response as well as overlooking Presto mention
in your initial post itself.
#IMHO :
Presto is lightweight, easy to install and configure.
It does not support "UPDATE" .. hmm , i don't need updates in Big Data
analytics where i can have a temp / intermediate table which will be faster
as well (between, I don't know how many others provide true Update
capabilities)
I am happy with Hive itself , and don't need Presto for my ad-hoc analytics
since the overhead of MR job kick-off timing is not overwhelming compared
to the total query execution time.
Presto is good for me when I need to run parameterized and fixed queries
from a dashboard directly on my HDP cluster as it reduces my screen staring
time

Hope you find it helpful in your decision making.

regards
Devopam



On Tue, Feb 3, 2015 at 1:57 PM, Samuel Marks  wrote:

> Thanks Devopam,
>
> In my initial post I did mention Presto, with his review:
> " can query Hive, Cassandra <http://cassandra.apache.org/>, relational
> DBs &etc. Doesn't seem to be designed for low-latency responses across
> small clusters, or support UPDATE operations. It is optimized for data
> warehousing or analytics¹
> <http://prestodb.io/docs/current/overview/use-cases.html>"
>
> Your thoughts?
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
> On 03/02/2015 6:06 pm, "Devopam Mittra"  wrote:
>
>> hi Samuel,
>> You may wish to evaluate Presto (https://prestodb.io/) , which has an
>> added advantage of being faster than conventional Hive due to no MR jobs
>> being fired.
>> It has a dependency on Hive metastore though , through which it derives
>> the mechanism to execute the queries directly on source files.
>> The only flip side I found was the absence of complex SQL syntax that
>> means creating a lot of intermediate tables for little complicated
>> calculations (and imho , all calculations become complex sooner than we
>> intend them to )
>>
>> regards
>> Devopam
>>
>> On Tue, Feb 3, 2015 at 10:30 AM, Samuel Marks 
>> wrote:
>>
>>> Alexander: So would you recommend using Phoenix for all but those kind
>>> of queries, and switching to Hive+Tez for the rest? - Is that feasible?
>>>
>>> Checking their documentation, it looks like it just might be:
>>> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>>>
>>> There is some early work on a Hive + Phoenix integration on GitHub:
>>> https://github.com/nmaillard/Phoenix-Hive
>>>
>>> Saurabh: I am sure there are a variety of very good non open-source
>>> products on the market :) - However in this thread I am only looking at
>>> open-source options. Additionally I am planning on open-sourcing this
>>> project I am building using these tools, so it makes even more sense that
>>> the entire toolset and their dependencies are also open-source.
>>>
>>> Best,
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>> On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B 
>>> wrote:
>>>
>>>> This is not open source but we are using Vertica and it works very
>>>> nicely for us. There is a 1TB community edition but above that it costs
>>>> money.
>>>> It has really advanced SQL (analytical functions, etc), works like an
>>>> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
>>>> Redshift available but Vertica has more features (pattern matching
>>>> functions, etc).
>>>>
>>>> Again, not open source so I would be interested to know what you end up
>>>> going with and what your experience is.
>>>>
>>>> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks 
>>>> wrote:
>>>>
>>>>> Well what I am seeking is a Big Data database that can work with Small
>>>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>>>> maintaining relatively low latency throughout.
>>>>>
>>>>> Which fit into this category?
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-02 Thread Devopam Mittra
hi Samuel,
You may wish to evaluate Presto (https://prestodb.io/) , which has an added
advantage of being faster than conventional Hive due to no MR jobs being
fired.
It has a dependency on Hive metastore though , through which it derives the
mechanism to execute the queries directly on source files.
The only flip side I found was the absence of complex SQL syntax that means
creating a lot of intermediate tables for little complicated calculations
(and imho , all calculations become complex sooner than we intend them to )

regards
Devopam

On Tue, Feb 3, 2015 at 10:30 AM, Samuel Marks  wrote:

> Alexander: So would you recommend using Phoenix for all but those kind of
> queries, and switching to Hive+Tez for the rest? - Is that feasible?
>
> Checking their documentation, it looks like it just might be:
> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>
> There is some early work on a Hive + Phoenix integration on GitHub:
> https://github.com/nmaillard/Phoenix-Hive
>
> Saurabh: I am sure there are a variety of very good non open-source
> products on the market :) - However in this thread I am only looking at
> open-source options. Additionally I am planning on open-sourcing this
> project I am building using these tools, so it makes even more sense that
> the entire toolset and their dependencies are also open-source.
>
> Best,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
> On Tue, Feb 3, 2015 at 2:33 PM, Saurabh B 
> wrote:
>
>> This is not open source but we are using Vertica and it works very nicely
>> for us. There is a 1TB community edition but above that it costs money.
>> It has really advanced SQL (analytical functions, etc), works like an
>> RDBMS, has R/Java/C++ SDK and scales nicely. There is a similar option of
>> Redshift available but Vertica has more features (pattern matching
>> functions, etc).
>>
>> Again, not open source so I would be interested to know what you end up
>> going with and what your experience is.
>>
>> On Mon, Feb 2, 2015 at 12:08 AM, Samuel Marks 
>> wrote:
>>
>>> Well what I am seeking is a Big Data database that can work with Small
>>> Data also. I.e.: scaleable from one node to vast clusters; whilst
>>> maintaining relatively low latency throughout.
>>>
>>> Which fit into this category?
>>>
>>> Samuel Marks
>>> http://linkedin.com/in/samuelmarks
>>>
>>
>>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Unable to retrieve date

2015-01-29 Thread Devopam Mittra
please consider not writing a new UDF and instead using the following
syntax with built-in UDFs itself .

SELECT
from_unixtime(unix_timestamp('07/Dec/2014:23:59:04','dd/MMM/:HH:mm:ss'),
'dd-MMM- HH:mm:ss')
FROM twitter_data_raw LIMIT 1;

feel free to play around with the formats to get desired results

regards
Devopam


On Thu, Jan 29, 2015 at 3:12 PM, Nagarjuna Vissarapu <
nagarjuna.v...@gmail.com> wrote:

> Hi All,
>
> I wrote hive UDF to convert
>
> 07/Dec/2014:23:59:04 to  07-Dec-2014 23:59:04
>
> I attached UDF. PFA ...
>
> I added jar to hive and created temporary function
>
>
> *stringTotimestampQuery: select stringTotimestamp(date) from rm_get limit
> 1;*
> *Result: *
> [image: Inline image 1]
>
> *Query: select to_date(stringTotimestamp(date)) from rm_get limit 1;*
> *Result: Null*
>
> What may be the issue?
> --
> With Thanks & Regards
> Nagarjuna Vissarapu
> 9052179339
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: [ANNOUNCE] New Hive PMC Members - Szehon Ho, Vikram Dixit, Jason Dere, Owen O'Malley and Prasanth Jayachandran

2015-01-28 Thread Devopam Mittra
+1

Congratulations !!

warm regards
Devopam


On Thu, Jan 29, 2015 at 2:45 AM, Carl Steinbach  wrote:

> I am pleased to announce that Szehon Ho, Vikram Dixit, Jason Dere, Owen
> O'Malley and Prasanth Jayachandran have been elected to the Hive Project
> Management Committee. Please join me in congratulating the these new PMC
> members!
>
> Thanks.
>
> - Carl
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Warning when running hive

2015-01-27 Thread Devopam Mittra
+1

I also need help on this one in particular

regards
Devopam

On Tue, Jan 27, 2015 at 7:44 PM, Philippe Kernévez 
wrote:

> Hi,
>
> I had several warning like " WARN conf.HiveConf: HiveConf of name
> hive.optimize.mapjoin.mapreduce does not exist" when I run hive.
>
> Extract from /etc/hive/conf/hive-site.xml :
> "
>   hive.optimize.mapjoin.mapreduce
>   true
> ""
>
> I don't find that properties in hive configuration but Amabari does't
> allow to remove those parameters and have a documentation for them. Should
> I remove them ?
>
> The full list is :
> WARN conf.HiveConf: HiveConf of name hive.optimize.mapjoin.mapreduce does
> not exist
> WARN conf.HiveConf: HiveConf of name hive.heapsize does not exist
> WARN conf.HiveConf: HiveConf of name hive.server2.enable.impersonation
> does not exist
> WARN conf.HiveConf: HiveConf of name hive.semantic.analyzer.factory.impl
> does not exist
> WARN conf.HiveConf: HiveConf of name
> hive.auto.convert.sortmerge.join.noconditionaltask does not exist
>
> Regards,
> Philippe Kernévez
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: convert empty strings to null

2015-01-26 Thread Devopam Mittra
You may simply try the below if it helps...

CASE LENGTH(colname) WHEN 0 THEN NULL ELSE colname

if needed , use substring to derive the 'colname' from your json tuple..
and apply / extend the same logic

regards
Dev



On Mon, Jan 26, 2015 at 11:35 PM, Binal Jhaveri  wrote:

> My input file is a json file. I am loading that into a staging table. Then
> i am parsing the json using hive (Lateral view and json_tuple) and writing
> to a final avro table. My issue is some json values have empty strings. I
> want those empty strings to display as null..
> what is the best way to achieve this??
>



-- 
Devopam Mittra
Life and Relations are not binary


Hive Logo details

2015-01-22 Thread Devopam Mittra
hi,
We don't have the Hive logo available on Wikipedia page.
Can someone please point me to the logo owner or the creative license
details + location so that I can do the same for proper display.

ref: https://en.wikipedia.org/wiki/Apache_Hive

-- 
Devopam Mittra
Life and Relations are not binary


Re: Hive UPDATE : query around usage

2015-01-18 Thread Devopam Mittra
hi Lefty,
Thanks again for the crisp response.
After going through it all over again , and brooding over the weekend I
understood that my ask is yet not there and I was trying to do things in a
(yet) impractical way.

I am sure it will be the need in future , if we were to use and promote
Hive as The 'SQL' engine for Hadoop in the longer run.

Till then I will rather stick to my old technique of handling UPDATE as
DELETE+INSERT and leveraging dynamic partitions to achieve the objective.


warm regards
Devopam


On Sat, Jan 17, 2015 at 1:48 PM, Lefty Leverenz 
wrote:

> I can't answer your question, but here are some links to documentation in
> the Hive wiki:
>
>- DML -- Update
>
> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update>
>- Hive Transactions
><https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions>
>
>
> -- Lefty Leverenz
>
>
> On Fri, Jan 16, 2015 at 6:09 AM, Devopam Mittra  wrote:
>
>> hi All,
>> I wanted to UPDATE multiple rows in a table by joining it with another
>> table for identification of matching records (otherwise in the RDBMS world
>> also commonly known as SET based UPDATE).
>>
>> The idea is to update multiple records in one go , to downsize the lead
>> time for each MR job to kick off.
>>
>> Is it supported or I am being greedy here with my wishlist ?
>>
>> Also, at an atomic level update as well, is there a way I can update a
>> table without providing explicit variables (aka expression)  in the WHERE
>> clause.
>> e.g.: I want to update one row in a table and in the where clause i wish
>> to use a SELECT with other table to identify / match the record.
>>
>> Kindly drop a pointer to the syntax manual to look it up.
>>
>> Happy weekend !
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Hive UPDATE : query around usage

2015-01-16 Thread Devopam Mittra
hi All,
I wanted to UPDATE multiple rows in a table by joining it with another
table for identification of matching records (otherwise in the RDBMS world
also commonly known as SET based UPDATE).

The idea is to update multiple records in one go , to downsize the lead
time for each MR job to kick off.

Is it supported or I am being greedy here with my wishlist ?

Also, at an atomic level update as well, is there a way I can update a
table without providing explicit variables (aka expression)  in the WHERE
clause.
e.g.: I want to update one row in a table and in the where clause i wish to
use a SELECT with other table to identify / match the record.

Kindly drop a pointer to the syntax manual to look it up.

Happy weekend !
-- 
Devopam Mittra
Life and Relations are not binary


Re: date_add statement problem

2015-01-16 Thread Devopam Mittra
hi Yashwanth,
Can you please check once if  DATE_ADD('1985-01-01', day_key) works for you

regards
Devopam


On Fri, Jan 16, 2015 at 1:12 PM, Yashwanth M 
wrote:

>  Hi,
>
>
>
> I’ve been facing some issues in using date_add()
>
>
>
> I have column ‘*day_key*’  that is short integer and stores day keys with
> respect to 1985/01/01.
>
>
>
> I need to extract actual date from this column.
>
>
>
> *Date_add(‘1985/01/01’,day_key)*
>
> However I get an error saying the second argument needs integer.
>
>
>
> I also did
>
> *Date_add(‘1985/01/01’,cast(day_key as int))*
>
> Resulted in an error saying the column is short.
>
>
>
> Please help me with this.
>
>
>
> Thanks
>
> Regards,
>
> Yashwanth M *|* +91 8884 177 884 *|*www.mu-sigma.com *|*
>
>
>  Disclaimer: http://www.mu-sigma.com/disclaimer.html
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: struggling with Hive ACID - need help

2015-01-15 Thread Devopam Mittra
Any ideas as to where I am going wrong here.. .Between , observed that by
changing hive txn manager , some other things are breaking in my existing
setup so I have reverted to the old configuration for now to keep the
cluster in a workable mode.


regards
Devopam


On Wed, Jan 14, 2015 at 4:25 PM, Devopam Mittra  wrote:

> Env: Hive 0.14 , HDP2.2, Centos 6.6
>
> I created a simple table (ref syntax below)
> DROP TABLE IF EXISTS  twitter_user_info;
>
> CREATE TABLE twitter_user_info
> ( userid BIGINT COMMENT 'User Id',
>   username STRING COMMENT 'User Name',
>   is_latest BOOLEAN COMMENT 'Currently active',
>   last_modified_ts TIMESTAMP COMMENT 'Last modified timestamp'
> )
> COMMENT 'Twitter User Info Table'
> CLUSTERED BY (userid,is_latest) SORTED BY (userid) INTO 30 BUCKETS
> STORED AS ORC
> TBLPROPERTIES('transactional'='true');
>
>
> set the parameters in hive conf via ambari:
>
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=100;
>
> I inserted few rows via the following query (not-relevant , yet pasting):
> INSERT INTO TABLE twitter_user_info
> SELECT created_by_id, created_by_name,'Y',MAX(created_at)
> FROM twitter_data B
> WHERE b.created_by_id IS NOT NULL
> AND b.created_by_id NOT IN (
>  SELECT created_by_id FROM
> (SELECT created_by_id,COUNT(DISTINCT
> created_by_name)
>  FROM twitter_data
>  WHERE created_by_id IS NOT NULL
>  GROUP BY created_by_id
>  HAVING COUNT(DISTINCT created_by_name) >1) A
> )
> GROUP BY b.created_by_id,b.created_by_name;
>
> For each operation below I am hitting the same error, and am clueless at
> this point .
>
> hive> UPDATE twitter_user_info
> > SET is_latest = false
> > WHERE userid = 298779403
> > AND last_modified_ts = '2015-01-13 14:25:14';
> *FAILED: NoMatchingMethodException No matching method for class
> org.apache.hadoop.hive.ql.udf.UDFToLong with
> (struct). Possible choices:
> _FUNC_(bigint)  _FUNC_(boolean)  _FUNC_(decimal(38,18))  _FUNC_(double)
> _FUNC_(float)  _FUNC_(int)  _FUNC_(smallint)  _FUNC_(string)
> _FUNC_(timestamp)  _FUNC_(tinyint)  _FUNC_(void)  *
> hive> DELETE FROM twitter_user_info WHERE userid=100;
>
> *FAILED: NoMatchingMethodException No matching method for class
> org.apache.hadoop.hive.ql.udf.UDFToLong with
> (struct). Possible choices:
> _FUNC_(bigint)  _FUNC_(boolean)  _FUNC_(decimal(38,18))  _FUNC_(double)
> _FUNC_(float)  _FUNC_(int)  _FUNC_(smallint)  _FUNC_(string)
> _FUNC_(timestamp)  _FUNC_(tinyint)  _FUNC_(void)  *
>
> Any pointers are welcome
>
> --
> Devopam Mittra
> Life and Relations are not binary
>



-- 
Devopam Mittra
Life and Relations are not binary


struggling with Hive ACID - need help

2015-01-14 Thread Devopam Mittra
Env: Hive 0.14 , HDP2.2, Centos 6.6

I created a simple table (ref syntax below)
DROP TABLE IF EXISTS  twitter_user_info;

CREATE TABLE twitter_user_info
( userid BIGINT COMMENT 'User Id',
  username STRING COMMENT 'User Name',
  is_latest BOOLEAN COMMENT 'Currently active',
  last_modified_ts TIMESTAMP COMMENT 'Last modified timestamp'
)
COMMENT 'Twitter User Info Table'
CLUSTERED BY (userid,is_latest) SORTED BY (userid) INTO 30 BUCKETS
STORED AS ORC
TBLPROPERTIES('transactional'='true');


set the parameters in hive conf via ambari:

set hive.support.concurrency=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=100;

I inserted few rows via the following query (not-relevant , yet pasting):
INSERT INTO TABLE twitter_user_info
SELECT created_by_id, created_by_name,'Y',MAX(created_at)
FROM twitter_data B
WHERE b.created_by_id IS NOT NULL
AND b.created_by_id NOT IN (
 SELECT created_by_id FROM
(SELECT created_by_id,COUNT(DISTINCT
created_by_name)
 FROM twitter_data
 WHERE created_by_id IS NOT NULL
 GROUP BY created_by_id
 HAVING COUNT(DISTINCT created_by_name) >1) A
)
GROUP BY b.created_by_id,b.created_by_name;

For each operation below I am hitting the same error, and am clueless at
this point .

hive> UPDATE twitter_user_info
> SET is_latest = false
> WHERE userid = 298779403
> AND last_modified_ts = '2015-01-13 14:25:14';
*FAILED: NoMatchingMethodException No matching method for class
org.apache.hadoop.hive.ql.udf.UDFToLong with
(struct). Possible choices:
_FUNC_(bigint)  _FUNC_(boolean)  _FUNC_(decimal(38,18))  _FUNC_(double)
_FUNC_(float)  _FUNC_(int)  _FUNC_(smallint)  _FUNC_(string)
_FUNC_(timestamp)  _FUNC_(tinyint)  _FUNC_(void)  *
hive> DELETE FROM twitter_user_info WHERE userid=100;

*FAILED: NoMatchingMethodException No matching method for class
org.apache.hadoop.hive.ql.udf.UDFToLong with
(struct). Possible choices:
_FUNC_(bigint)  _FUNC_(boolean)  _FUNC_(decimal(38,18))  _FUNC_(double)
_FUNC_(float)  _FUNC_(int)  _FUNC_(smallint)  _FUNC_(string)
_FUNC_(timestamp)  _FUNC_(tinyint)  _FUNC_(void)  *

Any pointers are welcome

-- 
Devopam Mittra
Life and Relations are not binary


Re: compare float column

2015-01-13 Thread Devopam Mittra
please try the following and report observation:

WHERE long = CAST(-41.338276 AS FLOAT)


regards
Devopam


On Tue, Jan 13, 2015 at 2:25 PM, Patcharee Thongtra <
patcharee.thong...@uni.no> wrote:

> Hi,
>
> I have a table with float columns. I tried to query based on the condition
> on a float column (called 'long'), but it failed (nothing returned).
>
> hive> select * from test_float where long == -41.338276;
> select * from test_float where long == -41.338276
> Status: Finished successfully
> OK
> Time taken: 14.262 seconds
>
> hive> select long from test_float;
> select long from test_float
> Status: Finished successfully
> OK
> -41.338276
> Time taken: 6.843 seconds, Fetched: 1 row(s)
>
>
> Any ideas? I am using hive version 0.13.
>
> BR,
> Patcharee
>
>
>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: [ANNOUNCE] New Hive PMC Member - Prasad Mujumdar

2014-12-09 Thread Devopam Mittra
+1 :)

regards
Devopam

On Wed, Dec 10, 2014 at 3:47 AM, Carl Steinbach  wrote:

> I am pleased to announce that Prasad Mujumdar has been elected to the Hive
> Project Management Committee. Please join me in congratulating Prasad!
>
> Thanks.
>
> - Carl
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: How to do single user multiple access in hive

2014-11-08 Thread Devopam Mittra
hi Amit,

Please go for the solution / design that suits your need and not exactly
just because someone(like me) suggested.

1. I presume you are using Hue for the ease and not really for any
particular technical reason.
2. Try installing Hive client on one of the machines that you have in your
cluster and see if you can safely switch over to it without major
disruptions (yes, there is a bit of learning needed, since Hue has
suggestive UI features to guide you and client is pure terminal based). If
this works, then try installing hive cli / client in all the machines that
you would want to use to fire queries. Some reference :
https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-CommandLine
This should be trivial if you have ambari based installation. Else also,
there is not much configuration required to install and use.
3. Here is the answer to your query :
http://cloudera.github.io/hue/docs-3.6.0/user-guide/oozie.html , should you
want to implement oozie.
4. Despite my suggestion, please don't go to oozie or any other technology
unless you have a well defined need for it. Else the burden of additional
learning and impact on existing work is overwhelming at times and derails
the motive.

regards
Devopam



regards
Devopam


On Sat, Nov 8, 2014 at 12:34 AM, Amit Behera  wrote:

> hi Devopam,
>
> Thank you for replying.
>
> I am using Hue on the top of Hive. So can you please help me, how oozie
> will help me and how can I integrate oozie with this.
>
> Thanks
> Amit
>
>
>
> On Fri, Nov 7, 2014 at 7:58 PM, Devopam Mittra  wrote:
>
>> hi Amit,
>> Please try to see if Hive CLI (client) installed on the 'different'
>> machines helps you achieve your goal at the minimalist end.
>> If you use any other program like oozie (to submit your queries) etc.
>> then you can fire queries through the respective interfaces safely enough.
>>
>> regards
>> Devopam
>>
>>
>> On Fri, Nov 7, 2014 at 7:29 PM, Amit Behera  wrote:
>>
>>> Hi users,
>>>
>>> I have hive set up at multi node hadoop cluster.
>>> I want to run multiple queries on top of a table from different machines.
>>>
>>> So please help how to achieve multiple access on hive to run multiple
>>> queries simultaneously.
>>>
>>> Thanks
>>> Amit
>>>
>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Setting job diagnostics to REDUCE capability required - error in hive

2014-11-07 Thread Devopam Mittra
Which scheduler are you using ? FairScheduler might throw such issues...
And environment details please ? setting config parameters optimally in
yarn/mr configs might help you but please do so wisely as it may imbalance
other things if not implemented thoughtfully.

regards
Devopam

On Fri, Nov 7, 2014 at 7:56 PM, Ja Sam  wrote:

> I have a simple query with grouping. Something similar to bellow:
>   SELECT col1, col2, col3, min(date), count(*)
> FROM tblX
>   WHERE partitionDate="20141107"
>   GROUP BY col1, col2, col3;
>
> When I run this query through WebHCat everything works fine. But when I
> try to run it from hive shell I have error like this:
>
> Setting job diagnostics to REDUCE capability required is more than the
> supported max container capability in the cluster. Killing the Job.
> reduceResourceReqt: 21238 maxContainerCapability:8192
>
> I tried to setup in hive: SET hive.exec.reducers.max=8192
>
> but this doesn't change anything. What I did wrong?
>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: How to do single user multiple access in hive

2014-11-07 Thread Devopam Mittra
hi Amit,
Please try to see if Hive CLI (client) installed on the 'different'
machines helps you achieve your goal at the minimalist end.
If you use any other program like oozie (to submit your queries) etc. then
you can fire queries through the respective interfaces safely enough.

regards
Devopam


On Fri, Nov 7, 2014 at 7:29 PM, Amit Behera  wrote:

> Hi users,
>
> I have hive set up at multi node hadoop cluster.
> I want to run multiple queries on top of a table from different machines.
>
> So please help how to achieve multiple access on hive to run multiple
> queries simultaneously.
>
> Thanks
> Amit
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Parsing Hive Query to get table names and column names

2014-11-05 Thread Devopam Mittra
hi Ritesh,
Please reconsider your entire design , it might be helpful to do it now
than becoming unmanageable later.

If unavoidable, please use a metadata based approach for pre-calculating
and keeping the list of tables that you need to refresh prior to firing a
query on them (?)

Hope it helps.

regards
Devopam


On Wed, Nov 5, 2014 at 6:20 PM, Ritesh Gautam  wrote:

> hey Alok,
>  I want to do this so that I can refresh the dependent tables before I run
> my query, so that my query would now run on the current data.
>
> The queries are written manually, so that the only way to do this will be
> to parse the query.
> Isn't Hive somewhat different from SQL? I have already tried using
> JsqlParser but for some cases it doesn't works.
>
> Thanks,
> Regards,
> Ritesh
>
> On Wed, Nov 5, 2014 at 6:05 PM, Alok Kumar  wrote:
>
>> Hi,
>>
>> Why at the first place you would want this? ( just curious )
>>
>> Few thoughts -
>> a) Try to get it from the piece of code where these query are being
>> generated [ if not static in code!], that would be best place to get it.
>> b) [ if you don't have access to a) ] - try http://zql.sourceforge.net/
>> ,  it should be easier. Also check the licence.
>>
>> Thanks
>> Alok
>>
>> On Wed, Nov 5, 2014 at 5:47 PM, Ritesh Gautam 
>> wrote:
>>
>>> Hello,
>>> I am trying to parse hive queries so that I can get the table
>>> names on which the query is dependent on.
>>>
>>> I have tried the following :
>>> 1) downloaded the grammer and used ANTLR to generate the lexer and
>>> parser, but there are some errors as such when I try to build it:
>>> ..
>>>   symbol:   class RecognitionException
>>>   location: class HiveLexer
>>> HiveLexer.java:2432: error: cannot find symbol
>>> public final void mKW_ESCAPED() throws RecognitionException {
>>>^
>>>   symbol:   class RecognitionException
>>>   location: class HiveLexer
>>> HiveLexer.java:2453: error: cannot find symbol
>>> public final void mKW_COLLECTION() throws RecognitionException {
>>>   ^
>>>   symbol:   class RecognitionException
>>>   location: class HiveLexer
>>> 100 errors
>>>
>>> 2) I have tried using org.apache.hadoop.hive.ql.parse but I am stuck at
>>> this point:
>>>
>>> ANTLRStringStream input = new ANTLRStringStream("SELECT x FROM
>>> abc");
>>> HiveLexer lexer = new HiveLexer(input);
>>>     TokenStream tokens = new CommonTokenStream(lexer);
>>> HiveParser parser = new HiveParser(tokens);
>>> System.out.println(parser.statement());
>>>
>>> *How should I proceed from here to extract the table names and column
>>> names?*
>>> *And, Is the way I am doing it correct?*
>>>
>>> Thank You.
>>> Regards,
>>> Ritesh
>>>
>>>
>>
>>
>> --
>> Alok Kumar
>> http://sharepointorange.blogspot.in/
>>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: Unix script for identifying current active namenode in a HA cluster

2014-11-05 Thread Devopam Mittra
hi Nitin,
Thanks for the vital input around Hadoop Home addition. At times such
things totally go off the radar when you have customized your own
environment.

As suggested I have shared this on github :
https://github.com/devopam/hadoopHA
apologies if there is any problem on github as I have limited familiarity
with it :(


regards
Devopam



On Wed, Nov 5, 2014 at 12:31 PM, Nitin Pawar 
wrote:

> +1
> If you can optionally add hadoop home directory in the script and use that
> in path, it can be used out of the box.
>
> Also can you share this on github
>
> On Wed, Nov 5, 2014 at 10:02 AM, Devopam Mittra  wrote:
>
>> hi All,
>> Please find attached a simple shell script to dynamically determine the
>> active namenode in the HA Cluster and subsequently run the Hive job / query
>> via Talend OS generated workflows.
>>
>> It was tried successfully on a HDP2.1 cluster with 2 nn, 7 dn running on
>> CentOS 6.5.
>> Each ETL job invokes this script first in our framework to derive the NN
>> FQDN and then run the hive jobs subsequently to avoid failures.
>> Takes a max. of 2 secs to execute (small cost in our case, as compared to
>> dealing with a failure and then recalculating the NN to resubmit the job).
>>
>> Sharing it with you in case you can leverage the same without spending
>> effort to code it.
>>
>> Do share your feedback/ fixes if you spot any.
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Devopam Mittra
Life and Relations are not binary


findActiveNameNode.sh
Description: Bourne shell script


Unix script for identifying current active namenode in a HA cluster

2014-11-04 Thread Devopam Mittra
hi All,
Please find attached a simple shell script to dynamically determine the
active namenode in the HA Cluster and subsequently run the Hive job / query
via Talend OS generated workflows.

It was tried successfully on a HDP2.1 cluster with 2 nn, 7 dn running on
CentOS 6.5.
Each ETL job invokes this script first in our framework to derive the NN
FQDN and then run the hive jobs subsequently to avoid failures.
Takes a max. of 2 secs to execute (small cost in our case, as compared to
dealing with a failure and then recalculating the NN to resubmit the job).

Sharing it with you in case you can leverage the same without spending
effort to code it.

Do share your feedback/ fixes if you spot any.

-- 
Devopam Mittra
Life and Relations are not binary


findActiveNameNode.sh
Description: Bourne shell script


Re: [ANNOUNCE] New Hive PMC Member - Alan Gates

2014-10-27 Thread Devopam Mittra
Congratulations Alan !


regards
Dev

On Tue, Oct 28, 2014 at 4:08 AM, Carl Steinbach  wrote:

> I am pleased to announce that Alan Gates has been elected to the Hive
> Project Management Committee. Please join me in congratulating Alan!
>
> Thanks.
>
> - Carl
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Need to place an edit request for https://wiki.apache.org/hadoop/FrontPage

2014-10-24 Thread Devopam Mittra
hi Lefty,
Thanks a bunch for the clear instructions.

I have subscribed to the group and sent them an email request also.

warm regards
Dev

On Fri, Oct 24, 2014 at 11:50 AM, Lefty Leverenz 
wrote:

> Have you already created a Hadoop wiki login account
> <https://wiki.apache.org/hadoop/FrontPage?action=login>?
>
> I couldn't find any instructions, but a general search turned up this
> email to hadoop-common-dev
> <http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201401.mbox/%3CCAFiYk=p5Bfp4xC_8b4TGDAB+01Ec_R-HMXV+auhUG=jbpmj...@mail.gmail.com%3E>
> which suggests that's where you request edit permission (
> common-...@hadoop.apache.org
> <http://hadoop.apache.org/mailing_lists.html#Common>).
>
> -- Lefty
>
> On Fri, Oct 24, 2014 at 12:33 AM, Devopam Mittra 
> wrote:
>
>> hi ,
>> I wanted to make some edits to the Hadoop main page.
>> https://wiki.apache.org/hadoop/FrontPage
>>
>> Not able to trace the list of administrators / place a request for
>> granting edit rights.
>>
>> Any help/pointers will be big help !
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Need to place an edit request for https://wiki.apache.org/hadoop/FrontPage

2014-10-23 Thread Devopam Mittra
hi ,
I wanted to make some edits to the Hadoop main page.
https://wiki.apache.org/hadoop/FrontPage

Not able to trace the list of administrators / place a request for granting
edit rights.

Any help/pointers will be big help !

-- 
Devopam Mittra
Life and Relations are not binary


Re: will I get conflict if I run 2 "INSERT INTO TABLE" in parallel ?

2014-10-23 Thread Devopam Mittra
Can you confirm that your table:
1. is partitioned (static or dynamic)
2. loading is done at a partition level or just based on the selection
criteria for each process i.e. it may run into multiple partitions based on
the input set of data in that process.

Also, rhetoric but please state hive version ..
regards
Dev

On Fri, Oct 24, 2014 at 1:10 AM, Yang  wrote:

> I have to run some processes in parallel, for different input datasets. in
> each of the processes, one step is insertion into a hive table, shared by
> all these processes.
>
> would I get conflicts if the insertions are run in parallel ? it's fine if
> I get a blocking, but I need to guarantee correctness of data.
>
> if there is conflict,  would "INSERT OVERWRITE PARTITION" get conflicts ?
> the different processes indeed process different partitions
>
> thanks
> Yang
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: PLEASE HELP :: HOW TO DO INNER JOIN IN HIVE

2014-10-15 Thread Devopam Mittra
hi Yogesh,
Please try to leverage common table expression (WITH ...) to achieve your 
desired outcome 

regards
Dev

> On Oct 15, 2014, at 1:18 PM, yogesh dhari  wrote:
> 
> Hello all,
> 
> I have a use case where I need to do inner join..
> 
> Like
> 
> select A.iteam , B.Decsription, 
> from iteam_table A INNER JOIN iteam_desc B
> on A.id = B.id
> 
> 
> As hive does not support Inner Join,
> 
> Please suggest how to do it
> 
> 
> Thanks in Advance
> 


Re: Insert into hive table without overwriting data

2014-10-08 Thread Devopam Mittra
the query should not really overwrite the data. can you please check your
hive version and also validate that your table(s) don't have dynamic
partitioning scheme... or that you are not firing a truncate before
inserting the data.

i assume you have done the data validation already (i.e. ensuring that
actually data is getting overwritten, and not really replicating)

regards
Dev

On Wed, Oct 8, 2014 at 8:21 PM, Thiago Henrique dos Santos Bento <
thiago.be...@b2wdigital.com> wrote:

>  Hi,
>
>
>
> When I insert data into a hive table like:
>
> Insert into hive_table
>
> Select columns
>
> From another_hive_table
>
>
>
> This query overwrite existing data. Any idea how to insert data without
> overwriting data?
>
>
>
> *Thiago Henrique S. Bento*
>
> B2W Digital / B2W Labs | BIT SP
>
> thiago.be...@b2wdigital.com
>
> Tel.: +55 (11) 3454-3024
>
>
> .
>
> *B**2**W DIGITAL*  #Americanas.com <http://americanas.com/>  #Shoptime.com
> <http://shoptime.com/>  #Submarino.com <http://submarino.com/>  #
> SouBarato.com <http://soubarato.com/>
>
>
>
>
>  --
>
> Esta mensagem pode conter informações confidenciais e somente o indivíduo
> ou entidade a quem foi destinada pode utilizá-la. A transmissão incorreta
> da mensagem não acarreta a perda de sua confidencialidade. Caso esta
> mensagem tenha sido recebida por engano, solicitamos que o fato seja
> comunicado ao remetente e que a mensagem seja eliminada de seu sistema
> imediatamente. É vedado a qualquer pessoa que não seja o destinatário usar,
> revelar, distribuir ou copiar qualquer parte desta mensagem. Ambiente de
> comunicação sujeito a monitoramento.
>
> This message may include confidential information and only the intended
> addresses have the right to use it as is, or any part of it. A wrong
> transmission does not break its confidentiality. If you've received it
> because of a mistake or erroneous transmission, please notify the sender
> and delete it from your system immediately. This communication environment
> is controlled and monitored.
>
> B2W Digital
>
>
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: audit hive query

2014-09-22 Thread Devopam Mittra
hi Siva,
If I were to address your problem, I will create a table in hive metastore
DB , and load the Hive query text + timing in it , by writing my own custom
shell script.

And I will need to do this on all hive client machines (not just one) ,
maybe by suitably adding $HOSTNAME as another variable.

If I have Java programs calling Hive SQL queries , I may need to suitably
log it separately as well.

Hope it helps.

regards
Dev


On Mon, Sep 22, 2014 at 6:15 PM, siva kumar  wrote:

> Hi Dev,
> When i run the hive query in hive shell, the query is stored
> in hive history file at default location.Since my metastore is sql for hive
> whenever i create a table with some columns in hive shell,it gets populated
> in sql metastore database.Simillarly,all the queries should be popoulated
> in a sql table. I want to view the hive queries over a period of
> time populated in a sql table, but not through the web browser. Can
> u suggest me some idea on dis scenario.
>
>
> Thanks and regards,
> siva
> On Mon, Sep 22, 2014 at 9:18 AM, Devopam Mittra  wrote:
>
>> Please check .hivehistory settings and it might be of easy / simple good
>> help .
>> If you are on Ambari based installation then simply looking into Resource
>> Manager UI should suffice the need.
>> If you are on native hadoop, then you can access the Yarn log through web
>> browser as well.
>>
>> Hope this helps
>>
>> regards
>> Dev
>>
>>
>> On Fri, Sep 19, 2014 at 4:01 PM, siva kumar  wrote:
>>
>>> Hi All,
>>>   I have a requirement to view the recently executed Hive
>>> queries in sql. I have configured sql as metastore to hive and im able to
>>> view hive tables,columns,partitions in sql . Simillarly,i need to view
>>> the hive queries executed on hive data.
>>>
>>>
>>>
>>> Any help is appreciated,
>>> sivakumar.c
>>>
>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>
>


-- 
Devopam Mittra
Life and Relations are not binary


Re: audit hive query

2014-09-21 Thread Devopam Mittra
Please check .hivehistory settings and it might be of easy / simple good
help .
If you are on Ambari based installation then simply looking into Resource
Manager UI should suffice the need.
If you are on native hadoop, then you can access the Yarn log through web
browser as well.

Hope this helps

regards
Dev


On Fri, Sep 19, 2014 at 4:01 PM, siva kumar  wrote:

> Hi All,
>   I have a requirement to view the recently executed Hive queries
> in sql. I have configured sql as metastore to hive and im able to view hive
> tables,columns,partitions in sql . Simillarly,i need to view the hive
> queries executed on hive data.
>
>
>
> Any help is appreciated,
> sivakumar.c
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Hive or phoenix

2014-09-18 Thread Devopam Mittra
Ever tried Presto for simpler , yet faster alternative option ?
Might help if used in conjunction with hive (i repeat it is not a
replacement, but a complement)


regards
Dev

On Thu, Sep 18, 2014 at 3:18 PM, Prakash Hosalli <
prakash.hosa...@syncoms.com> wrote:

>
>
> Hi,
>
>
>
>
>
> I am in dilemma to  go ahead with Hive or phoenix.
>
>
>
> My requirement is achieving  *“Low Latency”* for the
> query.
>
>
>
> I went through both one with suitable records of data and
> I got following output.
>
>
>
> 1.   Hive ran all aggregation query with help of MR and took around 5
> Mins (approximately) with 1 Lakh of data.(With join, where clause, AND).
>
> 2.   Same query with Phoenix it took less than 30 seconds.
>
>
>
> But, Phoenix is not support its own security layer, which is concern to
> me. I need some clarification on following points
>
>
>
> 1.   Is it possible to run MR as a pre-complied one rather than on
> fly. If yes, then what are the step to take to fulfil this one.
>
>
>
>
>
>
>
>
>
> Thanks in advance,
>
>
>
>
>
> Regards,
>
> Prakash Hosalli
>
>
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: How can I know one table is a partitioned table in hive?

2014-07-31 Thread Devopam Mittra
If you have mySQL as your metastore , you may use something similar to
below:

SELECT tbl.TBL_NAME,COUNT(DISTINCT part.PART_NAME) AS partition_count
FROM metastore_db.TBLS tbl, metastore_db.PARTITIONS part
WHERE tbl.TBL_ID = part.TBL_ID
AND tbl.TBL_TYPE = 'MANAGED_TABLE'
GROUP BY tbl.TBL_NAME;

Do remember to change the db name to yours before testing it out :)
regards
Dev



On Thu, Jul 31, 2014 at 6:34 PM, Dan Fan  wrote:

>  The easiest way is to go to hive,
> Show create table table_name.
> Then you can know when the table, if the it is partitioned
>
>   From: Nitin Pawar 
> Reply-To: "user@hive.apache.org" 
> Date: Thursday, July 31, 2014 at 3:50 AM
> To: "user@hive.apache.org" 
> Subject: Re: How can I know one table is a partitioned table in hive?
>
>   what are the options you have?
> can you write a java code which can interact with hcatalog ?
> or you can do a describle table and check for partion column details in
> there
>
>
> On Thu, Jul 31, 2014 at 1:11 PM, 张甲超  wrote:
>
>> dear all,
>> I want know that one table is a partitioned table in hive, and
>> return the result to shell.
>> How can I do?
>>
>
>
>
>  --
> Nitin Pawar
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Tuning Triangle Joins on Hive

2014-07-31 Thread Devopam Mittra
Please try the following approach and let me know if you are not getting
better performance:

1. Ensure indexes are present on dst , rsc columns in the respective tables.
2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
physical table, and then create index on its new src column as well
3. Join this to r1

If this approach works well, then try out the WITH SELECT ... using the
same approach , just no physical intermediate table will be created.

Hope it helps..

regards
Dev




On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid 
wrote:

> Hi,
>
> We're running various "triangle" join queries on Hive 0.9.0, and we're
> wondering if we can get any better performance. Here's the query we're
> running:
>
> SELECT count(*)
> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst
> = r3.src AND r3.dst = r1.src)
> WHERE r1.src < r2.src AND r2.src < r3.src;
>
> We're currently passing the following tuning parameters as well:
>
> set mapred.map.tasks=120;
> set mapred.reduce.tasks=120;
> set mapred.tasktracker.map.tasks.maximum=8;
> set mapred.tasktracker.reduce.tasks.maximum=8;
> set mapred.child.java.opts=-Xmx5120m;
>
> The dataset we're using has 5 million nodes and 70 million edges, and most
> of our time is spent on garbage collection. We have about 30 machines in
> our cluster, and each machine has 45GB of RAM. Any thoughts on how we can
> improve performance? Thanks in advance!
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Hive Data

2014-07-30 Thread Devopam Mittra
You may please give dbpedia dataset a try - I am sure you won't be disappointed 
:)

regards
Dev
+91 958 305 9899

> On Jul 30, 2014, at 6:05 PM, CHEBARO Abdallah  
> wrote:
> 
> Till now I don’t have a file. I am willing to search online for a sample 
> dataset that contains at least 1 million rows. If you know any link to a 
> sample file, it would be very much appreciated.
>  
> Thank you.
>  
> From: Nitin Pawar [mailto:nitinpawar...@gmail.com] 
> Sent: Wednesday, July 30, 2014 3:33 PM
> To: user@hive.apache.org
> Subject: Re: Hive Data
>  
> hive reads the files by the input format defined by the table schema. 
>  
> By default it reads the TextFile in which columns are separated by "CTRL+A" 
> key
>  
> if you have a csv file then you can use a csv serde. 
> there are lots of such file formats.
>  
> what does your file look like? 
>  
>  
> 
> On Wed, Jul 30, 2014 at 5:54 PM, CHEBARO Abdallah 
>  wrote:
> Hello,
>  
> I am interested in testing Hive with a huge sample data. Does Hive read all 
> data types? Should the file be a table?
>  
> Thank you
> ***
> This e-mail contains information for the intended recipient only. It may 
> contain proprietary material or confidential information. If you are not the 
> intended recipient you are not authorised to distribute, copy or use this 
> e-mail or any attachment to it. Murex cannot guarantee that it is virus free 
> and accepts no responsibility for any loss or damage arising from its use. If 
> you have received this e-mail in error please notify immediately the sender 
> and delete the original email received, any attachments and all copies from 
> your system.
> 
> 
>  
> -- 
> Nitin Pawar
> ***
> This e-mail contains information for the intended recipient only. It may 
> contain proprietary material or confidential information. If you are not the 
> intended recipient you are not authorised to distribute, copy or use this 
> e-mail or any attachment to it. Murex cannot guarantee that it is virus free 
> and accepts no responsibility for any loss or damage arising from its use. If 
> you have received this e-mail in error please notify immediately the sender 
> and delete the original email received, any attachments and all copies from 
> your system.


Re: SELECT specific data

2014-07-30 Thread Devopam Mittra
If you have a defined table, then loading partial columns will be easiest
handled with inserting the rest columns with NULL value after mapping your
partial column file as an external table.

regards
Devopam


On Wed, Jul 30, 2014 at 2:49 PM, CHEBARO Abdallah <
abdallah.cheb...@murex.com> wrote:

>  I am only using Hive and hadoop, nothing more.
>
>
>
> *From:* Devopam Mittra [mailto:devo...@gmail.com]
> *Sent:* Wednesday, July 30, 2014 12:15 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: SELECT specific data
>
>
>
> Are you using any tool to load data ? If yes, then the ETL tool will
> provide you such options.
>
> If not, then please explore unix file processing/external table route.
>
>
>
> On Wed, Jul 30, 2014 at 2:09 PM, CHEBARO Abdallah <
> abdallah.cheb...@murex.com> wrote:
>
> Hello,
>
>
>
> Thank you for your reply.
>
>
>
> Consider we have data divided into 5 columns (col1, col2, col3, col4,
> col5).
>
> So I can’t load directly col1, col3 and col5?
>
> If I can’t do it directly, can you provide me with an alternate solution?
>
>
>
> Thank you.
>
>
>
> *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com]
> *Sent:* Wednesday, July 30, 2014 11:37 AM
> *To:* user@hive.apache.org
> *Subject:* Re: SELECT specific data
>
>
>
> you mean just by writing query then I think no.
>
>
>
> But if you want to read only first 3 columns of the data then it would
> work with just a single table and load data into
>
>
>
> On Wed, Jul 30, 2014 at 1:47 PM, CHEBARO Abdallah <
> abdallah.cheb...@murex.com> wrote:
>
> Hello,
>
>
>
> I am interested in selecting specific data from a source and loading it to
> a table. For example, if I have 5 columns in my dataset, I want to load 3
> columns of it. Is it possible to do it without create a second table?
>
>
>
> Thank you
>
> ***
>
> This e-mail contains information for the intended recipient only. It may
> contain proprietary material or confidential information. If you are not
> the intended recipient you are not authorised to distribute, copy or use
> this e-mail or any attachment to it. Murex cannot guarantee that it is
> virus free and accepts no responsibility for any loss or damage arising
> from its use. If you have received this e-mail in error please notify
> immediately the sender and delete the original email received, any
> attachments and all copies from your system.
>
>
>
>
>
> --
> Nitin Pawar
>
> ***
>
> This e-mail contains information for the intended recipient only. It may
> contain proprietary material or confidential information. If you are not
> the intended recipient you are not authorised to distribute, copy or use
> this e-mail or any attachment to it. Murex cannot guarantee that it is
> virus free and accepts no responsibility for any loss or damage arising
> from its use. If you have received this e-mail in error please notify
> immediately the sender and delete the original email received, any
> attachments and all copies from your system.
>
>
>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>
> ***
>
> This e-mail contains information for the intended recipient only. It may
> contain proprietary material or confidential information. If you are not
> the intended recipient you are not authorised to distribute, copy or use
> this e-mail or any attachment to it. Murex cannot guarantee that it is
> virus free and accepts no responsibility for any loss or damage arising
> from its use. If you have received this e-mail in error please notify
> immediately the sender and delete the original email received, any
> attachments and all copies from your system.
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: SELECT specific data

2014-07-30 Thread Devopam Mittra
Are you using any tool to load data ? If yes, then the ETL tool will
provide you such options.
If not, then please explore unix file processing/external table route.


On Wed, Jul 30, 2014 at 2:09 PM, CHEBARO Abdallah <
abdallah.cheb...@murex.com> wrote:

>  Hello,
>
>
>
> Thank you for your reply.
>
>
>
> Consider we have data divided into 5 columns (col1, col2, col3, col4,
> col5).
>
> So I can’t load directly col1, col3 and col5?
>
> If I can’t do it directly, can you provide me with an alternate solution?
>
>
>
> Thank you.
>
>
>
> *From:* Nitin Pawar [mailto:nitinpawar...@gmail.com]
> *Sent:* Wednesday, July 30, 2014 11:37 AM
> *To:* user@hive.apache.org
> *Subject:* Re: SELECT specific data
>
>
>
> you mean just by writing query then I think no.
>
>
>
> But if you want to read only first 3 columns of the data then it would
> work with just a single table and load data into
>
>
>
> On Wed, Jul 30, 2014 at 1:47 PM, CHEBARO Abdallah <
> abdallah.cheb...@murex.com> wrote:
>
> Hello,
>
>
>
> I am interested in selecting specific data from a source and loading it to
> a table. For example, if I have 5 columns in my dataset, I want to load 3
> columns of it. Is it possible to do it without create a second table?
>
>
>
> Thank you
>
> ***
>
> This e-mail contains information for the intended recipient only. It may
> contain proprietary material or confidential information. If you are not
> the intended recipient you are not authorised to distribute, copy or use
> this e-mail or any attachment to it. Murex cannot guarantee that it is
> virus free and accepts no responsibility for any loss or damage arising
> from its use. If you have received this e-mail in error please notify
> immediately the sender and delete the original email received, any
> attachments and all copies from your system.
>
>
>
>
>
> --
> Nitin Pawar
>
> ***
>
> This e-mail contains information for the intended recipient only. It may
> contain proprietary material or confidential information. If you are not
> the intended recipient you are not authorised to distribute, copy or use
> this e-mail or any attachment to it. Murex cannot guarantee that it is
> virus free and accepts no responsibility for any loss or damage arising
> from its use. If you have received this e-mail in error please notify
> immediately the sender and delete the original email received, any
> attachments and all copies from your system.
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Drop Partition by ID

2014-07-22 Thread Devopam Mittra
Please try using escape character around the '%' if not already done so.

regards
Dev


On Mon, Jul 21, 2014 at 7:32 PM, fab wol  wrote:

> Hi everyone,
>
> I have the following problem: I have a partitoned managed table (Partition
> table is a string which represents a date, eg. log-date="2014-07-15").
> Unfortunately there is one partition in there like this:
> log_date=2014-07-15-23%3A45%3A38 (copied from show partitions stmt). This
> partitions most likeley got created to a wrong script 8which is fixed).
>
> Now i want to delete this partition, but it doesn't work:
>
>- alter table ... drop partitition
>(log_date='2014-07-15-23%3A45%3A38') gives no error, but the partition is
>still existing afterwards
>- I tried escaping the %-signs with backslashes but no luck with that
>- I delete the directory in the HDFS and run msck repair table
>afterwards. It recognizes that the folder is missing but is not deleting
>the metadata
>
> So what can I do to get rid of the metadata? My next guess would be to go
> directly to the metastore DB and delete the metadata there. But what
> exactly has to be deleted? I guess there are several dependencies.
>
> Other idea: is there a possibility in Hive to delete a partition by a
> unique ID or something like that?
>
> Or what is needed to delete the table with the normal "alter table drop
> partition" command?
>
> Cheers
> Wolli
>



-- 
Devopam Mittra
Life and Relations are not binary


Re: Hive giving errors while running queries

2014-07-06 Thread Devopam Mittra
Please check your yarn log. It will have details around this error .

regards
Dev
+91 958 305 9899

> On Jul 7, 2014, at 10:46 AM, Ritesh Kumar Singh 
>  wrote:
> 
> hive>select COUNT(*) from movies;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapred.reduce.tasks=
> Starting Job = job_1404661589735_0002, Tracking URL = 
> http://gonephishing:8088/proxy/application_1404661589735_0002/
> Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1404661589735_0002
> Hadoop job information for Stage-1: number of mappers: 0; number of reducers: > 0
> 2014-07-07 10:11:14,770 Stage-1 map = 0%,  reduce = 0%
> Ended Job = job_1404661589735_0002 with errors
> Error during job, obtaining debugging information...
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.MapRedTask
> MapReduce Jobs Launched: 
> Job 0:  HDFS Read: 0 HDFS Write: 0 FAIL
> Total MapReduce CPU Time Spent: 0 msec
> 
> --
> Any suggestions are welcomed.


UDF development group details

2014-06-09 Thread Devopam Mittra
hi All,
Can you please redirect/connect me with Hive custom UDF development group.

My intent is to create / co-develop custom UDFs for text analytics and data
mining over Hive directly.

-- 
Devopam Mittra
Life and Relations are not binary


Hive : ngrams : stopword usage

2014-05-27 Thread Devopam Mittra
hi All,
I was trying to explore the usage of ngrams with a custom stopword list.
Though the documentation states that it can be used, there is no example
provided and the syntax is also not apparent.

Any pointers are welcome
Please forgive me if this is a repost/known issue and I am still ignorant
about the same.

regards
Dev