Re: Hive footprint

2016-04-19 Thread Peyman Mohajerian
Hi Amey,

It is about seek vs scan. HBase is great in case a rowkey or a range of
rowkeys is part of the where clause, then you do a seek and ORC/Parquest
reading off HDFS would not do better in absence of an index. However for
Data Warehouse that is generally not what you do, you mostly do scan, e.g.
doing aggregation you aren't looking for a particular record(s). In this
case the IO throughput dominates (generally), because you have to read lots
of data, then reading large blocks of data and using headers info
(predicate push-down) in ORC or Parquet will be faster compared to reading
lots of HFiles in HBase. Of course compaction in HBase can turn the files
to larger chunks but still 'typically' it will be slower.
I should super emphasized that making statements about what is faster or
not is very dangerous, there could be many exceptions depending on the type
of query and other factors. When I did this test I was using map/reduce and
with newer engines queries will be faster. Also caching in HBase is
critical, if all you data is cached and you got lots of memory and system
isn't busy handling compaction and lots of new write then your read
performance in all cases will improve. Always do your own POC and use your
own data to test.

Thanks,
Peyman



On Tue, Apr 19, 2016 at 2:26 AM, Amey Barve  wrote:

> Hi Peyman,
>
> You say: "you can use Hive storage handler to read data from HBase the
> performance would be lower than reading from HDFS directly for analytic."
> Why is it so? Is it slow as compared to ORC, Parquet, and even Text file
> format?
>
> Regards,
> Amey
>
> On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian 
> wrote:
>
>> HBase can handle high read/write throughput, e.g. IOT use cases. It is
>> not an analytic engine even though you can use Hive storage handler to read
>> data from HBase the performance would be lower than reading from HDFS
>> directly for analytic.  But HBase has index, rowkey and you can add
>> secondary index, usually with Elasticsearch or other means. You can also
>> run Phoenix over HBase to do analytic but again only if you data
>> collection/use case mandates HBase, e.g. small amount of data from millions
>> of devices. It is common to copy data from HBase to HDFS (even though HBase
>> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
>> again you do have the choice of using Phoenix or Hive to run analytic over
>> HBase if you don't want to pay for the cost of data copying.
>> HBase can only be part of a DW solution in a limited way, e.g. as index
>> to data in HDFS, partition discovery, etc. Pretty soon it will be the
>> metadata for Hive (optional instead of RDMS). HBase can  sits on the edge
>> of DW for collect fast landing data.
>> I don't see any compete between Hive and HBase, they work together and I
>> don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>>
>>
>>
>> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin 
>> wrote:
>>
>>> We use a hive with ORC setup now. Queries may take thousands of seconds
>>> with joins, and potentially tens of seconds with selects on very large
>>> tables.
>>>
>>> My understanding is that the goal of hbase is to provide much lower
>>> latency for queries. Obviously, this comes at the cost of not being able to
>>> perform joins. I don't actually use hbase, so I hesitate to say more about
>>> it.
>>>
>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Thanks Marcin.
>>>>
>>>> What is the definition of low latency here? Are you referring to the
>>>> performance of SQL against HBase tables compared to Hive. As I understand
>>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>>> to achieve the same?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 18 April 2016 at 23:43, Marcin Tustin  wrote:
>>>>
>>>>> HBase has a different use case - it's for low-latency querying of big
>>>>> tables. If you combined it with Hive, you might have something nice for
>>>>> certain queries, but I wouldn't think of t

Re: Hive footprint

2016-04-18 Thread Peyman Mohajerian
HBase can handle high read/write throughput, e.g. IOT use cases. It is not
an analytic engine even though you can use Hive storage handler to read
data from HBase the performance would be lower than reading from HDFS
directly for analytic.  But HBase has index, rowkey and you can add
secondary index, usually with Elasticsearch or other means. You can also
run Phoenix over HBase to do analytic but again only if you data
collection/use case mandates HBase, e.g. small amount of data from millions
of devices. It is common to copy data from HBase to HDFS (even though HBase
is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
again you do have the choice of using Phoenix or Hive to run analytic over
HBase if you don't want to pay for the cost of data copying.
HBase can only be part of a DW solution in a limited way, e.g. as index to
data in HDFS, partition discovery, etc. Pretty soon it will be the metadata
for Hive (optional instead of RDMS). HBase can  sits on the edge of DW for
collect fast landing data.
I don't see any compete between Hive and HBase, they work together and I
don't see modern DW having a monolithic engine, Tez+Spark+MPP+...



On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin 
wrote:

> We use a hive with ORC setup now. Queries may take thousands of seconds
> with joins, and potentially tens of seconds with selects on very large
> tables.
>
> My understanding is that the goal of hbase is to provide much lower
> latency for queries. Obviously, this comes at the cost of not being able to
> perform joins. I don't actually use hbase, so I hesitate to say more about
> it.
>
> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Marcin.
>>
>> What is the definition of low latency here? Are you referring to the
>> performance of SQL against HBase tables compared to Hive. As I understand
>> HBase is a columnar database. Would it be possible to use Hive against ORC
>> to achieve the same?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 18 April 2016 at 23:43, Marcin Tustin  wrote:
>>
>>> HBase has a different use case - it's for low-latency querying of big
>>> tables. If you combined it with Hive, you might have something nice for
>>> certain queries, but I wouldn't think of them as direct competitors.
>>>
>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 I notice that Impala is rarely mentioned these days.  I may be missing
 something. However, I gather it is coming to end now as I don't recall many
 use cases for it (or customers asking for it). In contrast, Hive has hold
 its ground with the new addition of Spark and Tez as execution engines,
 support for ACID and ORC and new stuff in Hive 2. In addition provided a
 good choice for its metastore it scales well.

 If Hive had the ability (organic) to have local variable and stored
 procedure support then it would be top notch Data Warehouse. Given its
 metastore, I don't see any technical reason why it cannot support these
 constructs.

 I was recently asked to comment on migration from commercial DWs to Big
 Data (primarily for TCO reason) and really could not recall any better
 candidate than Hive. Is HBase a viable alternative? Obviously whatever one
 decides there is still HDFS, a good engine for Hive (sounds like many
 prefer TEZ although I am a Spark fan) and the ubiquitous YARN.

 Let me know your thoughts.


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



>>>
>>>
>>> Want to work at Handy? Check out our culture deck and open roles
>>> 
>>> Latest news  at Handy
>>> Handy just raised $50m
>>> 
>>>  led
>>> by Fidelity
>>>
>>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> 
> Latest news  at Handy
> Handy just raised $50m
> 
>  led
> by Fidelity
>
>


Re: How to set hive.aux.jars.path in hive1.1.0?

2015-12-28 Thread Peyman Mohajerian
Maybe you need to also add:

HIVE_AUX_JARS_PATH = /path/to/JAR

e.g.: 
http://www.cloudera.com/content/www/en-us/documentation/archive/manager/4-x/4-8-0/Cloudera-Manager-Managing-Clusters/cmmc_hive_udf.html


On Mon, Dec 28, 2015 at 11:52 PM, Lee S  wrote:

> Sorry, the  is , forgive my wrong typing. Anybody can
> help?
>
>
> On Tue, Dec 29, 2015 at 3:50 PM, Lee S  wrote:
>
>> Hi all:
>>I have a serde jar written by myself to deserialize some kind of data.
>>
>>I put the jar in the host of hiverserver and set the
>> *hive.aux.jars.path* property in hive-site.xml.
>>
>>   Then I use beeline connecting to hiveserver2, create a table with the
>> serde class, but it said
>>
>>class not found exception.  why? anybody can help?
>>
>>
>>  hive.aux.jars.path
>>  file:///root/serde.jar
>> 
>>
>
>


Re: Best way to load CSV file into Hive

2015-10-31 Thread Peyman Mohajerian
if you find a way to escape the characters, some pre-processing step then
you may find this useful:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

On Fri, Oct 30, 2015 at 11:36 AM, Martin Menzel 
wrote:

> Hi
> Do have access to the data source?
> If not you have first to find out if the data can be mapped to the columns
> in a unique way and for all rows. If yes maybe bindy can be a option to
> convert the data in a first step to tsv.
> I hope this helps.
> Regards
> Martin
> Am 30.10.2015 19:16 schrieb "Vijaya Narayana Reddy Bhoomi Reddy" <
> vijaya.bhoomire...@whishworks.com>:
>
>> Hi,
>>
>> I have a CSV file which contains hunderd thousand rows and about 200+
>> columns. Some of the columns have free text information, which means it
>> might contain characters like comma, colon, quotes etc with in the column
>> content.
>>
>> What is the best way to load such CSV file into Hive?
>>
>> Another serious issue, I have stored the file in a location in HDFS and
>> then created an external hive table on it. However, upon running Create
>> external table using HDP Hive View, the original CSV is no longer present
>> in the folder where it is meant to be. Not sure on how HDP processes and
>> where it is stored? My understanding was that EXTERNAL table wouldnt be
>> moved from their original HDFS location?
>>
>> Request someone to help out!
>>
>>
>> Thanks & Regards
>> Vijay
>>
>>
>>
>> The contents of this e-mail are confidential and for the exclusive use of
>> the intended recipient. If you receive this e-mail in error please delete
>> it from your system immediately and notify us either by e-mail or
>> telephone. You should not copy, forward or otherwise disclose the content
>> of the e-mail. The views expressed in this communication may not
>> necessarily be the view held by WHISHWORKS.
>
>


Re: Data Deleted on Hive External Table

2015-08-25 Thread Peyman Mohajerian
Data was generated in some other cluster, they moved it to s3 and then
copied it to my cluster into the warehouse path. I then created a schema
over it. You are correct that this would not be the right process and we
had no plans to do this in production, it was a POC. Nevertheless in my
view 'external' should still carry the same meaning that 'Despite the fact
that data is in warehouse, I'm just doing some experimentation on the
different schema design and am creating temporary schema over this data and
therefore don't delete the content'. Perhaps instead of using 'external'
there is other options.  Also if 'external' doesn't mean anything in this
scenario perhaps throw me an exception so I'm unable to create the table in
the first place.
Again what I'm saying above is my logic and I could be wrong in something.



On Tue, Aug 25, 2015 at 7:09 AM, Jeetendra G 
wrote:

> if you put external in the table definition and point  INPATH to hive the
> original data(where data is landing from other source  ). then how come
> data will come to /user/hive/warehouse. /user/hive/warehouse should only be
> populated with data when its 'internal'?
>
> On Tue, Aug 25, 2015 at 7:33 PM, Peyman Mohajerian 
> wrote:
>
>> Hi Jeetendra,
>>
>> What I was originally saying is that if you drop the table, it will
>> deleted the data despite the fact that you put 'external' in the
>> definition. I think this behavior is due to the fact that data is in
>> /user/hive/warehouse and therefore Hive assumes ownership and ignores the
>> 'external' directive! I would have assumed 'external' would still carry its
>> meaning and dropping the table would not delete the data, but I was wrong.
>> If I got this inaccurately please challenge my conclusion.
>>
>> Thanks,
>> Peyman
>>
>> On Mon, Aug 24, 2015 at 11:22 PM, Jeetendra G 
>> wrote:
>>
>>> Hi Peyman
>>>
>>> I created a new Hive external table with partition column name of 'yr'
>>> instead of 'year' pointing to the same base directory.
>>> if this is a case how come /user/hive/warehouse having the data? it
>>> should not right?
>>>
>>> On Tue, Aug 25, 2015 at 4:41 AM, Peyman Mohajerian 
>>> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> I managed to delete some data in HDFS by dropping a partitioned
>>>> external Hive table. One explanation is that data resided in the
>>>> 'warehouse' directory of Hive and that had something to do with?
>>>> An alternative explanation may that my 'drop table' statement didn't
>>>> delete the data but my follow up 'create table' statement with a different
>>>> partition name did. Let me elaborate, files used to be in this directory
>>>> structure:
>>>> /user/hive/warehouse//year=2009
>>>>
>>>> I created a new Hive external table with partition column name of 'yr'
>>>> instead of 'year' pointing to the same base directory. Is it possible that
>>>> this create statement deleted the data (highly doubt that)? Either case
>>>> were unexpected to me!
>>>>
>>>> This is on Hive 1.0.
>>>>
>>>> Thanks,
>>>> Peyman
>>>>
>>>
>>>
>>
>


Re: Data Deleted on Hive External Table

2015-08-25 Thread Peyman Mohajerian
Hi Jeetendra,

What I was originally saying is that if you drop the table, it will deleted
the data despite the fact that you put 'external' in the definition. I
think this behavior is due to the fact that data is in /user/hive/warehouse
and therefore Hive assumes ownership and ignores the 'external' directive!
I would have assumed 'external' would still carry its meaning and dropping
the table would not delete the data, but I was wrong.
If I got this inaccurately please challenge my conclusion.

Thanks,
Peyman

On Mon, Aug 24, 2015 at 11:22 PM, Jeetendra G 
wrote:

> Hi Peyman
>
> I created a new Hive external table with partition column name of 'yr'
> instead of 'year' pointing to the same base directory.
> if this is a case how come /user/hive/warehouse having the data? it should
> not right?
>
> On Tue, Aug 25, 2015 at 4:41 AM, Peyman Mohajerian 
> wrote:
>
>> Hi Guys,
>>
>> I managed to delete some data in HDFS by dropping a partitioned external
>> Hive table. One explanation is that data resided in the 'warehouse'
>> directory of Hive and that had something to do with?
>> An alternative explanation may that my 'drop table' statement didn't
>> delete the data but my follow up 'create table' statement with a different
>> partition name did. Let me elaborate, files used to be in this directory
>> structure:
>> /user/hive/warehouse//year=2009
>>
>> I created a new Hive external table with partition column name of 'yr'
>> instead of 'year' pointing to the same base directory. Is it possible that
>> this create statement deleted the data (highly doubt that)? Either case
>> were unexpected to me!
>>
>> This is on Hive 1.0.
>>
>> Thanks,
>> Peyman
>>
>
>


Data Deleted on Hive External Table

2015-08-24 Thread Peyman Mohajerian
Hi Guys,

I managed to delete some data in HDFS by dropping a partitioned external
Hive table. One explanation is that data resided in the 'warehouse'
directory of Hive and that had something to do with?
An alternative explanation may that my 'drop table' statement didn't delete
the data but my follow up 'create table' statement with a different
partition name did. Let me elaborate, files used to be in this directory
structure:
/user/hive/warehouse//year=2009

I created a new Hive external table with partition column name of 'yr'
instead of 'year' pointing to the same base directory. Is it possible that
this create statement deleted the data (highly doubt that)? Either case
were unexpected to me!

This is on Hive 1.0.

Thanks,
Peyman


Re: Writing ORC Files

2015-04-07 Thread Peyman Mohajerian
Yep i didn't see that, my guess is that what you are passing to 'addrow' is
incorrect, e.g. take a look at:
https://github.com/gautamphegde/HadoopCraft/blob/master/ORCOutput/src/main/java/ORCout/ORCMapper.java
It isn't the same thing, but you see there are passing a list that is first
serialize.
Also if you are on a later version of Hive you have an easier option:
http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/
https://github.com/apache/storm/tree/master/external/storm-hive

Good Luck

On Tue, Apr 7, 2015 at 9:24 AM, Grant Overby (groverby) 
wrote:

>  addRow() is called in execute(). Does something look wrong with the call?
>
> *Grant Overby*
> Software Engineer
> Cisco.com <http://www.cisco.com/>
> grove...@cisco.com
> Mobile: *865 724 4910 <865%20724%204910>*
>
>
>
>Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>
>
>   From: Peyman Mohajerian 
> Reply-To: "user@hive.apache.org" 
> Date: Tuesday, April 7, 2015 at 12:20 PM
> To: "user@hive.apache.org" 
> Cc: "Bhavana Kamichetty (bkamiche)" 
> Subject: Re: Writing ORC Files
>
>   I think you have to call 'addRow' to the writer:
>
>
> https://hive.apache.org/javadocs/r0.12.0/api/org/apache/hadoop/hive/ql/io/orc/Writer.html
>
>  That's just based on the javadoc, i don't have any experience doing this.
>
> On Tue, Apr 7, 2015 at 8:43 AM, Grant Overby (groverby) <
> grove...@cisco.com> wrote:
>
>>   I have a Storm Trident Bolt for writing ORC File. The files are
>> created; however, they are always zero length. This code eventually causes
>> an OOME. I suspect I am missing some sort of flushing action, but don’t see
>> anything like that in the api.
>>
>>  My bolt follows. Any thoughts as to what I’m doing wrong or links to
>> reference uses of org.apache.hadoop.hive.ql.io.orc.Writer ?
>>
>>  package com.cisco.tinderbox.burner.trident.functions;
>>
>> import storm.trident.operation.BaseFunction;
>> import storm.trident.operation.TridentCollector;
>> import storm.trident.tuple.TridentTuple;
>>
>> import com.cisco.tinderbox.burner.io.system.CurrentUnixTime;
>> import com.cisco.tinderbox.burner.trident.Topology;
>> import com.cisco.tinderbox.model.ConnectionEvent;
>> import com.google.common.base.Throwables;
>>
>> import java.io.IOException;
>> import java.util.List;
>> import java.util.UUID;
>>
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.fs.RawLocalFileSystem;
>> import org.apache.hadoop.hive.ql.io.orc.OrcFile;
>> import org.apache.hadoop.hive.ql.io.orc.Writer;
>> import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
>> import org.apache.hive.hcatalog.streaming.FlatTableColumn;
>> import org.apache.hive.hcatalog.streaming.FlatTableObjectInspector;
>> import org.slf4j.Logger;
>> import org.slf4j.LoggerFactory;
>>
>> import static org.apache.hadoop.hive.ql.io.orc.CompressionKind.*;
>>
>> public class OrcSink extends BaseFunction {
>> private static final Logger logger = 
>> LoggerFactory.getLogger(OrcSink.class);
>> private static final CurrentUnixTime currentUnixTime = 
>> CurrentUnixTime.getInstance();
>> private static final long serialVersionUID = 7435558912956446385L;
>> private final String dbName;
>> private final String tableName;
>> private final List> fields;
>> private final String hdfsUrl;
>> private transient volatile int partition;
>> private transient volatile Writer writer;
>> private transient volatile Path path;
>>
>> public OrcSink(String hdfsUrl, String dbName, String tableName, 
>> List> fields) {
>> this.hdfsUrl = hdfsUrl;
>> this.dbName = dbName;
>> this.tableName = tableName;
>> this.fields = fields;
>> }
>>
>> @Override
>> public void cleanup() {
>> closeWriter();
>> }
>>
>> @Override
>> public synchroniz

Re: Writing ORC Files

2015-04-07 Thread Peyman Mohajerian
I think you have to call 'addRow' to the writer:

https://hive.apache.org/javadocs/r0.12.0/api/org/apache/hadoop/hive/ql/io/orc/Writer.html

That's just based on the javadoc, i don't have any experience doing this.

On Tue, Apr 7, 2015 at 8:43 AM, Grant Overby (groverby) 
wrote:

>   I have a Storm Trident Bolt for writing ORC File. The files are
> created; however, they are always zero length. This code eventually causes
> an OOME. I suspect I am missing some sort of flushing action, but don’t see
> anything like that in the api.
>
>  My bolt follows. Any thoughts as to what I’m doing wrong or links to
> reference uses of org.apache.hadoop.hive.ql.io.orc.Writer ?
>
>  package com.cisco.tinderbox.burner.trident.functions;
>
> import storm.trident.operation.BaseFunction;
> import storm.trident.operation.TridentCollector;
> import storm.trident.tuple.TridentTuple;
>
> import com.cisco.tinderbox.burner.io.system.CurrentUnixTime;
> import com.cisco.tinderbox.burner.trident.Topology;
> import com.cisco.tinderbox.model.ConnectionEvent;
> import com.google.common.base.Throwables;
>
> import java.io.IOException;
> import java.util.List;
> import java.util.UUID;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.RawLocalFileSystem;
> import org.apache.hadoop.hive.ql.io.orc.OrcFile;
> import org.apache.hadoop.hive.ql.io.orc.Writer;
> import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
> import org.apache.hive.hcatalog.streaming.FlatTableColumn;
> import org.apache.hive.hcatalog.streaming.FlatTableObjectInspector;
> import org.slf4j.Logger;
> import org.slf4j.LoggerFactory;
>
> import static org.apache.hadoop.hive.ql.io.orc.CompressionKind.*;
>
> public class OrcSink extends BaseFunction {
> private static final Logger logger = 
> LoggerFactory.getLogger(OrcSink.class);
> private static final CurrentUnixTime currentUnixTime = 
> CurrentUnixTime.getInstance();
> private static final long serialVersionUID = 7435558912956446385L;
> private final String dbName;
> private final String tableName;
> private final List> fields;
> private final String hdfsUrl;
> private transient volatile int partition;
> private transient volatile Writer writer;
> private transient volatile Path path;
>
> public OrcSink(String hdfsUrl, String dbName, String tableName, 
> List> fields) {
> this.hdfsUrl = hdfsUrl;
> this.dbName = dbName;
> this.tableName = tableName;
> this.fields = fields;
> }
>
> @Override
> public void cleanup() {
> closeWriter();
> }
>
> @Override
> public synchronized void execute(TridentTuple tuple, TridentCollector 
> collector) {
> try {
> refreshWriterIfNeeded();
> ConnectionEvent connectionEvent = (ConnectionEvent) 
> tuple.getValueByField(Topology.FIELD_CORRELATED);
> writer.addRow(connectionEvent);
> } catch (IOException e) {
> logger.error("could not write to orc", e);
> }
> }
>
> private void closeWriter() {
> if (writer != null) {
> try {
> writer.close();
> } catch (IOException e) {
> Throwables.propagate(e);
> } finally {
> writer = null;
> }
> }
> }
>
> private void createWriter() {
> try {
> Configuration fsConf = new Configuration();
> fsConf.set("fs.defaultFS", hdfsUrl);
> FileSystem fs = new RawLocalFileSystem(); 
> //FileSystem.get(fsConf);
> String fileName = System.currentTimeMillis() + "-" + 
> UUID.randomUUID().toString() + ".orc";
> path = new Path("/data/diska/orc/" + dbName + "/" + tableName + 
> "/" + partition + "/" + fileName);
> Configuration writerConf = new Configuration();
> ObjectInspector oi = new FlatTableObjectInspector(dbName + "." + 
> tableName, fields);
> int stripeSize = 250 * 1024 * 1024;
> int compressBufferSize = 256 * 1024;
> int rowIndexStride = 1;
> writer = OrcFile.createWriter(fs, path, writerConf, oi, 
> stripeSize, SNAPPY, compressBufferSize, rowIndexStride);
> } catch (IOException e) {
> throw Throwables.propagate(e);
> }
> }
>
> private void refreshWriter() {
> partition = currentUnixTime.getQuarterHour();
> closeWriter();
> createWriter();
> }
>
> private void refreshWriterIfNeeded() {
> if (writer == null || partition != currentUnixTime.getQuarterHour()) {
> refreshWriter();
> }
> }
> }
>
>
>
> *Grant Overby*
> Software Engineer
> Cisco.com 
> grove...@cisco.com
> Mobile: *865 724 4910 <865%20724%204910>*
>
>
>
>Think before you print.
>
> This email may contain conf

Re: How to provide security in hive using impersonation?

2014-11-28 Thread Peyman Mohajerian
You can use a database, store the users records in there and lookup the
username/password from there using a simple JDBC.
For example you maybe able to use the same database that is keeping Hive
metadata, mysql or postgres typically.

On Fri, Nov 28, 2014 at 8:55 PM, prasad bezavada 
wrote:

> Thank you so much for your reply .I already tried this custom/pluggable
> authentication , i have implemented a PasswdAuthenticationProvider class
> that validates the  given username and password while connecting to hive
> from java program,but here the list of usernames and password I saved in
> one property file  kind of file , when ever i want to connect to hive it
> just checks that the given username & password  with the entries in that
> property file, if entry is there then it will allow us to connect otherwise
> throws an error. And if we want to add an user simply we can add an user in
> that property file.But I don't want to use this property file or anything
> ,I just want to do it dynamically .Is there any way to do that ?Like we do
> entries in LDAP. Please Let me know..
>
> On Fri, Nov 28, 2014 at 10:49 PM, Peyman Mohajerian 
> wrote:
>
>> If you don't want to deal with LDAP, I know of one other way:
>>
>> 
>>
>> hive.server2.authentication
>>
>> CUSTOM
>>
>>   
>>
>>   
>>
>> hive.server2.custom.authentication.class
>>
>> 
>>
>>   
>>
>> You can implement: PasswdAuthenticationProvider and you put your
>> implementation class in (xxx). I think these guy have some examples:
>>
>> from http://shiro.apache.org/
>> <http://shiro.apache.org/download.html#latestBinary>
>>
>> But I haven't looked into it myself.
>>
>>
>>
>>
>>
>>
>> On Fri, Nov 28, 2014 at 5:05 AM, prasad bezavada <
>> prasadbezav...@gmail.com> wrote:
>>
>>> Hi ,
>>>
>>>
>>>  I am writing a java program to connect with hive and query the
>>> data from hive. From  my program i am connecting  to hive as follows
>>>  private static String driverName =
>>> "org.apache.hadoop.hive.jdbc.HiveDriver";
>>> public static void main(String[] args) throws SQLException {
>>> try {
>>>   Class.forName(driverName);
>>> } catch (ClassNotFoundException e) {
>>>   // TODO Auto-generated catch block
>>>   e.printStackTrace();
>>>   System.exit(1);
>>> }
>>> Connection con = DriverManager.getConnection(
>>> "jdbc:hive://localhost:1/default", "hive", "any");
>>> Statement stmt = con.createStatement();
>>> 
>>> My Problem is, the above connection is not taking the password ,i
>>> mean with any username and password it is allowing me to connect and ge
>>> get the data from hive. But I want to  restrict this and want to
>>> provide security so that
>>> only the  specified user would be able to connect and query hive.
>>> I tried :pluggable authentication and its working fine.
>>> But i want to do it dynamically.
>>>  i don't want to use LDAP or Kerberos.
>>>
>>> is there any way to do that?
>>>
>>> and can we create users and roles in hive
>>> ?if possible then how?
>>> when i use the command create role role_name its giving error i.e
>>> something like hive authorization incomplete or disabled
>>>
>>>
>>>
>>>
>>>
>>
>


Re: How to provide security in hive using impersonation?

2014-11-28 Thread Peyman Mohajerian
If you don't want to deal with LDAP, I know of one other way:



hive.server2.authentication

CUSTOM

  

  

hive.server2.custom.authentication.class



  

You can implement: PasswdAuthenticationProvider and you put your
implementation class in (xxx). I think these guy have some examples:

from http://shiro.apache.org/


But I haven't looked into it myself.






On Fri, Nov 28, 2014 at 5:05 AM, prasad bezavada 
wrote:

> Hi ,
>
>
>  I am writing a java program to connect with hive and query the
> data from hive. From  my program i am connecting  to hive as follows
>  private static String driverName =
> "org.apache.hadoop.hive.jdbc.HiveDriver";
> public static void main(String[] args) throws SQLException {
> try {
>   Class.forName(driverName);
> } catch (ClassNotFoundException e) {
>   // TODO Auto-generated catch block
>   e.printStackTrace();
>   System.exit(1);
> }
> Connection con = DriverManager.getConnection(
> "jdbc:hive://localhost:1/default", "hive", "any");
> Statement stmt = con.createStatement();
> 
> My Problem is, the above connection is not taking the password ,i mean
> with any username and password it is allowing me to connect and ge
> get the data from hive. But I want to  restrict this and want to
> provide security so that
> only the  specified user would be able to connect and query hive.
> I tried :pluggable authentication and its working fine.
> But i want to do it dynamically.
>  i don't want to use LDAP or Kerberos.
>
> is there any way to do that?
>
> and can we create users and roles in hive
> ?if possible then how?
> when i use the command create role role_name its giving error i.e
> something like hive authorization incomplete or disabled
>
>
>
>
>


Re: Hide HCatalog Metadata to unauthorized users

2014-10-31 Thread Peyman Mohajerian
one way is to use Hive's role based security: SQL Standard Based
 Authorization via hiveserver2:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization#LanguageManualAuthorization-2SQLStandardsBasedAuthorizationinHiveServer2
available in hive .13.
I'm also working for Teradata btw.

On Fri, Oct 31, 2014 at 10:56 AM, Papp, Stefan 
wrote:

>  Hi,
>
>
>
>
>
> I want to hide selected Databases and Tables from users. On file system
> level, we achieved it via file permissions. What we cannot do so far is to
> hide the corresponding metadata in HCatalog. When unsers login to Hue as
> WebUi, they are able to see all the tables and databases when they open the
> HCatalog Tab.
>
>
>
> Is there a way to restrict the view to the same rights as on file level?
>
>
>
> Thank you!
>
>
>
> I posted the same issue also on Stackoverflow:
> http://stackoverflow.com/questions/26658267/is-there-a-way-to-hide-hcatalog-tables-for-unauthorized-users-in-hue
>
>
>
> Stefan Papp
>
> Hadoop Consultant
>
> Teradata Vienna
> Storchengasse 1
>
> 1150 Wien
> Work: +43 (0)1 22715 000
>
> Mobile: +43 664 22 08 616
>
> EMAIL: stefan.p...@teradata.com
>
> Teradata
> Analytic Data Platforms | Applications | Services
>
>
>
> The information contained in this message is private and confidential, is
> the property of Teradata Corporation, and is solely for the use of its
> intended recipient.  If you are not the person to whom this e-mail is
> addressed, or if it has been sent to you in error, please notify the sender
> immediately.  If you are not the intended recipient, please note that
> permission to use, copy, disclose, alter or distribute this message, and
> any attachments, is expressly denied.
>
> Please consider the environment before printing.
>
>
>
>
>


Re: USE CASE:: Hierarchical Structure in Hive and Java

2014-10-22 Thread Peyman Mohajerian
I can think of two approaches, one is to use hbase and use
hbasestoragehandler in hive to read the data, handling hierarchical
structures is easier in hbase, or in hive just store the data from left to
right in a single row with flexible number of columns. You can also store
it in json or xml and use serd to read the data in either columnar or row
format.
I'm sure there are other ways too.

On Wed, Oct 22, 2014 at 1:55 AM, yogesh dhari  wrote:

> Hello All,
>
> We are having a use case where we need to create the hierarchical
> structure using Hive and Java
>
> For example
> Lets say in an organisation we need to create Org chart
> i.e. Senior director -> director -> associate director -> senior manager
> -> manager -> senior associate -> associate -> Developer
> means parent child then sub child and so on.
>
> Input Source: summarized table which is getting populated after running
> the joins  which will run business logic and fetch the data from base table.
> Output: Table which store the parent child relationship in a hierarchical
> manner
>
>
> If anyone come across this kind of requirement/scenario kindly suggest the
> approach to proceed.
>
> Thanks in Advance
>
>
> Thanks & Regards
> Yogesh
>
>


Re: String to Timestamp conversion bug

2014-09-22 Thread Peyman Mohajerian
So i found out more detail about this issue,
if in:
select cast('2999-12-31 23:59:59' as timestamp) from table;
if the table has 'orc' data, and you are using hive .13 and set
hive.vectorized.execution.enabled = true;
then this issue occurs, it maybe related to: hive-6656 i'm not certain of
that.





On Wed, Sep 10, 2014 at 11:05 PM, Peyman Mohajerian 
wrote:

> It is using either 1.6 or 1.7, but i tested
> System.out.println(" " + Timestamp.valueOf("2999-12-31 23:59:59" ));
> on both 1.7 and 1.6 version and it works in both cases.
>
> On Wed, Sep 10, 2014 at 10:12 PM, Jason Dere 
> wrote:
>
>> Hmm that's odd .. it looks like this works for me:
>>
>> hive> select cast('2999-12-31 23:59:59' as timestamp);
>> OK
>> 2999-12-31 23:59:59
>> Time taken: 0.212 seconds, Fetched: 1 row(s)
>>
>> For string to timestamp conversion, it should be using
>> java.sql.Timestamp.valueOf().  What version of jvm are you using?
>>
>>
>>
>> On Sep 10, 2014, at 2:38 PM, Peyman Mohajerian 
>> wrote:
>>
>> Hi Guys,
>>
>> I Hive .13 for this conversion:
>> select cast('2999-12-31 23:59:59' as timestamp)
>> I get:
>> 1830-11-23 00:50:51.580896768
>> up to around year 2199 it works fine, the work around is to convert the
>> string to int and then back to timestamp:
>> from_unixtime(unix_timestamp('2999-12-31 23:59:59.00')
>>
>> But why is this issue happening in the first place?
>>
>> Thanks,
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>


Re: String to Timestamp conversion bug

2014-09-10 Thread Peyman Mohajerian
It is using either 1.6 or 1.7, but i tested
System.out.println(" " + Timestamp.valueOf("2999-12-31 23:59:59" ));
on both 1.7 and 1.6 version and it works in both cases.

On Wed, Sep 10, 2014 at 10:12 PM, Jason Dere  wrote:

> Hmm that's odd .. it looks like this works for me:
>
> hive> select cast('2999-12-31 23:59:59' as timestamp);
> OK
> 2999-12-31 23:59:59
> Time taken: 0.212 seconds, Fetched: 1 row(s)
>
> For string to timestamp conversion, it should be using
> java.sql.Timestamp.valueOf().  What version of jvm are you using?
>
>
>
> On Sep 10, 2014, at 2:38 PM, Peyman Mohajerian  wrote:
>
> Hi Guys,
>
> I Hive .13 for this conversion:
> select cast('2999-12-31 23:59:59' as timestamp)
> I get:
> 1830-11-23 00:50:51.580896768
> up to around year 2199 it works fine, the work around is to convert the
> string to int and then back to timestamp:
> from_unixtime(unix_timestamp('2999-12-31 23:59:59.00')
>
> But why is this issue happening in the first place?
>
> Thanks,
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.


Re: String to Timestamp conversion bug

2014-09-10 Thread Peyman Mohajerian
The point of over-flow is:
2262-04-11 20:00:00
if you go a second earlier it works fine:
2262-04-11 19:23:59

On Wed, Sep 10, 2014 at 5:38 PM, Peyman Mohajerian 
wrote:

> Hi Guys,
>
> I Hive .13 for this conversion:
> select cast('2999-12-31 23:59:59' as timestamp)
> I get:
> 1830-11-23 00:50:51.580896768
> up to around year 2199 it works fine, the work around is to convert the
> string to int and then back to timestamp:
> from_unixtime(unix_timestamp('2999-12-31 23:59:59.00')
>
> But why is this issue happening in the first place?
>
> Thanks,
>


String to Timestamp conversion bug

2014-09-10 Thread Peyman Mohajerian
Hi Guys,

I Hive .13 for this conversion:
select cast('2999-12-31 23:59:59' as timestamp)
I get:
1830-11-23 00:50:51.580896768
up to around year 2199 it works fine, the work around is to convert the
string to int and then back to timestamp:
from_unixtime(unix_timestamp('2999-12-31 23:59:59.00')

But why is this issue happening in the first place?

Thanks,


Re: Load CSV to hive

2014-09-04 Thread Peyman Mohajerian
you could use: https://github.com/ogrodnek/csv-serde
That what i have done in the past, but with the latest version of Hive
there might be other options available too.


On Thu, Sep 4, 2014 at 12:43 PM, Ricardo Pompeu Ferreras <
ricardo.ferre...@gmail.com> wrote:

> Hi,
>
> How I create a table with this structure ?
>
> The separte are , comma,
> There are fields with "xx" date and "yyy" with value.
>
> 70268503,"2012-09-28 15:00:59",0,",01","18,01",,1,"2012-10-01 17:01:11",0,,
>
> help please
> --
> Ricardo
> SP, Brasil
>


Hive .13 issue with hive.map.aggr=true

2014-08-28 Thread Peyman Mohajerian
Hi Guys,

In this query:
select min(YYY), max(YYY) from  where trim(YYY) is not null and
trim(YYY)<>'';
we expect the following result:
#42684  ZYP7250455

Column 'YYY' is of type 'string' the file format is ORC with Snappy
compression.
However we get:
ZYP7250455
(empty for the minimum), and the only work around is to set:
hive.map.aggr=false;

Any idea what is going on here, I also made sure to update the table stats
using:
analyze table   PARTITION(monthly_capture) compute statistics;

Thanks,


Re: how to create custom user defined data type in Hive

2014-08-25 Thread Peyman Mohajerian
The only option I know in that case is using 'string' in hive. Also have to
see how something like sqoop will bring the data over, perhaps need to cast
the data type in Teradata first, using views. Those are my thoughts, there
could be other tricks.


On Mon, Aug 25, 2014 at 9:26 PM, reena upadhyay  wrote:

> Hi,
>
> As long as the data type is ANSI complaint, its equivalent type is
> available in Hive. But there are few data types that are database specific.
> Like there is a PERIOD data type in teradata, it is specific to teradata
> only, So how to map such columns in Hive?
>
> Thanks.
>
>
> On Tue, Aug 26, 2014 at 6:44 AM, Peyman Mohajerian 
> wrote:
>
>> As far as i know you cannot do that and most likely you don't need it,
>> here are sample mappings between the two systems:
>> Teradata
>>   Hive
>>DECIMAL(x,y)  double  DATE,TIMESTAMP  timestamp
>> INTEGER,SMALLINT,BYTINT  int  VARCHAR,CHAR  string  DECIMAL(x,0)  bigint
>>
>>
>> I would typically stage data in hadoop as all string and then move it to
>> hive managed/orc with the above mapping.
>>
>>
>>
>>
>> On Mon, Aug 25, 2014 at 8:42 PM, reena upadhyay 
>> wrote:
>>
>>> Hi,
>>>
>>> Is there any way to create custom user defined data type in Hive? I want
>>> to move some table data from teradata database to Hive. But in teradata
>>> database tables, there are few columns data type that are not supported in
>>> Hive. So to map the source table columns to my destination table columns in
>>> Hive, I want to create my own data type in Hive.
>>>
>>> I know about writing UDF's in Hive but have no idea about creating user
>>> defined data type in HIve. Any idea and example on the same would be of
>>> great help.
>>>
>>> Thanks.
>>>
>>
>>
>


Re: how to create custom user defined data type in Hive

2014-08-25 Thread Peyman Mohajerian
As far as i know you cannot do that and most likely you don't need it, here
are sample mappings between the two systems:
Teradata
Hive
  DECIMAL(x,y)  double  DATE,TIMESTAMP  timestamp  INTEGER,SMALLINT,BYTINT
int  VARCHAR,CHAR  string  DECIMAL(x,0)  bigint


I would typically stage data in hadoop as all string and then move it to
hive managed/orc with the above mapping.



On Mon, Aug 25, 2014 at 8:42 PM, reena upadhyay  wrote:

> Hi,
>
> Is there any way to create custom user defined data type in Hive? I want
> to move some table data from teradata database to Hive. But in teradata
> database tables, there are few columns data type that are not supported in
> Hive. So to map the source table columns to my destination table columns in
> Hive, I want to create my own data type in Hive.
>
> I know about writing UDF's in Hive but have no idea about creating user
> defined data type in HIve. Any idea and example on the same would be of
> great help.
>
> Thanks.
>


Re: Difference between Hive and HCat table?

2014-08-14 Thread Peyman Mohajerian
Other tools, e.g. Pig can access HCat and find out what the schema is. At
Teradata we look up the meta-data directly from HCat and then read the data
in parallel from HDFS rather than the slower route that is Hiverserver2.
So HCat is an important tool for vendors who want to import/export data to
Hadoop and don't have to have direct dependency on Hive.


On Sat, Aug 9, 2014 at 10:04 AM, André Hacker 
wrote:

> Thank you Andrew and Lefty, that helped a lot with clarification.
>
> So the link tells me that, assuming a single metastore, everything done in
> the HCat CLI will be reflected in Hive CLI and vice versa, but there are
> some features exclusively available in Hive CLI and a few others
> exclusively in HCat CLI (only table groups/permissions as far as I can see).
>
> From my user perspective it still looks a bit redundant to distinguish
> these two CLIs. However, I understand that there are reasons to distinguish
> HCat, which is a very generic metadata layer, and Hive, which is one (the
> most popular one) of many engines running on HCat. The fact that HCat is
> bundled with Hive and at the same time separated was always a bit confusing
> to me, so I wanted to see if I missed something.
>
> So thanks again, this section in the documentation was just what I was
> looking for.
>
> André
> Am 05.08.2014 21:32 schrieb "Lefty Leverenz" :
>
>> Perhaps this documentation will help:  HCatalog CLI -- Hive CLI
>> 
>> .
>>
>> Also note the section that follows it, which begins "HCatalog supports
>> all Hive Data Definition Language except those operations that require
>> running a MapReduce job."
>>
>> -- Lefty
>>
>>
>> On Tue, Aug 5, 2014 at 5:00 AM, Andrew Mains 
>> wrote:
>>
>>> André,
>>>
>>> To my knowledge, your understanding is correct--given that both Hive and
>>> HCatalog are pointing to the same metastore instance, all HCatalog table
>>> operations should be
>>> reflected in Hive, and vice versa. You should be able to use the Hive
>>> CLI and hcat interchangeably to execute your DDL.
>>>
>>> Andrew
>>>
>>>
>>> On 8/5/14, 12:23 AM, André Hacker wrote:
>>>
 Hi,

 a very simple question: Is there a difference between a table in Hive
 and a table in HCat?
 In other words: Can I create a table in Hive that is invisible in HCat,
 or vice versa?
 (Assuming that Hive and HCat point to the same metastore)

 From my understanding, HCat is just a wrapper around the Hive
 metastore, so there should be no major difference. This is in line with my
 experience: If I create a table via the Hive CLI, it will be shown in HCat
 too when running hcat -e "show tables;". And vice versa.

 I ask because some online documentation makes me feel that I have to
 run my DDL in HCat to make it visible there. At least I didn't find
 documents that say that I can use either Hive CLI or HCat.

 Thanks,

 André Hacker


>>>
>>


Re: How to strip off double quotes while loading csv

2014-07-12 Thread Peyman Mohajerian
https://github.com/ogrodnek/csv-serde


On Sat, Jul 12, 2014 at 9:17 AM, Sarath P R  wrote:

> Is there any way to strip double quotes while loading csv file into hive
> table ?
>
> I am using Hive 0.12.
>
> --
> Thank You
> Sarath P R
> Contact +91 99 95 02 4287 | Twitter  | Blog
> 
>
>


testing subscription

2014-05-10 Thread Peyman Mohajerian
I have stopped receiving any email from this list!


Re: Compressed Data Column in Hive Table

2014-04-14 Thread Peyman Mohajerian
you can also build a UDF to decompress.


On Mon, Apr 14, 2014 at 5:31 PM, Abdelrahman Shettia <
ashet...@hortonworks.com> wrote:

> Hi Kanna,
>
> It may not be the best option, but you can select the column and insert it
> into a staging temp table.
>
>
> Thanks,
> Rahman
>
> On Apr 8, 2014, at 6:28 PM, Kanna Karanam  wrote:
>
> Hi – One of the columns in my hive table contains compressed data (Json
> document in Gzip format). Is there any recommended way to decompress it and
> explode the json doc?
> Thanks,
> Kanna
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.


Re: get_json_object for nested field returning a String instead of an Array

2014-04-07 Thread Peyman Mohajerian
perhaps: https://github.com/rcongiu/Hive-JSON-Serde


On Mon, Apr 7, 2014 at 6:52 PM, Narayanan K  wrote:

> Hi all
>
> I am using get_json_object to read a json text file. I have created
> the external table as below :
>
> CREATE EXTERNAL TABLE EXT_TABLE ( json string)
> PARTITIONED BY (dt string)
> LOCATION '/users/abc/';
>
>
> The json data has some fields that are not simple fields but fields
> which are nested fields like -  "field" : [{"id":1},{"id":2}.. ].
>
> While using the get_json_object to retrieve that field, it is
> returning back a string instead of an Array. Hence I am not able to
> explode the array as it is a string.
>
> Is there some way we can get an array of get_json_object instead of a
> string so that we can perform explode on this nested field ? or Anyway
> we can convert the string into an array so that I can use explode ?
>
> Thanks in advance,
> Narayanan
>


Re: UDF reflect

2014-04-03 Thread Peyman Mohajerian
Maybe your intention is the following:
reflect("java.util.UUID", "randomUUID")


On Thu, Apr 3, 2014 at 2:33 AM, Szehon Ho  wrote:

> Hi, according to the description of the reflect UDF, you are trying to
> call java.util.UUID.hashcode(uidString), which doesnt seem to be an
> existing method on either java 6/7.
>
> http://docs.oracle.com/javase/7/docs/api/java/util/UUID.html#hashCode()
>
> Thanks
> Szehon
>
>
>
>
> On Wed, Apr 2, 2014 at 2:13 PM, Andy Srine  wrote:
>
>> Hi guys,
>>
>>
>> I am trying to use the reflect UDF for an UUID method and am getting an
>> exception. I believe this function should be available in java 1.6.0_31 the
>> system is running.
>>
>>
>> select reflect("java.util.UUID", "hashCode", uid_str) my_uid,
>>
>> ...
>>
>>
>> My suspicion is, this is because the hive column I am calling this on is
>> a string and not an UUID. So I nested the reflects as shown below to go
>> from a string to an UUID first and then to "hashCode" it.
>>
>>
>> reflect("java.util.UUID", "hashCode", reflect("java.util.UUID",
>> "fromString", uid_str)) my_uid,
>>
>>
>> In either case, I always get the exception below though the row of data
>> it prints has no null for the uid_str column. Any ideas?
>>
>>
>>  at
>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:565)
>>
>> at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)
>>
>> ... 8 more
>>
>> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: UDFReflect
>> getMethod
>>
>> at
>> org.apache.hadoop.hive.ql.udf.generic.GenericUDFReflect.evaluate(GenericUDFReflect.java:164)
>>
>> at
>> org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.evaluate(ExprNodeGenericFuncEvaluator.java:163)
>>
>> at
>> org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.getNewKey(KeyWrapperFactory.java:113)
>>
>> at
>> org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:794)
>>
>> at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
>>
>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
>>
>> at
>> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
>>
>> at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
>>
>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
>>
>> at
>> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
>>
>> at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
>>
>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
>>
>> at
>> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:548)
>>
>> ... 9 more
>>
>> Caused by: java.lang.NoSuchMethodException: java.util.UUID.hashCode(null)
>>
>> at java.lang.Class.getMethod(Class.java:1605)
>>
>> at
>> org.apache.hadoop.hive.ql.udf.generic.GenericUDFReflect.evaluate(GenericUDFReflect.java:160)
>>
>>
>> Thanks,
>>
>> Andy
>>
>>
>>
>


Re: disable internal tables

2014-01-30 Thread Peyman Mohajerian
This is a known issue, it still will write something at '/apps/hive/warehouse',
it's best to assign a common group to your hive and hdfs users and assign
that group to both of these directories. I heard this issue is fixed in .12
or .13, others can confirm.


On Thu, Jan 30, 2014 at 8:27 AM, Alex Nastetsky wrote:

> Hi,
>
> I am trying to enforce all Hive tables to be created with EXTERNAL. The
> way I am doing this is by making the location of the warehouse
> (/apps/hive/warehouse in my case) to have permissions 000 (completely
> inaccessible).
>
> But then when I try to create an external table, I see that it still tries
> to write to /apps/hive/warehouse and, of course, fails:
>
> hive> CREATE EXTERNAL TABLE mytable(id INT, name STRING) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS
> TEXTFILE LOCATION '/user/anastetsky/warehouse';
> Authorization failed:java.security.AccessControlException: action WRITE
> not permitted on path hdfs://:8020/apps/hive/warehouse for user
> anastetsky. Use show grant to get more details.
>
> What am I missing? Or is there a better way to enforce tables to be
> EXTERNAL?
>
> Thanks in advance,
> Alex.
>


Re: One information about the Hive

2014-01-13 Thread Peyman Mohajerian
I don't work for IBM, but found their training material helpful:
http://bigdatauniversity.com

There is a bit of biased toward IBM's stack, but they do a good job of
teaching Hive in general.


On Mon, Jan 13, 2014 at 3:01 AM, Nitin Pawar wrote:

> The best way to answer your queries is,
>
> 1) set up a single node hadoop VM (there are readily available images from
> hortonworks and cloudera)
> 2) try to load data and see where it is stored (hive is a data access
> framework .. it does not store any data, information related to data is
> stored in metastore .. mainly hcatalog)
> 3) With hive its just writing queries and doing numbers, there are lot of
> file formats which do better with different kind of workloads.
>
> If you have basic understanding of hive and tried few queries you will
> find that hive is not a stand alone system (for now). It has hadoop
> mapreduce1 and hdfs then it has metastore then it has hive framework.
>
> You will need to understand bit more of hdfs as well.
>
> to answer your queries
>
> how the hive will connect with hadoop cluster,
>
> .. when you setup hive you can point it to a hadoop cluster or you can
> change these properties at table level.
>
>
> how the hive will  get the request,
>
> .. not sure what you mean by request .. if you mean the query then there
> are ways like hive cli (as I am aware development on this is getting less),
> then there are clients like beeline and then u have options of jdbc
> connections etc
>
>
> how the hive will process the request,
> .. how converts your query into an optimal mapreduce program and processes
> the data using that mapreduce program. How to convert a sql query to
> mapreduce program, you can look at ysmart framework from ohio university .
>
> after analysis ,where the analyzed data will be stored for further
> decision making
> .. hive does store any data automatically. You have to specifically
> mention where you want to save the data. a table or a file or something
> like that.
>
>
> On Mon, Jan 13, 2014 at 4:14 PM, Vikas Parashar wrote:
>
>> Thanks Prashant, Definitely i shall go through that if needed. But from
>>  my experience, what i have faced is that user will have some integration
>> problem with HADOOP 2.
>>
>>
>>  Hi Vikas
>>>
>>>  Welcome to the world of Hive !
>>>
>>>  The first book u should read is by Capriolo , Wampler, Rutherglen
>>> Programming Hive
>>> http://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335
>>>
>>>  This is a must read. I have immensely benefited from this book and the
>>> hive user group (the group is kickass).
>>>
>>>  If u r not sure of the details of HDFS/Hadoop then the Hadoop
>>> Definitive Guide (Tom White) is a must read.
>>> My view would be u should know both very well eventually...
>>>
>>
>>
>>
>>>  I have setup Hadoop and Hive cluster in three ways
>>> [1] manually thru tarballs (lightweight but u need to know what u r
>>> installing and where)
>>> [2] CDH & Cloudera manager (heavyweight but it does things in the
>>> backgroundeasy to install and quick to setup on a sandbox and
>>> learn)...Plus Beeswax is s great starter UI for Hive queries
>>> [3] Using Amazon EMR Hive (I realize this is the easiest and the fastest
>>> to setup to learn Hive)
>>>
>>>  My suggestion , Don't go for option [1] - u learn a lot there but it
>>> could take time and u might feel frustrated as well
>>>
>>> using option [2] above , then I suggest
>>> - 1 or 2 boxes - i7 quad core (or u can use a 8 core AMD FX 8300) with
>>> 16-32GB RAM
>>> - download and install Cloudera manager
>>>
>>>  If u don't have access to box(es) to install hadoop/hive then the
>>> cheapest way  to learn is by using Amazon EMR
>>> - First create a S3 bucket and a folder to store a data file called
>>> songs.txt
>>>
>>>1,2,lennon,john,nowhere man
>>>   1,3,lennon,john,strawberry fields forever
>>>   2,1,mccartney,paul,penny lane
>>>   2,2,mccartney,paul,michelle
>>>   2,3,mccartney,paul,yesterday
>>>3,1,harrison,george,while my guitar gently weeps
>>> 3,2,harrison,george,i want to tell you
>>>3,3,harrison,george,think for yourself
>>>3,4,harrison,george,something
>>> 4,1,starr,ringo,octopuss garden
>>> 4,2,starr,ringo,with a liitle help from my friends
>>>
>>>  - Create a key pair from the AWS console and save the private key on
>>> your local desktop
>>>
>>>  - Create a EMR cluster with Hive installed
>>>
>>>  - ssh -i /path/on/your/desktop/to/amazonkeypair.pem   hadoop@
>>> .compute.amazonaws.com
>>>
>>>  - One the linux prompt
>>>-->   hive -e "CREATE EXTERNAL TABLE IF NOT EXISTS songs(id INT,
>>> SEQID INT, LASTNAME STRING, FIRSTNAME STRING, SONGNAME STRING) ROW FORMAT
>>> DELIMITED FIELDS TERMINATED BY ',' "
>>>   --> hive -e "select songname from songs where lastname='lennon' OR
>>> lastname = 'harrison'"
>>>
>>>  Hope this helps
>>>
>>>  Hive on !!!
>>>
>>>  sanjay
>>>
>>>
>>>
>>>
>>>
>>>
>>>id,seq,lastname,firstname,songname
>>>
>>>
>>>
>>>
>>>   

Re: Help on loading data stream to hive table.

2014-01-07 Thread Peyman Mohajerian
You may find summingbird relevant, I'm still investigating it:
https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird


On Tue, Jan 7, 2014 at 11:39 AM, Alan Gates  wrote:

> I am not wise enough in the ways of Storm to tell you how you should
> partition data across bolts.  However, there is no need in Hive for all
> data for a partition to be in the same file, only in the same directory.
>  So if each bolt creates a file for each partition and then all those files
> are placed in one directory and loaded into Hive it will work.
>
> Alan.
>
> On Jan 6, 2014, at 6:26 PM, Chen Wang  wrote:
>
> > Alan,
> > the problem is that the data is partitioned by epoch ten hourly, and i
> want all data belong to that partition to be written into one file named
> with that partition. How can i share the file writer across different bolt?
> should I instruct data within the same partition to the same bolt?
> > Thanks,
> > Chen
> >
> >
> > On Fri, Jan 3, 2014 at 3:27 PM, Alan Gates 
> wrote:
> > You shouldn’t need to write each record to a separate file.  Each Storm
> bolt should be able to write to it’s own file, appending records as it
> goes.  As long as you only have one writer per file this should be fine.
>  You can then close the files every 15 minutes (or whatever works for you)
> and have a separate job that creates a new partition in your Hive table
> with the files created by your bolts.
> >
> > Alan.
> >
> > On Jan 2, 2014, at 11:58 AM, Chen Wang 
> wrote:
> >
> >> Guys,
> >> I am using storm to read data stream from our socket server, entry by
> entry, and then write them to file: one entry per file.  At some point, i
> need to import the data into my hive table. There are several approaches i
> could think of:
> >> 1. directly write to hive hdfs file whenever I get the entry(from our
> socket server). The problem is that this could be very inefficient,  since
> we have huge amount of data stream, and I would not want to write to hive
> hdfs one by one.
> >> Or
> >> 2 i can write the entries to files(normal file or hdfs file) on the
> disk, and then have a separate job to merge those small files into big one,
> and then load them into hive table.
> >> The problem with this is, a) how can I merge small files into big files
> for hive? b) what is the best file size to upload to hive?
> >>
> >> I am seeking advice on both approaches, and appreciate your insight.
> >> Thanks,
> >> Chen
> >>
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>