Re: hive will die or not?

2016-08-07 Thread Marcin Tustin
I think that's right. My testing (not very scientific) puts it on par for
redshift for the datasets I use.

On Sunday, August 7, 2016, Edward Capriolo  wrote:

> A few entities going to "kill/take out/better than hive"
> I seem to remember HadoopDb, Impala, RedShift , voltdb...
>
> But apparent hive is still around and probably faster
> http://www.slideshare.net/hortonworks/hive-on-spark-is-
> blazing-fast-or-is-it-final
>
>
>
>
> On Sun, Aug 7, 2016 at 9:49 PM, 理  > wrote:
>
>> in  my opinion, multiple  engine  is not  advantage,  but reverse.  it
>>  disperse  the dev energy.
>>   consider  the activity ,sparksql  support  all  tpc ds without modify
>> syntax!  but  hive cannot.
>> consider the tech,   dag, vectorization,   etc sparksql also has,   seems
>> the  code  is  more   efficiently.
>>
>>
>> regards
>> On 08/08/2016 08:48, Will Du
>>  wrote:
>>
>> First, hive supports different engines. Look forward it's dynamic engine
>> switch
>> Second, look forward hadoop 3rd gen and map reduce on memory will fill
>> the gap
>>
>> Thanks,
>> Will
>>
>> On 2016年8月7日, at 20:27, 理 > > wrote:
>>
>> hi,
>>   sparksql improve  so fast,   both  hive and sparksql  are similar,  so
>> hive  will  lost  or not?
>>
>> regards
>>
>>
>>
>>
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-07 Thread Marcin Tustin
Yes, but a create table unpartitioned as select * from partitioned will
create an unpartitioned table with all the data in partitioned table. It
won't lose the partition column, but nowhere do I see a need for that
column to be removed.

On Sun, Aug 7, 2016 at 9:25 AM, Mich Talebzadeh 
wrote:

> Hi Marcin,
>
> The thread owner  question was
>
> "Hi I've a scenario where I need to create a table from partitioned table
> but my destination table should not be partitioned. I won't be knowing the
> schema so I cannot create manually the destination table. By the way both
> tables are external tables."
>
> This can be easily achieved through Spark by reading the Hive external
> table (assuming that the thread owner knows its name and the Hive database
> name :)) into a DF
>
> DF will display all the column names and a filter on it can get rid of the
> partition columns.
>
> New table can be created without those two columns and of course will not
> be partitioned.
>
>  HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 7 August 2016 at 13:17, Marcin Tustin  wrote:
>
>> Will CREATE TABLE sales5 AS SELECT * FROM SALES; not work for you?
>>
>> On Thu, Aug 4, 2016 at 5:05 PM, Nagabhushanam Bheemisetty <
>> nbheemise...@gmail.com> wrote:
>>
>>> Hi I've a scenario where I need to create a table from partitioned table
>>> but my destination table should not be partitioned. I won't be knowing the
>>> schema so I cannot create manually the destination table. By the way both
>>> tables are external tables.
>>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-07 Thread Marcin Tustin
Why exclude the partition column? It will still be unpartitioned whether or
not the partition column is excluded.

On Sun, Aug 7, 2016 at 8:23 AM, Markovitz, Dudu 
wrote:

> It won’t help him since ‘*’ represent all columns including the partition
> columns which he wants to exclude.
>
>
>
> Dudu
>
>
>
> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
> *Sent:* Sunday, August 07, 2016 3:17 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Crate Non-partitioned table from partitioned table using
> CREATE TABLE .. LIKE
>
>
>
> Will CREATE TABLE sales5 AS SELECT * FROM SALES; not work for you?
>
>
>
> On Thu, Aug 4, 2016 at 5:05 PM, Nagabhushanam Bheemisetty <
> nbheemise...@gmail.com> wrote:
>
> Hi I've a scenario where I need to create a table from partitioned table
> but my destination table should not be partitioned. I won't be knowing the
> schema so I cannot create manually the destination table. By the way both
> tables are external tables.
>
>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
>
> Latest news <http://www.handy.com/press> at Handy
>
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-07 Thread Marcin Tustin
Will CREATE TABLE sales5 AS SELECT * FROM SALES; not work for you?

On Thu, Aug 4, 2016 at 5:05 PM, Nagabhushanam Bheemisetty <
nbheemise...@gmail.com> wrote:

> Hi I've a scenario where I need to create a table from partitioned table
> but my destination table should not be partitioned. I won't be knowing the
> schema so I cannot create manually the destination table. By the way both
> tables are external tables.
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Create table from orc file

2016-08-03 Thread Marcin Tustin
Correct you need to specify the columns. If you created the file I assume
you have a record of them.

Someone more familiar with the hive code will have to comment on the
exceptions.

On Wednesday, August 3, 2016, Johannes Stamminger <
johannes.stammin...@airbus.com> wrote:

> But doing so I assume it does not detect the columns on it's own, I have to
> specify such manually - or am I wrong? The orc file I finally want to work
> with contains ~28000 columns (513MB size, ~50 rows, 3 structs with 2 of
> them containing ~14000 fields each) ...
>
> The hive documentation for the create table statement shows the columns
> part
> being optional. In fact it seems required, at least I found no way to avoid
> it.
>
>
> For testing purposes I started with a smaller one and found two ways of
> bringing the data to hive. Unfortunately I actually fail on accessing it:
>
>
> a) create external table:
>
> Succeeding statement:
>
> create external table if not exists CFA1_Fan_Speed_DMC(record
> struct) stored as ORC location
> '...';
>
> with the location having specified containig my existing orc file named
> exactly like the table, CFA1_Fan_Speed_DMC.
>
> But every selection for data results in:
>
> Error: java.io.IOException: java.lang.RuntimeException: Char length 256
> out of
> allowed range [1, 255] (state=,code=0)
>
> Tried with:
>  - select * from CFA1_Fan_Speed_DMC;
>  - select record from CFA1_Fan_Speed_DMC;
>  - select record.normalizedTime from CFA1_Fan_Speed_DMC;
>
>
> b) create table and load from file
>
> Succeeding statements:
>
> create table cfa1(record
> struct)
> stored as orc;
>
> load data inpath '.../CFA1_Fan_Speed_DMC' into table cfa1;
>
> Same statements for querying as above (of course using the different table
> name) still fail, but now with:
>
> Error: java.io.IOException: java.io.IOException: ORC does not support type
> conversion from file type bigint (1) to reader type
> struct (1) (state=,code=0)
>
>
>
> So what is wrong with the above?
>
>
> I should mention, that I created the orc files having used using the latest
> orc-core lib (1.1.2). That seems not to be the same implementation for orc
> files access as being used in hive.
>
>
> Thanks for all hints!
>
>
>
> Am Mittwoch, 3. August 2016, 08:45:45 CEST schrieb Marcin Tustin:
> > Yes. Create an external table whose location contains only the orc
> file(s)
> > you want to include in the table.
> >
> > On Wed, Aug 3, 2016 at 7:53 AM, Johannes Stamminger <
> >
> > johannes.stammin...@airbus.com > wrote:
> > > Hi,
> > >
> > >
> > > is it possible to write data to an orc file(s) using the hive-orc api
> and
> > > to
> > > use such by hive (create a table from it)?
> > >
> > >
> > > Regards
> > > This email (including any attachments) may contain confidential and/or
> > > privileged information or information otherwise protected from
> disclosure.
> > > If you are not the intended recipient, please notify the sender
> > > immediately, do not copy this message or any attachments and do not
> use it
> > > for any purpose or disclose its content to any person, but delete this
> > > message and any attachments from your system. Astrium and Airbus Group
> > > companies disclaim any and all liability if this email transmission was
> > > virus corrupted, altered or falsified.
> > > -
> > > Airbus DS GmbH
> > > Vorsitzender des Aufsichtsrates: Bernhard Gerwert
> > > Geschäftsführung: Evert Dudok (Vorsitzender), Dr. Lars Immisch, Dr.
> > > Michael Menking, Dr. Johannes von Thadden
> > > Sitz der Gesellschaft: München - Registergericht: Amtsgericht München,
> HRB
> > > Nr. 107 647
> > > Ust. Ident. Nr. /VAT reg. no. DE167015356
>
>
> --
>johannes.stammin...@airbus.com  [2FE783D0 http://wwwkeys.PGP.net]
> -- <--{(@ --  AIRBUS Defence & Space
> Koenigsberger Str. 17, 28857 Barrien Ground SW Eng. & Del. (TSOTC 6)
> +49 4242 169582 (Tel + FAX) Airbus Allee 1, 28199 Bremen
> +49 174 7731593 (Mobile) +49 421 539 4152 (Tel) / 4378 (FAX)
>
> This email (including any attachments) may contain confidential and/or
> privileged information or information otherwise protected from disclosure.
> If you are not the intended recipient, please notify the sender
> immediately, do not copy this message or any attachments and do not use it
> for any purpose or disclose i

Re: Create table from orc file

2016-08-03 Thread Marcin Tustin
Yes. Create an external table whose location contains only the orc file(s)
you want to include in the table.

On Wed, Aug 3, 2016 at 7:53 AM, Johannes Stamminger <
johannes.stammin...@airbus.com> wrote:

> Hi,
>
>
> is it possible to write data to an orc file(s) using the hive-orc api and
> to
> use such by hive (create a table from it)?
>
>
> Regards
> This email (including any attachments) may contain confidential and/or
> privileged information or information otherwise protected from disclosure.
> If you are not the intended recipient, please notify the sender
> immediately, do not copy this message or any attachments and do not use it
> for any purpose or disclose its content to any person, but delete this
> message and any attachments from your system. Astrium and Airbus Group
> companies disclaim any and all liability if this email transmission was
> virus corrupted, altered or falsified.
> -
> Airbus DS GmbH
> Vorsitzender des Aufsichtsrates: Bernhard Gerwert
> Geschäftsführung: Evert Dudok (Vorsitzender), Dr. Lars Immisch, Dr.
> Michael Menking, Dr. Johannes von Thadden
> Sitz der Gesellschaft: München - Registergericht: Amtsgericht München, HRB
> Nr. 107 647
> Ust. Ident. Nr. /VAT reg. no. DE167015356

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: A dedicated Web UI interface for Hive

2016-07-15 Thread Marcin Tustin
I was thinking of query and admin interfaces.

There's ambari, which has plugins for introspecting what's up with tez
sessions. I can't use those because I don't use the yarn history server (I
find it very flaky).

There's also hue, which is a query interface.

If you're running on spark as the execution engine, can you not use the
spark UI for those applications to see what's up with hive?

On Fri, Jul 15, 2016 at 3:19 AM, Mich Talebzadeh 
wrote:

> Hi Marcin,
>
> Which two web interfaces are these. I know the usual one on 8088 any other
> one?
>
> I want something in line with what Spark provides. I thought Gopal has got
> something:
>
> [image: Inline images 1]
>
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 July 2016 at 23:29, Marcin Tustin  wrote:
>
>> What do you want it to do? There are at least two web interfaces I can
>> think of.
>>
>> On Thu, Jul 14, 2016 at 6:04 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Gopal,
>>>
>>> If I recall you were working on a UI support for Hive. Currently the one
>>> available is the standard Hadoop one on port 8088.
>>>
>>> Do you have any timelines which release of Hive is going to have this
>>> facility?
>>>
>>> Thanks,
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: A dedicated Web UI interface for Hive

2016-07-14 Thread Marcin Tustin
What do you want it to do? There are at least two web interfaces I can
think of.

On Thu, Jul 14, 2016 at 6:04 PM, Mich Talebzadeh 
wrote:

> Hi Gopal,
>
> If I recall you were working on a UI support for Hive. Currently the one
> available is the standard Hadoop one on port 8088.
>
> Do you have any timelines which release of Hive is going to have this
> facility?
>
> Thanks,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Marcin Tustin
More like 2x than 10x as I recall.

On Tue, Jul 12, 2016 at 9:39 AM, Mich Talebzadeh 
wrote:

> thanks Marcin.
>
> What Is your guesstimate on the order of "faster" please?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 12 July 2016 at 14:35, Marcin Tustin  wrote:
>
>> Quick note - my experience (no benchmarks) is that Tez without LLAP
>> (we're still not on hive 2) is faster than MR by some way. I haven't dug
>> into why that might be.
>>
>> On Tue, Jul 12, 2016 at 9:19 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> sorry I completely miss your points
>>>
>>> I was NOT talking about Exadata. I was comparing Oracle 12c caching with
>>> that of Oracle TimesTen. no one mentioned Exadata here and neither
>>> storeindex etc..
>>>
>>>
>>> so if Tez is not MR with DAG could you give me an example of how it
>>> works. No opinions but relevant to this point. I do not know much about Tez
>>> as I stated it before
>>>
>>> Case in point if Tez could do the job on its own why Tez is used in
>>> conjunction with LLAP as Martin alluded to as well in this thread.
>>>
>>>
>>> Having said that , I would be interested if you provide a working
>>> example of Hive on Tez, compared to Hive on MR.
>>>
>>> One experiment is worth hundreds of opinions
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 12 July 2016 at 13:31, Jörn Franke  wrote:
>>>
>>>>
>>>> I think the comparison with Oracle rdbms and oracle times ten is not so
>>>> good. There are times when the in-memory database of Oracle is slower than
>>>> the rdbms (especially in case of Exadata) due to the issue that in-memory -
>>>> as in Spark - means everything is in memory and everything is always
>>>> processed (no storage indexes , no bloom filters etc) which explains this
>>>> behavior quiet well.
>>>>
>>>> Hence, I do not agree with the statement that tez is basically mr with
>>>> dag (or that llap is basically in-memory which is also not correct). This
>>>> is a wrong oversimplification and I do not think this is useful for the
>>>> community, but better is to understand when something can be used and when
>>>> not. In-memory is also not the solution to everything and if you look for
>>>> example behind SAP Hana or NoSql there is much more around this, which is
>>>> not even on the roadmap of Spark.
>>>>
>>>> Anyway, discovering good use case patterns should be done on
>>>> standardized benchmarks going beyond the select count etc
>>>>
>>>> On 12 Jul 2016, at 11:16, Mich Talebzadeh 
>>>> wrote:
>>>>
>>>> That is only a plan not what execution engine is doing.
>>>>
>>>> As I stated before Spark uses DAG + in-memory computing. MR is serial
>>>> on disk.
>>>>
>>>> The key is the execution here or rather the execution engine.
>>>>
>>>> In general
>>>>
>>>> The standard MapReduce  as I know reads the data from 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Marcin Tustin
Quick note - my experience (no benchmarks) is that Tez without LLAP (we're
still not on hive 2) is faster than MR by some way. I haven't dug into why
that might be.

On Tue, Jul 12, 2016 at 9:19 AM, Mich Talebzadeh 
wrote:

> sorry I completely miss your points
>
> I was NOT talking about Exadata. I was comparing Oracle 12c caching with
> that of Oracle TimesTen. no one mentioned Exadata here and neither
> storeindex etc..
>
>
> so if Tez is not MR with DAG could you give me an example of how it works.
> No opinions but relevant to this point. I do not know much about Tez as I
> stated it before
>
> Case in point if Tez could do the job on its own why Tez is used in
> conjunction with LLAP as Martin alluded to as well in this thread.
>
>
> Having said that , I would be interested if you provide a working example
> of Hive on Tez, compared to Hive on MR.
>
> One experiment is worth hundreds of opinions
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 12 July 2016 at 13:31, Jörn Franke  wrote:
>
>>
>> I think the comparison with Oracle rdbms and oracle times ten is not so
>> good. There are times when the in-memory database of Oracle is slower than
>> the rdbms (especially in case of Exadata) due to the issue that in-memory -
>> as in Spark - means everything is in memory and everything is always
>> processed (no storage indexes , no bloom filters etc) which explains this
>> behavior quiet well.
>>
>> Hence, I do not agree with the statement that tez is basically mr with
>> dag (or that llap is basically in-memory which is also not correct). This
>> is a wrong oversimplification and I do not think this is useful for the
>> community, but better is to understand when something can be used and when
>> not. In-memory is also not the solution to everything and if you look for
>> example behind SAP Hana or NoSql there is much more around this, which is
>> not even on the roadmap of Spark.
>>
>> Anyway, discovering good use case patterns should be done on standardized
>> benchmarks going beyond the select count etc
>>
>> On 12 Jul 2016, at 11:16, Mich Talebzadeh 
>> wrote:
>>
>> That is only a plan not what execution engine is doing.
>>
>> As I stated before Spark uses DAG + in-memory computing. MR is serial on
>> disk.
>>
>> The key is the execution here or rather the execution engine.
>>
>> In general
>>
>> The standard MapReduce  as I know reads the data from HDFS, apply
>> map-reduce algorithm and writes back to HDFS. If there are many iterations
>> of map-reduce then, there will be many intermediate writes to HDFS. This is
>> all serial writes to disk. Each map-reduce step is completely independent
>> of other steps, and the executing engine does not have any global knowledge
>> of what map-reduce steps are going to come after each map-reduce step. For
>> many iterative algorithms this is inefficient as the data between each
>> map-reduce pair gets written and read from the file system.
>>
>> The equivalent to parallelism in Big Data is deploying what is known as
>> Directed Acyclic Graph (DAG
>> ) algorithm. In a
>> nutshell deploying DAG results in a fuller picture of global optimisation
>> by deploying parallelism, pipelining consecutive map steps into one and not
>> writing intermediate data to HDFS. So in short this prevents writing data
>> back and forth after every reduce step which for me is a significant
>> improvement, compared to the classical MapReduce algorithm.
>>
>> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
>> computing. Think of it as a comparison between a classic RDBMS like Oracle
>> and IMDB like Oracle TimesTen with in-memory processing.
>>
>> The outcome is that Hive using Spark as execution engine is pretty
>> impressive. You have the advantage of Hive CBO + In-memory computing. If
>> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own
>> optimizer called Catalyst that does not have CBO yet plus in memory
>> computing.
>>
>> As usual your mileage varies.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage

Re: loading in ORC from big compressed file

2016-06-21 Thread Marcin Tustin
This is because a GZ file is not splittable at all. Basically, try creating
this from an uncompressed file, or even better split up the file and put
the files in a directory in hdfs/s3/whatever.

On Tue, Jun 21, 2016 at 7:45 PM, @Sanjiv Singh 
wrote:

> Hi ,
>
> I have big compressed data file *my_table.dat.gz* ( approx size 100 GB)
>
> # load staging table *STAGE_**my_table* from file *my_table.dat.gz*
>
> HIVE>> LOAD DATA  INPATH '/var/lib/txt/*my_table.dat.gz*' OVERWRITE INTO
> TABLE STAGE_my_table ;
>
> *# insert into ORC table "my_table"*
>
> HIVE>> INSERT INTO TABLE my_table SELECT * FROM TXT_my_table;
> 
> INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
> 
>
>
> Insertion into orc table in going on since 5-6 hours , Seems everything is
> going sequential with one mapper reading complete file?
>
> Please suggest ? help me in improving ORC table load.
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Where are jars stored for permanent functions

2016-06-08 Thread Marcin Tustin
Hi All,

I just added local jars to my hive session, created permanent functions,
and find that they are available across sessions and machines. This is of
course excellent, but I'm wondering where those jars are being stored? What
setting or what default directory would I find them in.

My session was:

add jars /mnt/storage/spatial-sdk-hive-1.1.jar
/mnt/storage/esri-geometry-api-1.2.1.jar;

create function ST_GeomFromWKT as 'com.esri.hadoop.hive.ST_GeomFromWKT';


Then that function was available via the thriftserver.


Thanks,

Marcin

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Marcin Tustin
Mich - it sounds like maybe you should try these benchmarks with alluxio
abstracting the storage layer, and see how much it makes a difference.
Alluxio should (if I understand it right) provide a lot of the optimisation
you're looking for with in memory work.

I've never used it, but I would love to hear the experiences of people who
have.

On Mon, May 30, 2016 at 5:32 PM, Mich Talebzadeh 
wrote:

> I think we are going to move to a model that the computation stack will be
> separate from storage stack and moreover something like Hive that provides
> the means for persistent storage (well HDFS is the one that stores all the
> data) will have an in-memory type capability much like what Oracle TimesTen
> IMDB does with its big brother Oracle. Now TimesTen is effectively designed
> to provide in-memory capability for analytics for Oracle 12c. These two work 
> like
> an index or materialized view.  You write queries against tables -
> optimizer figures out whether to use row oriented storage and indexes to
> access (Oracle classic) or column non-indexed storage to answer (TimesTen).
> just one optimizer.
>
> I gather Hive will be like that eventually. it will decide based on the
> frequency of access where to look for data. Yes we may have 10 TB of data
> on disk but how much of it is frequently accessed (hot data). 80-20 rule?
> In reality may be just 2TB or most recent partitions etc. The rest is cold
> data.
>
> cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 21:59, Michael Segel  wrote:
>
>> And you have MapR supporting Apache Drill.
>>
>> So these are all alternatives to Spark, and its not necessarily an either
>> or scenario. You can have both.
>>
>> On May 30, 2016, at 12:49 PM, Mich Talebzadeh 
>> wrote:
>>
>> yep Hortonworks supports Tez for one reason or other which I am going
>> hopefully to test it as the query engine for hive. Tthough I think Spark
>> will be faster because of its in-memory support.
>>
>> Also if you are independent then you better off dealing with Spark and
>> Hive without the need to support another stack like Tez.
>>
>> Cloudera support Impala instead of Hive but it is not something I have
>> used. .
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 30 May 2016 at 20:19, Michael Segel  wrote:
>>
>>> Mich,
>>>
>>> Most people use vendor releases because they need to have the support.
>>> Hortonworks is the vendor who has the most skin in the game when it
>>> comes to Tez.
>>>
>>> If memory serves, Tez isn’t going to be M/R but a local execution
>>> engine? Then LLAP is the in-memory piece to speed up Tez?
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>> thanks I think the problem is that the TEZ user group is exceptionally
>>> quiet. Just sent an email to Hive user group to see anyone has managed to
>>> built a vendor independent version.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 21:23, Jörn Franke  wrote:
>>>
 Well I think it is different from MR. It has some optimizations which
 you do not find in MR. Especially the LLAP option in Hive2 makes it
 interesting.

 I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it
 is integrated in the Hortonworks distribution.


 On 29 May 2016, at 21:43, Mich Talebzadeh 
 wrote:

 Hi Jorn,

 I started building apache-tez-0.8.2 but got few errors. Couple of guys
 from TEZ user group kindly gave a hand but I could not go very far (or may
 be I did not make enough efforts) making it work.

 That TEZ user group is very quiet as well.

 My understanding is TEZ is MR with DAG but of course Spark has both
 plus in-memory capability.

 It would be interesting to see what version of TEZ works as execution
 engine with Hive.

 Vendors are divided on this (use Hive with TEZ) or use Impala instead
 of Hive etc as I am sure you already know.

 Cheers,




 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress

NullPointerException when dropping database backed by S3

2016-05-06 Thread Marcin Tustin
Hi All,

I have a database backed by an s3 bucket. When I try to drop that database,
I get a NullPointerException:

hive> drop database services_csvs cascade;

FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:java.lang.NullPointerException)


hive> describe database services_csvs;

OK

services_csvs s3a://ID:SECRETWITHOUTSLASHES@services-csvs/ mtustin USER

→ hive --version

WARNING: Use "yarn jar" to launch YARN applications.

Hive 1.2.1.2.3.4.0-3485

Subversion
git://c66-slave-20176e25-2/grid/0/jenkins/workspace/HDP-build-centos6/bigtop/build/hive/rpm/BUILD/hive-1.2.1.2.3.4.0
-r efb067075854961dfa41165d5802a62ae334a2db

Compiled by jenkins on Wed Dec 16 04:01:39 UTC 2015

>From source with checksum 4ecc763ed826fd070121da702cbd17e9

Any ideas or suggestions would be greatly appreciated.


Thanks,

Marcin

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Making sqoop import use Spark engine as opposed to MapReduce for Hive

2016-04-30 Thread Marcin Tustin
No, the execution engines are not in general interchangeable. The Hive
project uses an abstraction layer to be able to plug different execution
engines. I don't know if sqoop uses hive code, or if it uses an old
version, or what.

As with many things in the hadoop world, if you want to know if there's
something undocumented, your best bet is to look at the source code.

My suggestion would be to (1) make sure you're executing somewhere close to
the data - i.e. on nodemanagers colocated with datanodes; (2) profile to
make sure the slowness really is where you think; and (3) if you really
can't get the speed you need, try writing a small spark job to do the
export. Newer versions of spark seem faster.


On Sat, Apr 30, 2016 at 10:05 AM, Mich Talebzadeh  wrote:

> Hi Marcin,
>
> It is the speed really. The speed in which data is digested into Hive.
>
> Sqoop is two stage as I understand.
>
>
>1. Take the data out of RDMSD via JADB and put in on an external HDFS
>file
>2. Read that file and insert into a Hive table
>
>  The issue is the second part. In general I use Hive 2 with Spark 1.3.1
> engine to put data into Hive table. I wondered if there was such a
> parameter in Sqoop to use Spark engine.
>
> Well I gather this is easier said that done.  I am importing 1 billion
> rows table from Oracle
>
> sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username
> scratchpad -P \
> --query "select * from scratchpad.dummy where \
> \$CONDITIONS" \
> --split-by ID \
> --hive-import  --hive-table "oraclehadoop.dummy" --target-dir
> "dummy"
>
>
> Now the fact that in hive-site.xml I have set hive.execution.engine=spark
> does not matter. Sqoop seems to internally set  hive.execution.engine=mr
> anyway.
>
> May be there should be an option   --hive-execution-engine='mr/tez/spak'
> etc in above command?
>
> Cheers,
>
> Mich
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 April 2016 at 14:51, Marcin Tustin  wrote:
>
>> They're not simply interchangeable. sqoop is written to use mapreduce.
>>
>> I actually implemented my own replacement for sqoop-export in spark,
>> which was extremely simple. It wasn't any faster, because the bottleneck
>> was the receiving database.
>>
>> Is your motivation here speed? Or correctness?
>>
>> On Sat, Apr 30, 2016 at 8:45 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> What is the simplest way of making sqoop import use spark engine as
>>> opposed to the default mapreduce when putting data into hive table. I did
>>> not see any parameter for this in sqoop command line doc.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: Making sqoop import use Spark engine as opposed to MapReduce for Hive

2016-04-30 Thread Marcin Tustin
They're not simply interchangeable. sqoop is written to use mapreduce.

I actually implemented my own replacement for sqoop-export in spark, which
was extremely simple. It wasn't any faster, because the bottleneck was the
receiving database.

Is your motivation here speed? Or correctness?

On Sat, Apr 30, 2016 at 8:45 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> What is the simplest way of making sqoop import use spark engine as
> opposed to the default mapreduce when putting data into hive table. I did
> not see any parameter for this in sqoop command line doc.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Hive footprint

2016-04-20 Thread Marcin Tustin
Could you expand on this? This sounds like something that would be great to
know, and probably fold into the wiki.

On Wed, Apr 20, 2016 at 11:57 AM, Jörn Franke  wrote:

> Hive has working indexes. However many people overlook that a block is
> usually much larger than in a relational database and thus do not use them
> right.
>
> On 19 Apr 2016, at 09:31, Mich Talebzadeh 
> wrote:
>
> The issue is that Hive has indexes (not index store) but they don't work
> so there we go. May be in later releases we can make use of these indexes
> for faster queries. Hive allows even bitmap indexes on Fact table but they
> are never used by COB.
>
> show indexes on sales;
>
>
> +---+---+---+--+---+--+--+
> |   idx_name|   tab_name|   col_names
> |   idx_tab_name   |   idx_type|
> comment  |
>
> +---+---+---+--+---+--+--+
> | sales_cust_bix| sales | cust_id   |
> oraclehadoop__sales_sales_cust_bix__ | bitmap|
> |
> | sales_channel_bix | sales | channel_id|
> oraclehadoop__sales_sales_channel_bix__  | bitmap|
> |
> | sales_prod_bix| sales | prod_id   |
> oraclehadoop__sales_sales_prod_bix__ | bitmap|
> |
> | sales_promo_bix   | sales | promo_id  |
> oraclehadoop__sales_sales_promo_bix__| bitmap|
> |
> | sales_time_bix| sales | time_id   |
> oraclehadoop__sales_sales_time_bix__ | bitmap|
> |
>
> +---+---+---+--+---+--+--+
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 18 April 2016 at 23:51, Marcin Tustin  wrote:
>
>> We use a hive with ORC setup now. Queries may take thousands of seconds
>> with joins, and potentially tens of seconds with selects on very large
>> tables.
>>
>> My understanding is that the goal of hbase is to provide much lower
>> latency for queries. Obviously, this comes at the cost of not being able to
>> perform joins. I don't actually use hbase, so I hesitate to say more about
>> it.
>>
>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Marcin.
>>>
>>> What is the definition of low latency here? Are you referring to the
>>> performance of SQL against HBase tables compared to Hive. As I understand
>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>> to achieve the same?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 18 April 2016 at 23:43, Marcin Tustin  wrote:
>>>
>>>> HBase has a different use case - it's for low-latency querying of big
>>>> tables. If you combined it with Hive, you might have something nice for
>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>
>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I notice that Impala is rarely mentioned these days.  I may be missing
>>>>> something. However, I gather it is coming to end now as I don't recall 
>>>>> many
>>>>> use cases for it (or customers asking for it). In contrast, Hive has hold
>>>>> its ground with the new addition of Spark and Tez as execution engines,
>>>>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>>>>> good choice for its metast

Re: Hive footprint

2016-04-18 Thread Marcin Tustin
We use a hive with ORC setup now. Queries may take thousands of seconds
with joins, and potentially tens of seconds with selects on very large
tables.

My understanding is that the goal of hbase is to provide much lower latency
for queries. Obviously, this comes at the cost of not being able to perform
joins. I don't actually use hbase, so I hesitate to say more about it.

On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh 
wrote:

> Thanks Marcin.
>
> What is the definition of low latency here? Are you referring to the
> performance of SQL against HBase tables compared to Hive. As I understand
> HBase is a columnar database. Would it be possible to use Hive against ORC
> to achieve the same?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 18 April 2016 at 23:43, Marcin Tustin  wrote:
>
>> HBase has a different use case - it's for low-latency querying of big
>> tables. If you combined it with Hive, you might have something nice for
>> certain queries, but I wouldn't think of them as direct competitors.
>>
>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I notice that Impala is rarely mentioned these days.  I may be missing
>>> something. However, I gather it is coming to end now as I don't recall many
>>> use cases for it (or customers asking for it). In contrast, Hive has hold
>>> its ground with the new addition of Spark and Tez as execution engines,
>>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>>> good choice for its metastore it scales well.
>>>
>>> If Hive had the ability (organic) to have local variable and stored
>>> procedure support then it would be top notch Data Warehouse. Given its
>>> metastore, I don't see any technical reason why it cannot support these
>>> constructs.
>>>
>>> I was recently asked to comment on migration from commercial DWs to Big
>>> Data (primarily for TCO reason) and really could not recall any better
>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>
>>> Let me know your thoughts.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: Hive footprint

2016-04-18 Thread Marcin Tustin
HBase has a different use case - it's for low-latency querying of big
tables. If you combined it with Hive, you might have something nice for
certain queries, but I wouldn't think of them as direct competitors.

On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh 
wrote:

> Hi,
>
> I notice that Impala is rarely mentioned these days.  I may be missing
> something. However, I gather it is coming to end now as I don't recall many
> use cases for it (or customers asking for it). In contrast, Hive has hold
> its ground with the new addition of Spark and Tez as execution engines,
> support for ACID and ORC and new stuff in Hive 2. In addition provided a
> good choice for its metastore it scales well.
>
> If Hive had the ability (organic) to have local variable and stored
> procedure support then it would be top notch Data Warehouse. Given its
> metastore, I don't see any technical reason why it cannot support these
> constructs.
>
> I was recently asked to comment on migration from commercial DWs to Big
> Data (primarily for TCO reason) and really could not recall any better
> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
> decides there is still HDFS, a good engine for Hive (sounds like many
> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>
> Let me know your thoughts.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: De-identification_in Hive

2016-03-19 Thread Marcin Tustin
This is a classic transform-load problem. You'll want to anonymise it once
before making it available for analysis.

On Thursday, March 17, 2016, Ajay Chander  wrote:

> Hi Everyone,
>
> I have a csv.file which has some sensitive data in a particular column
> in it.  Now I have to create a table in hive and load the data into it. But
> when loading the data I have to make sure that the data is masked. Is there
> any built in function is used ch supports this or do I have to write UDF ?
> Any suggestions are appreciated. Thanks

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Hive alter table concatenate loses data - can parquet help?

2016-03-14 Thread Marcin Tustin
Thank you very much for thinking of this. I do not have such files. I will
file a bug as per your suggestion.

On Monday, March 14, 2016, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Hi Marcin
>
> I came across this issue recently. Do you have old orc files (created with
> hive 0.11) in the table/partition? If so this patch is required
>
> https://issues.apache.org/jira/browse/HIVE-13285
>
> Thanks
> Prasanth
>
> On Mar 10, 2016, at 5:02 PM, Prasanth Jayachandran <
> pjayachand...@hortonworks.com
> > wrote:
>
> After hive 1.2.1 there is one patch that went in related to alter table
> concatenation. https://issues.apache.org/jira/browse/HIVE-12450
>
> I am not sure if its related though. Could you please file a bug for this?
> It will be great if you can attach a small enough repro for this issue. I
> can verify it and provide a fix in case of bug.
>
> Thanks
> Prasanth
>
> On Mar 8, 2016, at 5:52 AM, Marcin Tustin  > wrote:
>
> Hi Mich,
>
> ddl as below.
>
> Hi Prasanth,
>
> Hive version as reported by Hortonworks is 1.2.1.2.3.
>
> Thanks,
> Marcin
>
> CREATE TABLE ``(
>
>   `col1` string,
>
>   `col2` bigint,
>
>   `col3` string,
>
>   `col4` string,
>
>   `col4` string,
>
>   `col5` bigint,
>
>   `col6` string,
>
>   `col7` string,
>
>   `col8` string,
>
>   `col9` string,
>
>   `col10` boolean,
>
>   `col11` boolean,
>
>   `col12` string,
>
>   `metadata`
> struct,
>
>   `col14` string,
>
>   `col15` bigint,
>
>   `col16` double,
>
>   `col17` bigint)
>
> ROW FORMAT SERDE
>
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>
> STORED AS INPUTFORMAT
>
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>
> OUTPUTFORMAT
>
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>
> LOCATION
>
>   'hdfs://reporting-handy/'
>
> TBLPROPERTIES (
>
>   'COLUMN_STATS_ACCURATE'='true',
>
>   'numFiles'='2800',
>
>   'numRows'='297263',
>
>   'rawDataSize'='454748401',
>
>   'totalSize'='31310353',
>
>   'transient_lastDdlTime'='1457437204')
>
> Time taken: 1.062 seconds, Fetched: 34 row(s)
>
> On Tue, Mar 8, 2016 at 4:29 AM, Mich Talebzadeh  > wrote:
>
>> Hi
>>
>> can you please provide DDL for this table "show create table "
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 March 2016 at 23:25, Marcin Tustin > > wrote:
>>
>>> Hi All,
>>>
>>> Following on from from our parquet vs orc discussion, today I observed
>>> hive's alter table ... concatenate command remove rows from an ORC
>>> formatted table.
>>>
>>> 1. Has anyone else observed this (fuller description below)? And
>>> 2. How to do parquet users handle the file fragmentation issue?
>>>
>>> Description of the problem:
>>>
>>> Today I ran a query to count rows by date. Relevant days below:
>>> 2016-02-28 16866
>>> 2016-03-06 219
>>> 2016-03-07 2863
>>> I then ran concatenation on that table. Rerunning the same query
>>> resulted in:
>>>
>>> 2016-02-28 16866
>>> 2016-03-06 219
>>> 2016-03-07 1158
>>>
>>> Note reduced count for 2016-03-07
>>>
>>> I then ran concatenation a second time, and the query a third time:
>>> 2016-02-28 16344
>>> 2016-03-06 219
>>> 2016-03-07 1158
>>>
>>> Now the count for 2016-02-28 is reduced.
>>>
>>> This doesn't look like an elimination of duplicates occurring by design
>>> - these didn't all happen on the first run of concatenation. It looks like
>>> concatenation just kind of loses data.
>>>
>>>
>>>
>>> Want to work at Handy? Check out our culture deck and open roles
>>> <http://www.handy.com/careers>
>>> Latest news <http://www.handy.com/press> at Handy
>>> Handy just raised $50m
>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>>  led
>>> by Fidelity
>>>
>>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: How to rename a hive table without changing location?

2016-03-12 Thread Marcin Tustin
I you wish to keep it in its current location consider creating an external
table.

On Saturday, March 12, 2016, Rex X  wrote:

> Hi Mich,
>
> I am doing this, because I need to update an existing big hive table,
> which can be stored in any arbitrary customized location on hdfs. But when
> we do Alter Table Rename, Hive will automatically move the files to the
> subdirectory of the corresponding database, /user/hive/warehouse/test.db/
> in your case.
>
> I want to keep its original location.
>
>
>
>
>
>
> On Sat, Mar 12, 2016 at 4:17 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com
> > wrote:
>
>> I do not see much point in renaming a table from A to B but still looking
>> at files A for this table. What is the purpose of renaming the table but
>> having the same file system?
>>
>> hive>
>> *create table a (col1 int);*hive>
>> *show create table a;*CREATE TABLE `a`(
>>   `col1` int)
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.mapred.TextInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>>
>> *LOCATION*
>> * 'hdfs://rhes564:9000/user/hive/warehouse/test.db/a'*
>>
>> hive>
>> *alter table a rename to b;*hive> *show create table a;*
>> FAILED: SemanticException [Error 10001]: Table not found a
>>
>> hive> *show create table b;*
>> CREATE TABLE `b`(
>>   `col1` int)
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.mapred.TextInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>>
>> *LOCATION*
>> *'hdfs://rhes564:9000/user/hive/warehouse/test.db/b'*
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 13 March 2016 at 00:01, Rex X > > wrote:
>>
>>> Based on the Hive doc below:
>>>
>>>
>>> Rename Table
>>>
>>> *ALTER TABLE table_name RENAME TO new_table_name;*
>>>
>>> This statement lets you change the name of a table to a different name.
>>>
>>> *As of version 0.6, a rename on a managed table moves its HDFS location
>>> as well. (Older Hive versions just renamed the table in the metastore
>>> without moving the HDFS location.)*
>>>
>>>
>>> Is there any way to rename a table without changing the location?
>>>
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Hive alter table concatenate loses data - can parquet help?

2016-03-08 Thread Marcin Tustin
Hi Mich,

ddl as below.

Hi Prasanth,

Hive version as reported by Hortonworks is 1.2.1.2.3.

Thanks,
Marcin

CREATE TABLE ``(

  `col1` string,

  `col2` bigint,

  `col3` string,

  `col4` string,

  `col4` string,

  `col5` bigint,

  `col6` string,

  `col7` string,

  `col8` string,

  `col9` string,

  `col10` boolean,

  `col11` boolean,

  `col12` string,

  `metadata`
struct,

  `col14` string,

  `col15` bigint,

  `col16` double,

  `col17` bigint)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

STORED AS INPUTFORMAT

  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

LOCATION

  'hdfs://reporting-handy/'

TBLPROPERTIES (

  'COLUMN_STATS_ACCURATE'='true',

  'numFiles'='2800',

  'numRows'='297263',

  'rawDataSize'='454748401',

  'totalSize'='31310353',

  'transient_lastDdlTime'='1457437204')

Time taken: 1.062 seconds, Fetched: 34 row(s)

On Tue, Mar 8, 2016 at 4:29 AM, Mich Talebzadeh 
wrote:

> Hi
>
> can you please provide DDL for this table "show create table "
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 March 2016 at 23:25, Marcin Tustin  wrote:
>
>> Hi All,
>>
>> Following on from from our parquet vs orc discussion, today I observed
>> hive's alter table ... concatenate command remove rows from an ORC
>> formatted table.
>>
>> 1. Has anyone else observed this (fuller description below)? And
>> 2. How to do parquet users handle the file fragmentation issue?
>>
>> Description of the problem:
>>
>> Today I ran a query to count rows by date. Relevant days below:
>> 2016-02-28 16866
>> 2016-03-06 219
>> 2016-03-07 2863
>> I then ran concatenation on that table. Rerunning the same query resulted
>> in:
>>
>> 2016-02-28 16866
>> 2016-03-06 219
>> 2016-03-07 1158
>>
>> Note reduced count for 2016-03-07
>>
>> I then ran concatenation a second time, and the query a third time:
>> 2016-02-28 16344
>> 2016-03-06 219
>> 2016-03-07 1158
>>
>> Now the count for 2016-02-28 is reduced.
>>
>> This doesn't look like an elimination of duplicates occurring by design -
>> these didn't all happen on the first run of concatenation. It looks like
>> concatenation just kind of loses data.
>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
 led 
by Fidelity



Re: Hive 2 insert error

2016-03-07 Thread Marcin Tustin
I believe updates and deletes have always had this constraint. It's at
least hinted at by:
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-ConfigurationValuestoSetforINSERT,UPDATE,DELETE

On Mon, Mar 7, 2016 at 7:46 PM, Mich Talebzadeh 
wrote:

> Hi,
>
> I noticed this one in Hive2.
>
> insert into sales3 select * from smallsales;
> FAILED: SemanticException [Error 10297]: Attempt to do update or delete on
> table sales3 that does not use an AcidOutputFormat or is not bucketed
>
> Is this something new in Hive 2 as I don't recall having this issue before?
>
> Table sales3 has been created as follows:
>
> +-+--+
> |   createtab_stmt|
> +-+--+
> | CREATE TABLE `sales3`(  |
> |   `prod_id` bigint, |
> |   `cust_id` bigint, |
> |   `time_id` timestamp,  |
> |   `channel_id` bigint,  |
> |   `promo_id` bigint,|
> |   `quantity_sold` decimal(10,0),|
> |   `amount_sold` decimal(10,0))  |
> | ROW FORMAT SERDE|
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'   |
> | STORED AS INPUTFORMAT   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
> | OUTPUTFORMAT|
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'|
> | LOCATION|
> |   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales3'  |
> | TBLPROPERTIES ( |
> |   'orc.compress'='SNAPPY',  |
> |   'transactional'='true',   |
> |   'transient_lastDdlTime'='1457396808') |
> +-+--+
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Hive alter table concatenate loses data - can parquet help?

2016-03-07 Thread Marcin Tustin
Hi All,

Following on from from our parquet vs orc discussion, today I observed
hive's alter table ... concatenate command remove rows from an ORC
formatted table.

1. Has anyone else observed this (fuller description below)? And
2. How to do parquet users handle the file fragmentation issue?

Description of the problem:

Today I ran a query to count rows by date. Relevant days below:
2016-02-28 16866
2016-03-06 219
2016-03-07 2863
I then ran concatenation on that table. Rerunning the same query resulted
in:

2016-02-28 16866
2016-03-06 219
2016-03-07 1158

Note reduced count for 2016-03-07

I then ran concatenation a second time, and the query a third time:
2016-02-28 16344
2016-03-06 219
2016-03-07 1158

Now the count for 2016-02-28 is reduced.

This doesn't look like an elimination of duplicates occurring by design -
these didn't all happen on the first run of concatenation. It looks like
concatenation just kind of loses data.

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Updating column in table throws error

2016-03-06 Thread Marcin Tustin
Don't bucket on columns you expect to update.

Potentially you could delete the whole row and reinsert it.

On Sunday, March 6, 2016, Ashok Kumar  wrote:

> Hi gurus,
>
> I have an ORC table bucketed on invoicenumber with "transactional"="true"
>
> I am trying to update invoicenumber column used for bucketing this table
> but it comes back with
>
> Error: Error while compiling statement: FAILED: SemanticException [Error
> 10302]: Updating values of bucketing columns is not supported.  Column
> invoicenumber
>
> Any ideas how it can be solved?
>
> Thank you
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Parquet versus ORC

2016-03-06 Thread Marcin Tustin
If you google, you'll find benchmarks showing each to be faster than the
other. In so far as there's any reality to which is faster in any given
comparison, it seems to be a result of each incorporating ideas from the
other, or at least going through development cycles to beat each other.

ORC is very fast for working with hive, and we use it at Handy. That said,
the broader support for parquet might enable things like performing your
own insertions into tables by dropping new files in there, or doing your
own concatenation and cleanup.

In summary, until you benchmark your own usage I'd assume performance is
the same. If you're not going to benchmark, go by what's likely to be most
convenient.

On Sun, Mar 6, 2016 at 11:06 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> Thanks for that link.
>
> It appears that the main advantages of Parquet is stated as and I quote:
>
> "Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies."
>
> Fair enough Parquet provides columnar format and compression. As I stated
> I do not know much about it. However, my understanding of ORC is that it
> provides better encoding of data, Predicate push down for some predicates
> plus support for ACID properties.
>
> As Alan Gates stated before (Hive user forum, "Difference between ORC and
> RC files" , 21 Dec 15) and I quote
>
> "Whether ORC is the best format for what you're doing depends on the data
> you're storing and how you are querying it.  If you are storing data where
> you know the schema and you are doing analytic type queries it's the best
> choice (in fairness, some would dispute this and choose Parquet, though
> much of what I said above (about ORC vs RC applies to Parquet as well).  If
> you are doing queries that select the whole row each time columnar formats
> like ORC won't be your friend.  Also, if you are storing self structured
> data such as JSON or Avro you may find text or Avro storage to be a better
> format.
>
> So what would be the main advantage(s) of Parquet over ORC please besides
> using queries that select whole row (much like "a row based" type
> relational database does).
>
>
> Cheers.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 6 March 2016 at 15:34, Uli Bethke  wrote:
>
>> Curious why you think that Parquet does not have metadat at file, row
>> group or column level.
>> Please refer here to the type of metadata that Parquet supports in the
>> docs http://parquet.apache.org/documentation/latest/
>>
>>
>> n 06/03/2016 15:26, Mich Talebzadeh wrote:
>>
>> Hi.
>>
>> I have been hearing a fair bit about Parquet versus ORC tables.
>>
>> In a nutshell I can say that Parquet is a predecessor to ORC (both
>> provide columnar type storage) but I notice that it is still being used
>> especially with Spark users.
>>
>> In mitigation it appears that Spark users are reluctant to use ORC
>> despite the fact that with inbuilt Store Index it offers superior
>> optimisation with data and stats at file, stripe and row group level. Both
>> Parquet and ORC offer SNAPPY compression as well. ORC offers ZLIB as
>> default.
>>
>> There may be other than technical reasons for this adaption, for example
>> too much reliance on Hive plus the fact that it is easier to flatten
>> Parquet than ORC (whatever that means).
>>
>> I for myself use either text files or ORC with Hive and Spark and don't
>> really see any reason why I should adopt others like Avro, Parquet etc.
>>
>> Appreciate any verification or experience on this.
>>
>> Thanks
>> ,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>>
>> --
>> ___
>> Uli Bethke
>> Chair Hadoop User Group Irelandwww.hugireland.org
>> HUG Ireland is community sponsor of Hadoop Summit Europe in Dublin 
>> http://2016.hadoopsummit.org/dublin/
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Data corruption/loss in hive

2016-01-22 Thread Marcin Tustin
Hi All,

I'm seeing some data loss/corruption in hive. This isn't HDFS-level
corruption - hdfs reports that the files and blocks are healthy.

I'm using managed ORC tables. Normally we write once an hour to each table,
with occasional concatenations through hive. We perform the writing using
spark 1.3.1, (using the spark sql interface) running either locally or over
yarn.

Occasionally we will run many insertion jobs against a table, generally
when backfilling data.

The data loss seems to happen more frequently when we are doing frequent
concatenations and multiple insertion jobs at once.

The problem goes away when we drop the table and reingest. The problem also
appears to be localised to specific orc files within the table - if we
delete the affected files (detectable by trying to orcdump each file), the
rest are just fine.

Has anyone seen this? Any suggestions for avoiding this or chasing down a
root cause?

Thanks,
Marcin

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: the `use database` command will change the scheme of target table?

2016-01-19 Thread Marcin Tustin
That is the expected behaviour. Managed tables are created within the
directory of their host database.

On Tuesday, 19 January 2016, 董亚军  wrote:

> hi list,
>
> we use the HDFS and S3 as the Hive Filesystem at the same time.   here has
> an issue:
>
>
> *scenario* 1:
>
> hive command:
>
> use default;
>
> create table temp.t1   // the database of temp which points to HDFS
> as
> select c1 from prd.t2; // the database of prd and the table t2 are all
> points to S3
>
> it works well.
>
>
> *scenario* 2:
>
> hive command:
>
> *use prd; *
>
> create table temp.t1   // the database of temp which points to HDFS
> as
> select c1 from prd.t2; // the database of prd and the table t2 are all
> point to S3
>
> the exception occurred with:
>
> Failed with exception Unable to move source
> s3a://warehouse-tmp/tmp/hive-ubuntu/hive_2016-01-20_xx/-ext-10001 to
> destination hdfs://hadoop-0/warehouse/temp.db/t1/
>
> and then, I try to change the Scratch space by the configuration key:
> hive.exec.scratchdir, and set the value to hdfs://hadoop-0/*tmp-foo*/...
> , but also failed with:
>
> Unable to move source s3a://warehouse-tmp*/tmp-foo* ... to
>
> it seems to the *use database* command change the scheme of the path for
> target table?
>
> hive version: 0.13.1
>
>
> thanks.
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: eiquivalent to identity column in Hive

2016-01-16 Thread Marcin Tustin
See this:
http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive

On Sat, Jan 16, 2016 at 11:52 AM, Ashok Kumar  wrote:

> Hi,
>
> Is there an equivalent to Microsoft IDENTITY column in Hive please.
>
> Thanks  and regards
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Loading data containing newlines

2016-01-15 Thread Marcin Tustin
You can open a file as an RDD of lines, and map whatever custom
tokenisation function you want over it; alternatively you can partition
down to a reasonable size and use map_partitions to map the standard python
csv parser over the partitions.

In general, the advantage of spark is that you can do anything you like
rather than being limited to a specific set of primitives.

On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh 
wrote:

> Hi Marcin,
>
>
>
> Can you be specific in what way Spark is better suited for this operation
> compared to Hive?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
> *Sent:* 15 January 2016 21:39
> *To:* user@hive.apache.org
> *Subject:* Re: Loading data containing newlines
>
>
>
> I second this. I've generally found anything else to be disappointing when
> working with data which is at all funky.
>
>
>
> On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov 
> wrote:
>
> Time to use Spark and Spark-Sql in addition to Hive?
>
> It's probably going to happen sooner or later anyway.
>
>
>
> I sent you Spark solution yesterday.  (you just need to write 
> unbzip2AndCsvToListOfArrays(file:
> String): List[Array[String]]  function using BZip2CompressorInputStream
> and Super CSV API)
>
> you can download spark,  open spark-shell and run/debug the program on a
> single computer
>
>
>
> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
> cluster in 7 min)
>
>
>
> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W 
> wrote:
>
> 1.   hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
> host.name/data/stg/table/
>
> 2.   CREATE EXTERNAL TABLE stg_ (cols…) ROW FORMAT serde
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
> ‘/data/stg/table/’
>
> 3.   CREATE TABLE  (cols…) STORE AS ORC  tblproperties
> ("orc.compress"="ZLIB");
>
> 4.   INSERT INTO TABLE  SELECT cols, udf1(cola),
> udf2(colb),functions(),etc. FROM ext_
>
> 5.   Delete files from hdfs://host.name/data/stg/table/
>
>
>
> This has been working quite well, until our newest data contains fields
> with embedded newlines.
>
>
>
> We are now looking into options further up the pipeline to see if we can
> condition the data earlier in the process.
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* Wednesday, January 13, 2016 10:34 AM
>
>
> *To:* user@hive.apache.org
> *Subject:* RE: Loading data containing newlines
>
>
>
> Thanks Brian.
>
>
>
> Just to clarify do you use something like below?
>
>
>
> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
> rhes564.hedat.net:9000/misc/t.bcp
>
> 2.  CREATE EXTERNAL TABLE  name (col1 INT, col2 string, …) COMMENT
> 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
> AS ORC
>
>
>
> Cheers,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAE

Re: Loading data containing newlines

2016-01-15 Thread Marcin Tustin
I second this. I've generally found anything else to be disappointing when
working with data which is at all funky.

On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov 
wrote:

> Time to use Spark and Spark-Sql in addition to Hive?
> It's probably going to happen sooner or later anyway.
>
> I sent you Spark solution yesterday.  (you just need to write
> unbzip2AndCsvToListOfArrays(file: String): List[Array[String]]  function
> using BZip2CompressorInputStream and Super CSV API)
> you can download spark,  open spark-shell and run/debug the program on a
> single computer
>
> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
> cluster in 7 min)
>
> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W 
> wrote:
>
>> 1.   hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
>> host.name/data/stg/table/
>>
>> 2.   CREATE EXTERNAL TABLE stg_ (cols…) ROW FORMAT serde
>> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
>> ‘/data/stg/table/’
>>
>> 3.   CREATE TABLE  (cols…) STORE AS ORC  tblproperties
>> ("orc.compress"="ZLIB");
>>
>> 4.   INSERT INTO TABLE  SELECT cols, udf1(cola),
>> udf2(colb),functions(),etc. FROM ext_
>>
>> 5.   Delete files from hdfs://host.name/data/stg/table/
>>
>>
>>
>> This has been working quite well, until our newest data contains fields
>> with embedded newlines.
>>
>>
>>
>> We are now looking into options further up the pipeline to see if we can
>> condition the data earlier in the process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
>> *Sent:* Wednesday, January 13, 2016 10:34 AM
>>
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Thanks Brian.
>>
>>
>>
>> Just to clarify do you use something like below?
>>
>>
>>
>> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
>> rhes564.hedat.net:9000/misc/t.bcp
>>
>> 2.  CREATE EXTERNAL TABLE  name (col1 INT, col2 string, …)
>> COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY
>> ',' STORED AS ORC
>>
>>
>>
>> Cheers,
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov]
>> *Sent:* 13 January 2016 18:12
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> We are pushing the compressed text files into HDFS directory for Hive
>> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
>> letting Hive handle the ORC file creation process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk ]
>>
>> *Sent:* Tuesday, January 12, 2016 4:41 PM
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Hi Bryan,
>>
>>
>>
>> As a matter of interest are you loading text files into local directories
>> in encrypted format at all and then push it into HDFS/Hive as ORC?
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-975

Re: foreign keys in Hive

2016-01-10 Thread Marcin Tustin
You can join on any equality criterion, just like in any other relational
database. Foreign keys in "standard" relational databases are primarily an
integrity constraint. Hive in general lacks integrity constraints.

On Sun, Jan 10, 2016 at 9:45 AM, Ashok Kumar  wrote:

> hi,
>
> what is the equivalent to foreign keys in Hive?
>
> Thanks
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Marcin Tustin
Yes, that's why I haven't had to compile anything.

On Wed, Dec 30, 2015 at 4:16 PM, Jörn Franke  wrote:

> Hdp Should have TEZ already on-Board bye default.
>
> On 30 Dec 2015, at 21:42, Marcin Tustin  wrote:
>
> I'm afraid I use the HDP distribution so I haven't yet had to compile
> anything. (Incidentally, this isn't a recommendation of HDP over anything
> else).
>
> On Wed, Dec 30, 2015 at 3:33 PM, Mich Talebzadeh 
> wrote:
>
>> Thanks Marcin
>>
>>
>>
>> Trying to build TEZ 0.7 in
>>
>>
>>
>> /usr/lib/apache-tez-0.7.0-src
>>
>>
>>
>> using
>>
>>
>>
>> mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
>>
>>
>>
>> with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it
>> OK with 3.2.5 following the same error ass below
>>
>>
>>
>> mvn --version
>>
>> Apache Maven *3.2.5* (12a6b3acb947671f09b81f49094c53f426d8cea1;
>> 2014-12-14T17:29:23+00:00)
>>
>> Maven home: /usr/local/apache-maven/apache-maven-3.2.5
>>
>> Java version: 1.7.0_25, vendor: Oracle Corporation
>>
>> Java home: /usr/java/jdk1.7.0_25/jre
>>
>>
>>
>> *I get this error*
>>
>>
>>
>> [INFO] tez-ui . FAILURE [
>> 0.411 s]
>>
>> [
>>
>>
>>
>> DEBUG] -- end configuration --
>>
>> [INFO] Running 'npm install --color=false' in
>> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
>>
>> [INFO]
>> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh:
>> line 3: 23781 Aborted "$@"
>>
>>
>>
>>
>>
>> [ERROR] Failed to execute goal
>> com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
>> project tez-ui: Failed to run task: 'npm install --color=false' failed.
>> (error code 134) -> [Help 1]
>>
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
>> project tez-ui: Failed to run task
>>
>>
>>
>>
>>
>> any ideas as there is little info available in net.
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
>> *Sent:* 30 December 2015 19:27
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Running the same query on 1 billion rows fact table in
>> Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is
>> much faster than MR in pretty much all cases. Also, with hive, you'll make
>> sure you've performed optimizations like aligning ORC stripe sizes with
>> HDFS block sizes, and concatenated your tables (not so much an optimization
>> as a must for avoiding the small files problem).
>>
>>
>>
>> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh 
>> wrote

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Marcin Tustin
I'm afraid I use the HDP distribution so I haven't yet had to compile
anything. (Incidentally, this isn't a recommendation of HDP over anything
else).

On Wed, Dec 30, 2015 at 3:33 PM, Mich Talebzadeh 
wrote:

> Thanks Marcin
>
>
>
> Trying to build TEZ 0.7 in
>
>
>
> /usr/lib/apache-tez-0.7.0-src
>
>
>
> using
>
>
>
> mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
>
>
>
> with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it
> OK with 3.2.5 following the same error ass below
>
>
>
> mvn --version
>
> Apache Maven *3.2.5* (12a6b3acb947671f09b81f49094c53f426d8cea1;
> 2014-12-14T17:29:23+00:00)
>
> Maven home: /usr/local/apache-maven/apache-maven-3.2.5
>
> Java version: 1.7.0_25, vendor: Oracle Corporation
>
> Java home: /usr/java/jdk1.7.0_25/jre
>
>
>
> *I get this error*
>
>
>
> [INFO] tez-ui . FAILURE [
> 0.411 s]
>
> [
>
>
>
> DEBUG] -- end configuration --
>
> [INFO] Running 'npm install --color=false' in
> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
>
> [INFO]
> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh:
> line 3: 23781 Aborted "$@"
>
>
>
>
>
> [ERROR] Failed to execute goal
> com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
> project tez-ui: Failed to run task: 'npm install --color=false' failed.
> (error code 134) -> [Help 1]
>
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
> project tez-ui: Failed to run task
>
>
>
>
>
> any ideas as there is little info available in net.
>
>
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
> *Sent:* 30 December 2015 19:27
>
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is
> much faster than MR in pretty much all cases. Also, with hive, you'll make
> sure you've performed optimizations like aligning ORC stripe sizes with
> HDFS block sizes, and concatenated your tables (not so much an optimization
> as a must for avoiding the small files problem).
>
>
>
> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh 
> wrote:
>
> Thanks again Jorn.
>
>
>
>
>
> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I
> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap)
> indexes on the time_id column. For the dimension table (times) I have
> time_id defined as primary key. Also Sybase IQ creates FP (fast projection)
> indexes on every column by default.
>
>
>
> Anyway I am trying to download and build TEZ. Do we know which version of
> TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strate

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

2015-12-30 Thread Marcin Tustin
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much
faster than MR in pretty much all cases. Also, with hive, you'll make sure
you've performed optimizations like aligning ORC stripe sizes with HDFS
block sizes, and concatenated your tables (not so much an optimization as a
must for avoiding the small files problem).

On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh 
wrote:

> Thanks again Jorn.
>
>
>
>
>
> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I
> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap)
> indexes on the time_id column. For the dimension table (times) I have
> time_id defined as primary key. Also Sybase IQ creates FP (fast projection)
> indexes on every column by default.
>
>
>
> Anyway I am trying to download and build TEZ. Do we know which version of
> TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Jörn Franke [mailto:jornfra...@gmail.com]
> *Sent:* 30 December 2015 16:29
>
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
>
> Hmm i think the execution Engine TEZ has (currently) the most
> optimizations on Hive. What about your hardware - is it the same? Do you
> have also compression on Sybase?
>
> Alternatively you need to wait for Hive for interactive analytics (tez 0.8
> + llap).
>
>
> On 30 Dec 2015, at 13:47, Mich Talebzadeh  wrote:
>
> Hi Jorn,
>
>
>
> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not
> tried it on TEZ. I tried the query on MR engine and it did nor fair better.
> I also ran it without SDDDEV function and found out that the function did
> not slow it down.
>
>
>
> I tried a simple query as follows builr in sales FACT table 1e9 rows and
> dimension table times (1826 rows)
>
>
>
> --
>
> -- Get the total amount sold for each calendar month
>
> --
>
> *SELECT t.calendar_month_desc, SUM(s.amount_sold)*
>
> *FROM sales s, times t WHERE s.time_id = t.time_id*
>
> *GROUP BY t.calendar_month_desc;*
>
>
>
> Now Sybase IQ comes back in around 30 seconds.
>
>
>
> Started query at Dec 30 2015 08:14:33:399AM
>
> (48 rows affected)
>
> Finished query at Dec 30 2015 08:15:04:640AM
>
>
>
> Whereas Hive with the following setting and running the same query
>
>
>
> set
> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>
> set hive.optimize.bucketmapjoin=true;
>
> set hive.optimize.bucketmapjoin.sortedmerge=true;
>
>
>
> Comes back in
>
>
>
> 48 rows selected (1514.687 seconds)
>
>
>
> I don’t know what else can be done. Obviously this is all schema on read
> so I am not sure I can change bucketing on FACT table based on one query
> alone!
>
>
>
>
>
>
>
> ++--+
>
> |   createtab_stmt   |
>
> ++--+
>
> | CREATE TABLE `times`(  |
>
> |   `time_id` timestamp, |
>
> |   `day_name` varchar(9),   |
>
> |   `day_number_in_week` int,|
>
> |   `day_number_in_month` int,   |
>
> |   `calendar_week_number` int,  |
>
> |   `fiscal_week_number` int,|
>
> |   `week_ending_day` timestamp, |
>
> |   `week_ending_day_id` bigint, |
>
> |   `calendar_month_number` i

Importing into a hive database with minimal unavailability or renaming a database

2015-12-18 Thread Marcin Tustin
Hi All,

We import our production database into hive on a schedule using sqoop.
Unfortunately, sqoop won't update the table schema in hive when the table
schema has changed in the source database.

Accordingly, to get updates to the table schema we drop the hive table
first.

Unfortunately, this causes the data to be unavailable in hive for a certain
period of time.

Accordingly, I'd like to know how people on this list have tackled the
issue. Is there a way to get sqoop to update the table schema in hive, or
can we import into a staging hive database and rename it?

Thanks,
Marcin

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity