Re: Vectorised Query Execution extension

2016-08-04 Thread Jörn Franke
Even if it is possible it does only make sense to a certain limit given by your 
CPU and CPU caches.

> On 04 Aug 2016, at 22:57, Mich Talebzadeh  wrote:
> 
> As I understand from the manual:
> 
> Vectorized query execution is a Hive feature that greatly reduces the CPU 
> usage for typical query operations like scans, filters, aggregates, and 
> joins. A standard query execution system processes one row at a time. This 
> involves long code .. Vectorized query execution streamlines operations 
> by processing a block of 1024 rows at a time. Within the block, each column 
> is stored as a vector (an array of a primitive data type).
> 
> As fart as I can see Vectorized query execution (VQE) can be applied to most 
> columns and sql operations. Is it therefore possible to extend it beyond 1024 
> rows to include the whole column in table?
> 
> VQE would be very useful especially with ORC as it basically means that one 
> can process the whole column separately thus improving performance of the 
> query.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: hive concurrency not working

2016-08-04 Thread Raj hadoop
Thanks everyone..

we are raising case with Hortonworks

On Wed, Aug 3, 2016 at 6:44 PM, Raj hadoop  wrote:

> Dear All,
>
> In need or your help,
>
> we have horton works 4 node cluster,and the problem is hive is allowing
> only one user at a time,
>
> if any second resource need to login hive is not working,
>
> could someone please help me in this
>
> Thanks,
> Rajesh
>


Re: Vectorised Query Execution extension

2016-08-04 Thread Gopal Vijayaraghavan
> Vectorized query execution streamlines operations by processing a block
>of 1024 rows at a time.

The real win of vectorization + columnar is that you get to take advantage
of them at the same time.

We get to execute the function once per 1024 rows when things are
repeating - particularly true when the repetition naturally clusters
together (like a Date field).

And what's the possibility that no-row has nulls in a file - the current
mode of operation prevents small fractions of nulls from hurting the whole
runtime.

Making it 1024 also was because of the way Java -XX:+UseNUMA allocates
pages. If you look at LLAP startup options, you'll notice that it does
most of the allocations off the TLAB, to restrict the allocations to the
same NUMA zone as the thread.

No such easy solutions exist for larger allocations.

so 

1) we'd lose isRepeating=true when you increase the block size
2) we get slower memory with NUMA interleaving when you increase the size
of allocations.
3) we lose low-pause GC effects when allocating from the Humongous section
of the G1GC

The illusion of Java is that allocations are free - imagine a count(1)
that allocates a huge array before returning versus a reader which reuses
the same memory to read chunks and operate, which one would pause more
often for the GC?

> VQE would be very useful especially with ORC as it basically means that
>one can process the whole column separately thus improving performance of
>the query.

Why and How? 

The columnar layout already process the whole column separately because of
how chunks are read out.

There's more to be done there for pure performance, of course - we could
run the pre-exec filters on the String dictionaries and then only run pure
int:int comparisons for the offsets, we could execute deterministic UDFs
once per dictionary offset to make the isRepeating model operate across a
whole ORC stripe.

Cheers,
Gopal




Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-04 Thread Mich Talebzadeh
Ok

Does it matter whether the table you create is accessible to Hive?

You can read your hive table in Spark assuming you know the table name

// Read hive table. This one is partitioned
scala> val s = HiveContext.table("oraclehadoop.sales")
s: org.apache.spark.sql.DataFrame = [prod_id: bigint, cust_id: bigint,
time_id: timestamp, channel_id: bigint, promo_id: bigint, quantity_sold:
decimal(10,0), amount_sold: decimal(10,0), year: int, month: int]

//Create an external table in parquet format from that table and save it
without the partitioned columns ( year: int, month: int)  in an external
directory
scala> val s2 = s.select('prod_id, 'cust_id, 'time_id, 'channel_id,
'promo_id, 'quantity_sold, 'amount_sold)
s2: org.apache.spark.sql.DataFrame = [prod_id: bigint, cust_id: bigint,
time_id: timestamp, channel_id: bigint, promo_id: bigint, quantity_sold:
decimal(10,0), amount_sold: decimal(10,0)]
scala> s2.write.mode("overwrite").parquet("/data/stg/newtable/sales5")

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 23:21, Nagabhushanam Bheemisetty  wrote:

> Yes you are correct that is just meta copy and I need only that but
> without partition:(
>
> On Thu, Aug 4, 2016 at 5:15 PM Mich Talebzadeh 
> wrote:
>
>> yes but that essentially copies the metadata and leaves the partition
>> there with no data. it is just an image copy. won't help this case
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 August 2016 at 23:08, Nagabhushanam Bheemisetty <
>> nbheemise...@gmail.com> wrote:
>>
>>> Well you can create using like.
>>>
>>>
>>> CREATE EXTERNAL TABLE sales5 LIKE sales;
>>>
>>>
>>> On Thu, Aug 4, 2016 at 5:06 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Which process creates the master table in Hive as an external table?
 There must be a process that creates the master table as external table?
 Hive knows about the schema of that table. It is in Hive metastore.

 You cannot create an external table with CREATE EXTERNAL TABLE AS ...

 hive> CREATE EXTERNAL TABLE sales5 AS SELECT * FROM SALES;
 FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot
 create external table

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 4 August 2016 at 22:56, Nagabhushanam Bheemisetty <
 nbheemise...@gmail.com> wrote:

> I only get the table names that I need to ingest. So I don't know the
> master table schema upfront.
>
> Yes the new table based on master table which is partitioned but new
> table should not be partitioned and should not have partition column.
>
> On Thu, Aug 4, 2016 at 4:54 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Do you know the existing table schema? The new table schema will be
>> based on that table without partitioning?
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* 

Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-04 Thread Nagabhushanam Bheemisetty
Yes you are correct that is just meta copy and I need only that but without
partition:(

On Thu, Aug 4, 2016 at 5:15 PM Mich Talebzadeh 
wrote:

> yes but that essentially copies the metadata and leaves the partition
> there with no data. it is just an image copy. won't help this case
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 August 2016 at 23:08, Nagabhushanam Bheemisetty <
> nbheemise...@gmail.com> wrote:
>
>> Well you can create using like.
>>
>>
>> CREATE EXTERNAL TABLE sales5 LIKE sales;
>>
>>
>> On Thu, Aug 4, 2016 at 5:06 PM Mich Talebzadeh 
>> wrote:
>>
>>> Which process creates the master table in Hive as an external table?
>>> There must be a process that creates the master table as external table?
>>> Hive knows about the schema of that table. It is in Hive metastore.
>>>
>>> You cannot create an external table with CREATE EXTERNAL TABLE AS ...
>>>
>>> hive> CREATE EXTERNAL TABLE sales5 AS SELECT * FROM SALES;
>>> FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot
>>> create external table
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 4 August 2016 at 22:56, Nagabhushanam Bheemisetty <
>>> nbheemise...@gmail.com> wrote:
>>>
 I only get the table names that I need to ingest. So I don't know the
 master table schema upfront.

 Yes the new table based on master table which is partitioned but new
 table should not be partitioned and should not have partition column.

 On Thu, Aug 4, 2016 at 4:54 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Do you know the existing table schema? The new table schema will be
> based on that table without partitioning?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
> On 4 August 2016 at 22:05, Nagabhushanam Bheemisetty <
> nbheemise...@gmail.com> wrote:
>
>> Hi I've a scenario where I need to create a table from partitioned
>> table but my destination table should not be partitioned. I won't be
>> knowing the schema so I cannot create manually the destination table. By
>> the way both tables are external tables.
>>
>
>
>>>
>


Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-04 Thread Mich Talebzadeh
yes but that essentially copies the metadata and leaves the partition there
with no data. it is just an image copy. won't help this case

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 23:08, Nagabhushanam Bheemisetty  wrote:

> Well you can create using like.
>
>
> CREATE EXTERNAL TABLE sales5 LIKE sales;
>
>
> On Thu, Aug 4, 2016 at 5:06 PM Mich Talebzadeh 
> wrote:
>
>> Which process creates the master table in Hive as an external table?
>> There must be a process that creates the master table as external table?
>> Hive knows about the schema of that table. It is in Hive metastore.
>>
>> You cannot create an external table with CREATE EXTERNAL TABLE AS ...
>>
>> hive> CREATE EXTERNAL TABLE sales5 AS SELECT * FROM SALES;
>> FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot
>> create external table
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 August 2016 at 22:56, Nagabhushanam Bheemisetty <
>> nbheemise...@gmail.com> wrote:
>>
>>> I only get the table names that I need to ingest. So I don't know the
>>> master table schema upfront.
>>>
>>> Yes the new table based on master table which is partitioned but new
>>> table should not be partitioned and should not have partition column.
>>>
>>> On Thu, Aug 4, 2016 at 4:54 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Do you know the existing table schema? The new table schema will be
 based on that table without partitioning?



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 4 August 2016 at 22:05, Nagabhushanam Bheemisetty <
 nbheemise...@gmail.com> wrote:

> Hi I've a scenario where I need to create a table from partitioned
> table but my destination table should not be partitioned. I won't be
> knowing the schema so I cannot create manually the destination table. By
> the way both tables are external tables.
>


>>


Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-04 Thread Nagabhushanam Bheemisetty
Well you can create using like.


CREATE EXTERNAL TABLE sales5 LIKE sales;


On Thu, Aug 4, 2016 at 5:06 PM Mich Talebzadeh 
wrote:

> Which process creates the master table in Hive as an external table? There
> must be a process that creates the master table as external table?  Hive
> knows about the schema of that table. It is in Hive metastore.
>
> You cannot create an external table with CREATE EXTERNAL TABLE AS ...
>
> hive> CREATE EXTERNAL TABLE sales5 AS SELECT * FROM SALES;
> FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot
> create external table
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 August 2016 at 22:56, Nagabhushanam Bheemisetty <
> nbheemise...@gmail.com> wrote:
>
>> I only get the table names that I need to ingest. So I don't know the
>> master table schema upfront.
>>
>> Yes the new table based on master table which is partitioned but new
>> table should not be partitioned and should not have partition column.
>>
>> On Thu, Aug 4, 2016 at 4:54 PM Mich Talebzadeh 
>> wrote:
>>
>>> Do you know the existing table schema? The new table schema will be
>>> based on that table without partitioning?
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 4 August 2016 at 22:05, Nagabhushanam Bheemisetty <
>>> nbheemise...@gmail.com> wrote:
>>>
 Hi I've a scenario where I need to create a table from partitioned
 table but my destination table should not be partitioned. I won't be
 knowing the schema so I cannot create manually the destination table. By
 the way both tables are external tables.

>>>
>>>
>


Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-04 Thread Mich Talebzadeh
Which process creates the master table in Hive as an external table? There
must be a process that creates the master table as external table?  Hive
knows about the schema of that table. It is in Hive metastore.

You cannot create an external table with CREATE EXTERNAL TABLE AS ...

hive> CREATE EXTERNAL TABLE sales5 AS SELECT * FROM SALES;
FAILED: SemanticException [Error 10070]: CREATE-TABLE-AS-SELECT cannot
create external table

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 22:56, Nagabhushanam Bheemisetty  wrote:

> I only get the table names that I need to ingest. So I don't know the
> master table schema upfront.
>
> Yes the new table based on master table which is partitioned but new table
> should not be partitioned and should not have partition column.
>
> On Thu, Aug 4, 2016 at 4:54 PM Mich Talebzadeh 
> wrote:
>
>> Do you know the existing table schema? The new table schema will be based
>> on that table without partitioning?
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 August 2016 at 22:05, Nagabhushanam Bheemisetty <
>> nbheemise...@gmail.com> wrote:
>>
>>> Hi I've a scenario where I need to create a table from partitioned table
>>> but my destination table should not be partitioned. I won't be knowing the
>>> schema so I cannot create manually the destination table. By the way both
>>> tables are external tables.
>>>
>>
>>


Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-04 Thread Nagabhushanam Bheemisetty
I only get the table names that I need to ingest. So I don't know the
master table schema upfront.

Yes the new table based on master table which is partitioned but new table
should not be partitioned and should not have partition column.

On Thu, Aug 4, 2016 at 4:54 PM Mich Talebzadeh 
wrote:

> Do you know the existing table schema? The new table schema will be based
> on that table without partitioning?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 August 2016 at 22:05, Nagabhushanam Bheemisetty <
> nbheemise...@gmail.com> wrote:
>
>> Hi I've a scenario where I need to create a table from partitioned table
>> but my destination table should not be partitioned. I won't be knowing the
>> schema so I cannot create manually the destination table. By the way both
>> tables are external tables.
>>
>
>


Re: Crate Non-partitioned table from partitioned table using CREATE TABLE .. LIKE

2016-08-04 Thread Mich Talebzadeh
Do you know the existing table schema? The new table schema will be based
on that table without partitioning?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 22:05, Nagabhushanam Bheemisetty  wrote:

> Hi I've a scenario where I need to create a table from partitioned table
> but my destination table should not be partitioned. I won't be knowing the
> schema so I cannot create manually the destination table. By the way both
> tables are external tables.
>


Vectorised Query Execution extension

2016-08-04 Thread Mich Talebzadeh
As I understand from the manual:

Vectorized query execution is a Hive feature that greatly reduces the CPU
usage for typical query operations like scans, filters, aggregates, and
joins. A standard query execution system processes one row at a time. This
involves long code .. Vectorized query execution streamlines operations
by processing a block of 1024 rows at a time. Within the block, each column
is stored as a vector (an array of a primitive data type).

As fart as I can see Vectorized query execution (VQE) can be applied to
most columns and sql operations. Is it therefore possible to extend it
beyond 1024 rows to include the whole column in table?

VQE would be very useful especially with ORC as it basically means that one
can process the whole column separately thus improving performance of the
query.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Hive LIKE predicate. '_' wildcard decrease perfomance

2016-08-04 Thread Gopal Vijayaraghavan
> where res_url like '%mts.ru%'
...
> where res_url like '%mts_ru%'
...
> Why '_' wildcard decrease perfomance?

Because it misses the fast path by just one "_".

ORC vectorized reader has a zero-copy check for 3 patterns - prefix,
suffix and middle.

That means "https://%;, "%.html", "%mts.ru%" will hit the fast path -
which uses StringExpr::equal() which JITs into the following.

https://issues.apache.org/jira/secure/attachment/12748720/string-intrinsic-
sse.png


In Hive-2.0, you can mix these up too to get "https:%mts%.html" in a
ChainedChecker.


Anything other than these 3 cases becomes a Regex and takes the slow path.

The pattern you mentioned gets rewritten into ".*mts.ru.*" and the inner
loop has a new String() as the input to the matcher + matcher.matches() in
it.

I've put in some patches recently which rewrite it Lazy regexes like
".?*mts.ru.?*", so the regex DFA will be smaller (HIVE-13196).

That improves the case where the pattern is found, but does nothing to
improve the performance of the new String() GC garbage.

Cheers,
Gopal




Re: Malformed orc file

2016-08-04 Thread Prasanth Jayachandran
Hi

In case of streaming, when a transaction is open orc file is not closed and 
hence may not be flushed completely. Did the transaction commit successfully? 
Or was there any exception thrown during writes/commit?

Thanks
Prasanth

On Aug 3, 2016, at 6:09 AM, Igor Kuzmenko 
> wrote:

Hello, I've got a malformed ORC file in my Hive table. File was created by Hive 
Streaming API and I have no idea under what circumstances it became corrupted.

File on google drive: 
link

Exception message when trying to perform select from table:
ERROR : Vertex failed, vertexName=Map 1, 
vertexId=vertex_1468498236400_1106_6_00, diagnostics=[Task failed, 
taskId=task_1468498236400_1106_6_00_00, diagnostics=[TaskAttempt 0 failed, 
info=[Error: Failure while running task:java.lang.RuntimeException: 
java.lang.RuntimeException: java.io.IOException: 
org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file 
hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_0.
 Invalid postscript length 0
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException: 
org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file 
hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_0.
 Invalid postscript length 0
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:142)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:326)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
... 14 more
Caused by: java.io.IOException: 
org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file 
hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_0.
 Invalid postscript length 0
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
... 19 more
Caused by: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file 

Re: Iterating over partitions using the metastore API

2016-08-04 Thread Elliot West
Thanks for your reply. I hadn't considered driving it from a list of
partition names.

To avoid the N+1 reads I am considering reading in batches like so:

   - Sorting the names
   - Taking every nth name (where n is the batch size) to use a a batch
   boundary.
   - Building a filter derived from boundary_name[n-1] and boundary_name[n].
   - Then selecting the batch using the filter and
   IMSC.listPartitionsWithFilter(...)

A drawback to this approach is that filters only support string key types
IIRC.

Thanks,

Elliot.

On 4 August 2016 at 13:15, Furcy Pin  wrote:

> Hi Elliot,
>
> I guess you can use IMetaStoreClient.listPartitionsNames instead, and
> then use IMetaStoreClient.getPartition for each partition.
> This might be slow though, as you will have to make 10 000 calls to get
> them.
>
> Another option I'd consider is connecting directly to the Hive metastore.
> This require a little more configuration (grant read-only access to your
> process to the metastore), and might make your implementation dependent
> on the metastore underlying implementation (mysql, postgres, derby),
> unless you use a ORM to query it.
> Anyway, you could ask the metastore directly via JDBC for all the
> partitions, and get java.sql.ResultSet that can be iterated over.
>
> Regards,
>
> Furcy
>
>
> On Thu, Aug 4, 2016 at 1:29 PM, Elliot West  wrote:
>
>> Hello,
>>
>> I have a process that needs to iterate over all of the partitions in a
>> table using the metastore API.The process should not need to know about the
>> structure or meaning of the partition key values (i.e. whether they are
>> dates, numbers, country names etc), or be required to know the existing
>> range of partition values. Note that the process only needs to know about
>> one partition at any given time.
>>
>> Currently I am naively using the IMetaStoreClient.listPartitions(String,
>> String, short) method to retrieve all partitions but clearly this is not
>> scalable for tables with many 10,000s of partitions. I'm finding that even
>> with relatively large heaps I'm running into OOM exceptions when the
>> metastore API is building the List return value. I've
>> experimented with using IMetaStoreClient.listPartitionSpecs(String,
>> String, int) but this too seems to have high memory requirements.
>>
>> Can anyone suggest how I can better iterate over partitions in a manner
>> that is more considerate of memory usage?
>>
>> Thanks,
>>
>> Elliot.
>>
>>
>


Hive LIKE predicate. '_' wildcard decrease perfomance

2016-08-04 Thread Igor Kuzmenko
I've got Hive Transactional table 'data_http' in ORC format, containing
around 100.000.000 rows.

When I execute query:

select * from data_http
where res_url like '%mts.ru%'

it completes in 10 seconds.

But executing query

select * from data_http
where res_url like '%mts_ru%'


takes more than 30 minutes.

Why '_' wildcard decrease perfomance?


Re: Create table from orc file

2016-08-04 Thread Johannes Stamminger
Some progress: I could eliminate the error reported in a): the data file needs 
to be named 00_0 and must be placed in a the directory denoted by the 
location given at table creation. This is what the error message is about? ;-)


Now the situation for a) is the same as for b):

Trying to fetch data by

select * from CFA1_Fan_Speed_DMC limit 1;

results in 

Error: java.io.IOException: java.io.IOException: ORC does not support type 
conversion from file type timestamp (1) to reader type 
struct (1) (state=,code=0)



But if I create a comparable table within hive, things do work:

create table x(y struct);

insert into x select named_struct('a', current_timestamp, 'b', bigint(42), 
'c', float(4.2)) from dummy limit 1;

select * from x;
+-+--+
|   x.y   |
+-+--+
| {"a":"2016-08-04 14:49:01.636","b":42,"c":4.2}  |
+-+--+



The tables look similar:

describe CFA1_Fan_Speed_DMC;
+---+
+--+--+
| col_name  |   data_type| comment  
|
+---+
+--+--+
| record| struct  |  
|
+---+
+--+--+

describe x;
+---+---+--+--+
| col_name  |   data_type   | comment  |
+---+---+--+--+
| y | struct  |  |
+---+---+--+--+



So does anybody have an idea what might be wrong with my external table 
access? Anything that I could give a try?

This email (including any attachments) may contain confidential and/or 
privileged information or information otherwise protected from disclosure. If 
you are not the intended recipient, please notify the sender immediately, do 
not copy this message or any attachments and do not use it for any purpose or 
disclose its content to any person, but delete this message and any attachments 
from your system. Astrium and Airbus Group companies disclaim any and all 
liability if this email transmission was virus corrupted, altered or falsified.
-
Airbus DS GmbH 
Vorsitzender des Aufsichtsrates: Bernhard Gerwert 
Geschäftsführung: Evert Dudok (Vorsitzender), Dr. Lars Immisch, Dr. Michael 
Menking, Dr. Johannes von Thadden 
Sitz der Gesellschaft: München - Registergericht: Amtsgericht München, HRB Nr. 
107 647 
Ust. Ident. Nr. /VAT reg. no. DE167015356

Re: Iterating over partitions using the metastore API

2016-08-04 Thread Furcy Pin
Hi Elliot,

I guess you can use IMetaStoreClient.listPartitionsNames instead, and then
use IMetaStoreClient.getPartition for each partition.
This might be slow though, as you will have to make 10 000 calls to get
them.

Another option I'd consider is connecting directly to the Hive metastore.
This require a little more configuration (grant read-only access to your
process to the metastore), and might make your implementation dependent
on the metastore underlying implementation (mysql, postgres, derby), unless
you use a ORM to query it.
Anyway, you could ask the metastore directly via JDBC for all the
partitions, and get java.sql.ResultSet that can be iterated over.

Regards,

Furcy


On Thu, Aug 4, 2016 at 1:29 PM, Elliot West  wrote:

> Hello,
>
> I have a process that needs to iterate over all of the partitions in a
> table using the metastore API.The process should not need to know about the
> structure or meaning of the partition key values (i.e. whether they are
> dates, numbers, country names etc), or be required to know the existing
> range of partition values. Note that the process only needs to know about
> one partition at any given time.
>
> Currently I am naively using the IMetaStoreClient.listPartitions(String,
> String, short) method to retrieve all partitions but clearly this is not
> scalable for tables with many 10,000s of partitions. I'm finding that even
> with relatively large heaps I'm running into OOM exceptions when the
> metastore API is building the List return value. I've
> experimented with using IMetaStoreClient.listPartitionSpecs(String,
> String, int) but this too seems to have high memory requirements.
>
> Can anyone suggest how I can better iterate over partitions in a manner
> that is more considerate of memory usage?
>
> Thanks,
>
> Elliot.
>
>


Iterating over partitions using the metastore API

2016-08-04 Thread Elliot West
Hello,

I have a process that needs to iterate over all of the partitions in a
table using the metastore API.The process should not need to know about the
structure or meaning of the partition key values (i.e. whether they are
dates, numbers, country names etc), or be required to know the existing
range of partition values. Note that the process only needs to know about
one partition at any given time.

Currently I am naively using the IMetaStoreClient.listPartitions(String,
String, short) method to retrieve all partitions but clearly this is not
scalable for tables with many 10,000s of partitions. I'm finding that even
with relatively large heaps I'm running into OOM exceptions when the
metastore API is building the List return value. I've
experimented with using IMetaStoreClient.listPartitionSpecs(String, String,
int) but this too seems to have high memory requirements.

Can anyone suggest how I can better iterate over partitions in a manner
that is more considerate of memory usage?

Thanks,

Elliot.