Re: clarification please

2015-10-29 Thread Jörn Franke

> On 29 Oct 2015, at 06:43, Ashok Kumar  wrote:
> 
> hi gurus,
> 
> kindly clarify the following please
> 
> Hive currently does not support indexes or indexes are not used in the query
Not correct. See https://snippetessay.wordpress.com
> The lowest granularity for concurrency is partition. If table is partitioned, 
> then partition will be lucked in DML operation
Not correct for select queries. For all other queries see Hive transactions.
> What is the best file format to store Hive table in HDFS? Is this ORC or Avro 
> that allow being split and support block compression?
Depends on your needs. Avro is an exchange format between different 
systems. ORC is very efficient for everything related to sql type of analysis 
due to internal indexes, bloom filters etc both Orc and avro support block 
compression for any compression algorithm
> Text/CSV files. By default if file type is not specified at creation time, 
> Hive will default to text file?
Depends how it is configured

> 
> 
> Thanks


Re: clarification please

2015-10-29 Thread Ashok Kumar
Thank you sir. Very helpful 


 On Thursday, 29 October 2015, 15:22, Alan Gates  
wrote:
   

 


Ashok Kumar  October 28, 2015 at 22:43 hi gurus,
kindly clarify the following please
   
   - Hive currently does not support indexes or indexes are not used in the 
query

Mostly true.  There is a create index, but Hive does not use the resulting 
index by default.  Some storage formats (ORC, Parquet I think) have their own 
indices they use internally to speed access.

  
   - The lowest granularity for concurrency is partition. If table is 
partitioned, then partition will be lucked in DML operation
  
lucked =locked?  I'm not sure what you intended here.  If you mean locked, then 
it depends.  By default Hive doesn't use locking.  You can set it up to do 
locking via ZooKeeper or as part of Hive transactions.  They have different 
locking models.  See 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions and 
https://cwiki.apache.org/confluence/display/Hive/Locking for more information.

You can sub-partition using buckets, but for most queries partition is the 
lowest level of granularity.  Hive does a lot of work to optimize only reading 
relevant partitions for a query.

  
   - What is the best file format to store Hive table in HDFS? Is this ORC or 
Avro that allow being split and support block compression?
  
It depends on what you want to do.  ORC and Parquet do better for traditional 
data warehousing type queries because they are columnar formats and have lots 
of optimization built in for fast access, pushing filter down into the storage 
level etc. People like Avro and other self describing formats when their data 
brings its own structure.  We very frequently see pipelines where people dump 
Avro, text, etc. into Hive and then ETL it into ORC.

  
   - Text/CSV files. By default if file type is not specified at creation time, 
Hive will default to text file?
  
Out of the box yes, but you can change that in your Hive installation by 
setting hive.default.fileformat in your hive-site.xml.

Alan.

  

Thanks 


  

Re: clarification please

2015-10-29 Thread Alan Gates




Ashok Kumar 
October 28, 2015 at 22:43
hi gurus,

kindly clarify the following please

  * Hive currently does not support indexes or indexes are not used in
the query

Mostly true.  There is a create index, but Hive does not use the 
resulting index by default.  Some storage formats (ORC, Parquet I think) 
have their own indices they use internally to speed access.


  * The lowest granularity for concurrency is partition. If table is
partitioned, then partition will be lucked in DML operation

lucked =locked?  I'm not sure what you intended here.  If you mean 
locked, then it depends.  By default Hive doesn't use locking.  You can 
set it up to do locking via ZooKeeper or as part of Hive transactions.  
They have different locking models.  See 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions and 
https://cwiki.apache.org/confluence/display/Hive/Locking for more 
information.


You can sub-partition using buckets, but for most queries partition is 
the lowest level of granularity.  Hive does a lot of work to optimize 
only reading relevant partitions for a query.


  * What is the best file format to store Hive table in HDFS? Is this
ORC or Avro that allow being split and support block compression?

It depends on what you want to do.  ORC and Parquet do better for 
traditional data warehousing type queries because they are columnar 
formats and have lots of optimization built in for fast access, pushing 
filter down into the storage level etc. People like Avro and other self 
describing formats when their data brings its own structure.  We very 
frequently see pipelines where people dump Avro, text, etc. into Hive 
and then ETL it into ORC.


  * Text/CSV files. By default if file type is not specified at
creation time, Hive will default to text file?

Out of the box yes, but you can change that in your Hive installation by 
setting hive.default.fileformat in your hive-site.xml.


Alan.



Thanks