There are a number of other things you can do with Hive to improve your simple
query.
First is best to upgrade to Hive 1.2.1 or above which will require minimal
effort just ensure that you upgrade your metastore as well.
Second your first column is unique (?) as it is ID so in all probability it is
a monotonically increasing number. You can therefore deploy bucketing to spread
your col1 uniformily as its selectivity is 1.
I just wrote a simple shell routine to create 100K rows in an ORC table.
Divided into 256 buckets. I am running Hive on Spark engine
CREATE TABLE txtest (
ID string
, DESCRIPTION string
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
;
INFO : Loading data to table test.txtest from
hdfs://rhes564:9000/user/hive/warehouse/test.db/txtest/.hive-staging_hive_2016-01-29_16-10-04_765_4205127232246216560-13/-ext-10000
INFO : Table test.txtest stats: [numFiles=256, numRows=100000,
totalSize=5653369, rawDataSize=22200000]
OK let us just do a simple point query as you did
0: jdbc:hive2://rhes564:10010/default> use test;
No rows affected (0.02 seconds)
0: jdbc:hive2://rhes564:10010/default> select * from txtest where id = '1234';
+------------+-----------------------------------------------------+--+
| txtest.id | txtest.description |
+------------+-----------------------------------------------------+--+
| 1234 | R_MaH9oKm_OfLbe8LumdSXzBvDFJFrUIkvUoRm_iDPdeegPEWf |
+------------+-----------------------------------------------------+--+
1 row selected (0.887 seconds)
HTH
Dr Mich Talebzadeh
LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one
out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message
shall not be understood as given or endorsed by Peridale Technology Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees
accept any responsibility.
From: Frank Luo [mailto:[email protected]]
Sent: 29 January 2016 00:47
To: [email protected]
Subject: bloom filter used in 0.14?
All,
I have a huge table that I periodically want to do select on some particular
value. For example, supposing I have a table for the entire world population.
Then I know the id of “1234” is criminal, hence I want to pull out his
information from the table.
Without any optimization, I have to use thousands of mappers to find just one
id. So not ideal. I tried to enable bloom-filter on the column that I want to
search on. But a simple query shows that the amount of data read is the same as
that without a bloom-filter. So I am questioning whether it is enabled on the
version I am on, which is 0.14. Does anyone know? If bloom-filter is not the
way to go, does anyone have suggestions?
Here is the hql:
create table test
(
col1 STRING,
col2 STRING
) STORED AS ORC
tblproperties ("orc.bloom.filter.columns"="col1");
select * from test where col1 = ‘1234’;
Thx
Frank
<http://www2.merkleinc.com/janfooter>
This email and any attachments transmitted with it are intended for use by the
intended recipient(s) only. If you have received this email in error, please
notify the sender immediately and then delete it. If you are not the intended
recipient, you must not keep, use, disclose, copy or distribute this email
without the author’s prior permission. We take precautions to minimize the risk
of transmitting software viruses, but we advise you to perform your own virus
checks on any attachment to this message. We cannot accept liability for any
loss or damage caused by software viruses. The information contained in this
communication may be confidential and may be subject to the attorney-client
privilege.