“I'm assuming …we'd have to organize the input Parquet files into
subdirectories where each subdirectory contains data just for the given 'date'”
Well, no…
Given that you are doing CREATE TABLE and not CREATE EXTERNAL TABLE, you are
creating a hive managed table….if hive is managing the table, HIVE will take
care of organizing the parquet input files into subdirectories where each
subdirectory contains data just for the given partition….that’s what the INSERT
INTO db.mytable will do.
To ‘stage’ your data, you could create an EXTERNAL table….you have 2 options
here:
1)
CREATE EXTERNAL TABLE IF NOT EXISTS db.my_staging_table (
‘item_id’ string,
‘ts’ string,
‘item_comments’ string,
‘dt’ string,
‘content_type’ string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/hdfs/path/to/source/data/directory';
Then to load your data: INSERT INTO db.mytable (dt=dt,
content_type=content_type) SELECT * from db.my_staging_table
2)
Alternatively, if your input data needs any additional cleanup, sometimes it is
easier to just set the external table to have a single-column containing the
entire input line
CREATE EXTERNAL TABLE IF NOT EXISTS db.my_staging_table (‘input_line’ string)
STORED AS TEXTFILE
LOCATION '/hdfs/path/to/source/data/directory';
Then to load the data, you’d write a query that would clean-up and parse out
the data:
INSERT INTO db.mytable (dt=dt, content_type=content_type)
SELECT * from (
SELECT split(input_line,’,’)[0] as item_id,
split(input_line,’,’)[1] as ts,
split(input_line,’,’)[2] as item_comments,
split(input_line,’,’)[3] as dt,
split(input_line,’,’)[4] as content_type
FROM db.my_staging_table
) subq_a
Then, once you have loaded the data that is currently in the hdfs path where
the EXTERNAL table is mapped to (/hdfs/path/to/source/data/directory), just
delete the input files in the directory, add new incoming data and repeat the
cycle….
From: Dmitry Goldenberg [mailto:[email protected]]
Sent: Thursday, April 06, 2017 3:19 PM
To: [email protected]
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED
AS PARQUET table?
[External Email]
________________________________
I'm assuming, given this:
CREATE TABLE IF NOT EXISTS db.mytable (
`item_id` string,
`timestamp` string,
`item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;
we'd have to organize the input Parquet files into subdirectories where each
subdirectory contains data just for the given 'date' (YYMMDD), then within that
subdirectory, content would be organized by content_type, one file per
content_type value. How does Hive make the association of a partition with a
subdirectory naming or know to look for files for content_type, and how would
it match content_type='Presentation' -- would the file just need to be named
"Presentation"?
On Thu, Apr 6, 2017 at 5:05 PM, Dmitry Goldenberg
<[email protected]<mailto:[email protected]>> wrote:
>> properly split and partition your data before using LOAD if you want hive to
>> be able to find it again.
If the destination table is defined as
CREATE TABLE IF NOT EXISTS db.mytable (
`item_id` string,
`timestamp` string,
`item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;
and supposing that we have the data "in hand" (in memory or as CSV files) how
does one go about the 'proper split and partition' so it adheres to:
PARTITIONED BY (`date`, `content_type`) ?
Thanks
On Thu, Apr 6, 2017 at 12:29 PM, Ryan Harris
<[email protected]<mailto:[email protected]>> wrote:
“If we represent our data as delimited files” ….the question is how you plan on
getting your data into these parquet files since it doesn’t sound like your
data is already in that format….
If your data is not already in parquet format, you are going to need to run
*some* process to get it into that format…why not just use hive (running a
query on an external table) to perform the conversion?
“and Hive represents it as Parquet internally” That entirely depends on the
declared STORED AS format when you define the table. The files backing the
hive table *could* be TEXT, sequence, RC, ORC, Parquet… If you declared the
table to be backed by delimited text, you could format your data into standard
text files (not parquet) and then add the data to the hive table using LOAD
DATA.
So, why NOT use text data for the table storage? There is no way to optimize
future queries against that data.
One hypothetical workflow assuming that your data is currently delimited….
You could either have a hive managed table, with the table data stored as
TEXTFILE using some delimiter based SerDe, and you could then use LOAD DATA to
put your original raw files into this table. OR, you could use an external
table (not managed by hive) to point to the data wherever it currently resides.
(The only difference between the two choices here is whether the original raw
files end up in ‘/user/hive/warehouse/tablename’ or the current HDFS path where
they reside.
From there, you could query FROM that temp table, INSERT into your final
destination table, and the data will be formatted according to the data
definition of your destination table.
If you want to (for whatever reason) use LOAD DATA INPATH to shove the original
data directly into your final destination table you must
1) Ensure that the data is formatted into parquet files that are
compatible with the version of hive that you are running. The parquet format
has been used by a number of different projects, unfortunately there are
different versions of parquet and it cannot be taken for granted that any
parquet file will be compatible with the version of hive you are using.
Testing and validation is required…see
https://github.com/Parquet/parquet-compatibility<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Parquet_parquet-2Dcompatibility&d=DwMFaQ&c=9WYoWBgz3TbmQlstBqb6LDRA8PY_DPmoAS0YWoTLU-g&r=_W3sXrqd7teXL8R6ey10dgFH1GT5KbehFX_EaUG41XM&m=_wnqQLkY44L-FRH_YaFWVcP1Y6QpsQPSlSB_hHb3ycg&s=6bHXvd4qHaEBRBfde9Xs-W4epRxbplOIJnP6kNWydGM&e=>
2) Parquet files have internal partitioning to them, but from hive’s
perspective, hive partitions will still be separated into individual
directories. You’ll need to ensure that you properly split and partition your
data before using LOAD if you want hive to be able to find it again.
It doesn’t sound like your source data is currently formatted to match your
hive table formatting. If you are already processing the data with a spark
pipeline and you just happened to set the output of that processing to be
delimited text and you can just as easily change it to something that is
compatible with your hive table….then that may make sense to do. However, if
you are going to require a separate processing step to convert the data from
delimited text to hive-compatible parquet, I don’t see a reason to use any tool
OTHER than hive to perform that conversion.
LOAD DATA is generally used in situations where you *know* that the data format
is already 100% exactly compatible with your destination table….which most
often occurs when the source of the data is the raw data backing an existing
hive managed table (possibly copied/moved from a different cluster).
From: Dmitry Goldenberg
[mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, April 06, 2017 6:48 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED
AS PARQUET table?
[External Email]
________________________________
Thanks, Ryan.
I was actually more curious about scenario B. If we represent our data as
delimited files, why don't we just use LOAD DATA INPATH and load it right into
the final, parquet, partitioned table in one step, bypassing dealing with the
temp table?
Are there any advantages to having a temp table besides the validation? One
advantage could possibly be making it a transactional table and being able to
run direct INSERT's into the temp table, avoiding having to deal with delimited
files and LOAD DATA INPATH.
If we go with route B, LOAD DATA INPATH directly into the parquet, partitioned
table, would we have to:
1) represent the input files as Parquet? - it looks like the data is still
delimited, and Hive represents it as Parquet internally
2) do anything specific in the input files / with the input files in order to
make partitioning work, or does Hive just take the data and take full care of
partitioning it?
On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris
<[email protected]<mailto:[email protected]>> wrote:
For A) I’d recommend mapping an EXTERNAL table to the raw/original source
files…then you can just run a SELECT query from the EXTERNAL source and INSERT
into your destination.
LOAD DATA can be very useful when you are trying to move data between two
tables that share the same schema but 1 table is partitioned and the other
table is NOT partitioned…once the files have been inserted into the
unpartitioned table the source files from the hive warehouse can be added to
the partitioned table using LOAD DATA. Another place I’ve frequently used LOAD
DATA is when synchronizing hive table data between two clusters, the hive
warehouse data files can be copied from one cluster to the other with distcp
and then loading the data flies to the duplicate cluster using LOAD DATA to
ensure the metadata is recorded in hive metastore.
From: Dmitry Goldenberg
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, April 04, 2017 3:31 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED
AS PARQUET table?
[External Email]
________________________________
Right, that makes sense, Dudu.
So basically, if we have our data in "some form", and a goal of loading it into
a parquet, partitioned table in Hive, we have two choices:
A. Load this data into a temporary table first. Presumably, for this we should
be able to do a LOAD INPATH, from delimited data files. Perhaps we could
designate the temp table as transactional and then simply do direct INSERT's
into this temp table - ? Then, as the second step, we'd do an INSERT... SELECT,
to move the data into the destination table, and then DROP the temp table.
B. Represent the data as a delimited format and do a LOAD INPATH directly into
the destination table. Understandably, we lose the 'data verification' this
way. If we go this route, must the data in the input files be in the PARQUET
format or in a delimited format? I would guess, the former. And, how does
partitioning play into it? How would the input data need to be organized and
inserted so as to adhere to the partitions (the 'date' and 'content-type'
columns, in my example)?
On Tue, Apr 4, 2017 at 2:22 PM, Markovitz, Dudu
<[email protected]<mailto:[email protected]>> wrote:
“LOAD” is very misleading here. it is all in done the metadata level.
The data is not being touched. The data in not being verified. The “system”
does not have any clue if the flies format match the table definition and they
can be actually used.
The data files are being “moved” (again, a metadata operation) from their
current HDFS location to the location defined for the table.
Later on when you query the table the files will be scanned. If there are in
the right format you’ll get results. If not, then no.
From: Dmitry Goldenberg
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, April 04, 2017 8:54 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED
AS PARQUET table?
Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH on a
few tables to achieve the effect of actual insertion of records. Is it not the
case that the LOAD causes the data to get inserted into Hive?
Based on that I'd like to understand whether we can get away with using LOAD
INPATH instead of INSERT/SELECT FROM.
On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu
<[email protected]<mailto:[email protected]>> wrote:
I just want to verify that you understand the following:
• LOAD DATA INPATH is just a HDFS file movement operation.
You can achieve the same results by using hdfs dfs -mv …
• LOAD DATA LOCAL INPATH is just a file copying operation from the
shell to the HDFS.
You can achieve the same results by using hdfs dfs -put …
From: Dmitry Goldenberg [mailto:[email protected]]
Sent: Tuesday, April 04, 2017 7:48 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED
AS PARQUET table?
Dudu,
This is still in design stages, so we have a way to get the data from its
source. The data is *not* in the Parquet format. It's up to us to format it
the best and most efficient way. We can roll with CSV or Parquet; ultimately
the data must make it into a pre-defined PARQUET, PARTITIONED table in Hive.
Thanks,
- Dmitry
On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu
<[email protected]<mailto:[email protected]>> wrote:
Are your files already in Parquet format?
From: Dmitry Goldenberg
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, April 04, 2017 7:03 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED
AS PARQUET table?
Thanks, Dudu.
Just to re-iterate; the way I'm reading your response is that yes, we can use
LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in the
delimited file is properly formatted. Then we can LOAD it into the table
(mytable in my example) directly and avoid the creation of the temp table
(origtable in my example). Correct so far?
I did not quite follow the latter part of your response:
>> You should only create an external table which is an interface to read the
>> files and use it in an INSERT operation.
My assumption was that we would LOAD INPATH and not have to use INSERT
altogether. Am I missing something in groking this latter part of your
response?
Thanks,
- Dmitry
On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu
<[email protected]<mailto:[email protected]>> wrote:
Since LOAD DATA INPATH only moves files the answer is very simple.
If you’re files are already in a format that matches the destination table
(storage type, number and types of columns etc.) then – yes and if not, then –
no.
But –
You don’t need to load the files into intermediary table.
You should only create an external table which is an interface to read the
files and use it in an INSERT operation.
Dudu
From: Dmitry Goldenberg
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, April 04, 2017 4:52 PM
To: [email protected]<mailto:[email protected]>
Subject: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS
PARQUET table?
We have a table such as the following defined:
CREATE TABLE IF NOT EXISTS db.mytable (
`item_id` string,
`timestamp` string,
`item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;
Currently we insert data into this PARQUET, PARTITIONED table as follows, using
an intermediary table:
INSERT INTO TABLE db.mytable PARTITION(date, content_type)
SELECT itemid as item_id, itemts as timestamp, date, content_type
FROM db.origtable
WHERE date = “${SELECTED_DATE}”
GROUP BY item_id, date, content_type;
Our question is, would it be possible to use the LOAD DATA INPATH.. INTO TABLE
syntax to load the data from delimited data files into 'mytable' rather than
populating mytable from the intermediary table?
I see in the Hive documentation that:
* Load operations are currently pure copy/move operations that move datafiles
into locations corresponding to Hive tables.
* If the table is partitioned, then one must specify a specific partition of
the table by specifying values for all of the partitioning columns.
This seems to indicate that using LOAD is possible; however looking at this
discussion:
http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables<https://urldefense.proofpoint.com/v2/url?u=http-3A__grokbase.com_t_hive_user_114frbfg0y_can-2Di-2Duse-2Dhive-2Ddynamic-2Dpartition-2Dwhile-2Dloading-2Ddata-2Dinto-2Dtables&d=DwMFaQ&c=9WYoWBgz3TbmQlstBqb6LDRA8PY_DPmoAS0YWoTLU-g&r=_W3sXrqd7teXL8R6ey10dgFH1GT5KbehFX_EaUG41XM&m=w2-Xt3zXd67KWRPyy83l4Kn5EWquC767DmMpcE5RpgI&s=01kme5ZDH2EBjzLWRz6kJ5jQ9vxr-IzFeNepynsQ7-M&e=>,
perhaps not?
We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED
tables is possible and if so, then how does one go about using LOAD in that
case?
Thanks,
- Dmitry
________________________________
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL
and may contain information that is privileged and exempt from disclosure under
applicable law. If you are neither the intended recipient nor responsible for
delivering the message to the intended recipient, please note that any
dissemination, distribution, copying or the taking of any action in reliance
upon the message is strictly prohibited. If you have received this
communication in error, please notify the sender immediately. Thank you.
________________________________
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL
and may contain information that is privileged and exempt from disclosure under
applicable law. If you are neither the intended recipient nor responsible for
delivering the message to the intended recipient, please note that any
dissemination, distribution, copying or the taking of any action in reliance
upon the message is strictly prohibited. If you have received this
communication in error, please notify the sender immediately. Thank you.
======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL
and may contain information that is privileged and exempt from disclosure under
applicable law. If you are neither the intended recipient nor responsible for
delivering the message to the intended recipient, please note that any
dissemination, distribution, copying or the taking of any action in reliance
upon the message is strictly prohibited. If you have received this
communication in error, please notify the sender immediately. Thank you.