ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

2018-07-25 Thread Dmitry Goldenberg
Hi,

I apologize for the wide distribution and if this is not the right mailing
list for this.

We write Avro files to Parquet and load them to HDFS so they can be
accessed via an EXTERNAL Hive table.  These records have two timestamp
fields which are expressed in the Avro schema as type = long and
logicalType=timestamp-millis.

When trying to do a SELECT * FROM  we get the error as included
below. Basically, the long values cannot be converted to timestamps.  This
appears similar to https://issues.apache.org/jira/browse/HIVE-13534.

Could someone suggest a workaround? Would we have to make the timestamp
fields strings? Was hoping to avoid that... Thanks

-Dmitry

Bad status for request TFetchResultsReq(

fetchType=0, operationHandle=TOperationHandle(hasResultSet=True,
modifiedRowCount=None, operationType=0,

operationId=THandleIdentifier(secret='\x94yB\xb2\xf47K\x98\xaa\xce\\\xab\xdc_\xcdH',
guid='~\xf9\xd5x\x1e\xe1I*\x91\xddu\x92\xa7\xec\xc6\xda')),

orientation=4, maxRows=100):



TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOExcept
ion:

org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
cast to org.apache.hadoop.hive.serde2.io.TimestampWritable',

sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:
java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
cast to org.apache.hadoop.hive.serde2.io.TimestampWritable:14:13',

'org.apache.hive.service.cli.operation.SQLOperation:getNextR
owSet:SQLOperation.java:463',

'org.apache.hive.service.cli.operation.OperationManager:getO
perationNextRowSet:OperationManager.java:294',

'org.apache.hive.service.cli.session.HiveSessionImpl:fetchRe
sults:HiveSessionImpl.java:769',

'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:462',

'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchRe
sults:ThriftCLIService.java:694',

'org.apache.hive.service.cli.thrift.TCLIService$Processor$Fe
tchResults:getResult:TCLIService.java:1553',

'org.apache.hive.service.cli.thrift.TCLIService$Processor$Fe
tchResults:getResult:TCLIService.java:1538',

'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39',

'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39',

'org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server
$TUGIAssumingProcessor:process:HadoopThriftAuthBridge.java:747',

'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:
run:TThreadPoolServer.java:286',

'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoo
lExecutor.java:1149',

'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPo
olExecutor.java:624',

'java.lang.Thread:run:Thread.java:748',



'*java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:

java.lang.ClassCastException:

rg.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.hive.serde2.io.TimestampWritable:16:2',

'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:154',

'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2069',

'org.apache.hive.service.cli.operation.SQLOperation:getNextR
owSet:SQLOperation.java:458',

'*org.apache.hadoop.hive.ql.metadata.HiveException:java.lang
.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.hive.serde2.io.TimestampWritable:23:7',

'org.apache.hadoop.hive.ql.exec.ListSinkOperator:processOp:
ListSinkOperator.java:90',

'org.apache.hadoop.hive.ql.exec.Operator:forward:Operator.java:815',

'org.apache.hadoop.hive.ql.exec.SelectOperator:processOp:Sel
ectOperator.java:84',

'org.apache.hadoop.hive.ql.exec.Operator:forward:Operator.java:815',

'org.apache.hadoop.hive.ql.exec.TableScanOperator:processOp:
TableScanOperator.java:98',

'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchO
perator.java:425',

'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchO
perator.java:417',

'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:140',

'*java.lang.ClassCastException:org.apache.hadoop.io.LongWritable cannot be
cast to org.apache.hadoop.hive.serde2.io.TimestampWritable:28:5',

'org.apache.hadoop.hive.serde2.objectinspector.primitive.Wri
tableTimestampObjectInspector:getPrimitiveJavaObject:Writabl
eTimestampObjectInspector.java:39',

'org.apache.hadoop.hive.serde2.objectinspector.primitive.Wri
tableTimestampObjectInspector:getPrimitiveJavaObject:Writabl
eTimestampObjectInspector.java:25',

'org.apache.hadoop.hive.serde2.objectinspector.ObjectInspect
orUtils:copyToStandardObject:ObjectInspectorUtils.java:336',

'org.apache.hadoop.hive.serde2.SerDeUtils:toThriftPayload:
SerDeUtils.java:167',

'org.apache.hadoop.hive.ql.exec.FetchFormatter$ThriftFormatt
er:convert:FetchFormatter.java:61',

'org.apache.hadoop.hive.ql.exec.ListSinkOperator:processOp:ListSinkOperator.java:87'],
statusCode=3), results=None, hasMoreRows=None)


Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-06 Thread Dmitry Goldenberg
I'm assuming, given this:

CREATE TABLE IF NOT EXISTS db.mytable (
  `item_id` string,
  `timestamp` string,
  `item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;

we'd have to organize the input Parquet files into subdirectories where
each subdirectory contains data just for the given 'date' (YYMMDD), then
within that subdirectory, content would be organized by content_type, one
file per content_type value.  How does Hive make the association of a
partition with a subdirectory naming or know to look for files for
content_type, and how would it match content_type='Presentation' -- would
the file just need to be named "Presentation"?


On Thu, Apr 6, 2017 at 5:05 PM, Dmitry Goldenberg 
wrote:

> >> properly split and partition your data before using LOAD if you want
> hive to be able to find it again.
>
> If the destination table is defined as
> CREATE TABLE IF NOT EXISTS db.mytable (
>   `item_id` string,
>   `timestamp` string,
>   `item_comments` string)
> PARTITIONED BY (`date`, `content_type`)
> STORED AS PARQUET;
>
> and supposing that we have the data "in hand" (in memory or as CSV files)
> how does one go about the 'proper split and partition' so it adheres to:
> PARTITIONED BY (`date`, `content_type`)  ?
>
> Thanks
>
>
> On Thu, Apr 6, 2017 at 12:29 PM, Ryan Harris  > wrote:
>
>> “If we represent our data as delimited files” ….the question is how you
>> plan on getting your data into these parquet files since it doesn’t sound
>> like your data is already in that format….
>>
>>
>>
>> If your data is not already in parquet format, you are going to need to
>> run **some** process to get it into that format…why not just use hive
>> (running a query on an external table) to perform the conversion?
>>
>>
>>
>> “and Hive represents it as Parquet internally” That entirely depends on
>> the declared STORED AS format when you define the table.  The files backing
>> the hive table **could** be TEXT, sequence, RC, ORC, Parquet…  If you
>> declared the table to be backed by delimited text, you could format your
>> data into standard text files (not parquet) and then add the data to the
>> hive table using LOAD DATA.
>>
>>
>>
>> So, why NOT use text data for the table storage?  There is no way to
>> optimize future queries against that data.
>>
>>
>>
>> One hypothetical workflow assuming that your data is currently delimited….
>>
>>
>>
>> You could either have a hive managed table, with the table data stored as
>> TEXTFILE using some delimiter based SerDe, and you could then use LOAD DATA
>> to put your original raw files into this table.   OR, you could use an
>> external table (not managed by hive) to point to the data wherever it
>> currently resides.  (The only difference between the two choices here is
>> whether the original raw files end up in ‘/user/hive/warehouse/tablename’
>> or the current HDFS path where they reside.
>>
>>
>>
>> From there, you could query FROM that temp table, INSERT into your final
>> destination table, and the data will be formatted according to the data
>> definition of your destination table.
>>
>>
>>
>>
>>
>> If you want to (for whatever reason) use LOAD DATA INPATH to shove the
>> original data directly into your final destination table you must
>>
>> 1)  Ensure that the data is formatted into parquet files that are
>> compatible with the version of hive that you are running.   The parquet
>> format has been used by a number of different projects, unfortunately there
>> are different versions of parquet and it cannot be taken for granted that
>> any parquet file will be compatible with the version of hive you are
>> using.  Testing and validation is required…see
>> https://github.com/Parquet/parquet-compatibility
>>
>> 2)  Parquet files have internal partitioning to them, but from
>> hive’s perspective, hive partitions will still be separated into individual
>> directories.  You’ll need to ensure that you properly split and partition
>> your data before using LOAD if you want hive to be able to find it again.
>>
>>
>>
>> It doesn’t sound like your source data is currently formatted to match
>> your hive table formatting.  If you are already processing the data with a
>> spark pipeline and you just happened to set the output of that processing
>> to be delimited text and you can just as easily change it to something that
>> is compatible with your hive table….then that may make sens

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-06 Thread Dmitry Goldenberg
>> properly split and partition your data before using LOAD if you want
hive to be able to find it again.

If the destination table is defined as
CREATE TABLE IF NOT EXISTS db.mytable (
  `item_id` string,
  `timestamp` string,
  `item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;

and supposing that we have the data "in hand" (in memory or as CSV files)
how does one go about the 'proper split and partition' so it adheres to:
PARTITIONED BY (`date`, `content_type`)  ?

Thanks


On Thu, Apr 6, 2017 at 12:29 PM, Ryan Harris 
wrote:

> “If we represent our data as delimited files” ….the question is how you
> plan on getting your data into these parquet files since it doesn’t sound
> like your data is already in that format….
>
>
>
> If your data is not already in parquet format, you are going to need to
> run **some** process to get it into that format…why not just use hive
> (running a query on an external table) to perform the conversion?
>
>
>
> “and Hive represents it as Parquet internally” That entirely depends on
> the declared STORED AS format when you define the table.  The files backing
> the hive table **could** be TEXT, sequence, RC, ORC, Parquet…  If you
> declared the table to be backed by delimited text, you could format your
> data into standard text files (not parquet) and then add the data to the
> hive table using LOAD DATA.
>
>
>
> So, why NOT use text data for the table storage?  There is no way to
> optimize future queries against that data.
>
>
>
> One hypothetical workflow assuming that your data is currently delimited….
>
>
>
> You could either have a hive managed table, with the table data stored as
> TEXTFILE using some delimiter based SerDe, and you could then use LOAD DATA
> to put your original raw files into this table.   OR, you could use an
> external table (not managed by hive) to point to the data wherever it
> currently resides.  (The only difference between the two choices here is
> whether the original raw files end up in ‘/user/hive/warehouse/tablename’
> or the current HDFS path where they reside.
>
>
>
> From there, you could query FROM that temp table, INSERT into your final
> destination table, and the data will be formatted according to the data
> definition of your destination table.
>
>
>
>
>
> If you want to (for whatever reason) use LOAD DATA INPATH to shove the
> original data directly into your final destination table you must
>
> 1)  Ensure that the data is formatted into parquet files that are
> compatible with the version of hive that you are running.   The parquet
> format has been used by a number of different projects, unfortunately there
> are different versions of parquet and it cannot be taken for granted that
> any parquet file will be compatible with the version of hive you are
> using.  Testing and validation is required…see
> https://github.com/Parquet/parquet-compatibility
>
> 2)  Parquet files have internal partitioning to them, but from hive’s
> perspective, hive partitions will still be separated into individual
> directories.  You’ll need to ensure that you properly split and partition
> your data before using LOAD if you want hive to be able to find it again.
>
>
>
> It doesn’t sound like your source data is currently formatted to match
> your hive table formatting.  If you are already processing the data with a
> spark pipeline and you just happened to set the output of that processing
> to be delimited text and you can just as easily change it to something that
> is compatible with your hive table….then that may make sense to do.
> However, if you are going to require a separate processing step to convert
> the data from delimited text to hive-compatible parquet, I don’t see a
> reason to use any tool OTHER than hive to perform that conversion.
>
>
>
> LOAD DATA is generally used in situations where you **know** that the
> data format is already 100% exactly compatible with your destination
> table….which most often occurs when the source of the data is the raw data
> backing an existing hive managed table (possibly copied/moved from a
> different cluster).
>
>
>
>
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
> *Sent:* Thursday, April 06, 2017 6:48 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED,
> STORED AS PARQUET table?
>
>
>
> [External Email]
> --
>
> Thanks, Ryan.
>
>
>
> I was actually more curious about scenario B. If we represent our data as
> delimited files, why don't we just use LOAD DATA INPATH and load it right
> into the final, par

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-06 Thread Dmitry Goldenberg
Thank you, Ryan and Furcy for your detailed responses.

Our application doesn't necessarily have to have the data in the CSV
format. We read data from "a source" and load it in memory (not all at
once), basically as a continuous stream of records. These are meant to be
processed and written to Hive, either on the batch basis (finite dataset)
or continuously (streaming mode).

What I'm gathering is that we're best off by representing this data as CSV
files, loading these to a temp table, then performing an INSERT..SELECT
FROM to cause Hive to do the right thing with regard to both PARQUET
conversions and partitioning.

The alternative is for us to do a LOAD DATA INPATH into the final
destination (PARQUET, partitioned) table but we'd have to convert the data
to Parquet ourselves. Which has the attractiveness of not having to deal
with the temp table (which we'd need to create, then eventually drop).

Thanks again,
- Dmitry

On Thu, Apr 6, 2017 at 12:29 PM, Ryan Harris 
wrote:

> “If we represent our data as delimited files” ….the question is how you
> plan on getting your data into these parquet files since it doesn’t sound
> like your data is already in that format….
>
>
>
> If your data is not already in parquet format, you are going to need to
> run **some** process to get it into that format…why not just use hive
> (running a query on an external table) to perform the conversion?
>
>
>
> “and Hive represents it as Parquet internally” That entirely depends on
> the declared STORED AS format when you define the table.  The files backing
> the hive table **could** be TEXT, sequence, RC, ORC, Parquet…  If you
> declared the table to be backed by delimited text, you could format your
> data into standard text files (not parquet) and then add the data to the
> hive table using LOAD DATA.
>
>
>
> So, why NOT use text data for the table storage?  There is no way to
> optimize future queries against that data.
>
>
>
> One hypothetical workflow assuming that your data is currently delimited….
>
>
>
> You could either have a hive managed table, with the table data stored as
> TEXTFILE using some delimiter based SerDe, and you could then use LOAD DATA
> to put your original raw files into this table.   OR, you could use an
> external table (not managed by hive) to point to the data wherever it
> currently resides.  (The only difference between the two choices here is
> whether the original raw files end up in ‘/user/hive/warehouse/tablename’
> or the current HDFS path where they reside.
>
>
>
> From there, you could query FROM that temp table, INSERT into your final
> destination table, and the data will be formatted according to the data
> definition of your destination table.
>
>
>
>
>
> If you want to (for whatever reason) use LOAD DATA INPATH to shove the
> original data directly into your final destination table you must
>
> 1)  Ensure that the data is formatted into parquet files that are
> compatible with the version of hive that you are running.   The parquet
> format has been used by a number of different projects, unfortunately there
> are different versions of parquet and it cannot be taken for granted that
> any parquet file will be compatible with the version of hive you are
> using.  Testing and validation is required…see https://github.com/Parquet/
> parquet-compatibility
>
> 2)  Parquet files have internal partitioning to them, but from hive’s
> perspective, hive partitions will still be separated into individual
> directories.  You’ll need to ensure that you properly split and partition
> your data before using LOAD if you want hive to be able to find it again.
>
>
>
> It doesn’t sound like your source data is currently formatted to match
> your hive table formatting.  If you are already processing the data with a
> spark pipeline and you just happened to set the output of that processing
> to be delimited text and you can just as easily change it to something that
> is compatible with your hive table….then that may make sense to do.
> However, if you are going to require a separate processing step to convert
> the data from delimited text to hive-compatible parquet, I don’t see a
> reason to use any tool OTHER than hive to perform that conversion.
>
>
>
> LOAD DATA is generally used in situations where you **know** that the
> data format is already 100% exactly compatible with your destination
> table….which most often occurs when the source of the data is the raw data
> backing an existing hive managed table (possibly copied/moved from a
> different cluster).
>
>
>
>
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
> *Sent:* Thursday, April 06, 2017 6:48 AM
> *To:*

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-06 Thread Dmitry Goldenberg
Thanks, Ryan.

I was actually more curious about scenario B. If we represent our data as
delimited files, why don't we just use LOAD DATA INPATH and load it right
into the final, parquet, partitioned table in one step, bypassing dealing
with the temp table?

Are there any advantages to having a temp table besides the validation? One
advantage could possibly be making it a transactional table and being able
to run direct INSERT's into the temp table, avoiding having to deal with
delimited files and LOAD DATA INPATH.

If we go with route B, LOAD DATA INPATH directly into the parquet,
partitioned table, would we have to:

1) represent the input files as Parquet? - it looks like the data is still
delimited, and Hive represents it as Parquet internally
2) do anything specific in the input files / with the input files in order
to make partitioning work, or does Hive just take the data and take full
care of partitioning it?



On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris 
wrote:

> For A) I’d recommend mapping an EXTERNAL table to the raw/original source
> files…then you can just run a SELECT query from the EXTERNAL source and
> INSERT into your destination.
>
>
>
> LOAD DATA can be very useful when you are trying to move data between two
> tables that share the same schema but 1 table is partitioned and the other
> table is NOT partitioned…once the files have been inserted into the
> unpartitioned table the source files from the hive warehouse can be added
> to the partitioned table using LOAD DATA.  Another place I’ve frequently
> used LOAD DATA is when synchronizing hive table data between two clusters,
> the hive warehouse data files can be copied from one cluster to the other
> with distcp and then loading the data flies to the duplicate cluster using
> LOAD DATA to ensure the metadata is recorded in hive metastore.
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
> *Sent:* Tuesday, April 04, 2017 3:31 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED,
> STORED AS PARQUET table?
>
>
>
> [External Email]
> --
>
> Right, that makes sense, Dudu.
>
>
>
> So basically, if we have our data in "some form", and a goal of loading it
> into a parquet, partitioned table in Hive, we have two choices:
>
>
>
> A. Load this data into a temporary table first. Presumably, for this we
> should be able to do a LOAD INPATH, from delimited data files. Perhaps we
> could designate the temp table as transactional and then simply do direct
> INSERT's into this temp table - ? Then, as the second step, we'd do an
> INSERT... SELECT, to move the data into the destination table, and then
> DROP the temp table.
>
>
>
> B. Represent the data as a delimited format and do a LOAD INPATH directly
> into the destination table. Understandably, we lose the 'data verification'
> this way. If we go this route, must the data in the input files be in the
> PARQUET format or in a delimited format?  I would guess, the former.  And,
> how does partitioning play into it?  How would the input data need to be
> organized and inserted so as to adhere to the partitions (the 'date' and
> 'content-type' columns, in my example)?
>
>
>
>
>
>
>
> On Tue, Apr 4, 2017 at 2:22 PM, Markovitz, Dudu 
> wrote:
>
> “LOAD” is very misleading here. it is all in done the metadata level.
>
> The data is not being touched. The data in not being verified. The
> “system” does not have any clue if the flies format match the table
> definition and they can be actually used.
>
> The data files are being “moved” (again,  a metadata operation) from their
> current HDFS location to the location defined for the table.
>
> Later on when you  query the table the files will be scanned. If there are
> in the right format you’ll get results. If not, then no.
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
> *Sent:* Tuesday, April 04, 2017 8:54 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED,
> STORED AS PARQUET table?
>
>
>
> Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH
> on a few tables to achieve the effect of actual insertion of records. Is it
> not the case that the LOAD causes the data to get inserted into Hive?
>
> Based on that I'd like to understand whether we can get away with using
> LOAD INPATH instead of INSERT/SELECT FROM.
>
>
> On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu  wrote:
>
> I just want to verify that you understand the following:
>
>
>
> · LOAD DATA INPATH is just a HDFS file mov

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
Right, that makes sense, Dudu.

So basically, if we have our data in "some form", and a goal of loading it
into a parquet, partitioned table in Hive, we have two choices:

A. Load this data into a temporary table first. Presumably, for this we
should be able to do a LOAD INPATH, from delimited data files. Perhaps we
could designate the temp table as transactional and then simply do direct
INSERT's into this temp table - ? Then, as the second step, we'd do an
INSERT... SELECT, to move the data into the destination table, and then
DROP the temp table.

B. Represent the data as a delimited format and do a LOAD INPATH directly
into the destination table. Understandably, we lose the 'data verification'
this way. If we go this route, must the data in the input files be in the
PARQUET format or in a delimited format?  I would guess, the former.  And,
how does partitioning play into it?  How would the input data need to be
organized and inserted so as to adhere to the partitions (the 'date' and
'content-type' columns, in my example)?



On Tue, Apr 4, 2017 at 2:22 PM, Markovitz, Dudu 
wrote:

> “LOAD” is very misleading here. it is all in done the metadata level.
>
> The data is not being touched. The data in not being verified. The
> “system” does not have any clue if the flies format match the table
> definition and they can be actually used.
>
> The data files are being “moved” (again,  a metadata operation) from their
> current HDFS location to the location defined for the table.
>
> Later on when you  query the table the files will be scanned. If there are
> in the right format you’ll get results. If not, then no.
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
> *Sent:* Tuesday, April 04, 2017 8:54 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED,
> STORED AS PARQUET table?
>
>
>
> Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH
> on a few tables to achieve the effect of actual insertion of records. Is it
> not the case that the LOAD causes the data to get inserted into Hive?
>
> Based on that I'd like to understand whether we can get away with using
> LOAD INPATH instead of INSERT/SELECT FROM.
>
>
> On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu  wrote:
>
> I just want to verify that you understand the following:
>
>
>
> · LOAD DATA INPATH is just a HDFS file movement operation.
>
> You can achieve the same results by using *hdfs dfs -mv …*
>
>
>
> ·     LOAD DATA LOCAL  INPATH is just a file copying operation from
> the shell to the HDFS.
>
> You can achieve the same results by using *hdfs dfs -put …*
>
>
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com
> ]
> *Sent:* Tuesday, April 04, 2017 7:48 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED,
> STORED AS PARQUET table?
>
>
>
> Dudu,
>
>
>
> This is still in design stages, so we have a way to get the data from its
> source. The data is *not* in the Parquet format.  It's up to us to format
> it the best and most efficient way.  We can roll with CSV or Parquet;
> ultimately the data must make it into a pre-defined PARQUET, PARTITIONED
> table in Hive.
>
>
>
> Thanks,
>
> - Dmitry
>
>
>
> On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu 
> wrote:
>
> Are your files already in Parquet format?
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
> *Sent:* Tuesday, April 04, 2017 7:03 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED,
> STORED AS PARQUET table?
>
>
>
> Thanks, Dudu.
>
>
>
> Just to re-iterate; the way I'm reading your response is that yes, we can
> use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in
> the delimited file is properly formatted.  Then we can LOAD it into the
> table (mytable in my example) directly and avoid the creation of the temp
> table (origtable in my example).  Correct so far?
>
>
>
> I did not quite follow the latter part of your response:
>
> >> You should only create an external table which is an interface to read
> the files and use it in an INSERT operation.
>
>
>
> My assumption was that we would LOAD INPATH and not have to use INSERT
> altogether.  Am I missing something in groking this latter part of your
> response?
>
>
>
> Thanks,
>
> - Dmitry
>
>
>
> On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu 
> wrote:
>
> Since LOAD DATA INPATH  only moves files the answer is very simpl

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH on a 
few tables to achieve the effect of actual insertion of records. Is it not the 
case that the LOAD causes the data to get inserted into Hive?

Based on that I'd like to understand whether we can get away with using LOAD 
INPATH instead of INSERT/SELECT FROM.

> On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu  wrote:
> 
> I just want to verify that you understand the following:
>  
> · LOAD DATA INPATH is just a HDFS file movement operation.
> You can achieve the same results by using hdfs dfs -mv …
>  
> · LOAD DATA LOCAL  INPATH is just a file copying operation from the 
> shell to the HDFS.
> You can achieve the same results by using hdfs dfs -put …
>  
>  
> From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] 
> Sent: Tuesday, April 04, 2017 7:48 PM
> To: user@hive.apache.org
> Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, 
> STORED AS PARQUET table?
>  
> Dudu,
>  
> This is still in design stages, so we have a way to get the data from its 
> source. The data is *not* in the Parquet format.  It's up to us to format it 
> the best and most efficient way.  We can roll with CSV or Parquet; ultimately 
> the data must make it into a pre-defined PARQUET, PARTITIONED table in Hive.
>  
> Thanks,
> - Dmitry
>  
> On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu  
> wrote:
> Are your files already in Parquet format?
>  
> From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] 
> Sent: Tuesday, April 04, 2017 7:03 PM
> To: user@hive.apache.org
> Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, 
> STORED AS PARQUET table?
>  
> Thanks, Dudu.
>  
> Just to re-iterate; the way I'm reading your response is that yes, we can use 
> LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in the 
> delimited file is properly formatted.  Then we can LOAD it into the table 
> (mytable in my example) directly and avoid the creation of the temp table 
> (origtable in my example).  Correct so far?
>  
> I did not quite follow the latter part of your response:
> >> You should only create an external table which is an interface to read the 
> >> files and use it in an INSERT operation.
>  
> My assumption was that we would LOAD INPATH and not have to use INSERT 
> altogether.  Am I missing something in groking this latter part of your 
> response?
>  
> Thanks,
> - Dmitry
>  
> On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu  
> wrote:
> Since LOAD DATA INPATH  only moves files the answer is very simple.
> If you’re files are already in a format that matches the destination table 
> (storage type, number and types of columns etc.) then – yes and if not, then 
> – no.
>  
> But –
> You don’t need to load the files into intermediary table.
> You should only create an external table which is an interface to read the 
> files and use it in an INSERT operation.
>  
> Dudu
>  
> From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] 
> Sent: Tuesday, April 04, 2017 4:52 PM
> To: user@hive.apache.org
> Subject: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS 
> PARQUET table?
>  
> We have a table such as the following defined:
> CREATE TABLE IF NOT EXISTS db.mytable (
>   `item_id` string,
>   `timestamp` string,
>   `item_comments` string)
> PARTITIONED BY (`date`, `content_type`)
> STORED AS PARQUET;
> 
> Currently we insert data into this PARQUET, PARTITIONED table as follows, 
> using an intermediary table:
> 
> INSERT INTO TABLE db.mytable PARTITION(date, content_type)
> SELECT itemid as item_id, itemts as timestamp, date, content_type
> FROM db.origtable
> WHERE date = “${SELECTED_DATE}”
> GROUP BY item_id, date, content_type;
> 
> Our question is, would it be possible to use the LOAD DATA INPATH.. INTO 
> TABLE syntax to load the data from delimited data files into 'mytable' rather 
> than populating mytable from the intermediary table?
>  
> I see in the Hive documentation that:
> * Load operations are currently pure copy/move operations that move datafiles 
> into locations corresponding to Hive tables.
> * If the table is partitioned, then one must specify a specific partition of 
> the table by specifying values for all of the partitioning columns.
>  
> This seems to indicate that using LOAD is possible; however looking at this 
> discussion: 
> http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables,
>  perhaps not?
>  
> We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED 
> tables is possible and if so, then how does one go about using LOAD in that 
> case?
>  
> Thanks,
> - Dmitry
>  
>  
>  


Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
Dudu,

This is still in design stages, so we have a way to get the data from its
source. The data is *not* in the Parquet format.  It's up to us to format
it the best and most efficient way.  We can roll with CSV or Parquet;
ultimately the data must make it into a pre-defined PARQUET, PARTITIONED
table in Hive.

Thanks,
- Dmitry

On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu 
wrote:

> Are your files already in Parquet format?
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
> *Sent:* Tuesday, April 04, 2017 7:03 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED,
> STORED AS PARQUET table?
>
>
>
> Thanks, Dudu.
>
>
>
> Just to re-iterate; the way I'm reading your response is that yes, we can
> use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in
> the delimited file is properly formatted.  Then we can LOAD it into the
> table (mytable in my example) directly and avoid the creation of the temp
> table (origtable in my example).  Correct so far?
>
>
>
> I did not quite follow the latter part of your response:
>
> >> You should only create an external table which is an interface to read
> the files and use it in an INSERT operation.
>
>
>
> My assumption was that we would LOAD INPATH and not have to use INSERT
> altogether.  Am I missing something in groking this latter part of your
> response?
>
>
>
> Thanks,
>
> - Dmitry
>
>
>
> On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu 
> wrote:
>
> Since LOAD DATA INPATH  only moves files the answer is very simple.
>
> If you’re files are already in a format that matches the destination table
> (storage type, number and types of columns etc.) then – yes and if not,
> then – no.
>
>
>
> But –
>
> You don’t need to load the files into intermediary table.
>
> You should only create an external table which is an interface to read the
> files and use it in an INSERT operation.
>
>
>
> Dudu
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
> *Sent:* Tuesday, April 04, 2017 4:52 PM
> *To:* user@hive.apache.org
> *Subject:* Is it possible to use LOAD DATA INPATH with a PARTITIONED,
> STORED AS PARQUET table?
>
>
>
> We have a table such as the following defined:
>
> CREATE TABLE IF NOT EXISTS db.mytable (
>   `item_id` string,
>   `timestamp` string,
>   `item_comments` string)
> PARTITIONED BY (`date`, `content_type`)
> STORED AS PARQUET;
>
> Currently we insert data into this PARQUET, PARTITIONED table as follows,
> using an intermediary table:
>
> INSERT INTO TABLE db.mytable PARTITION(date, content_type)
> SELECT itemid as item_id, itemts as timestamp, date, content_type
> FROM db.origtable
> WHERE date = “${SELECTED_DATE}”
> GROUP BY item_id, date, content_type;
>
> Our question is, would it be possible to use the LOAD DATA INPATH.. INTO
> TABLE syntax to load the data from delimited data files into 'mytable'
> rather than populating mytable from the intermediary table?
>
>
>
> I see in the Hive documentation that:
>
> * Load operations are currently pure copy/move operations that move
> datafiles into locations corresponding to Hive tables.
>
> * If the table is partitioned, then one must specify a specific partition
> of the table by specifying values for all of the partitioning columns.
>
>
>
> This seems to indicate that using LOAD is possible; however looking at
> this discussion: http://grokbase.com/t/hive/user/114frbfg0y/
> can-i-use-hive-dynamic-partition-while-loading-data-into-tables, perhaps
> not?
>
>
>
> We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED
> tables is possible and if so, then how does one go about using LOAD in that
> case?
>
>
>
> Thanks,
>
> - Dmitry
>
>
>
>
>


Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
Thanks, Dudu.

Just to re-iterate; the way I'm reading your response is that yes, we can
use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in
the delimited file is properly formatted.  Then we can LOAD it into the
table (mytable in my example) directly and avoid the creation of the temp
table (origtable in my example).  Correct so far?

I did not quite follow the latter part of your response:
>> You should only create an external table which is an interface to read
the files and use it in an INSERT operation.

My assumption was that we would LOAD INPATH and not have to use INSERT
altogether.  Am I missing something in groking this latter part of your
response?

Thanks,
- Dmitry

On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu 
wrote:

> Since LOAD DATA INPATH  only moves files the answer is very simple.
>
> If you’re files are already in a format that matches the destination table
> (storage type, number and types of columns etc.) then – yes and if not,
> then – no.
>
>
>
> But –
>
> You don’t need to load the files into intermediary table.
>
> You should only create an external table which is an interface to read the
> files and use it in an INSERT operation.
>
>
>
> Dudu
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
> *Sent:* Tuesday, April 04, 2017 4:52 PM
> *To:* user@hive.apache.org
> *Subject:* Is it possible to use LOAD DATA INPATH with a PARTITIONED,
> STORED AS PARQUET table?
>
>
>
> We have a table such as the following defined:
>
> CREATE TABLE IF NOT EXISTS db.mytable (
>   `item_id` string,
>   `timestamp` string,
>   `item_comments` string)
> PARTITIONED BY (`date`, `content_type`)
> STORED AS PARQUET;
>
> Currently we insert data into this PARQUET, PARTITIONED table as follows,
> using an intermediary table:
>
> INSERT INTO TABLE db.mytable PARTITION(date, content_type)
> SELECT itemid as item_id, itemts as timestamp, date, content_type
> FROM db.origtable
> WHERE date = “${SELECTED_DATE}”
> GROUP BY item_id, date, content_type;
>
> Our question is, would it be possible to use the LOAD DATA INPATH.. INTO
> TABLE syntax to load the data from delimited data files into 'mytable'
> rather than populating mytable from the intermediary table?
>
>
>
> I see in the Hive documentation that:
>
> * Load operations are currently pure copy/move operations that move
> datafiles into locations corresponding to Hive tables.
>
> * If the table is partitioned, then one must specify a specific partition
> of the table by specifying values for all of the partitioning columns.
>
>
>
> This seems to indicate that using LOAD is possible; however looking at
> this discussion: http://grokbase.com/t/hive/user/114frbfg0y/
> can-i-use-hive-dynamic-partition-while-loading-data-into-tables, perhaps
> not?
>
>
>
> We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED
> tables is possible and if so, then how does one go about using LOAD in that
> case?
>
>
>
> Thanks,
>
> - Dmitry
>
>
>


Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

2017-04-04 Thread Dmitry Goldenberg
We have a table such as the following defined:

CREATE TABLE IF NOT EXISTS db.mytable (
  `item_id` string,
  `timestamp` string,
  `item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;

Currently we insert data into this PARQUET, PARTITIONED table as follows,
using an intermediary table:

INSERT INTO TABLE db.mytable PARTITION(date, content_type)
SELECT itemid as item_id, itemts as timestamp, date, content_type
FROM db.origtable
WHERE date = “${SELECTED_DATE}”
GROUP BY item_id, date, content_type;
Our question is, would it be possible to use the LOAD DATA INPATH.. INTO
TABLE syntax to load the data from delimited data files into 'mytable'
rather than populating mytable from the intermediary table?

I see in the Hive documentation that:
* Load operations are currently pure copy/move operations that move
datafiles into locations corresponding to Hive tables.
* If the table is partitioned, then one must specify a specific partition
of the table by specifying values for all of the partitioning columns.

This seems to indicate that using LOAD is possible; however looking at this
discussion:
http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables,
perhaps not?

We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED
tables is possible and if so, then how does one go about using LOAD in that
case?

Thanks,
- Dmitry