RE: Questime abou the Payload in Hudi

2019-05-17 Thread FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1)
Hi,

I am very interested in fix this behavior. Actually, I have implemented a new 
Payload for our use case which can upload both the delta and the parquest 
record.
However, there still contain some problem in that implementation.
Go back to the question 3, in the Payload, there are three methods, the 
question is on 'combineAndGetUpdateValue' method.

In the preCombine method, this function will pick the payload with greatest 
ordering value. The comparison is between two different payload, and both 
payload contain the comparable variable 'orderingVal' to indicate the order.
However, in the function 'combineAndGetUpdateValue' the comparision will 
between the current Payload and current IndexedRecord in parquest. 

The question is how can I get the 'orderingVal' in this IndexedRecord? 
After searching nearly all code related to the Payload, I found that Hudi first 
get the 'PRECOMBINE_FIELD_OPT_KEY' from Hudi config, and then extract this 
fieldname as the orderingVal and combine both 'orderingVal'  and 'record' as an 
Payload object. So in the Hudi Payload object I cannot get the value of 
'PRECOMBINE_FIELD_OPT_KEY', however, this value is necessary for extracting 
orderingVal in 'IndexedRecord'.

In our case, I can hard-code the fieldname in the 'combineAndGetUpdateValue', 
below I contain the example of the method I used in our case. But I don't like 
this method and obviously this is not a good way to do this.

@Override
  public Optional combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema)
  throws IOException {
GenericRecord record = HoodieAvroUtils.bytesToAvro(this.recordBytes, 
schema);
Long thisDATE = (Long) record.get("RESULT_DATE_UTS");
Long currDATE = (Long) ((GenericRecord) 
currentValue).get("RESULT_DATE_UTS");
if (currDATE.compareTo(thisDATE) < 0) {
  return Optional.of(record);
} else {
  return Optional.of(currentValue);
}
  }

I have the idea that I can contain the 'PRECOMBINE_FIELD_OPT_KEY' in the 
Payload, but which means that I will change a lot of file in Hudi to help me 
make this change. Would you mind me do this?
Otherwise, if you have any other good idea or suggestion about how to fix this 
behavior, we can have some discussion about it. I am glad to make a patch about 
it.

Thanks so much for the reply and help.

Mit freundlichen Grüßen / Best regards

Yuanbin Cheng
CR/PJ-AI-S1  



-Original Message-
From: Vinoth Chandar  
Sent: Friday, May 17, 2019 8:02 AM
To: dev@hudi.apache.org
Subject: Re: Questime abou the Payload in Hudi

Hi,

What you mentioned is correct.

@Override
public Optional combineAndGetUpdateValue(IndexedRecord
currentValue, Schema schema)
throws IOException {
  // combining strategy here trivially ignores currentValue on disk and writes 
this record
  return getInsertValue(schema);
}

I think we could change this behavior to match pre-combining. Are you 
interested in sending a patch?

Thanks
Vinoth

On Fri, May 17, 2019 at 7:18 AM Vinoth Chandar  wrote:

> Thanks for the clear example. Let me check this out and get back shortly.
>
> On Thu, May 16, 2019 at 5:29 PM Yanjia Li 
> wrote:
>
>> Hello Vinoth,
>>
>> I could add an example here to clarify this question.
>>
>> We have DF1{id:1, ts: 9} and DF2{id:1, ts:1; id:1, ts:2}. We save DF1 
>> first, then upsert DF2 to DF1. With the default payload, we will have 
>> the final result DF{id:1, ts:2}. But we are looking for DF{id:1, 
>> ts:9}. If I didn’t understand wrong, the precombine only combine the 
>> data in the delta dataframe, which is DF2 in the example. And the 
>> default payload only guarantees that we keep the latest time stamp in 
>> the current batch. In this example, the newer data arrived before the 
>> older data. We would like to confirm that whether we will need to 
>> write our own payload to handle this case. It will also be helpful to 
>> know if anyone else had similar issue before.
>>
>> Thanks so much!
>> Gary
>>
>> On Thu, May 16, 2019 at 2:49 PM Vinoth Chandar  wrote:
>>
>> > Hi,
>> >
>> > (Please subscribe to the mailing list, so the message actually 
>> > comes
>> over
>> > directly to the list.)
>> >
>> > On 1, the default payload overwrites the record on storage with new
>> coming
>> > record, if the precombine field has a higher value. for e.g, if you 
>> > use timestamp field, then it will overwrite with latest record 
>> > while it will not overwrite if you accidentally write a much older record.
>> >
>> > On 2, I think you can achieve this by setting the precombine key
>> properly..
>> > IIUC, you don't want the older record to overwrite the newer record?
>> >
>> > On 3, you can configure the PRECOMBINE key as documented here 
>> > http://hudi.apache.org/configurations.html#PRECOMBINE_FIELD_OPT_KEY ?
>> >
>> > Hope that helps. Please let me know if I missed something.
>> >
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Thu, May 16, 2019 at 7:07 AM FIXED-TERM Cheng Yuanbin 
>> > (CR/PJ-AI-S1) < 

Re: Upgrade HUDI to Hive 2.x

2019-05-17 Thread Vinoth Chandar
I am in favor of deprecating Hive 1.x unless someone has a strong
objection. Most cloud offerings like EMR/Data Proc all support Hive 2.x and
Hive 3.x is going to grow.
This seems like a move in the right direction

/thanks/vinoth

On Fri, May 17, 2019 at 11:55 AM nishith agarwal 
wrote:

> All,
>
> Is anyone using Hudi with Hive 1.x ? Currently, Hudi has a dependency on
> Hive 1.x and works against Hive 2.x by using specific profiles.
> There are non-backwards compatible changes in the HiveRecordReader for Hive
> 1.x vs Hive 2.x. I'm planning to upgrade to Hive 2.x which would
> essentially mean Hudi's realtime view (HudiRealtimeInputFormat) will NOT
> work with Hive 1.x anymore (mostly if the schema has nested columns). Also,
> I'm un-sure if Hive 2.x protocol is backward compatible with Hive 1.x (we
> depend on forwards compatibility right now for Hudi to work with 2.x and
> beyond).
> Let me know what you guys think.
>
> Thanks,
> Nishith
>


Upgrade HUDI to Hive 2.x

2019-05-17 Thread nishith agarwal
All,

Is anyone using Hudi with Hive 1.x ? Currently, Hudi has a dependency on
Hive 1.x and works against Hive 2.x by using specific profiles.
There are non-backwards compatible changes in the HiveRecordReader for Hive
1.x vs Hive 2.x. I'm planning to upgrade to Hive 2.x which would
essentially mean Hudi's realtime view (HudiRealtimeInputFormat) will NOT
work with Hive 1.x anymore (mostly if the schema has nested columns). Also,
I'm un-sure if Hive 2.x protocol is backward compatible with Hive 1.x (we
depend on forwards compatibility right now for Hudi to work with 2.x and
beyond).
Let me know what you guys think.

Thanks,
Nishith


Re: Read RO table in Spark as hive table | No records returned

2019-05-17 Thread Vinoth Chandar
Glad you got it working.. Any reason why you are not using the Hive sync
tool to manage the table creation/registration to Hive?

On Fri, May 17, 2019 at 7:04 AM satish.sidnakoppa...@gmail.com <
satish.sidnakoppa...@gmail.com> wrote:

>
>
> On 2019/05/17 12:45:26, satish.sidnakoppa...@gmail.com <
> satish.sidnakoppa...@gmail.com> wrote:
> >
> >
> > On 2019/05/17 12:37:10, satish.sidnakoppa...@gmail.com <
> satish.sidnakoppa...@gmail.com> wrote:
> > > Hi Team,
> > >
> > > Data is returned when queried from hive.
> > > But not in spark ,Could you assist in finding the gap.
> > >
> > > Details below
> > >
> > > **Approach 1 ---
> successful
> > >
> > > select * from emp_cow limit 2;
> > > 20190503171506  20190503171506_0_4244   default
> 71ff4cc6-bd8e-4c48-a075-98f32efc14b2_0_20190503171506.parquet  413Vivian
> Walter -1641   1556883906604   608806001   511.63  146186820
>  401217383000
> > > 20190503171506  20190503171506_0_4258   default
> 71ff4cc6-bd8e-4c48-a075-98f32efc14b2_0_20190503171506.parquet  813Oprah
> Gross   -32255  1556883906604   761166471   536.4   151647300
>  816189568000
> > >
> > > **Approach 2 ---
> successful
> > >
> > >
> spark.read.format("com.uber.hoodie").load("/apps/hive/warehouse/emp_cow_03/default/*").show
> > >
> +---++--+--++--+--+-+-+-+-+-+-+
> > >
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>  _hoodie_file_name|emp_id|  emp_name|emp_short|   ts|
> emp_long|emp_float| emp_date|emp_timestamp|
> > >
> +---++--+--++--+--+-+-+-+-+-+-+
> > > | 20190503171506|20190503171506_0_424| 4|
>  default|71ff4cc6-bd8e-4c4...| 4|   13Vivian Walter|
> -1641|1556883906604|608806001|   511.63|146186820| 401217383000|
> > > +
> > >
> > > **Approach 3 --- No
> records
> > >
> > >
> > > ***To read RO table as a Hive table using Spark
> > > But when I read from spark as hive table - no records returned.
> > >
> > >
> > > sqlContext.sql("select * from hudi.emp_cow").show;  in scala
> console
> > > select * from hudi.emp_cow   
> in spark console
> > >
> > > NO result.
> > >
> > > Only headers/column names are printed.
> > >
> > >
> > > FYI Table DDL
> > >
> > >
> > > CREATE EXTERNAL TABLE `emp_cow`(
> > >   `_hoodie_commit_time` string,
> > >   `_hoodie_commit_seqno` string,
> > >   `_hoodie_record_key` string,
> > >   `_hoodie_partition_path` string,
> > >   `_hoodie_file_name` string,
> > >   `emp_id` int,
> > >   `emp_name` string,
> > >   `emp_short` int,
> > >   `ts` bigint,
> > >   `emp_long` bigint,
> > >   `emp_float` float,
> > >   `emp_date` bigint,
> > >   `emp_timestamp` bigint)
> > > ROW FORMAT SERDE
> > >   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> > > STORED AS INPUTFORMAT
> > >   'com.uber.hoodie.hadoop.HoodieInputFormat'
> > > OUTPUTFORMAT
> > >   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> > > LOCATION
> > >   '/apps/hive/warehouse/emp_cow'
> > >
> >
> >
> >
> > Fixed the typo mistake
> >
> > path is /apps/hive/warehouse/emp_cow
> > table name is emp_cow
> >
>
>
> Issue fixed.
>
> Path in table creation was incorrect.
>
> LOCATION   '/apps/hive/warehouse/emp_cow'
> should
> LOCATION   '/apps/hive/warehouse/emp_cow/default'
>


Re: Questime abou the Payload in Hudi

2019-05-17 Thread Vinoth Chandar
Hi,

What you mentioned is correct.

@Override
public Optional combineAndGetUpdateValue(IndexedRecord
currentValue, Schema schema)
throws IOException {
  // combining strategy here trivially ignores currentValue on disk and
writes this record
  return getInsertValue(schema);
}

I think we could change this behavior to match pre-combining. Are you
interested in sending a patch?

Thanks
Vinoth

On Fri, May 17, 2019 at 7:18 AM Vinoth Chandar  wrote:

> Thanks for the clear example. Let me check this out and get back shortly.
>
> On Thu, May 16, 2019 at 5:29 PM Yanjia Li 
> wrote:
>
>> Hello Vinoth,
>>
>> I could add an example here to clarify this question.
>>
>> We have DF1{id:1, ts: 9} and DF2{id:1, ts:1; id:1, ts:2}. We save DF1
>> first, then upsert DF2 to DF1. With the default payload, we will have the
>> final result DF{id:1, ts:2}. But we are looking for DF{id:1, ts:9}. If I
>> didn’t understand wrong, the precombine only combine the data in the delta
>> dataframe, which is DF2 in the example. And the default payload only
>> guarantees that we keep the latest time stamp in the current batch. In
>> this
>> example, the newer data arrived before the older data. We would like to
>> confirm that whether we will need to write our own payload to handle this
>> case. It will also be helpful to know if anyone else had similar issue
>> before.
>>
>> Thanks so much!
>> Gary
>>
>> On Thu, May 16, 2019 at 2:49 PM Vinoth Chandar  wrote:
>>
>> > Hi,
>> >
>> > (Please subscribe to the mailing list, so the message actually comes
>> over
>> > directly to the list.)
>> >
>> > On 1, the default payload overwrites the record on storage with new
>> coming
>> > record, if the precombine field has a higher value. for e.g, if you use
>> > timestamp field, then it will overwrite with latest record while it will
>> > not overwrite if you accidentally write a much older record.
>> >
>> > On 2, I think you can achieve this by setting the precombine key
>> properly..
>> > IIUC, you don't want the older record to overwrite the newer record?
>> >
>> > On 3, you can configure the PRECOMBINE key as documented here
>> > http://hudi.apache.org/configurations.html#PRECOMBINE_FIELD_OPT_KEY ?
>> >
>> > Hope that helps. Please let me know if I missed something.
>> >
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Thu, May 16, 2019 at 7:07 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
>> > fixed-term.yuanbin.ch...@us.bosch.com> wrote:
>> >
>> > > Hi Hudi,
>> > >
>> > > We want to use Apache Hudi for immigrating our data pipeline from
>> > batching
>> > > to incremental.
>> > > We face several questions about the Hudi. We so appreciate you can
>> help
>> > us
>> > > figure them out.
>> > >
>> > >
>> > > 1.  In the default Payload (OverwriteWithLatestAvroPayload), this
>> > > payload only concern and merge the records with the same key value in
>> the
>> > > Delta dataframe (new coming records), right?
>> > >
>> > > 2.  In our usage case, we want to keep the latest record in our
>> > > system. However. In the default Payload, if the Delta dataframe
>> contains
>> > > the older record than the record in the record already written in
>> Hudi,
>> > it
>> > > will simply overwrite them, which is not what we want. Do you have
>> some
>> > > suggestions about how to get the global latest record in Hudi?
>> > >
>> > > 3.  We have implemented a custom Payload class in order to get the
>> > > global latest record. However, we found that in the Payload class, we
>> > have
>> > > to hard-code the PRECOMBINE_FIELD_OPT_KEY value in Payload to get the
>> > value
>> > > in currentValue in order to compare them. Can I ask is any method I
>> can
>> > get
>> > > PRECOMBINE_FIELD_OPT_KEY in Payload, or is there any suggested method
>> for
>> > > dealing with this issue.
>> > >
>> > > Thanks so much!
>> > >
>> > > Mit freundlichen Grüßen / Best regards
>> > >
>> > > Yuanbin Cheng
>> > >
>> > >
>> > >
>> >
>>
>


Re: Questime abou the Payload in Hudi

2019-05-17 Thread Vinoth Chandar
Thanks for the clear example. Let me check this out and get back shortly.

On Thu, May 16, 2019 at 5:29 PM Yanjia Li  wrote:

> Hello Vinoth,
>
> I could add an example here to clarify this question.
>
> We have DF1{id:1, ts: 9} and DF2{id:1, ts:1; id:1, ts:2}. We save DF1
> first, then upsert DF2 to DF1. With the default payload, we will have the
> final result DF{id:1, ts:2}. But we are looking for DF{id:1, ts:9}. If I
> didn’t understand wrong, the precombine only combine the data in the delta
> dataframe, which is DF2 in the example. And the default payload only
> guarantees that we keep the latest time stamp in the current batch. In this
> example, the newer data arrived before the older data. We would like to
> confirm that whether we will need to write our own payload to handle this
> case. It will also be helpful to know if anyone else had similar issue
> before.
>
> Thanks so much!
> Gary
>
> On Thu, May 16, 2019 at 2:49 PM Vinoth Chandar  wrote:
>
> > Hi,
> >
> > (Please subscribe to the mailing list, so the message actually comes over
> > directly to the list.)
> >
> > On 1, the default payload overwrites the record on storage with new
> coming
> > record, if the precombine field has a higher value. for e.g, if you use
> > timestamp field, then it will overwrite with latest record while it will
> > not overwrite if you accidentally write a much older record.
> >
> > On 2, I think you can achieve this by setting the precombine key
> properly..
> > IIUC, you don't want the older record to overwrite the newer record?
> >
> > On 3, you can configure the PRECOMBINE key as documented here
> > http://hudi.apache.org/configurations.html#PRECOMBINE_FIELD_OPT_KEY ?
> >
> > Hope that helps. Please let me know if I missed something.
> >
> >
> > Thanks
> > Vinoth
> >
> > On Thu, May 16, 2019 at 7:07 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
> > fixed-term.yuanbin.ch...@us.bosch.com> wrote:
> >
> > > Hi Hudi,
> > >
> > > We want to use Apache Hudi for immigrating our data pipeline from
> > batching
> > > to incremental.
> > > We face several questions about the Hudi. We so appreciate you can help
> > us
> > > figure them out.
> > >
> > >
> > > 1.  In the default Payload (OverwriteWithLatestAvroPayload), this
> > > payload only concern and merge the records with the same key value in
> the
> > > Delta dataframe (new coming records), right?
> > >
> > > 2.  In our usage case, we want to keep the latest record in our
> > > system. However. In the default Payload, if the Delta dataframe
> contains
> > > the older record than the record in the record already written in Hudi,
> > it
> > > will simply overwrite them, which is not what we want. Do you have some
> > > suggestions about how to get the global latest record in Hudi?
> > >
> > > 3.  We have implemented a custom Payload class in order to get the
> > > global latest record. However, we found that in the Payload class, we
> > have
> > > to hard-code the PRECOMBINE_FIELD_OPT_KEY value in Payload to get the
> > value
> > > in currentValue in order to compare them. Can I ask is any method I can
> > get
> > > PRECOMBINE_FIELD_OPT_KEY in Payload, or is there any suggested method
> for
> > > dealing with this issue.
> > >
> > > Thanks so much!
> > >
> > > Mit freundlichen Grüßen / Best regards
> > >
> > > Yuanbin Cheng
> > >
> > >
> > >
> >
>


Re: Read RO table in Spark as hive table | No records returned

2019-05-17 Thread satish . sidnakoppa . it



On 2019/05/17 12:45:26, satish.sidnakoppa...@gmail.com 
 wrote: 
> 
> 
> On 2019/05/17 12:37:10, satish.sidnakoppa...@gmail.com 
>  wrote: 
> > Hi Team,
> > 
> > Data is returned when queried from hive.
> > But not in spark ,Could you assist in finding the gap.
> > 
> > Details below
> > 
> > **Approach 1 --- 
> > successful
> > 
> > select * from emp_cow limit 2;
> > 20190503171506  20190503171506_0_4244   default 
> > 71ff4cc6-bd8e-4c48-a075-98f32efc14b2_0_20190503171506.parquet  413Vivian 
> > Walter -1641   1556883906604   608806001   511.63  146186820   
> > 401217383000
> > 20190503171506  20190503171506_0_4258   default 
> > 71ff4cc6-bd8e-4c48-a075-98f32efc14b2_0_20190503171506.parquet  813Oprah 
> > Gross   -32255  1556883906604   761166471   536.4   151647300   
> > 816189568000
> > 
> > **Approach 2 --- 
> > successful
> > 
> > spark.read.format("com.uber.hoodie").load("/apps/hive/warehouse/emp_cow_03/default/*").show
> > +---++--+--++--+--+-+-+-+-+-+-+
> > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
> >_hoodie_file_name|emp_id|  emp_name|emp_short|   ts| 
> > emp_long|emp_float| emp_date|emp_timestamp|
> > +---++--+--++--+--+-+-+-+-+-+-+
> > | 20190503171506|20190503171506_0_424| 4|   
> > default|71ff4cc6-bd8e-4c4...| 4|   13Vivian Walter|
> > -1641|1556883906604|608806001|   511.63|146186820| 401217383000|
> > +
> > 
> > **Approach 3 --- No 
> > records
> > 
> > 
> > ***To read RO table as a Hive table using Spark
> > But when I read from spark as hive table - no records returned.
> > 
> > 
> > sqlContext.sql("select * from hudi.emp_cow").show;  in scala console 
> > select * from hudi.emp_cow    in 
> > spark console
> > 
> > NO result.
> > 
> > Only headers/column names are printed.
> > 
> > 
> > FYI Table DDL
> > 
> > 
> > CREATE EXTERNAL TABLE `emp_cow`(
> >   `_hoodie_commit_time` string,
> >   `_hoodie_commit_seqno` string,
> >   `_hoodie_record_key` string,
> >   `_hoodie_partition_path` string,
> >   `_hoodie_file_name` string,
> >   `emp_id` int,
> >   `emp_name` string,
> >   `emp_short` int,
> >   `ts` bigint,
> >   `emp_long` bigint,
> >   `emp_float` float,
> >   `emp_date` bigint,
> >   `emp_timestamp` bigint)
> > ROW FORMAT SERDE
> >   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> > STORED AS INPUTFORMAT
> >   'com.uber.hoodie.hadoop.HoodieInputFormat'
> > OUTPUTFORMAT
> >   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> > LOCATION
> >   '/apps/hive/warehouse/emp_cow'
> > 
> 
> 
> 
> Fixed the typo mistake 
> 
> path is /apps/hive/warehouse/emp_cow
> table name is emp_cow
> 


Issue fixed.

Path in table creation was incorrect.

LOCATION   '/apps/hive/warehouse/emp_cow'
should 
LOCATION   '/apps/hive/warehouse/emp_cow/default'


Re: Read RO table in Spark as hive table | No records returned

2019-05-17 Thread satish . sidnakoppa . it



On 2019/05/17 12:37:10, satish.sidnakoppa...@gmail.com 
 wrote: 
> Hi Team,
> 
> Data is returned when queried from hive.
> But not in spark ,Could you assist in finding the gap.
> 
> Details below
> 
> **Approach 1 --- 
> successful
> 
> select * from emp_cow limit 2;
> 20190503171506  20190503171506_0_4244   default 
> 71ff4cc6-bd8e-4c48-a075-98f32efc14b2_0_20190503171506.parquet  413Vivian 
> Walter -1641   1556883906604   608806001   511.63  146186820   
> 401217383000
> 20190503171506  20190503171506_0_4258   default 
> 71ff4cc6-bd8e-4c48-a075-98f32efc14b2_0_20190503171506.parquet  813Oprah Gross 
>   -32255  1556883906604   761166471   536.4   151647300   816189568000
> 
> **Approach 2 --- 
> successful
> 
> spark.read.format("com.uber.hoodie").load("/apps/hive/warehouse/emp_cow_03/default/*").show
> +---++--+--++--+--+-+-+-+-+-+-+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>_hoodie_file_name|emp_id|  emp_name|emp_short|   ts| 
> emp_long|emp_float| emp_date|emp_timestamp|
> +---++--+--++--+--+-+-+-+-+-+-+
> | 20190503171506|20190503171506_0_424| 4|   
> default|71ff4cc6-bd8e-4c4...| 4|   13Vivian Walter|
> -1641|1556883906604|608806001|   511.63|146186820| 401217383000|
> +
> 
> **Approach 3 --- No 
> records
> 
> 
> ***To read RO table as a Hive table using Spark
> But when I read from spark as hive table - no records returned.
> 
> 
> sqlContext.sql("select * from hudi.emp_cow").show;  in scala console 
> select * from hudi.emp_cow    in 
> spark console
> 
> NO result.
> 
> Only headers/column names are printed.
> 
> 
> FYI Table DDL
> 
> 
> CREATE EXTERNAL TABLE `emp_cow`(
>   `_hoodie_commit_time` string,
>   `_hoodie_commit_seqno` string,
>   `_hoodie_record_key` string,
>   `_hoodie_partition_path` string,
>   `_hoodie_file_name` string,
>   `emp_id` int,
>   `emp_name` string,
>   `emp_short` int,
>   `ts` bigint,
>   `emp_long` bigint,
>   `emp_float` float,
>   `emp_date` bigint,
>   `emp_timestamp` bigint)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'com.uber.hoodie.hadoop.HoodieInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   '/apps/hive/warehouse/emp_cow'
> 



Fixed the typo mistake 

path is /apps/hive/warehouse/emp_cow
table name is emp_cow


Read RO table in Spark as hive table | No records returned

2019-05-17 Thread satish . sidnakoppa . it
Hi Team,

Data is returned when queried from hive.
But not in spark ,Could you assist in finding the gap.

Details below

**Approach 1 --- 
successful

select * from emp_cow limit 2;
20190503171506  20190503171506_0_4244   default 
71ff4cc6-bd8e-4c48-a075-98f32efc14b2_0_20190503171506.parquet  413Vivian Walter 
-1641   1556883906604   608806001   511.63  146186820   401217383000
20190503171506  20190503171506_0_4258   default 
71ff4cc6-bd8e-4c48-a075-98f32efc14b2_0_20190503171506.parquet  813Oprah Gross   
-32255  1556883906604   761166471   536.4   151647300   816189568000

**Approach 2 --- 
successful

spark.read.format("com.uber.hoodie").load("/apps/hive/warehouse/emp_cow_03/default/*").show
+---++--+--++--+--+-+-+-+-+-+-+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|emp_id|  emp_name|emp_short|   ts| 
emp_long|emp_float| emp_date|emp_timestamp|
+---++--+--++--+--+-+-+-+-+-+-+
| 20190503171506|20190503171506_0_424| 4|   
default|71ff4cc6-bd8e-4c4...| 4|   13Vivian Walter|
-1641|1556883906604|608806001|   511.63|146186820| 401217383000|
+

**Approach 3 --- No 
records


***To read RO table as a Hive table using Spark
But when I read from spark as hive table - no records returned.


sqlContext.sql("select * from hudi.emp_cow_03").show;  in scala console 
select * from hudi.emp_cow_03    in 
spark console

NO result.

Only headers/column names are printed.


FYI Table DDL


CREATE EXTERNAL TABLE `emp_cow`(
  `_hoodie_commit_time` string,
  `_hoodie_commit_seqno` string,
  `_hoodie_record_key` string,
  `_hoodie_partition_path` string,
  `_hoodie_file_name` string,
  `emp_id` int,
  `emp_name` string,
  `emp_short` int,
  `ts` bigint,
  `emp_long` bigint,
  `emp_float` float,
  `emp_date` bigint,
  `emp_timestamp` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'com.uber.hoodie.hadoop.HoodieInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://nn10.htrunk.com/apps/hive/warehouse/emp_cow'