Hi Vinoth

Thank you for your time. None of these 2 issues are blocking as of now. But 
priority should be given to S3 connections.
Further to our discussion, I shall focus on the strategy below:
Disable dynamic allocation and run the job and check if clean job is 
parallelized or running on a single executor.

Try to find the exact option on the lines of: 
"df.option("hoodie.clean.automatic", "false")" to disable cleaning job and see 
if that solves the issue.

I shall relay back my findings tomorrow.

Thanks for all the support,
Kabeer.

On Mar 26 2019, at 2:06 am, Vinoth Chandar <[email protected]> wrote:
> Will give this a shot as well. Between this and the S3 thing, what's
> blocking progress? both? ;) ?
>
> On Sat, Mar 23, 2019 at 7:10 PM Kabeer Ahmed <[email protected]> wrote:
> > Hi Vinoth,
> > Thank you for looking into this. I am planning to try out this over this
> > weekend if it is possible. Just downloaded the 0.4.6 version of Hudi.
> > I think we can start with a very simple schema as below (copied from the
> > Hudi's own example)?
> >
> > val EXAMPLE_SCHEMA = "{\"type\": \"record\"," + "\"name\": \"hudirec\"," +
> > "\"fields\": [ " +
> > "{\"name\": \"timestamp\", \"type\": \"double\"}," +
> > "{\"name\": \"_row_key\", \"type\": \"string\"}," +
> > "{\"name\": \"trade_date\", \"type\": \"string\"}," +
> > "{\"name\": \"bats\", \"type\": \"int\"}]}";
> >
> > And sample data could be (cricket bats):
> > kabeer,2018-11-15T07:35:54.387Z,7
> > vinoth,2018-11-16T09:35:54.387Z,9
> > I did try passing several combinations in the timestamp field above to
> > indicate the logicalType to timestamp but no success. I was using Hive 1.1
> > compile time flag but I was not worried about reading data through Hive. I
> > could see in the generated parquet that the timestamp field was NOT INT96
> > timestamp format that the parquet expects.
> >
> > Keep me posted as to how you get along with this and I shall keep you
> > posted if I find any joy sooner than yourself.
> > Thanks
> > Kabeer.
> >
> > On Mar 23 2019, at 12:05 am, Vinoth Chandar <[email protected]> wrote:
> > > Hi Kabeer,
> > >
> > > I spent time looking at the issue and its other linked issues as well.
> > > High level, seems like we need to change the data type mappings for these
> > > date/timestamp types..
> > > It does seem doable, given Avro also supports date/timestamp types..
> > >
> > > Do you have some sample schema/data generation that we can start with?
> > > Thanks
> > > Vinoth
> > >
> > > On Fri, Mar 15, 2019 at 11:19 AM Vinoth Chandar <[email protected]>
> > wrote:
> > > > Hi Kabeer,
> > > > Thanks for bringing this up. I don't think we have actually hit this
> > > > before :)
> > > >
> > > > Let me spend sometime understanding the issue and get back to you
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Thu, Mar 14, 2019 at 10:46 PM Kabeer Ahmed <[email protected]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > https://github.com/apache/incubator-hudi/issues/547 (
> > > > >
> > https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F547&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D
> > )
> > > > > has resulted in the jira
> > > >
> > >
> >
> > https://issues.apache.org/jira/browse/HUDI-12 (
> > > > >
> > https://link.getmailspring.com/link/[email protected]/1?redirect=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-12&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D
> > > > > ).
> > > > > The requirement is to be able to interpret timestamp from CSV and
> > > >
> > >
> >
> > store
> > > > > it in the parquet table. Does anyone have a working example on these
> > > >
> > >
> >
> > lines?
> > > > > Going by the Hudi example from the GitHub:
> > > > > Timestamp is being encoded in avro as double:
> > > > >
> > https://github.com/apache/incubator-hudi/blob/master/hoodie-client/src/test/java/com/uber/hoodie/common/HoodieTestDataGenerator.java#L69
> > > > > (
> > > > >
> > https://link.getmailspring.com/link/[email protected]/2?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fblob%2Fmaster%2Fhoodie-client%2Fsrc%2Ftest%2Fjava%2Fcom%2Fuber%2Fhoodie%2Fcommon%2FHoodieTestDataGenerator.java%23L69&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D
> > > > > )
> > > > >
> > > > > The end result is that parquet field for timestamp is not of
> > timestamp
> > > > > (INT96).
> > > > >
> > > > > My best guess is that this would have been a requirement at Uber
> > > > > (tracking trips in minutes and seconds) and how is it being handled.
> > > > >
> > > > > If anyone else has handled this and has an example that can be
> > shared, it
> > > > > will be much appreciated.
> > > > > Kabeer Ahmed, http://www.linkedin.com/in/kabeerahmed
> > > >
> > >
> >
>
>

Reply via email to