Re: Questions about using Hudi

Qian Wang Tue, 08 Oct 2019 16:00:57 -0700

Hi Nishith,

I have checked the data, there is no null in that field. Does there has other 
possibility about this error?


Thanks,
Qian
On Oct 8, 2019, 10:55 AM -0700, Qian Wang <[email protected]>, wrote:
> Hi Nishith,
>
> Thanks for your response.
> The session_date is one field in my original dataset. I have some questions 
> about the schema parameter:
>
> 1. Do I need create the target table?
> 2. My source data is Parquet format, why the tool need the schema file as the 
> parameter?
> 3. Can I use the schema file of Avro format?
>
> The schema is looks like:
>
> {"type":"record","name":"PathExtractData","doc":"Path event extract fact 
> data”,”fields”:[
>     {“name”:”SESSION_DATE”,”type”:”string”},
>     {“name”:”SITE_ID”,”type”:”int”},
>     {“name”:”GUID”,”type”:”string”},
>     {“name”:”SESSION_KEY”,”type”:”long”},
>     {“name”:”USER_ID”,”type”:”string”},
>     {“name”:”STEP”,”type”:”int”},
>     {“name”:”PAGE_ID”,”type”:”int”}
> ]}
>
> Thanks.
>
> Best,
> Qian
> On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <[email protected]>, wrote:
> > Qian,
> >
> > It looks like the partitionPathField that you specified (session_date) is
> > missing or the code is unable to grab it from your payload. Is this field a
> > top-level field or a nested field in your schema ?
> > ( Currently, the HDFSImporterTool looks for your partitionPathField only at
> > the top-level, for example genericRecord.get("session_date") )
> >
> > Thanks,
> > Nishith
> >
> >
> > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > Thanks for your response.
> > >
> > > Now I tried to convert existing dataset to Hudi managed dataset and I used
> > > the hdfsparquestimport in hud-cli. I encountered following error:
> > >
> > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > > HoodieBloomIndex.java:148, took 2.913761 s
> > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > > commit time 20191008095056
> > >
> > > Caused by: org.apache.hudi.exception.HoodieIOException: partition key is
> > > missing. :session_date
> > >
> > > My command in hud-cli as following:
> > > hdfsparquetimport --upsert false --srcPath /path/to/source --targetPath
> > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE --rowKeyField
> > > _row_key --partitionPathField session_date --parallelism 1500
> > > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory 6g
> > > --retry 2
> > >
> > > Could you please tell me how to solve this problem? Thanks.
> > >
> > > Best,
> > > Qian
> > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <[email protected]>, wrote:
> > > > Hi,
> > > >
> > > > I have some questions when I try to use Hudi in my company’s prod env:
> > > >
> > > > 1. When I migrate the history table in HDFS, I tried use hudi-cli and
> > > HDFSParquetImporter tool. How can I specify Spark parameters in this tool,
> > > such as Yarn queue, etc?
> > > > 2. Hudi needs to write metadata to Hive and it uses HiveMetastoreClient
> > > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > > >
> > > > Thanks.
> > > >
> > > > Best,
> > > > Qian
> > >

Re: Questions about using Hudi

Reply via email to