Re: Questions about using Hudi

Kabeer Ahmed Fri, 11 Oct 2019 21:16:52 -0700
Hi Qian

I think you are using the default COW (Copy On Write). In your previous write 
you seem to have written 44G of data and then when you did a second write, 
another 44G of data seems to have been written. This seems to have doubled the 
size to 88G.
Can you please clear off all the data in the folder and then start from fresh 
and then report back the size please?
On Oct 12 2019, at 12:04 am, nishith agarwal <[email protected]> wrote:
> Qian,
>
> These columns will be present for every Hudi dataset. These columns are
> used to provide incremental queries on Hudi datasets so you can get
> changelogs and build incremental ETLs/pipelines.
>
> Thanks,
> Nishith
>
> On Fri, Oct 11, 2019 at 4:00 PM Qian Wang <[email protected]> wrote:
> > Hi,
> > I found that after I converted to Hudi managed dataset, there are added
> > several columns:
> >
> > _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key,
> > _hoodie_partition_path, _hoodie_file_name
> >
> > Does these columns added into table forever or temporary? Thanks.
> > Best,
> > Qian
> > On Oct 11, 2019, 3:39 PM -0700, Qian Wang <[email protected]>, wrote:
> > > Hi,
> > >
> > > I have successfully converted the parquet data into Hudi managed
> > dataset. However, I found that the previous data size is about 44G, after
> > converted by Hudi, the data size is about 88G. Why the data size increased
> > almost twice?
> > >
> > > Best,
> > > Qian
> > > On Oct 11, 2019, 1:57 PM -0700, Qian Wang <[email protected]>, wrote:
> > > > Hi Kabeer,
> > > >
> > > > Thanks for your detailed explanation. I will try it again. Will update
> > you the result.
> > > >
> > > > Best,
> > > > Qian
> > > > On Oct 11, 2019, 1:49 PM -0700, Kabeer Ahmed <[email protected]>,
> > >
> >
> > wrote:
> > > > > Hi Qian,
> > > > >
> > > > > If there are no nulls in the data, then most likey it is issue with
> > the data types being stored. I have seen this issue again and again and in
> > the recent one it was due to me storing double value when I had actually
> > declared the schema as IntegerType. I can reproduce this with an example to
> > prove the point. But I think you should look into your data.
> > > > > If possible I would recommend you run something like:
> > > >
> > >
> >
> > https://stackoverflow.com/questions/33270907/how-to-validate-contents-of-spark-dataframe
> > (
> > https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F33270907%2Fhow-to-validate-contents-of-spark-dataframe&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
> > This will show you if there is any value in any column that is against the
> > declared schema type. And when you fix that, the errors will go away.
> > > > > Keep us posted on how you get along with this.
> > > > > Thanks
> > > > > Kabeer.
> > > > >
> > > > > On Oct 9 2019, at 12:24 am, nishith agarwal <[email protected]>
> > wrote:
> > > > > > Hmm, AVRO is case-sensitive but I've not had issues reading fields
> > > > >
> > > >
> > >
> >
> > from
> > > > > > GenericRecords with lower or upper so I'm not 100% confident on
> > > > >
> > > >
> > >
> >
> > what the
> > > > > > resolution for a lower vs upper case is. Have you tried using the
> > > > > > partitionpath field names in upper case (in case your schema field
> > > > >
> > > >
> > >
> >
> > is also
> > > > > > upper case) ?
> > > > > >
> > > > > > -Nishith
> > > > > > On Tue, Oct 8, 2019 at 4:00 PM Qian Wang <[email protected]>
> > > > >
> > > >
> > >
> >
> > wrote:
> > > > > > > Hi Nishith,
> > > > > > > I have checked the data, there is no null in that field. Does
> > > > > >
> > > > >
> > > >
> > >
> >
> > there has
> > > > > > > other possibility about this error?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Qian
> > > > > > > On Oct 8, 2019, 10:55 AM -0700, Qian Wang <[email protected]>,
> > > > > >
> > > > >
> > > >
> > >
> >
> > wrote:
> > > > > > > > Hi Nishith,
> > > > > > > >
> > > > > > > > Thanks for your response.
> > > > > > > > The session_date is one field in my original dataset. I have
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > some
> > > > > > >
> > > > > > > questions about the schema parameter:
> > > > > > > >
> > > > > > > > 1. Do I need create the target table?
> > > > > > > > 2. My source data is Parquet format, why the tool need the
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > schema file
> > > > > > >
> > > > > > > as the parameter?
> > > > > > > > 3. Can I use the schema file of Avro format?
> > > > > > > >
> > > > > > > > The schema is looks like:
> > > > > > > > {"type":"record","name":"PathExtractData","doc":"Path event
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > extract fact
> > > > > > > data”,”fields”:[
> > > > > > > > {“name”:”SESSION_DATE”,”type”:”string”},
> > > > > > > > {“name”:”SITE_ID”,”type”:”int”},
> > > > > > > > {“name”:”GUID”,”type”:”string”},
> > > > > > > > {“name”:”SESSION_KEY”,”type”:”long”},
> > > > > > > > {“name”:”USER_ID”,”type”:”string”},
> > > > > > > > {“name”:”STEP”,”type”:”int”},
> > > > > > > > {“name”:”PAGE_ID”,”type”:”int”}
> > > > > > > > ]}
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > Best,
> > > > > > > > Qian
> > > > > > > > On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > [email protected]>,
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > Qian,
> > > > > > > > >
> > > > > > > > > It looks like the partitionPathField that you specified
> > (session_date)
> > > > > > > is
> > > > > > > > > missing or the code is unable to grab it from your payload.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > Is this
> > > > > > > >
> > > > > > >
> > > > > > > field a
> > > > > > > > > top-level field or a nested field in your schema ?
> > > > > > > > > ( Currently, the HDFSImporterTool looks for your
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > partitionPathField
> > > > > > > >
> > > > > > >
> > > > > > > only at
> > > > > > > > > the top-level, for example genericRecord.get("session_date")
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > )
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Nishith
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <
> > [email protected]> wrote:
> > > > > > > > > > Hi,
> > > > > > > > > > Thanks for your response.
> > > > > > > > > > Now I tried to convert existing dataset to Hudi managed
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > dataset and
> > > > > > > I used
> > > > > > > > > > the hdfsparquestimport in hud-cli. I encountered following
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > error:
> > > > > > > > > >
> > > > > > > > > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed:
> > countByKey at
> > > > > > > > > > HoodieBloomIndex.java:148, took 2.913761 s
> > > > > > > > > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > occurred.
> > > > > > > > > > org.apache.hudi.exception.HoodieUpsertException: Failed to
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > upsert for
> > > > > > > > > > commit time 20191008095056
> > > > > > > > > >
> > > > > > > > > > Caused by: org.apache.hudi.exception.HoodieIOException:
> > partition
> > > > > > > key is
> > > > > > > > > > missing. :session_date
> > > > > > > > > >
> > > > > > > > > > My command in hud-cli as following:
> > > > > > > > > > hdfsparquetimport --upsert false --srcPath /path/to/source
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --targetPath
> > > > > > > > > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > --rowKeyField
> > > > > > > > > > _row_key --partitionPathField session_date --parallelism
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > 1500
> > > > > > > > > > --schemaFilePath /path/to/avro/schema --format parquet
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > --sparkMemory
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > 6g
> > > > > > > > > > --retry 2
> > > > > > > > > >
> > > > > > > > > > Could you please tell me how to solve this problem? Thanks.
> > > > > > > > > > Best,
> > > > > > > > > > Qian
> > > > > > > > > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > [email protected]>,
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I have some questions when I try to use Hudi in my
> > company’s prod
> > > > > > > env:
> > > > > > > > > > >
> > > > > > > > > > > 1. When I migrate the history table in HDFS, I tried use
> > hudi-cli
> > > > > > > and
> > > > > > > > > > HDFSParquetImporter tool. How can I specify Spark
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > parameters in this
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > tool,
> > > > > > > > > > such as Yarn queue, etc?
> > > > > > > > > > > 2. Hudi needs to write metadata to Hive and it uses
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > HiveMetastoreClient
> > > > > > > > > > and HiveJDBC. How can I do if the Hive has Kerberos
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> > Authentication?
> > > > > > > > > > >
> > > > > > > > > > > Thanks.
> > > > > > > > > > > Best,
> > > > > > > > > > > Qian
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
Re: Questions about using Hudi

Reply via email to