Re: HUDI Read | Leverage Partitions

tanu dua Wed, 19 Aug 2020 07:56:48 -0700

I am so sorry to bother you. It worked , there was some typo. Really
apologize


On Wed, Aug 19, 2020 at 7:01 PM tanu dua <[email protected]> wrote:

> Hi Gary,
> I am getting an exception while loading HUDI tables using glob path. Does
> it work ? Have someone tried it ? If I use without {} it works
> Caused by: org.apache.spark.sql.AnalysisException: Path does not exist:
> file:/C:/Hudi/data/co/A/2019/{3,4};
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:552)
> at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
>
> On Tue, Jun 30, 2020 at 7:39 PM Tanuj <[email protected]> wrote:
>
>> Thanks a lot. I understand now.
>>
>> On 2020/06/27 02:45:52, Gary Li <[email protected]> wrote:
>> > Hi,
>> >
>> > If you use year=xxx/month=xxx folder structure, you can use Dataset<Row>
>> > df=
>> >
>> spark.read().format("hudi").schema(schema).load(<base_path>+<table_name>).
>> > Without a glob postfix, Spark can automatically load the partition
>> > information, just like regular parquet files.
>> >
>> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery
>> >
>> > If you use something like 2020/06, you may need to build the glob string
>> > and add it to the load() to skip the unnecessary partitions. e.g.
>> > .load(<base_path>+<table_name>+"2020/{05,06}")
>> >
>> > Or list one parquet file from different partitions and use a map
>> function
>> > to load 1 row from each path with a limit clause.
>> >
>> > On Fri, Jun 26, 2020 at 8:33 AM Tanuj <[email protected]> wrote:
>> >
>> > > Hi,
>> > > We have created a table with partition depth of 2 as year/month. We
>> need
>> > > to read data from HUDI in Spark Streaming layer where we get the
>> batch data
>> > > of say 10 rows which we need to use to read from HUDI. We are reading
>> it
>> > > like -
>> > >
>> > > // Read from HUDI
>> > > Dataset<Row> df=
>> > >
>> spark.read().format("hudi").schema(schema).load(<base_path>+<table_name>+"/*/*")
>> > >
>> > > //Apply filter
>> > >
>> > >
>> df=df.filter(df.col("year").isin(<vals>).filter(df.col("month").isin(<vals>)).filter(df.col("id").isin(<vals>));
>> > >
>> > > Is it the best way to read the data ? Will HUDI take care of just
>> reading
>> > > from the partitions or we need to take care of ? For eg. If I need to
>> read
>> > > just 1 row we can build the full path and then read which will read
>> the
>> > > parquet file from that partition quickly but here our requirement is
>> to
>> > > read data from multiple partitions.
>> > >
>> > >
>> > >
>> >
>>
>

Re: HUDI Read | Leverage Partitions

Reply via email to