[Question] HoodieROTablePathFilter not accept dir path

2020-09-08 Thread Raymond Xu
https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121

As shown in the 2 lines above, it does not seem to work with directory
Path.
It should work for both `new Path("base/partition")` and `new
Path("base/partition/")`, but it only works for the former case. In the
latter case, `folder` will be "base/partition" and `path` will be
"base/partition/", which will always result in returning false.
A potential bug?


Re: [Question] HoodieROTablePathFilter not accept dir path

2020-09-08 Thread Balaji Varadarajan
 Hi Raymond,
IIRC, we need to give a blob path to make  HoodieROTablePathFilter to work 
correctly (e.g: "base/partition/*"). The path-cache is at partition level and 
not at table level so we need to extract the partition-path correctly to be 
used as look-up key. To extract partition-path, the challenge here is "Path" 
type does not have APIs to quickly figure if a path is a directory or not and 
we should avoid making RPC calls here. 
Thanks,Balaji.V
On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu 
 wrote:  
 
 
https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121

As shown in the 2 lines above, it does not seem to work with directory
Path.
It should work for both `new Path("base/partition")` and `new
Path("base/partition/")`, but it only works for the former case. In the
latter case, `folder` will be "base/partition" and `path` will be
"base/partition/", which will always result in returning false.
A potential bug?
  

Re: [Question] HoodieROTablePathFilter not accept dir path

2020-09-09 Thread Raymond Xu
Hi Balaji, not sure if I fully get it.
I'm attempting to refer to this test case
https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHoodieROTablePathFilter.java#L63-L65

where a partition path is supposed to be accepted.
If I change L64 to
Path partitionPath = new Path(Paths.get(basePath, "2017/01/01").toUri());

Then it resulted in not being accepted due to partitionPath ending with `/`
(a directory path). To me, this seems to be a corner case not being
covered. Could you kindly confirm the expectation please? Thanks.

On Tue, Sep 8, 2020 at 8:58 PM Balaji Varadarajan
 wrote:

>  Hi Raymond,
> IIRC, we need to give a blob path to make  HoodieROTablePathFilter to work
> correctly (e.g: "base/partition/*"). The path-cache is at partition level
> and not at table level so we need to extract the partition-path correctly
> to be used as look-up key. To extract partition-path, the challenge here is
> "Path" type does not have APIs to quickly figure if a path is a directory
> or not and we should avoid making RPC calls here.
> Thanks,Balaji.V
> On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu <
> xu.shiyan.raym...@gmail.com> wrote:
>
>
> https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121
>
> As shown in the 2 lines above, it does not seem to work with directory
> Path.
> It should work for both `new Path("base/partition")` and `new
> Path("base/partition/")`, but it only works for the former case. In the
> latter case, `folder` will be "base/partition" and `path` will be
> "base/partition/", which will always result in returning false.
> A potential bug?
>


Re: [Question] HoodieROTablePathFilter not accept dir path

2020-09-11 Thread Balaji Varadarajan
 Hi Raymond,

HoodieROPathFilter is supposed to return true only for file matches belonging 
to latest version if the path refers to a Hudi partition or if the path refers 
to a non-hoodie partition or dataset.  
I looked at the test-case you referred. It only works because the path filter 
wrongly assumes it is a non-hoodie path. You can run this in debug mode to see 
the code path.  From the usage perspective, this is used only from Spark 
(InMemoryFileIndex) where only the files are passed to this filter. So, I 
wouldn't classify this as a bug. But, it makes sense to make it consistent for 
both cases 
Balaji.V

Thanks,Balaji.VOn Wednesday, September 9, 2020, 07:45:09 AM PDT, Raymond Xu 
 wrote:  
 
 Hi Balaji, not sure if I fully get it.
I'm attempting to refer to this test case
https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHoodieROTablePathFilter.java#L63-L65

where a partition path is supposed to be accepted.
If I change L64 to
Path partitionPath = new Path(Paths.get(basePath, "2017/01/01").toUri());

Then it resulted in not being accepted due to partitionPath ending with `/`
(a directory path). To me, this seems to be a corner case not being
covered. Could you kindly confirm the expectation please? Thanks.

On Tue, Sep 8, 2020 at 8:58 PM Balaji Varadarajan
 wrote:

>  Hi Raymond,
> IIRC, we need to give a blob path to make  HoodieROTablePathFilter to work
> correctly (e.g: "base/partition/*"). The path-cache is at partition level
> and not at table level so we need to extract the partition-path correctly
> to be used as look-up key. To extract partition-path, the challenge here is
> "Path" type does not have APIs to quickly figure if a path is a directory
> or not and we should avoid making RPC calls here.
> Thanks,Balaji.V
>    On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu <
> xu.shiyan.raym...@gmail.com> wrote:
>
>
> https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121
>
> As shown in the 2 lines above, it does not seem to work with directory
> Path.
> It should work for both `new Path("base/partition")` and `new
> Path("base/partition/")`, but it only works for the former case. In the
> latter case, `folder` will be "base/partition" and `path` will be
> "base/partition/", which will always result in returning false.
> A potential bug?
>
  

Re: [Question] HoodieROTablePathFilter not accept dir path

2020-09-12 Thread Raymond Xu
Understood. Thank you, Balaji, for the clarification.

On Fri, Sep 11, 2020 at 2:17 PM Balaji Varadarajan
 wrote:

>  Hi Raymond,
>
> HoodieROPathFilter is supposed to return true only for file matches
> belonging to latest version if the path refers to a Hudi partition or if
> the path refers to a non-hoodie partition or dataset.
> I looked at the test-case you referred. It only works because the path
> filter wrongly assumes it is a non-hoodie path. You can run this in debug
> mode to see the code path.  From the usage perspective, this is used only
> from Spark (InMemoryFileIndex) where only the files are passed to this
> filter. So, I wouldn't classify this as a bug. But, it makes sense to make
> it consistent for both cases
> Balaji.V
>
> Thanks,Balaji.VOn Wednesday, September 9, 2020, 07:45:09 AM PDT,
> Raymond Xu  wrote:
>
>  Hi Balaji, not sure if I fully get it.
> I'm attempting to refer to this test case
>
> https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHoodieROTablePathFilter.java#L63-L65
>
> where a partition path is supposed to be accepted.
> If I change L64 to
> Path partitionPath = new Path(Paths.get(basePath, "2017/01/01").toUri());
>
> Then it resulted in not being accepted due to partitionPath ending with `/`
> (a directory path). To me, this seems to be a corner case not being
> covered. Could you kindly confirm the expectation please? Thanks.
>
> On Tue, Sep 8, 2020 at 8:58 PM Balaji Varadarajan
>  wrote:
>
> >  Hi Raymond,
> > IIRC, we need to give a blob path to make  HoodieROTablePathFilter to
> work
> > correctly (e.g: "base/partition/*"). The path-cache is at partition level
> > and not at table level so we need to extract the partition-path correctly
> > to be used as look-up key. To extract partition-path, the challenge here
> is
> > "Path" type does not have APIs to quickly figure if a path is a directory
> > or not and we should avoid making RPC calls here.
> > Thanks,Balaji.V
> >On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu <
> > xu.shiyan.raym...@gmail.com> wrote:
> >
> >
> >
> https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121
> >
> > As shown in the 2 lines above, it does not seem to work with directory
> > Path.
> > It should work for both `new Path("base/partition")` and `new
> > Path("base/partition/")`, but it only works for the former case. In the
> > latter case, `folder` will be "base/partition" and `path` will be
> > "base/partition/", which will always result in returning false.
> > A potential bug?
> >
>