[Question] HoodieROTablePathFilter not accept dir path
https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121 As shown in the 2 lines above, it does not seem to work with directory Path. It should work for both `new Path("base/partition")` and `new Path("base/partition/")`, but it only works for the former case. In the latter case, `folder` will be "base/partition" and `path` will be "base/partition/", which will always result in returning false. A potential bug?
Re: [Question] HoodieROTablePathFilter not accept dir path
Hi Raymond, IIRC, we need to give a blob path to make HoodieROTablePathFilter to work correctly (e.g: "base/partition/*"). The path-cache is at partition level and not at table level so we need to extract the partition-path correctly to be used as look-up key. To extract partition-path, the challenge here is "Path" type does not have APIs to quickly figure if a path is a directory or not and we should avoid making RPC calls here. Thanks,Balaji.V On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu wrote: https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121 As shown in the 2 lines above, it does not seem to work with directory Path. It should work for both `new Path("base/partition")` and `new Path("base/partition/")`, but it only works for the former case. In the latter case, `folder` will be "base/partition" and `path` will be "base/partition/", which will always result in returning false. A potential bug?
Re: [Question] HoodieROTablePathFilter not accept dir path
Hi Balaji, not sure if I fully get it. I'm attempting to refer to this test case https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHoodieROTablePathFilter.java#L63-L65 where a partition path is supposed to be accepted. If I change L64 to Path partitionPath = new Path(Paths.get(basePath, "2017/01/01").toUri()); Then it resulted in not being accepted due to partitionPath ending with `/` (a directory path). To me, this seems to be a corner case not being covered. Could you kindly confirm the expectation please? Thanks. On Tue, Sep 8, 2020 at 8:58 PM Balaji Varadarajan wrote: > Hi Raymond, > IIRC, we need to give a blob path to make HoodieROTablePathFilter to work > correctly (e.g: "base/partition/*"). The path-cache is at partition level > and not at table level so we need to extract the partition-path correctly > to be used as look-up key. To extract partition-path, the challenge here is > "Path" type does not have APIs to quickly figure if a path is a directory > or not and we should avoid making RPC calls here. > Thanks,Balaji.V > On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu < > xu.shiyan.raym...@gmail.com> wrote: > > > https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121 > > As shown in the 2 lines above, it does not seem to work with directory > Path. > It should work for both `new Path("base/partition")` and `new > Path("base/partition/")`, but it only works for the former case. In the > latter case, `folder` will be "base/partition" and `path` will be > "base/partition/", which will always result in returning false. > A potential bug? >
Re: [Question] HoodieROTablePathFilter not accept dir path
Hi Raymond, HoodieROPathFilter is supposed to return true only for file matches belonging to latest version if the path refers to a Hudi partition or if the path refers to a non-hoodie partition or dataset. I looked at the test-case you referred. It only works because the path filter wrongly assumes it is a non-hoodie path. You can run this in debug mode to see the code path. From the usage perspective, this is used only from Spark (InMemoryFileIndex) where only the files are passed to this filter. So, I wouldn't classify this as a bug. But, it makes sense to make it consistent for both cases Balaji.V Thanks,Balaji.VOn Wednesday, September 9, 2020, 07:45:09 AM PDT, Raymond Xu wrote: Hi Balaji, not sure if I fully get it. I'm attempting to refer to this test case https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHoodieROTablePathFilter.java#L63-L65 where a partition path is supposed to be accepted. If I change L64 to Path partitionPath = new Path(Paths.get(basePath, "2017/01/01").toUri()); Then it resulted in not being accepted due to partitionPath ending with `/` (a directory path). To me, this seems to be a corner case not being covered. Could you kindly confirm the expectation please? Thanks. On Tue, Sep 8, 2020 at 8:58 PM Balaji Varadarajan wrote: > Hi Raymond, > IIRC, we need to give a blob path to make HoodieROTablePathFilter to work > correctly (e.g: "base/partition/*"). The path-cache is at partition level > and not at table level so we need to extract the partition-path correctly > to be used as look-up key. To extract partition-path, the challenge here is > "Path" type does not have APIs to quickly figure if a path is a directory > or not and we should avoid making RPC calls here. > Thanks,Balaji.V > On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu < > xu.shiyan.raym...@gmail.com> wrote: > > > https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121 > > As shown in the 2 lines above, it does not seem to work with directory > Path. > It should work for both `new Path("base/partition")` and `new > Path("base/partition/")`, but it only works for the former case. In the > latter case, `folder` will be "base/partition" and `path` will be > "base/partition/", which will always result in returning false. > A potential bug? >
Re: [Question] HoodieROTablePathFilter not accept dir path
Understood. Thank you, Balaji, for the clarification. On Fri, Sep 11, 2020 at 2:17 PM Balaji Varadarajan wrote: > Hi Raymond, > > HoodieROPathFilter is supposed to return true only for file matches > belonging to latest version if the path refers to a Hudi partition or if > the path refers to a non-hoodie partition or dataset. > I looked at the test-case you referred. It only works because the path > filter wrongly assumes it is a non-hoodie path. You can run this in debug > mode to see the code path. From the usage perspective, this is used only > from Spark (InMemoryFileIndex) where only the files are passed to this > filter. So, I wouldn't classify this as a bug. But, it makes sense to make > it consistent for both cases > Balaji.V > > Thanks,Balaji.VOn Wednesday, September 9, 2020, 07:45:09 AM PDT, > Raymond Xu wrote: > > Hi Balaji, not sure if I fully get it. > I'm attempting to refer to this test case > > https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHoodieROTablePathFilter.java#L63-L65 > > where a partition path is supposed to be accepted. > If I change L64 to > Path partitionPath = new Path(Paths.get(basePath, "2017/01/01").toUri()); > > Then it resulted in not being accepted due to partitionPath ending with `/` > (a directory path). To me, this seems to be a corner case not being > covered. Could you kindly confirm the expectation please? Thanks. > > On Tue, Sep 8, 2020 at 8:58 PM Balaji Varadarajan > wrote: > > > Hi Raymond, > > IIRC, we need to give a blob path to make HoodieROTablePathFilter to > work > > correctly (e.g: "base/partition/*"). The path-cache is at partition level > > and not at table level so we need to extract the partition-path correctly > > to be used as look-up key. To extract partition-path, the challenge here > is > > "Path" type does not have APIs to quickly figure if a path is a directory > > or not and we should avoid making RPC calls here. > > Thanks,Balaji.V > >On Tuesday, September 8, 2020, 09:56:49 AM PDT, Raymond Xu < > > xu.shiyan.raym...@gmail.com> wrote: > > > > > > > https://github.com/apache/hudi/blob/9bcd3221fd440081dbae70e89d08539c3b484862/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L120-L121 > > > > As shown in the 2 lines above, it does not seem to work with directory > > Path. > > It should work for both `new Path("base/partition")` and `new > > Path("base/partition/")`, but it only works for the former case. In the > > latter case, `folder` will be "base/partition" and `path` will be > > "base/partition/", which will always result in returning false. > > A potential bug? > > >