Thank you Vinoth and Balaji. As soon as you have a patch, I can get the tests
going on S3 and relay back the results.
Vinoth:
My proficiency with code can never be any better than you and Balaji. I have
never looked into the translation of fs calls (getPartitions()) into depth if
they are translated into filesystem calls.
But as far as my understanding wrt S3, even a simple get partition if going 3
levels down to retrieve the objects also results in a S3 connection. Let me
give a gist what we do in the pseudo-code below.
insertHudiRecords()
{
// Prepare 1st HUDI DF to write
df1.write.format(com.uber.hudi).... // S3 connections increase by 750 inline
with # of partitions we have.
// Prepare another HUDI DF to write.
df2.write.format(com.uber.hudi)... // S3 connections are at 750 level & another
750 are added UP
}
// S3 connections released only when Spark process in above routine is finished
i.e. actual application exits()
Thank you for all your responses.
On Mar 25 2019, at 4:15 pm, [email protected] wrote:
> +1, Incremental cleaning is a scheduled work. I will be working on this
> immediately after the HUDI-1
> Balaji.V On Sunday, March 24, 2019, 7:42:03 PM PDT, Vinoth Chandar
> <[email protected]> wrote:
>
> Hi Kabeer,
> You are right. HUDI-1 alone wont be sufficient. We need to do a follow on.
> IIRC this is already planned work (balaji?)
> Filed https://issues.apache.org/jira/browse/HUDI-80 to separate this from
> HUDI-1..
>
> On to the issue you are facing, seems like the connections to S3 keep
> hanging around? Don't think cleaning actually opens any files, simply lists
> and deletes. We could call a fs.close which probably shuts the connections
> down. But, need to think through that more, since fs caching is a tricky
> issue.. https://issues.apache.org/jira/browse/HUDI-81 filed this
> separately to track this. If you can help me track the connections to S3
> etc, I can take a stab and may be we can test teh patch in your
> environment?
>
> We can work on the ticket. Please share your jira id, so I can add you as
> acontributor, giving you commenting etc on jira
>
> Thanks
> Vinoth
>
>
>
> On Sun, Mar 24, 2019 at 2:11 PM Kabeer Ahmed <[email protected]> wrote:
> > Hi Vinoth,
> > Thank you for your response. I thought of reducing clear parallelism which
> > is Min(200, table_partitions). But it wouldnt have an effect as regardless
> > of parallelism, there will be an attempt to scan all files (reduced
> > parallelism might albeit slow the process).
> > So as stated in a table with 750+ partitions I did notice that connections
> > would increase and I have now been forced to keep the S3 connection limit
> > to 5k due to this issue.
> > I also looked into the brief description of the jira:
> > https://issues.apache.org/jira/browse/HUDI-1 (
> > https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-1&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
> > This is a very nice optimisation to have but I dont think it will help
> > alleviate the concerns on the S3. On HDFS, this jira will definitely help
> > reduce the # of name node connections but S3 objects will need to be opened
> > to clear them and the problem will no go away.
> > I think the effective work has to be on the lines of working up cleaning
> > the partitions in the routine below:
> >
> > // File: HoodieCopyOnWriteTable.java
> > public List<HoodieCleanStat> clean(JavaSparkContext jsc) {
> >
> > try {
> > FileSystem fs = getMetaClient().getFs();
> > List<String> partitionsToClean = FSUtils
> > .getAllPartitionPaths(fs, getMetaClient().getBasePath(),
> > config.shouldAssumeDatePartitioning());
> > logger.info("Partitions to clean up : " + partitionsToClean + ", with
> > policy " + config
> > .getCleanerPolicy());
> > if (partitionsToClean.isEmpty()) {
> > logger.info("Nothing to clean here mom. It is already clean");
> > return Collections.emptyList();
> > }
> > return cleanPartitionPaths(partitionsToClean, jsc);
> > } catch (IOException e) {
> > throw new HoodieIOException("Failed to clean up after commit", e);
> > }
> > }
> > In the above routine, all the connections opened are not closed. I think
> > the work should be on the lines of cleaning the connections in this routine
> > after the cleaning operation (i.e. file close logic added so that it is
> > executed in parallel for every file opened by Spark executors).
> >
> > Please feel free to correct me if you think I have goofed up somewhere.
> > Thanks
> > Kabeer.
> >
> > PS: There is so much going on and there is a need to progress with the
> > stuff at hand at work. Otherwise would have loved to spend time and send a
> > PR.
> > On Mar 24 2019, at 7:04 am, Vinoth Chandar <[email protected]> wrote:
> > > Hi Kabeer,
> > >
> > > No need to apologize :)
> > > Mailing list works lot better for reporting issues. We can respond much
> > > quicker, since its not buried with all other github events
> > >
> > > On what you saw, the cleaner does list all partitions currently. Have you
> > > tried reducing cleaner parallelism if limiting connections is your goal?
> > >
> > > Also some good news is, once
> > > https://issues.apache.org/jira/browse/HUDI-1 is landed (currently being
> > > reviewed), a follow on is to rework the cleaner incrementally on top
> >
> > which
> > > should help a lot here.
> > >
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Sat, Mar 23, 2019 at 7:39 PM Kabeer Ahmed <[email protected]>
> > wrote:
> > > > Hi,
> > > > I have just raised this issue and thought to share with the community
> > >
> >
> > if
> > > > someone else is experiencing this. Apologies in advance if this is a
> > > > redundant email.
> > > > Thanks
> > > > Kabeer.
> > >
> >
>
>