Hi Kabeer, Actually knowing things like S3 creates connections three level down helps us immensely already! :) We typically test with HDFS which just talks to namenode for RPC.
For context, are you running hudi in a streaming job that does not exit? Thanks Vinoth On Mon, Mar 25, 2019 at 2:58 PM Kabeer Ahmed <[email protected]> wrote: > Thank you Vinoth and Balaji. As soon as you have a patch, I can get the > tests going on S3 and relay back the results. > > Vinoth: > My proficiency with code can never be any better than you and Balaji. I > have never looked into the translation of fs calls (getPartitions()) into > depth if they are translated into filesystem calls. > > But as far as my understanding wrt S3, even a simple get partition if > going 3 levels down to retrieve the objects also results in a S3 > connection. Let me give a gist what we do in the pseudo-code below. > > insertHudiRecords() > { > // Prepare 1st HUDI DF to write > df1.write.format(com.uber.hudi).... // S3 connections increase by 750 > inline with # of partitions we have. > > // Prepare another HUDI DF to write. > df2.write.format(com.uber.hudi)... // S3 connections are at 750 level & > another 750 are added UP > > } > > // S3 connections released only when Spark process in above routine is > finished i.e. actual application exits() > Thank you for all your responses. > On Mar 25 2019, at 4:15 pm, [email protected] wrote: > > +1, Incremental cleaning is a scheduled work. I will be working on this > immediately after the HUDI-1 > > Balaji.V On Sunday, March 24, 2019, 7:42:03 PM PDT, Vinoth Chandar < > [email protected]> wrote: > > > > Hi Kabeer, > > You are right. HUDI-1 alone wont be sufficient. We need to do a follow > on. > > IIRC this is already planned work (balaji?) > > Filed https://issues.apache.org/jira/browse/HUDI-80 to separate this > from > > HUDI-1.. > > > > On to the issue you are facing, seems like the connections to S3 keep > > hanging around? Don't think cleaning actually opens any files, simply > lists > > and deletes. We could call a fs.close which probably shuts the > connections > > down. But, need to think through that more, since fs caching is a tricky > > issue.. https://issues.apache.org/jira/browse/HUDI-81 filed this > > separately to track this. If you can help me track the connections to S3 > > etc, I can take a stab and may be we can test teh patch in your > > environment? > > > > We can work on the ticket. Please share your jira id, so I can add you as > > acontributor, giving you commenting etc on jira > > > > Thanks > > Vinoth > > > > > > > > On Sun, Mar 24, 2019 at 2:11 PM Kabeer Ahmed <[email protected]> > wrote: > > > Hi Vinoth, > > > Thank you for your response. I thought of reducing clear parallelism > which > > > is Min(200, table_partitions). But it wouldnt have an effect as > regardless > > > of parallelism, there will be an attempt to scan all files (reduced > > > parallelism might albeit slow the process). > > > So as stated in a table with 750+ partitions I did notice that > connections > > > would increase and I have now been forced to keep the S3 connection > limit > > > to 5k due to this issue. > > > I also looked into the brief description of the jira: > > > https://issues.apache.org/jira/browse/HUDI-1 ( > > > > https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-1&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D > ). > > > This is a very nice optimisation to have but I dont think it will help > > > alleviate the concerns on the S3. On HDFS, this jira will definitely > help > > > reduce the # of name node connections but S3 objects will need to be > opened > > > to clear them and the problem will no go away. > > > I think the effective work has to be on the lines of working up > cleaning > > > the partitions in the routine below: > > > > > > // File: HoodieCopyOnWriteTable.java > > > public List<HoodieCleanStat> clean(JavaSparkContext jsc) { > > > > > > try { > > > FileSystem fs = getMetaClient().getFs(); > > > List<String> partitionsToClean = FSUtils > > > .getAllPartitionPaths(fs, getMetaClient().getBasePath(), > > > config.shouldAssumeDatePartitioning()); > > > logger.info("Partitions to clean up : " + partitionsToClean + ", with > > > policy " + config > > > .getCleanerPolicy()); > > > if (partitionsToClean.isEmpty()) { > > > logger.info("Nothing to clean here mom. It is already clean"); > > > return Collections.emptyList(); > > > } > > > return cleanPartitionPaths(partitionsToClean, jsc); > > > } catch (IOException e) { > > > throw new HoodieIOException("Failed to clean up after commit", e); > > > } > > > } > > > In the above routine, all the connections opened are not closed. I > think > > > the work should be on the lines of cleaning the connections in this > routine > > > after the cleaning operation (i.e. file close logic added so that it is > > > executed in parallel for every file opened by Spark executors). > > > > > > Please feel free to correct me if you think I have goofed up somewhere. > > > Thanks > > > Kabeer. > > > > > > PS: There is so much going on and there is a need to progress with the > > > stuff at hand at work. Otherwise would have loved to spend time and > send a > > > PR. > > > On Mar 24 2019, at 7:04 am, Vinoth Chandar <[email protected]> wrote: > > > > Hi Kabeer, > > > > > > > > No need to apologize :) > > > > Mailing list works lot better for reporting issues. We can respond > much > > > > quicker, since its not buried with all other github events > > > > > > > > On what you saw, the cleaner does list all partitions currently. > Have you > > > > tried reducing cleaner parallelism if limiting connections is your > goal? > > > > > > > > Also some good news is, once > > > > https://issues.apache.org/jira/browse/HUDI-1 is landed (currently > being > > > > reviewed), a follow on is to rework the cleaner incrementally on top > > > > > > which > > > > should help a lot here. > > > > > > > > > > > > Thanks > > > > Vinoth > > > > > > > > On Sat, Mar 23, 2019 at 7:39 PM Kabeer Ahmed <[email protected]> > > > wrote: > > > > > Hi, > > > > > I have just raised this issue and thought to share with the > community > > > > > > > > > > if > > > > > someone else is experiencing this. Apologies in advance if this is > a > > > > > redundant email. > > > > > Thanks > > > > > Kabeer. > > > > > > > > > > > > >
