Hi Vinoth,

Thank you for your response. I thought of reducing clear parallelism which is 
Min(200, table_partitions). But it wouldnt have an effect as regardless of 
parallelism, there will be an attempt to scan all files (reduced parallelism 
might albeit slow the process).
So as stated in a table with 750+ partitions I did notice that connections 
would increase and I have now been forced to keep the S3 connection limit to 5k 
due to this issue.
I also looked into the brief description of the jira: 
https://issues.apache.org/jira/browse/HUDI-1 
(https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-1&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
 This is a very nice optimisation to have but I dont think it will help 
alleviate the concerns on the S3. On HDFS, this jira will definitely help 
reduce the # of name node connections but S3 objects will need to be opened to 
clear them and the problem will no go away.
I think the effective work has to be on the lines of working up cleaning the 
partitions in the routine below:

// File: HoodieCopyOnWriteTable.java
public List<HoodieCleanStat> clean(JavaSparkContext jsc) {

try {
FileSystem fs = getMetaClient().getFs();
List<String> partitionsToClean = FSUtils
.getAllPartitionPaths(fs, getMetaClient().getBasePath(),
config.shouldAssumeDatePartitioning());
logger.info("Partitions to clean up : " + partitionsToClean + ", with policy " 
+ config
.getCleanerPolicy());
if (partitionsToClean.isEmpty()) {
logger.info("Nothing to clean here mom. It is already clean");
return Collections.emptyList();
}
return cleanPartitionPaths(partitionsToClean, jsc);
} catch (IOException e) {
throw new HoodieIOException("Failed to clean up after commit", e);
}
}
In the above routine, all the connections opened are not closed. I think the 
work should be on the lines of cleaning the connections in this routine after 
the cleaning operation (i.e. file close logic added so that it is executed in 
parallel for every file opened by Spark executors).

Please feel free to correct me if you think I have goofed up somewhere.
Thanks
Kabeer.

PS: There is so much going on and there is a need to progress with the stuff at 
hand at work. Otherwise would have loved to spend time and send a PR.
On Mar 24 2019, at 7:04 am, Vinoth Chandar <[email protected]> wrote:
> Hi Kabeer,
>
> No need to apologize :)
> Mailing list works lot better for reporting issues. We can respond much
> quicker, since its not buried with all other github events
>
> On what you saw, the cleaner does list all partitions currently. Have you
> tried reducing cleaner parallelism if limiting connections is your goal?
>
> Also some good news is, once
> https://issues.apache.org/jira/browse/HUDI-1 is landed (currently being
> reviewed), a follow on is to rework the cleaner incrementally on top which
> should help a lot here.
>
>
> Thanks
> Vinoth
>
> On Sat, Mar 23, 2019 at 7:39 PM Kabeer Ahmed <[email protected]> wrote:
> > Hi,
> > I have just raised this issue and thought to share with the community if
> > someone else is experiencing this. Apologies in advance if this is a
> > redundant email.
> > Thanks
> > Kabeer.
>
>

Reply via email to