Wahno opened a new issue, #50: URL: https://github.com/apache/doris-spark-connector/issues/50
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and found no similar issues. ### Description In the case of many small files, for example, a file has only 100 pieces of data, but there are thousands or more of files, then the partition of the RDD will be greater than or equal to the number of files. At this time, the amount of data carried by the request is small, but the number of requests is large, which leads to the problem of too many versions. ---- 在面临小文件极多的情况,例如,一个文件只有100条数据,但是有几千个甚至文件,这时RDD的分区会大于等于文件数。这样请求携带的数据量很少,但请求数很多,造成版本数过多的问题甚至导入失败。 ### Solution Add a RDD maximum partition parameter, but the default is Integer.MAX_VALUE. This parameter is controlled by the user, and the repartition operation can be performed in advance to reduce the number of partitions. ---- 添加一个RDD最大分区参数,但是默认为Integer.MAX_VALUE,由用户控制这个参数,可以提前做repartition操作减少分区数 ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
