[GitHub] [doris-spark-connector] Wahno opened a new issue, #50: [Enhancement] Fix performance issues caused by small file issues

GitBox Tue, 06 Sep 2022 04:43:53 -0700


Wahno opened a new issue, #50:
URL: https://github.com/apache/doris-spark-connector/issues/50


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Description
   
   In the case of many small files, for example, a file has only 100 pieces of 
data, but there are thousands or more of files, then the partition of the RDD 
will be greater than or equal to the number of files. At this time, the amount 
of data carried by the request is small, but the number of requests is large, 
which leads to the problem of too many versions.
   
   ---- 
   
在面临小文件极多的情况，例如，一个文件只有100条数据，但是有几千个甚至文件，这时RDD的分区会大于等于文件数。这样请求携带的数据量很少，但请求数很多，造成版本数过多的问题甚至导入失败。
   
   ### Solution
   
   Add a RDD maximum partition parameter, but the default is Integer.MAX_VALUE. 
This parameter is controlled by the user, and the repartition operation can be 
performed in advance to reduce the number of partitions.
   
   ----
   
   添加一个RDD最大分区参数，但是默认为Integer.MAX_VALUE，由用户控制这个参数，可以提前做repartition操作减少分区数
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [doris-spark-connector] Wahno opened a new issue, #50: [Enhancement] Fix performance issues caused by small file issues

Reply via email to