Github user jianqiao commented on a diff in the pull request:

    https://github.com/apache/incubator-quickstep/pull/19#discussion_r66470302
  
    --- Diff: relational_operators/TextScanOperator.cpp ---
    @@ -274,439 +116,293 @@ TextScanWorkOrder::TextScanWorkOrder(const 
std::size_t query_id,
     
     void TextScanWorkOrder::execute() {
       const CatalogRelationSchema &relation = 
output_destination_->getRelation();
    +  std::vector<Tuple> tuples;
     
    -  string current_row_string;
    -  if (is_file_) {
    -    FILE *file = std::fopen(filename_.c_str(), "r");
    -    if (file == nullptr) {
    -      throw TextScanReadError(filename_);
    -    }
    +  constexpr std::size_t kSmallBufferSize = 0x4000;
    --- End diff --
    
    This is the buffer size for processing the last row of the text segment.
    
    For each text segment, we will first: (1) start scanning from the first 
newline (`\n`) character in the segment, and end scanning with the last newline 
character in the segment; and then: (2) scanning from the _last_ newline 
character in _this_ text segment to the _first_ newline character in the _next_ 
text segment (corner cases will also be handled).
    
    Consider (2), how much data from the _next_ segment do we want to load from 
disk? Since it is just one row, in most cases we may not want to load too much. 
So the load buffer starts with 1024 bytes, and we keep appending the buffer's 
contents to a `std::string` if `\n` is not met. If this "tail row" is really 
large, the buffer will grow up to 0x4000 bytes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to