Kasper Sørensen created METAMODEL-31:
----------------------------------------

             Summary: CSV module: Ineffecient insert/append implementation for 
non-FileResources
                 Key: METAMODEL-31
                 URL: https://issues.apache.org/jira/browse/METAMODEL-31
             Project: Metamodel
          Issue Type: Bug
    Affects Versions: 4.0
            Reporter: Kasper Sørensen


I recently noticed a very poor performance of inserting records into a CSV 
resource which was a virtual file of a third party system. We have our own 
implementation of the Resource interface for this virtual file type. The 
Resource implementation itself is effective enough, so I was not sure why it 
took the additional time as compared to inserting (appending) data into the 
resource.

The answer is in the CsvUpdateCallback class, line 101 and onwards:

{code}
            // generic handling for any kind of resource
            final Action<OutputStream> action = new Action<OutputStream>() {
                @Override
                public void run(OutputStream out) throws Exception {
                    final String encoding = _configuration.getEncoding();
                    final OutputStreamWriter writer = new 
OutputStreamWriter(out, encoding);
                    writer.write(line);
                    writer.flush();
                }
            };
            if (append) {
                _resource.append(action);
            } else {
                _resource.write(action);
            }
{code}

It seems that there is a if-block specifically for FileResources. For 
FileResource's a trick is applied so that the FileOutputStream is reused for 
each inserted record.

The trouble is that for other types of Resources, the above method is used - 
request a separate append operation for each record. This involves typically 
opening and closing the output stream. When this happens for EACH record, it 
comes at a severe penanlty in the end.

A suggested fix for this would be to have a buffer of inserts in memory. Every 
time the buffer would hit say 1000 records, it would be flushed in a single 
append operation. This would dramatically improve the overall performance.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to