[ https://issues.apache.org/jira/browse/PARQUET-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Szadovszky resolved PARQUET-1808. --------------------------------------- Resolution: Fixed > SimpleGroup.toString() uses String += and so has poor performance > ----------------------------------------------------------------- > > Key: PARQUET-1808 > URL: https://issues.apache.org/jira/browse/PARQUET-1808 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Affects Versions: 1.11.0 > Reporter: Randy Tidd > Assignee: Shankar Koirala > Priority: Minor > Labels: pull-request-available > > This method in SimpleGroup uses `+=` for String concatenation which is a > known performance problem in Java, the performance degrades exponentially the > more strings that are added. > [https://github.com/apache/parquet-mr/blob/d69192809d0d5ec36c0d8c126c8bed09ee3cee35/parquet-column/src/main/java/org/apache/parquet/example/data/simple/SimpleGroup.java#L50] > We ran into a performance problem whereby a single column in a Parquet file > was defined as a group: > {code:java} > optional group customer_ids (LIST) { > repeated group list { > optional binary element (STRING); > } > }{code} > > and had over 31,000 values. Reading this single column took over 8 minutes > due to time spent in the `toString()` method. Using a different > implementation that uses `StringBuffer` like this: > {code:java} > StringBuffer result = new StringBuffer(); > int i = 0; > for (Type field : schema.getFields()) { > String name = field.getName(); > List<Object> values = data[i]; > ++i; > if (values != null) { > if (values.size() > 0) { > for (Object value : values) { > result.append(indent); > result.append(name); > if (value == null) { > result.append(": NULL\n"); > } else if (value instanceof Group){ > result.append("\n"); > result.append(betterToString((SimpleGroup)value, indent+" ")); > } else { > result.append(": "); > result.append(value.toString()); > result.append("\n"); > } > } > } > } > } > return result.toString();{code} > reduced that time to less than 500 milliseconds. > The existing implementation is really poor and exhibits an infamous Java > string performance issue and should be fixed. > This was a significant problem for us but we were able to work around it so I > am marking this issue as "Minor". -- This message was sent by Atlassian Jira (v8.3.4#803005)