[jira] [Commented] (LUCENE-6779) Reduce memory allocated by CompressingStoredFieldsWriter to write large strings

Dawid Weiss (JIRA) Thu, 03 Sep 2015 13:55:00 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729765#comment-14729765
 ]


Dawid Weiss commented on LUCENE-6779:
-------------------------------------

{code}
+  /** Writes UTF8 into the given OutputStream by first writing to the given 
scratch array
+   * and then writing the contents of the scratch array to the OutputStream. 
The given scratch byte array
+   * is used to buffer intermediate data before it is written to the byte 
buffer.
+   *
+   * @return the number of bytes written
+   */
+  public static int writeUTF16toUTF8(final CharSequence s, final int offset, 
final int len, final DataOutput dataOutput, final byte[] scratch) throws 
IOException {
{code}

Isn't this a mix of two things (buffering and coding)? I think it'd be nicer to 
have the DataOutput (or some decorator) take care of the buffering aspects and 
the routine could then focus on transcoding from UTF16 to UTF8.

Also, most of the hardcoded constants/ checks for surrogate pairs, etc. do have 
counterparts in Character.* methods (and they should inline very well).

> Reduce memory allocated by CompressingStoredFieldsWriter to write large 
> strings
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-6779
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6779
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Shalin Shekhar Mangar
>         Attachments: LUCENE-6779.patch
>
>
> In SOLR-7927, I am trying to reduce the memory required to index very large 
> documents (between 10 to 100MB) and one of the places which allocate a lot of 
> heap is the UTF8 encoding in CompressingStoredFieldsWriter. The same problem 
> existed in JavaBinCodec and we reduced its memory allocation by falling back 
> to a double pass approach in SOLR-7971 when the utf8 size of the string is 
> greater than 64KB.
> I propose to make the same changes to CompressingStoredFieldsWriter as we 
> made to JavaBinCodec in SOLR-7971.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6779) Reduce memory allocated by CompressingStoredFieldsWriter to write large strings

Reply via email to