[ https://issues.apache.org/jira/browse/SOLR-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070259#comment-16070259 ]
Andrew Lundgren edited comment on SOLR-10981 at 6/30/17 6:20 PM: ----------------------------------------------------------------- As this doesn't use the content-type header, it uses the content-encoding header, it does not interfere with the existing content-type header usage. In the patched ContentStreamBase, line 84 the content type is taken from the connection. As this is not changed by the gzip contentEncoding header on line 89; code using the content stream is unaffected. If the contentEncoding is not set, then the code will also detect if the file ends with ".gz". This could be dropped, though it seemed a reasonable usage. In the patched ContentStreamBase, lines 117, 121 the content type of a FileStream is determined by the first character found in the stream. As the stream is already opened and the gunzip stream applied over the input stream, the code that determines the content type is unaffected. The FileStream will work with any existing format that is gzipped as it determines the content type based on the first character of the decompressed stream. (Attached is a new patch that causes this method to use the getStream method on 117 rather than open the file itself applying the gzip layer) I agree that using generic {{Content-Type: application/gzip}} would lead to confusion. To me, the gzip layer is the encoding of the content, not the type itself. By using the encoding type you are able to handle the gzip at a lower layer, and keep all of your content-type support untouched. See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding See: https://www.iana.org/assignments/media-types/media-types.xhtml The current handling of a FileStream and a file:// URL are inconsistent, as the FileStream tries to guess the content type based on the first character. The file:// URL uses mime-types to determine the content. They seemingly should be consistent, though I did not try to make them consistent, as the FileStream implementation point's out, it's implementation is buggy. was (Author: lundgren): As this doesn't use the content-type header, it uses the content-encoding header, it does not interfere with the existing content-type header usage. In the patched ContentStreamBase, line 84 the content type is taken from the connection. As this is not changed by the gzip contentEncoding header on line 89; code using the content stream is unaffected. If the contentEncoding is not set, then the code will also detect if the file ends with ".gz". This could be dropped, though it seemed a reasonable usage. In the patched ContentStreamBase, lines 117, 121 the content type of a FileStream is determined by the first character found in the stream. As the stream is already opened and the gunzip stream applied over the input stream, the code that determines the content type is unaffected. The FileStream will work with any existing format that is gzipped as it determines the content type based on the first character of the decompressed stream. (Attached is a new patch that causes this method to use the getStream method on 117 rather than open the file itself applying the gzip layer) I agree that using generic {{Content-Type: application/gzip}} would lead to confusion. To me, the gzip layer is the encoding of the content, not the type itself. By using the encoding type you are able to handle the gzip at a lower layer, and keep all of your content-type support untouched. The current handling of a FileStream and a file:// URL are inconsistent, as the FileStream tries to guess the content type based on the first character. The file:// URL uses mime-types to determine the content. They seemingly should be consistent, though I did not try to make them consistent, as the FileStream implementation point's out, it's implementation is buggy. > Allow update to load gzip files > -------------------------------- > > Key: SOLR-10981 > URL: https://issues.apache.org/jira/browse/SOLR-10981 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrJ > Affects Versions: 6.6 > Reporter: Andrew Lundgren > Labels: patch > Fix For: 4.10.4, 6.6, master (7.0) > > Attachments: SOLR-10981.patch, SOLR-10981.patch, SOLR-10981.patch > > > We currently import large CSV files. We store them in gzip files as they > compress at around 80%. > To import them we must gunzip them and then import them. After that we no > longer need the decompressed files. > This patch allows directly opening either URL, or local files that are > gzipped. > For URLs, to determine if the file is gzipped, it will check the content > encoding=="gzip" or if the file ends in ".gz" > For files, if the file ends in ".gz" then it will assume the file is gzipped. > I have tested the patch with 4.10.4, 6.6.0 and master from git. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org