[jira] [Comment Edited] (SOLR-10981) Allow update to load gzip files

Andrew Lundgren (JIRA) Fri, 30 Jun 2017 11:21:37 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070259#comment-16070259
 ]


Andrew Lundgren edited comment on SOLR-10981 at 6/30/17 6:20 PM:
-----------------------------------------------------------------

As this doesn't use the content-type header, it uses the content-encoding 
header, it does not interfere with the existing content-type header usage. 

In the patched ContentStreamBase, line 84 the content type is taken from the 
connection.  As this is not changed by the gzip contentEncoding header on line 
89; code using the content stream is unaffected.  If the contentEncoding is not 
set, then the code will also detect if the file ends with ".gz".  This could be 
dropped, though it seemed a reasonable usage.

In the patched ContentStreamBase, lines 117, 121 the content type of a 
FileStream is determined by the first character found in the stream.  As the 
stream is already opened and the gunzip stream applied over the input stream, 
the code that determines the content type is unaffected.  The FileStream will 
work with any existing format that is gzipped as it determines the content type 
based on the first character of the decompressed stream.  (Attached is a new 
patch that causes this method to use the getStream method on 117 rather than 
open the file itself applying the gzip layer)

I agree that using generic {{Content-Type: application/gzip}} would lead to 
confusion.  To me, the gzip layer is the encoding of the content, not the type 
itself.  By using the encoding type you are able to handle the gzip at a lower 
layer, and keep all of your content-type support untouched.

See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding
See: https://www.iana.org/assignments/media-types/media-types.xhtml

The current handling of a FileStream and a file:// URL are inconsistent, as the 
FileStream tries to guess the content type based on the first character.  The 
file:// URL uses mime-types to determine the content.   They seemingly should 
be consistent, though I did not try to make them consistent, as the FileStream 
implementation point's out, it's implementation is buggy.



was (Author: lundgren):
As this doesn't use the content-type header, it uses the content-encoding 
header, it does not interfere with the existing content-type header usage. 

In the patched ContentStreamBase, line 84 the content type is taken from the 
connection.  As this is not changed by the gzip contentEncoding header on line 
89; code using the content stream is unaffected.  If the contentEncoding is not 
set, then the code will also detect if the file ends with ".gz".  This could be 
dropped, though it seemed a reasonable usage.

In the patched ContentStreamBase, lines 117, 121 the content type of a 
FileStream is determined by the first character found in the stream.  As the 
stream is already opened and the gunzip stream applied over the input stream, 
the code that determines the content type is unaffected.  The FileStream will 
work with any existing format that is gzipped as it determines the content type 
based on the first character of the decompressed stream.  (Attached is a new 
patch that causes this method to use the getStream method on 117 rather than 
open the file itself applying the gzip layer)

I agree that using generic {{Content-Type: application/gzip}} would lead to 
confusion.  To me, the gzip layer is the encoding of the content, not the type 
itself.  By using the encoding type you are able to handle the gzip at a lower 
layer, and keep all of your content-type support untouched.

The current handling of a FileStream and a file:// URL are inconsistent, as the 
FileStream tries to guess the content type based on the first character.  The 
file:// URL uses mime-types to determine the content.   They seemingly should 
be consistent, though I did not try to make them consistent, as the FileStream 
implementation point's out, it's implementation is buggy.


> Allow update to load gzip files 
> --------------------------------
>
>                 Key: SOLR-10981
>                 URL: https://issues.apache.org/jira/browse/SOLR-10981
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrJ
>    Affects Versions: 6.6
>            Reporter: Andrew Lundgren
>              Labels: patch
>             Fix For: 4.10.4, 6.6, master (7.0)
>
>         Attachments: SOLR-10981.patch, SOLR-10981.patch, SOLR-10981.patch
>
>
> We currently import large CSV files.  We store them in gzip files as they 
> compress at around 80%.
> To import them we must gunzip them and then import them.  After that we no 
> longer need the decompressed files.
> This patch allows directly opening either URL, or local files that are 
> gzipped.
> For URLs, to determine if the file is gzipped, it will check the content 
> encoding=="gzip" or if the file ends in ".gz"
> For files, if the file ends in ".gz" then it will assume the file is gzipped.
> I have tested the patch with 4.10.4, 6.6.0 and master from git.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10981) Allow update to load gzip files

Reply via email to