Re: Solr and HTTP headers

Jan Høydahl Mon, 03 Jul 2017 08:01:10 -0700

https://issues.apache.org/jira/browse/SOLR-10998 
<https://issues.apache.org/jira/browse/SOLR-10998>
https://issues.apache.org/jira/browse/SOLR-10999 
<https://issues.apache.org/jira/browse/SOLR-10999>


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 30. jun. 2017 kl. 23.57 skrev Uwe Schindler <u...@thetaphi.de>:
> 
> Hi Jan,
>> Inspired by SOLR-10981 "Allow update to load gzip files” where the proposal
>> is to obey the
>> Content-Encoding HTTP request header to update a compressed stream, I
>> started looking at other
>> headers to do things in more industry-standard ways.
>> 
>> Accept:
>> 
>>  Advertises which content types, expressed as MIME types, the client is able
>> to understand
>>  https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept
>> 
>>  Could replace or at least be an alternative to “wt”. Examples:
>>  Accept: application/xml
>>  Accept: text/csv
>> 
>>  Issue: Most browsers sends a long accept header, typically
>> application/xml,text/html,*/*, and now
>>  that json is default for Solr, we’d need to serve JSON if the accept header
>> includes “*/*"
> 
> That's known under term "Content Negotiation". I come from the scientific / 
> library publisher world... We use that every day!
> 
> So your described problem is well known and must be solved by some algorithm 
> that takes care of browsers. All "Accept" headers have additional scores 
> behind the media types. When parsing the Accept header, you split on commas 
> and then parse each item and look for the score (in parameter "q"). In 
> addition browser gernally send some special mime types that clearly identify 
> them as browsers 😊
> 
> One example from the scientific publishing world, where access to digital 
> object identifiers is standardized to use the "Content-Negotiation" mechanism 
> since approx 10 years, is this one:
> 
> Accept: application/rdf+xml;charset=ISO-8859-1;q=0.5, 
> application/vnd.citationstyles.csl+json;q=1.0, */*;q=0.1
> 
> This tells the webserver that you would like to get the citation of the DOI 
> as citeproc-json, but alternatively take it as RDF. The */* is just there 
> because you would as a last chance also accept anything else (like HTML).
> 
> So the order and scores are important. First order by "q" scores backwards 
> and if you have same scores, take the order in list. First wins.
> 
> The algorithm is used in the library/scientific publishing world and is well 
> understood. E.g. see this DOI (Digital Object Identifier) and their URL to 
> the landing page (I work for PANGAEA, too...):
> 
> https://doi.pangaea.de/10.1594/PANGAEA.867475 
> <https://doi.pangaea.de/10.1594/PANGAEA.867475>
> 
> By default it shows the landing page, if visited by a browser, but if you 
> want to have the metadata in JSON-LD format, do:
> 
> Uwe Schindler@VEGA:~ > curl -H 'Accept: application/ld+json' 
> 'https://doi.pangaea.de/10.1594/PANGAEA.867475 
> <https://doi.pangaea.de/10.1594/PANGAEA.867475>'                     
> {"@context":"http://schema.org/ 
> <http://schema.org/>","@type":"Dataset","identifier":"doi:10.1594/PANGAEA.867475","url":"https://doi.pangaea.de/10.1594/PANGAEA.867475
>  <https://doi.pangaea.de/10.1594/PANGAEA.867475>","name":"Response of Arctic 
> benthic bacterial deep-sea communities to different detritus composition 
> during an ex-situ high pressure 
> experiment","creator":[{"@type":"Person","name":"Hoffmann, Katy...]}
> 
> If you want to download the data behind the URL (or if it would be a 
> scientific paper - the PDF):
> 
> curl -H 'Accept: text/tab-separated-values, */*;q=.5' 
> 'https://doi.pangaea.de/10.1594/PANGAEA.867475 
> <https://doi.pangaea.de/10.1594/PANGAEA.867475>'
> 
> Here I also added the */* with a lower score. As the server allows to give 
> you text/tab-separated-values, it returns it by preference.
> 
> If your client accepts BIBTEX citations or Endnote (RIS) ones you can send a 
> header, too. So you can fetch the citation of an item in a machine readable 
> format the same way - and you can ask the server for any variant - 
> standardized across all scientific publishers! Which one you got back is in 
> the response's Content-Type 😊
> 
> If the server cannot satisfy any of your Accepts, it will send a HTTP error 
> 406:
> 
> Uwe Schindler@VEGA:~ > curl -I -H 'Accept: foo/bar' 
> 'https://doi.pangaea.de/10.1594/PANGAEA.867475 
> <https://doi.pangaea.de/10.1594/PANGAEA.867475>'
> HTTP/1.1 406 Not Acceptable
> Server: PANGAEA/1.0
> Date: Fri, 30 Jun 2017 21:49:31 GMT
> X-robots-tag: noindex,nofollow,noarchive
> Content-length: 139
> Content-type: text/html
> X-ua-compatible: IE=Edge
> X-content-type-options: nosniff
> Strict-transport-security: max-age=31536000
> 
> The IDF / CrossRef / DataCite organizations (including PANGAEA...) have good 
> code that also parses the "Accept" header so that stupid browser with many 
> plugins (like Internet Explorer) kill you. So basically you look for specific 
> media types and the catch all accept header and if it looks like a browser, 
> kill it. E.g. Internet Explorer always sends application/xml with high score.
> 
> With this type of content negotiation, you can safely remove the wt=xxx param 
> or make it optional.
> 
> For compression, you normally do the same (the gzip filter in Jetty does the 
> same algorithm), although browser behave well on compressions and you can 
> trust the header when the client sends it. The problem with sending data *to* 
> Solr is that you don't know what the server accepts because you are sending 
> data first...
> 
>> Accept-Encoding:
>> 
>>  Advertises which content encoding, usually a compression algorithm, the
>> client is able to understand
>>  https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept- 
>> <https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept->
>> Encoding
> 
> That's usual practise, every stupid browser on earth does it by default. To 
> enable it in Solr, just add the Gzip filter to Jetty. For sending data TO 
> Solr it's not so easy, see above.
> 
>>  Could enable compression of large search results. SOLR-856 suggests that
>> this is implemented,
>>  but it does not work. Seems it is only implemented for replication. I’d 
>> expect
>> this to be useful for
>>  large /export or /stream requests. Example:
>>  Accept-Encoding: gzip
>> 
>> 
>> 
>> What do you think?
> 
> Strong +1
> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com <http://www.cominvent.com/>
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>> <mailto:dev-unsubscr...@lucene.apache.org>
>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>> <mailto:dev-h...@lucene.apache.org>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>

Re: Solr and HTTP headers

Reply via email to