https://issues.apache.org/jira/browse/SOLR-10998 <https://issues.apache.org/jira/browse/SOLR-10998> https://issues.apache.org/jira/browse/SOLR-10999 <https://issues.apache.org/jira/browse/SOLR-10999>
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com > 30. jun. 2017 kl. 23.57 skrev Uwe Schindler <u...@thetaphi.de>: > > Hi Jan, >> Inspired by SOLR-10981 "Allow update to load gzip files” where the proposal >> is to obey the >> Content-Encoding HTTP request header to update a compressed stream, I >> started looking at other >> headers to do things in more industry-standard ways. >> >> Accept: >> >> Advertises which content types, expressed as MIME types, the client is able >> to understand >> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept >> >> Could replace or at least be an alternative to “wt”. Examples: >> Accept: application/xml >> Accept: text/csv >> >> Issue: Most browsers sends a long accept header, typically >> application/xml,text/html,*/*, and now >> that json is default for Solr, we’d need to serve JSON if the accept header >> includes “*/*" > > That's known under term "Content Negotiation". I come from the scientific / > library publisher world... We use that every day! > > So your described problem is well known and must be solved by some algorithm > that takes care of browsers. All "Accept" headers have additional scores > behind the media types. When parsing the Accept header, you split on commas > and then parse each item and look for the score (in parameter "q"). In > addition browser gernally send some special mime types that clearly identify > them as browsers 😊 > > One example from the scientific publishing world, where access to digital > object identifiers is standardized to use the "Content-Negotiation" mechanism > since approx 10 years, is this one: > > Accept: application/rdf+xml;charset=ISO-8859-1;q=0.5, > application/vnd.citationstyles.csl+json;q=1.0, */*;q=0.1 > > This tells the webserver that you would like to get the citation of the DOI > as citeproc-json, but alternatively take it as RDF. The */* is just there > because you would as a last chance also accept anything else (like HTML). > > So the order and scores are important. First order by "q" scores backwards > and if you have same scores, take the order in list. First wins. > > The algorithm is used in the library/scientific publishing world and is well > understood. E.g. see this DOI (Digital Object Identifier) and their URL to > the landing page (I work for PANGAEA, too...): > > https://doi.pangaea.de/10.1594/PANGAEA.867475 > <https://doi.pangaea.de/10.1594/PANGAEA.867475> > > By default it shows the landing page, if visited by a browser, but if you > want to have the metadata in JSON-LD format, do: > > Uwe Schindler@VEGA:~ > curl -H 'Accept: application/ld+json' > 'https://doi.pangaea.de/10.1594/PANGAEA.867475 > <https://doi.pangaea.de/10.1594/PANGAEA.867475>' > {"@context":"http://schema.org/ > <http://schema.org/>","@type":"Dataset","identifier":"doi:10.1594/PANGAEA.867475","url":"https://doi.pangaea.de/10.1594/PANGAEA.867475 > <https://doi.pangaea.de/10.1594/PANGAEA.867475>","name":"Response of Arctic > benthic bacterial deep-sea communities to different detritus composition > during an ex-situ high pressure > experiment","creator":[{"@type":"Person","name":"Hoffmann, Katy...]} > > If you want to download the data behind the URL (or if it would be a > scientific paper - the PDF): > > curl -H 'Accept: text/tab-separated-values, */*;q=.5' > 'https://doi.pangaea.de/10.1594/PANGAEA.867475 > <https://doi.pangaea.de/10.1594/PANGAEA.867475>' > > Here I also added the */* with a lower score. As the server allows to give > you text/tab-separated-values, it returns it by preference. > > If your client accepts BIBTEX citations or Endnote (RIS) ones you can send a > header, too. So you can fetch the citation of an item in a machine readable > format the same way - and you can ask the server for any variant - > standardized across all scientific publishers! Which one you got back is in > the response's Content-Type 😊 > > If the server cannot satisfy any of your Accepts, it will send a HTTP error > 406: > > Uwe Schindler@VEGA:~ > curl -I -H 'Accept: foo/bar' > 'https://doi.pangaea.de/10.1594/PANGAEA.867475 > <https://doi.pangaea.de/10.1594/PANGAEA.867475>' > HTTP/1.1 406 Not Acceptable > Server: PANGAEA/1.0 > Date: Fri, 30 Jun 2017 21:49:31 GMT > X-robots-tag: noindex,nofollow,noarchive > Content-length: 139 > Content-type: text/html > X-ua-compatible: IE=Edge > X-content-type-options: nosniff > Strict-transport-security: max-age=31536000 > > The IDF / CrossRef / DataCite organizations (including PANGAEA...) have good > code that also parses the "Accept" header so that stupid browser with many > plugins (like Internet Explorer) kill you. So basically you look for specific > media types and the catch all accept header and if it looks like a browser, > kill it. E.g. Internet Explorer always sends application/xml with high score. > > With this type of content negotiation, you can safely remove the wt=xxx param > or make it optional. > > For compression, you normally do the same (the gzip filter in Jetty does the > same algorithm), although browser behave well on compressions and you can > trust the header when the client sends it. The problem with sending data *to* > Solr is that you don't know what the server accepts because you are sending > data first... > >> Accept-Encoding: >> >> Advertises which content encoding, usually a compression algorithm, the >> client is able to understand >> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept- >> <https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-> >> Encoding > > That's usual practise, every stupid browser on earth does it by default. To > enable it in Solr, just add the Gzip filter to Jetty. For sending data TO > Solr it's not so easy, see above. > >> Could enable compression of large search results. SOLR-856 suggests that >> this is implemented, >> but it does not work. Seems it is only implemented for replication. I’d >> expect >> this to be useful for >> large /export or /stream requests. Example: >> Accept-Encoding: gzip >> >> >> >> What do you think? > > Strong +1 > >> -- >> Jan Høydahl, search solution architect >> Cominvent AS - www.cominvent.com <http://www.cominvent.com/> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> <mailto:dev-unsubscr...@lucene.apache.org> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> <mailto:dev-h...@lucene.apache.org> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > <mailto:dev-unsubscr...@lucene.apache.org> > For additional commands, e-mail: dev-h...@lucene.apache.org > <mailto:dev-h...@lucene.apache.org>