Could you clarify how exactly your friend does data transfer between Spark and PostgreSQL? My understanding is that the data can be exported from Spark as a file, which usually is in a printable text format. Therefore some sort of encoding of binary data is needed. I am not familiar with the HLL extension for PostgreSQL you are referring to. It seems to me from a quick glance that they encode binary data as hexadecimals \xHHHH.. (looking at their test csv files). Therefore each byte is encoded by two characters effectively doubling the size. Base64 is a much more efficient way of encoding binary data as printable text with the expansion ratio of 3-to-4 (as opposed to 1-to-2).
All these approaches are debatable, of course. In our experience, base64 is used widely in such cases. For example, in our production systems at Verizon Media sketches are often prepared on Hadoop clusters using Pig or Hive, then exported in base64, and imported into Druid. Our documentation certainly needs improvement. Thanks for bringing this to our attention. Let us know what your expectations and practices are with respect to importing and exporting data. We will see if any changes are needed on our side. On Mon, Jul 6, 2020 at 2:48 AM Evans Ye <[email protected]> wrote: > Hi team, > > This is a newbie question. > One of my friend in Taiwan is using Spark to write DataSketches to > Postgres. When it comes to estimation he got the data corruption error, and > then realize that the summary written in Postgres should be base64 encoded > to comply with the format. > > > https://github.com/apache/incubator-datasketches-postgresql/blob/3b553ef4dc7d2c988c41ab56695c5b082d3ce308/src/common.c#L37-L60 > > He found the other Postgres implementation of HLL does not do base64 > though[1]. > > I just want to learn that what are the considerations for doing base64? Is > it a convention that should be easy to inference or we should document it? > > Evans > > [1] > https://github.com/citusdata/postgresql-hll?fbclid=IwAR3GP2xgdCOsESuKRsqU4mJ7oeE7p-CPGrgeVUODRwVVShiOGBETfz5A4T8 > > >
