Could you clarify how exactly your friend does data transfer between Spark
and PostgreSQL?
My understanding is that the data can be exported from Spark as a file,
which usually is in a printable text format. Therefore some sort of
encoding of binary data is needed. I am not familiar with the HLL extension
for PostgreSQL you are referring to. It seems to me from a quick glance
that they encode binary data as hexadecimals \xHHHH.. (looking at their
test csv files). Therefore each byte is encoded by two characters
effectively doubling the size. Base64 is a much more efficient way of
encoding binary data as printable text with the expansion ratio of 3-to-4
(as opposed to 1-to-2).

All these approaches are debatable, of course. In our experience, base64 is
used widely in such cases. For example, in our production systems at
Verizon Media sketches are often prepared on Hadoop clusters using Pig or
Hive, then exported in base64, and imported into Druid.

Our documentation certainly needs improvement. Thanks for bringing this to
our attention.

Let us know what your expectations and practices are with respect to
importing and exporting data. We will see if any changes are needed on our
side.


On Mon, Jul 6, 2020 at 2:48 AM Evans Ye <[email protected]> wrote:

> Hi team,
>
> This is a newbie question.
> One of my friend in Taiwan is using Spark to write DataSketches to
> Postgres. When it comes to estimation he got the data corruption error, and
> then realize that the summary written in Postgres should be base64 encoded
> to comply with the format.
>
>
> https://github.com/apache/incubator-datasketches-postgresql/blob/3b553ef4dc7d2c988c41ab56695c5b082d3ce308/src/common.c#L37-L60
>
> He found the other Postgres implementation of HLL does not do base64
> though[1].
>
> I just want to learn that what are the considerations for doing base64? Is
> it a convention that should be easy to inference or we should document it?
>
> Evans
>
> [1]
> https://github.com/citusdata/postgresql-hll?fbclid=IwAR3GP2xgdCOsESuKRsqU4mJ7oeE7p-CPGrgeVUODRwVVShiOGBETfz5A4T8
>
>
>

Reply via email to