Hello, I made a utility program to dump PostgreSQL database in Apache Arrow format.
Apache Arrow is a kind of data format for columnar-based structured data; actively developed by Spark and comprehensive communities. It is suitable data representation for static and read-only but large number of rows. Many of data analytics tools support Apache Arrow as a common data exchange format. See, https://arrow.apache.org/ * pg2arrow https://github.com/heterodb/pg2arrow usage: $ ./pg2arrow -h localhost postgres -c 'SELECT * FROM hogehoge LIMIT 10000' -o /tmp/hogehoge.arrow --> fetch results of the query, then write out "/tmp/hogehoge" $ ./pg2arrow --dump /tmp/hogehoge --> shows schema definition of the "/tmp/hogehoge" $ python >>> import pyarrow as pa >>> X = pa.RecordBatchFileReader("/tmp/hogehoge").read_all() >>> X.schema id: int32 a: int64 b: double c: struct<x: int32, y: double, z: decimal(30, 11), memo: string> child 0, x: int32 child 1, y: double child 2, z: decimal(30, 11) child 3, memo: string d: string e: double ymd: date32[day] --> read the Apache Arrow file using PyArrow, then shows its schema definition. It is also a groundwork for my current development - arrow_fdw; which allows to scan on the configured Apache Arrow file(s) as like regular PostgreSQL table. I expect integration of the arrow_fdw support with SSD2GPU Direct SQL of PG-Strom can pull out maximum capability of the latest hardware (NVME and GPU). Likely, it is an ideal configuration for log-data processing generated by many sensors. Please check it. Comments, ideas, bug-reports, and other feedbacks are welcome. As an aside, NVIDIA announced their RAPIDS framework; to exchange data frames on GPU among multiple ML/Analytics solutions. It also uses Apache Arrow as a common format for data exchange, and this is also our groundwork for them. https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/ Thanks, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei <kai...@heterodb.com>