cool, thanks for sharing the sample! Aldrin Montana Computer Science PhD Student UC Santa Cruz
On Fri, May 6, 2022 at 1:31 PM Howard Engelhart <[email protected]> wrote: > Thanks Aldrin and Weston! Following your suggestions I was able to encode > the schema such that Athena recognized it... In case it helps anyone, > here's some sample code.. > > import { Schema, Field, Utf8, Table, RecordBatchStreamWriter, Int32, Bool, > DateMillisecond, DateDay } from 'apache-arrow'; > const s = new Schema([ > new Field('name', new Utf8), > new Field('address', new Utf8), > new Field('active', new Bool), > new Field('count', new Int32), > new Field('birthday', new DateDay), > new Field('created', new DateMillisecond) > ]); > const w = new RecordBatchStreamWriter(); > w.write(new Table(s)); > const encodedSchema = Buffer.from(w.toUint8Array(true)).toString('base64'); > > > > On Fri, May 6, 2022 at 3:53 PM Aldrin <[email protected]> wrote: > >> I didn't think of this as a possible solution, for some reason, but I >> think it actually makes a lot of sense. Just as a reference, this is >> something I currently do when storing data in a key-value interface: >> >> - I write a buffer with no batches >> - Write batches in separate buffers >> - these are sized to fully utilize the space for each key-value >> >> It is possible to then read the key-value that only contains a schema. >> >> I believe my approach for doing this can be seen in [1], and I use the >> StreamWriter because I want it to use an in-memory format that is >> streamable. >> >> [1]: >> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/mainline/src/cpp/processing/dataformats.cpp#L16 >> >> Aldrin Montana >> Computer Science PhD Student >> UC Santa Cruz >> >> >> On Fri, May 6, 2022 at 12:04 PM Weston Pace <[email protected]> >> wrote: >> >>> Can you serialize the schema by creating an IPC file with zero record >>> batches? I apologize, but I do not know the JS API as well. Maybe >>> you can create a table from just a schema (or a schema and a set of >>> empty arrays) and then turn that into an IPC file? This shouldn't add >>> too much overhead. >>> >>> On Thu, May 5, 2022 at 8:23 AM Howard Engelhart >>> <[email protected]> wrote: >>> > >>> > I'm looking to implement an Athena federated query custom connector >>> using the arrow js lib. I'm getting stuck on figuring out how to encode a >>> Schema properly for the Athena GetTableResponse. I have found an example >>> using python that does something like this.. (paraphrasing...) >>> > >>> > import pyarrow as pa >>> > ..... >>> > return { >>> > "@type": "GetTableResponse", >>> > "catalogName": self.catalogName, >>> > "tableName": {'schemaName': self.databaseName, >>> 'tableName': self.tableName}, >>> > "schema": {"schema": >>> base64.b64encode(pa.schema(....args...).serialize().slice(4)).decode("utf-8")}, >>> > "partitionColumns": self.partitions, >>> > "requestType": self.request_type >>> > } >>> > What i'm looking for is the js equivalent of >>> > pa.schema(....args...).serialize() >>> > >>> > Is there one? If not, could someone point me in the right direction >>> of how to code up something similar? >>> >>
