[ https://issues.apache.org/jira/browse/ARROW-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-4753: ---------------------------------- Fix Version/s: (was: 4.0.0) 5.0.0 > [C++] Extension types and layouts for text-optimized data structures > -------------------------------------------------------------------- > > Key: ARROW-4753 > URL: https://issues.apache.org/jira/browse/ARROW-4753 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Format > Environment: C/C++ > Reporter: Edmon Begoli > Priority: Minor > Labels: features > Fix For: 5.0.0 > > > Narrative (text), by default, is notoriously inefficient to store on the disk > or in memory. It is, in the most basic form, a long sequence of bytes with no > indexing or other optimized layout structure. > > There are data structures such as > [tries|https://en.wikipedia.org/wiki/Trie], > [DAFSAs|https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton], > or [b-tries|https://dl.acm.org/citation.cfm?id=1541552] that support more > efficient storage and lookup of phrases. > > We would like to enable arrow to serialize from/to these efficient > structures as the format/carrier between high performance text processing > steps which like to operate on binary data structures (lookups, spellers, or > more advance NLP routines). > > so, it could be something like: > > *{color:#707070}_text.to_arrow(infer=true|dafsa|trie|b-trie) : > arrow_{color}* {color:#14892c}// writes arrow as format for the specified > encoding. This could be implicit if we could store encoding in some kind of > manifest{color} > > *{color:#707070}_arrow.to_text(infer=true|dafsa|trie|b-trie) : > string_{color}* {color:#14892c}// restores text from the arrow format, and > from a specified encoding, same as above. {color} > > {color:#333333}On the dev mailing list we are discussion creation of the > contrib folder where such features could be optionally included for > Arrow.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)