Edmon Begoli created ARROW-4753:
-----------------------------------
Summary: Support optionally, and as an extension, an encoding
layout for text-optimized data structures
Key: ARROW-4753
URL: https://issues.apache.org/jira/browse/ARROW-4753
Project: Apache Arrow
Issue Type: Wish
Environment: C/C++
Reporter: Edmon Begoli
Narrative (text), by default, is notoriously inefficient to store on the disk
or in memory. It is, in the most basic form, a long sequence of bytes with no
indexing or other optimized layout structure.
There are data structures such as [tries|https://en.wikipedia.org/wiki/Trie],
[DAFSAs|]https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
or [b-tries|https://dl.acm.org/citation.cfm?id=1541552] that support more
efficient storage and lookup of phrases.
We would like to enable arrow to serialize from/to these efficient structures
as the format/carrier between high performance text processing steps which like
to operate on binary data structures (lookups, spellers, or more advance NLP
routines).
so, it could be something like:
*{color:#707070}_text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow_{color}*
{color:#14892c}// writes arrow as format for the specified encoding. This could
be implicit if we could store encoding in some kind of manifest{color}
*{color:#707070}_arrow.to_text(infer=true|dafsa|trie|b-trie) : string_{color}*
{color:#14892c}// restores text from the arrow format, and from a specified
encoding, same as above. {color}
{color:#333333}On the dev mailing list we are discussion creation of the
contrib folder where such features could be optionally included for
Arrow.{color}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)