hi Jacques, I agree with you, it's worth distinguishing between ORC in two different forms:
* The raw binary files * The "dataset format" that is maintained by the Hive libraries. For example, I don't think there is any practical way for us to handle this in C++ It may be that without the 2nd bullet point not many use cases are enabled. On Fri, Dec 13, 2019 at 1:17 PM Jacques Nadeau <jacq...@apache.org> wrote: > > To clarify, I don't really question the value. That was the wrong word. I > question the benefit/value tradeoff. You've got two options it seems: > > - Support Orc without acid (solves a much smaller set of usecases/users) > - Support Orc with acid (a magnitude more implementation work) > > On Fri, Dec 13, 2019 at 11:15 AM Jacques Nadeau <jacq...@apache.org> wrote: > > > I question the value of adding the Orc format. The format is fragmented > > with the main tool writing it (hive) writing a version of the format (acid > > v2) that can't be consumed by systems that only use the Orc libraries > > (since they don't support acid). If you want to consume that data, you have > > to depend on internal Hive code (which is only written in java). > > > > On Thu, Dec 12, 2019 at 2:49 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > >> FWIW, the incremental effort of adding new data formats to the C++ > >> Datasets API should be relatively low. I think we even should document > >> in broad terms how users can define their own data sources or file > >> formats > >> > >> On Wed, Dec 11, 2019 at 4:19 PM Neal Richardson > >> <neal.p.richard...@gmail.com> wrote: > >> > > >> > Hi William, > >> > ORC is part of the C++ Datasets grand vision: see > >> > > >> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit#heading=h.22aikbvt54fv > >> . > >> > That said, I don't think anyone in the Arrow community is currently > >> > prioritizing work on ORC, and we'd welcome contributions in that area. > >> > > >> > For a view of what open issues we have for ORC (at least for C++), see > >> > > >> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20in%20(%22C%2B%2B%22%2C%20%22C%2B%2B%20-%20Dataset%22)%20AND%20text%20~%20ORC%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC > >> , > >> > though that's surely not an exhaustive list of ORC-related features one > >> > could want. > >> > > >> > Neal > >> > > >> > On Wed, Dec 11, 2019 at 12:49 PM William Callaghan <wcal...@gmail.com> > >> > wrote: > >> > > >> > > Hi there, > >> > > > >> > > Not sure if this is the appropriate place, but I had done some > >> searching > >> > > and could not find anything with regards to supporting ORC datasets. > >> I see > >> > > that Parquet datasets are support (where a dataset could contain > >> multiple > >> > > Parquet files), but I do not see this for ORC (only the ability to > >> read a > >> > > single ORC file and not multiple, or nested ORCs -- ie. a directory > >> with > >> > > sub directories (indices) with corresponding orc files underneath). > >> > > > >> > > I'm wondering, does Arrow currently have support for nested ORC > >> structures? > >> > > If not, is this planned? > >> > > > >> > > Thank you. > >> > > Regards, > >> > > William > >> > > > >> > >