Re: Planned Support for ORC Dataset?

Jacques Nadeau Fri, 13 Dec 2019 11:17:48 -0800

To clarify, I don't really question the value. That was the wrong word. I
question the benefit/value tradeoff. You've got two options it seems:


- Support Orc without acid (solves a much smaller set of usecases/users)
- Support Orc with acid (a magnitude more implementation work)

On Fri, Dec 13, 2019 at 11:15 AM Jacques Nadeau <[email protected]> wrote:

> I question the value of adding the Orc format. The format is fragmented
> with the main tool writing it (hive) writing a version of the format (acid
> v2) that can't be consumed by systems that only use the Orc libraries
> (since they don't support acid). If you want to consume that data, you have
> to depend on internal Hive code (which is only written in java).
>
> On Thu, Dec 12, 2019 at 2:49 PM Wes McKinney <[email protected]> wrote:
>
>> FWIW, the incremental effort of adding new data formats to the C++
>> Datasets API should be relatively low. I think we even should document
>> in broad terms how users can define their own data sources or file
>> formats
>>
>> On Wed, Dec 11, 2019 at 4:19 PM Neal Richardson
>> <[email protected]> wrote:
>> >
>> > Hi William,
>> > ORC is part of the C++ Datasets grand vision: see
>> >
>> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit#heading=h.22aikbvt54fv
>> .
>> > That said, I don't think anyone in the Arrow community is currently
>> > prioritizing work on ORC, and we'd welcome contributions in that area.
>> >
>> > For a view of what open issues we have for ORC (at least for C++), see
>> >
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20in%20(%22C%2B%2B%22%2C%20%22C%2B%2B%20-%20Dataset%22)%20AND%20text%20~%20ORC%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC
>> ,
>> > though that's surely not an exhaustive list of ORC-related features one
>> > could want.
>> >
>> > Neal
>> >
>> > On Wed, Dec 11, 2019 at 12:49 PM William Callaghan <[email protected]>
>> > wrote:
>> >
>> > > Hi there,
>> > >
>> > > Not sure if this is the appropriate place, but I had done some
>> searching
>> > > and could not find anything with regards to supporting ORC datasets.
>> I see
>> > > that Parquet datasets are support (where a dataset could contain
>> multiple
>> > > Parquet files), but I do not see this for ORC (only the ability to
>> read a
>> > > single ORC file and not multiple, or nested ORCs -- ie. a directory
>> with
>> > > sub directories (indices) with corresponding orc files underneath).
>> > >
>> > > I'm wondering, does Arrow currently have support for nested ORC
>> structures?
>> > > If not, is this planned?
>> > >
>> > > Thank you.
>> > > Regards,
>> > > William
>> > >
>>
>

Re: Planned Support for ORC Dataset?

Reply via email to