Re: New Codec for ORC

2022-05-20 Thread Dongjoon Hyun
+1 for Owen's advice.

BTW, Rust ORC reader & writer sounds like a great idea.

Dongjoon.


On Thu, May 19, 2022 at 10:28 PM Owen O'Malley 
wrote:

> This is interesting, but it sounds like it corresponds more to the rle
> encoding that we do rather than the generic compression code.
>
> Has anyone done a Java version of the library? It is faster to iterate on
> this kind of design in Java. On the other hand, I’ve heard that someone is
> thinking about doing a Rust ORC reader & writer, but it isn’t done yet. 😊
>
> .. Owen
>
> > On May 19, 2022, at 17:16, Dongjoon Hyun 
> wrote:
> >
> > Thank you for sharing, Martin.
> >
> > For codec one, you can take advantage of our benchmark suite
> > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> >
> >
> https://github.com/apache/orc/blob/main/java/bench/core/src/java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> >
> > For the Spark connector one, I'd like to recommend you to send dev@spark
> > too. You will get attention in both parts (codec and connector).
> >
> > Then, I'm looking forward to seeing your benchmark result.
> >
> > Dongjoon.
> >
> >
> >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric  >
> >> wrote:
> >>
> >> I've developed a stable codec for numerical columns called Quantile
> >> Compression .
> >> It has about 30% higher compression ratio than even Zstd for similar
> >> compression and decompression time. It achieves this by tailoring to the
> >> data type (floats, ints, timestamps, bools).
> >>
> >> I'm using it in my own projects, and a few others have adopted it, but
> it
> >> would also be perfect for ORC columns. Assuming a 50-50 split between
> >> text-like and numerical data, it could reduce the average ORC file size
> by
> >> over 10% with no extra compute cost. Incorporating it into ORC would be
> >> quite powerful since the codec by itself only works on a single flat
> column
> >> of non-nullable numbers.
> >>
> >> Would the ORC community be interested in this? How can we make this
> >> available to users? I've already built a Spark connector
> >>  for a project
> >> using
> >> this codec and gotten fast query times.
> >>
> >> Thanks,
> >> Martin
> >>
>


Re: New Codec for ORC

2022-05-20 Thread Ian Joiner
There is already a Rust ORC reader:
https://rustrepo.com/repo/travisbrown-orcrs
We still need a writer though. If I have 6 months to do so I can write one.
Then I can also integrate it into Arrow Rust.

Ian

On Friday, May 20, 2022, Dongjoon Hyun  wrote:

> +1 for Owen's advice.
>
> BTW, Rust ORC reader & writer sounds like a great idea.
>
> Dongjoon.
>
>
> On Thu, May 19, 2022 at 10:28 PM Owen O'Malley 
> wrote:
>
> > This is interesting, but it sounds like it corresponds more to the rle
> > encoding that we do rather than the generic compression code.
> >
> > Has anyone done a Java version of the library? It is faster to iterate on
> > this kind of design in Java. On the other hand, I’ve heard that someone
> is
> > thinking about doing a Rust ORC reader & writer, but it isn’t done yet.
> 😊
> >
> > .. Owen
> >
> > > On May 19, 2022, at 17:16, Dongjoon Hyun 
> > wrote:
> > >
> > > Thank you for sharing, Martin.
> > >
> > > For codec one, you can take advantage of our benchmark suite
> > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> > >
> > >
> > https://github.com/apache/orc/blob/main/java/bench/core/src/
> java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > >
> > > For the Spark connector one, I'd like to recommend you to send
> dev@spark
> > > too. You will get attention in both parts (codec and connector).
> > >
> > > Then, I'm looking forward to seeing your benchmark result.
> > >
> > > Dongjoon.
> > >
> > >
> > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> m.w.lonca...@gmail.com
> > >
> > >> wrote:
> > >>
> > >> I've developed a stable codec for numerical columns called Quantile
> > >> Compression .
> > >> It has about 30% higher compression ratio than even Zstd for similar
> > >> compression and decompression time. It achieves this by tailoring to
> the
> > >> data type (floats, ints, timestamps, bools).
> > >>
> > >> I'm using it in my own projects, and a few others have adopted it, but
> > it
> > >> would also be perfect for ORC columns. Assuming a 50-50 split between
> > >> text-like and numerical data, it could reduce the average ORC file
> size
> > by
> > >> over 10% with no extra compute cost. Incorporating it into ORC would
> be
> > >> quite powerful since the codec by itself only works on a single flat
> > column
> > >> of non-nullable numbers.
> > >>
> > >> Would the ORC community be interested in this? How can we make this
> > >> available to users? I've already built a Spark connector
> > >>  for a project
> > >> using
> > >> this codec and gotten fast query times.
> > >>
> > >> Thanks,
> > >> Martin
> > >>
> >
>


Re: New Codec for ORC

2022-05-20 Thread Dongjoon Hyun
Thank you for sharing it, Ian.

Dongjoon.


On Fri, May 20, 2022 at 9:27 AM Ian Joiner  wrote:

> There is already a Rust ORC reader:
> https://rustrepo.com/repo/travisbrown-orcrs
> We still need a writer though. If I have 6 months to do so I can write one.
> Then I can also integrate it into Arrow Rust.
>
> Ian
>
> On Friday, May 20, 2022, Dongjoon Hyun  wrote:
>
> > +1 for Owen's advice.
> >
> > BTW, Rust ORC reader & writer sounds like a great idea.
> >
> > Dongjoon.
> >
> >
> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley 
> > wrote:
> >
> > > This is interesting, but it sounds like it corresponds more to the rle
> > > encoding that we do rather than the generic compression code.
> > >
> > > Has anyone done a Java version of the library? It is faster to iterate
> on
> > > this kind of design in Java. On the other hand, I’ve heard that someone
> > is
> > > thinking about doing a Rust ORC reader & writer, but it isn’t done yet.
> > 😊
> > >
> > > .. Owen
> > >
> > > > On May 19, 2022, at 17:16, Dongjoon Hyun 
> > > wrote:
> > > >
> > > > Thank you for sharing, Martin.
> > > >
> > > > For codec one, you can take advantage of our benchmark suite
> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> > > >
> > > >
> > > https://github.com/apache/orc/blob/main/java/bench/core/src/
> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > > >
> > > > For the Spark connector one, I'd like to recommend you to send
> > dev@spark
> > > > too. You will get attention in both parts (codec and connector).
> > > >
> > > > Then, I'm looking forward to seeing your benchmark result.
> > > >
> > > > Dongjoon.
> > > >
> > > >
> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> > m.w.lonca...@gmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >> I've developed a stable codec for numerical columns called Quantile
> > > >> Compression .
> > > >> It has about 30% higher compression ratio than even Zstd for similar
> > > >> compression and decompression time. It achieves this by tailoring to
> > the
> > > >> data type (floats, ints, timestamps, bools).
> > > >>
> > > >> I'm using it in my own projects, and a few others have adopted it,
> but
> > > it
> > > >> would also be perfect for ORC columns. Assuming a 50-50 split
> between
> > > >> text-like and numerical data, it could reduce the average ORC file
> > size
> > > by
> > > >> over 10% with no extra compute cost. Incorporating it into ORC would
> > be
> > > >> quite powerful since the codec by itself only works on a single flat
> > > column
> > > >> of non-nullable numbers.
> > > >>
> > > >> Would the ORC community be interested in this? How can we make this
> > > >> available to users? I've already built a Spark connector
> > > >>  for a
> project
> > > >> using
> > > >> this codec and gotten fast query times.
> > > >>
> > > >> Thanks,
> > > >> Martin
> > > >>
> > >
> >
>


RE: Re: New Codec for ORC

2022-05-20 Thread Martin Loncaric
Yes, I agree this would be better as an RLE encoding alternative (though it
isn't really RLE). Is there any ORC flag to choose between RLE
implementations, or is there only one available per specification? Is there
a benchmark suite for these encodings as well?

I'm working on a simple Java library right now (with a subset of
functionality via JNI), should have it out in a couple of days. You can try
it out then.

On 2022/05/20 15:48:42 Dongjoon Hyun wrote:
> +1 for Owen's advice.
>
> BTW, Rust ORC reader & writer sounds like a great idea.
>
> Dongjoon.
>
>
> On Thu, May 19, 2022 at 10:28 PM Owen O'Malley 
> wrote:
>
> > This is interesting, but it sounds like it corresponds more to the rle
> > encoding that we do rather than the generic compression code.
> >
> > Has anyone done a Java version of the library? It is faster to iterate
on
> > this kind of design in Java. On the other hand, I’ve heard that someone
is
> > thinking about doing a Rust ORC reader & writer, but it isn’t done yet.
😊
> >
> > .. Owen
> >
> > > On May 19, 2022, at 17:16, Dongjoon Hyun 
> > wrote:
> > >
> > > Thank you for sharing, Martin.
> > >
> > > For codec one, you can take advantage of our benchmark suite
> > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> > >
> > >
> >
https://github.com/apache/orc/blob/main/java/bench/core/src/java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > >
> > > For the Spark connector one, I'd like to recommend you to send
dev@spark
> > > too. You will get attention in both parts (codec and connector).
> > >
> > > Then, I'm looking forward to seeing your benchmark result.
> > >
> > > Dongjoon.
> > >
> > >
> > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
m.w.lonca...@gmail.com
> > >
> > >> wrote:
> > >>
> > >> I've developed a stable codec for numerical columns called Quantile
> > >> Compression .
> > >> It has about 30% higher compression ratio than even Zstd for similar
> > >> compression and decompression time. It achieves this by tailoring to
the
> > >> data type (floats, ints, timestamps, bools).
> > >>
> > >> I'm using it in my own projects, and a few others have adopted it,
but
> > it
> > >> would also be perfect for ORC columns. Assuming a 50-50 split between
> > >> text-like and numerical data, it could reduce the average ORC file
size
> > by
> > >> over 10% with no extra compute cost. Incorporating it into ORC would
be
> > >> quite powerful since the codec by itself only works on a single flat
> > column
> > >> of non-nullable numbers.
> > >>
> > >> Would the ORC community be interested in this? How can we make this
> > >> available to users? I've already built a Spark connector
> > >>  for a project
> > >> using
> > >> this codec and gotten fast query times.
> > >>
> > >> Thanks,
> > >> Martin
> > >>
> >
>


Re: New Codec for ORC

2022-05-20 Thread Dongjoon Hyun
BTW, the license of that Rust ORC reader looks unusual to me.

`ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)`

I guess we need to skip that repository.

Dongjoon


On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun 
wrote:

> Thank you for sharing it, Ian.
>
> Dongjoon.
>
>
> On Fri, May 20, 2022 at 9:27 AM Ian Joiner  wrote:
>
>> There is already a Rust ORC reader:
>> https://rustrepo.com/repo/travisbrown-orcrs
>> We still need a writer though. If I have 6 months to do so I can write
>> one.
>> Then I can also integrate it into Arrow Rust.
>>
>> Ian
>>
>> On Friday, May 20, 2022, Dongjoon Hyun  wrote:
>>
>> > +1 for Owen's advice.
>> >
>> > BTW, Rust ORC reader & writer sounds like a great idea.
>> >
>> > Dongjoon.
>> >
>> >
>> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley 
>> > wrote:
>> >
>> > > This is interesting, but it sounds like it corresponds more to the rle
>> > > encoding that we do rather than the generic compression code.
>> > >
>> > > Has anyone done a Java version of the library? It is faster to
>> iterate on
>> > > this kind of design in Java. On the other hand, I’ve heard that
>> someone
>> > is
>> > > thinking about doing a Rust ORC reader & writer, but it isn’t done
>> yet.
>> > 😊
>> > >
>> > > .. Owen
>> > >
>> > > > On May 19, 2022, at 17:16, Dongjoon Hyun 
>> > > wrote:
>> > > >
>> > > > Thank you for sharing, Martin.
>> > > >
>> > > > For codec one, you can take advantage of our benchmark suite
>> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
>> > > >
>> > > >
>> > > https://github.com/apache/orc/blob/main/java/bench/core/src/
>> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
>> > > >
>> > > > For the Spark connector one, I'd like to recommend you to send
>> > dev@spark
>> > > > too. You will get attention in both parts (codec and connector).
>> > > >
>> > > > Then, I'm looking forward to seeing your benchmark result.
>> > > >
>> > > > Dongjoon.
>> > > >
>> > > >
>> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
>> > m.w.lonca...@gmail.com
>> > > >
>> > > >> wrote:
>> > > >>
>> > > >> I've developed a stable codec for numerical columns called Quantile
>> > > >> Compression .
>> > > >> It has about 30% higher compression ratio than even Zstd for
>> similar
>> > > >> compression and decompression time. It achieves this by tailoring
>> to
>> > the
>> > > >> data type (floats, ints, timestamps, bools).
>> > > >>
>> > > >> I'm using it in my own projects, and a few others have adopted it,
>> but
>> > > it
>> > > >> would also be perfect for ORC columns. Assuming a 50-50 split
>> between
>> > > >> text-like and numerical data, it could reduce the average ORC file
>> > size
>> > > by
>> > > >> over 10% with no extra compute cost. Incorporating it into ORC
>> would
>> > be
>> > > >> quite powerful since the codec by itself only works on a single
>> flat
>> > > column
>> > > >> of non-nullable numbers.
>> > > >>
>> > > >> Would the ORC community be interested in this? How can we make this
>> > > >> available to users? I've already built a Spark connector
>> > > >>  for a
>> project
>> > > >> using
>> > > >> this codec and gotten fast query times.
>> > > >>
>> > > >> Thanks,
>> > > >> Martin
>> > > >>
>> > >
>> >
>>
>


Re: New Codec for ORC

2022-05-20 Thread Ian Joiner
Uh. I didn’t realize that. Give me 6 months and I can provide both the
reader and the writer.

Ian

On Friday, May 20, 2022, Dongjoon Hyun  wrote:

> BTW, the license of that Rust ORC reader looks unusual to me.
>
> `ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)`
>
> I guess we need to skip that repository.
>
> Dongjoon
>
>
> On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun 
> wrote:
>
> > Thank you for sharing it, Ian.
> >
> > Dongjoon.
> >
> >
> > On Fri, May 20, 2022 at 9:27 AM Ian Joiner 
> wrote:
> >
> >> There is already a Rust ORC reader:
> >> https://rustrepo.com/repo/travisbrown-orcrs
> >> We still need a writer though. If I have 6 months to do so I can write
> >> one.
> >> Then I can also integrate it into Arrow Rust.
> >>
> >> Ian
> >>
> >> On Friday, May 20, 2022, Dongjoon Hyun  wrote:
> >>
> >> > +1 for Owen's advice.
> >> >
> >> > BTW, Rust ORC reader & writer sounds like a great idea.
> >> >
> >> > Dongjoon.
> >> >
> >> >
> >> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <
> owen.omal...@gmail.com>
> >> > wrote:
> >> >
> >> > > This is interesting, but it sounds like it corresponds more to the
> rle
> >> > > encoding that we do rather than the generic compression code.
> >> > >
> >> > > Has anyone done a Java version of the library? It is faster to
> >> iterate on
> >> > > this kind of design in Java. On the other hand, I’ve heard that
> >> someone
> >> > is
> >> > > thinking about doing a Rust ORC reader & writer, but it isn’t done
> >> yet.
> >> > 😊
> >> > >
> >> > > .. Owen
> >> > >
> >> > > > On May 19, 2022, at 17:16, Dongjoon Hyun  >
> >> > > wrote:
> >> > > >
> >> > > > Thank you for sharing, Martin.
> >> > > >
> >> > > > For codec one, you can take advantage of our benchmark suite
> >> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> >> > > >
> >> > > >
> >> > > https://github.com/apache/orc/blob/main/java/bench/core/src/
> >> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> >> > > >
> >> > > > For the Spark connector one, I'd like to recommend you to send
> >> > dev@spark
> >> > > > too. You will get attention in both parts (codec and connector).
> >> > > >
> >> > > > Then, I'm looking forward to seeing your benchmark result.
> >> > > >
> >> > > > Dongjoon.
> >> > > >
> >> > > >
> >> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> >> > m.w.lonca...@gmail.com
> >> > > >
> >> > > >> wrote:
> >> > > >>
> >> > > >> I've developed a stable codec for numerical columns called
> Quantile
> >> > > >> Compression .
> >> > > >> It has about 30% higher compression ratio than even Zstd for
> >> similar
> >> > > >> compression and decompression time. It achieves this by tailoring
> >> to
> >> > the
> >> > > >> data type (floats, ints, timestamps, bools).
> >> > > >>
> >> > > >> I'm using it in my own projects, and a few others have adopted
> it,
> >> but
> >> > > it
> >> > > >> would also be perfect for ORC columns. Assuming a 50-50 split
> >> between
> >> > > >> text-like and numerical data, it could reduce the average ORC
> file
> >> > size
> >> > > by
> >> > > >> over 10% with no extra compute cost. Incorporating it into ORC
> >> would
> >> > be
> >> > > >> quite powerful since the codec by itself only works on a single
> >> flat
> >> > > column
> >> > > >> of non-nullable numbers.
> >> > > >>
> >> > > >> Would the ORC community be interested in this? How can we make
> this
> >> > > >> available to users? I've already built a Spark connector
> >> > > >>  for a
> >> project
> >> > > >> using
> >> > > >> this codec and gotten fast query times.
> >> > > >>
> >> > > >> Thanks,
> >> > > >> Martin
> >> > > >>
> >> > >
> >> >
> >>
> >
>


Re: New Codec for ORC

2022-05-20 Thread Dongjoon Hyun
Hi, Ian.

It was not a request or blame to that repo (and you) at all. There is no
time limit for someone's contributions.

We are in Apache ORC dev mailing address which discusses Apache ORC related
stuff, especially, on

- https://orc.apache.org (The Apache ORC website)
- https://github.com/apache/orc (Commits / GitHub Issues / PRs)
- https://issues.apache.org (Apache JIRA Issues)
- Some other ASF resources (other ASF projects repo and mailing lists)

In general, we are supposed to focus on some stuff contributed to the ASF
channel already.
It's a little different from users who have more broader options to choose
what they use.

Best,
Dongjoon.


On Fri, May 20, 2022 at 9:54 AM Ian Joiner  wrote:

> Uh. I didn’t realize that. Give me 6 months and I can provide both the
> reader and the writer.
>
> Ian
>
> On Friday, May 20, 2022, Dongjoon Hyun  wrote:
>
> > BTW, the license of that Rust ORC reader looks unusual to me.
> >
> > `ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)`
> >
> > I guess we need to skip that repository.
> >
> > Dongjoon
> >
> >
> > On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun 
> > wrote:
> >
> > > Thank you for sharing it, Ian.
> > >
> > > Dongjoon.
> > >
> > >
> > > On Fri, May 20, 2022 at 9:27 AM Ian Joiner 
> > wrote:
> > >
> > >> There is already a Rust ORC reader:
> > >> https://rustrepo.com/repo/travisbrown-orcrs
> > >> We still need a writer though. If I have 6 months to do so I can write
> > >> one.
> > >> Then I can also integrate it into Arrow Rust.
> > >>
> > >> Ian
> > >>
> > >> On Friday, May 20, 2022, Dongjoon Hyun 
> wrote:
> > >>
> > >> > +1 for Owen's advice.
> > >> >
> > >> > BTW, Rust ORC reader & writer sounds like a great idea.
> > >> >
> > >> > Dongjoon.
> > >> >
> > >> >
> > >> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <
> > owen.omal...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > This is interesting, but it sounds like it corresponds more to the
> > rle
> > >> > > encoding that we do rather than the generic compression code.
> > >> > >
> > >> > > Has anyone done a Java version of the library? It is faster to
> > >> iterate on
> > >> > > this kind of design in Java. On the other hand, I’ve heard that
> > >> someone
> > >> > is
> > >> > > thinking about doing a Rust ORC reader & writer, but it isn’t done
> > >> yet.
> > >> > 😊
> > >> > >
> > >> > > .. Owen
> > >> > >
> > >> > > > On May 19, 2022, at 17:16, Dongjoon Hyun <
> dongjoon.h...@gmail.com
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > Thank you for sharing, Martin.
> > >> > > >
> > >> > > > For codec one, you can take advantage of our benchmark suite
> > >> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> > >> > > >
> > >> > > >
> > >> > > https://github.com/apache/orc/blob/main/java/bench/core/src/
> > >> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > >> > > >
> > >> > > > For the Spark connector one, I'd like to recommend you to send
> > >> > dev@spark
> > >> > > > too. You will get attention in both parts (codec and connector).
> > >> > > >
> > >> > > > Then, I'm looking forward to seeing your benchmark result.
> > >> > > >
> > >> > > > Dongjoon.
> > >> > > >
> > >> > > >
> > >> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> > >> > m.w.lonca...@gmail.com
> > >> > > >
> > >> > > >> wrote:
> > >> > > >>
> > >> > > >> I've developed a stable codec for numerical columns called
> > Quantile
> > >> > > >> Compression .
> > >> > > >> It has about 30% higher compression ratio than even Zstd for
> > >> similar
> > >> > > >> compression and decompression time. It achieves this by
> tailoring
> > >> to
> > >> > the
> > >> > > >> data type (floats, ints, timestamps, bools).
> > >> > > >>
> > >> > > >> I'm using it in my own projects, and a few others have adopted
> > it,
> > >> but
> > >> > > it
> > >> > > >> would also be perfect for ORC columns. Assuming a 50-50 split
> > >> between
> > >> > > >> text-like and numerical data, it could reduce the average ORC
> > file
> > >> > size
> > >> > > by
> > >> > > >> over 10% with no extra compute cost. Incorporating it into ORC
> > >> would
> > >> > be
> > >> > > >> quite powerful since the codec by itself only works on a single
> > >> flat
> > >> > > column
> > >> > > >> of non-nullable numbers.
> > >> > > >>
> > >> > > >> Would the ORC community be interested in this? How can we make
> > this
> > >> > > >> available to users? I've already built a Spark connector
> > >> > > >>  for a
> > >> project
> > >> > > >> using
> > >> > > >> this codec and gotten fast query times.
> > >> > > >>
> > >> > > >> Thanks,
> > >> > > >> Martin
> > >> > > >>
> > >> > >
> > >> >
> > >>
> > >
> >
>


Re: New Codec for ORC

2022-05-20 Thread Ian Joiner
Hi Dongjoon,

Haha I understand. As the guy who wrote the ORC write adapter in Arrow and
want to understand both Rust and internals of big data formats more I’d
love to help out.

I will file a self-assigned issue then.

Ian

On Friday, May 20, 2022, Dongjoon Hyun  wrote:

> Hi, Ian.
>
> It was not a request or blame to that repo (and you) at all. There is no
> time limit for someone's contributions.
>
> We are in Apache ORC dev mailing address which discusses Apache ORC related
> stuff, especially, on
>
> - https://orc.apache.org (The Apache ORC website)
> - https://github.com/apache/orc (Commits / GitHub Issues / PRs)
> - https://issues.apache.org (Apache JIRA Issues)
> - Some other ASF resources (other ASF projects repo and mailing lists)
>
> In general, we are supposed to focus on some stuff contributed to the ASF
> channel already.
> It's a little different from users who have more broader options to choose
> what they use.
>
> Best,
> Dongjoon.
>
>
> On Fri, May 20, 2022 at 9:54 AM Ian Joiner  wrote:
>
> > Uh. I didn’t realize that. Give me 6 months and I can provide both the
> > reader and the writer.
> >
> > Ian
> >
> > On Friday, May 20, 2022, Dongjoon Hyun  wrote:
> >
> > > BTW, the license of that Rust ORC reader looks unusual to me.
> > >
> > > `ANTI-CAPITALIST SOFTWARE LICENSE (v 1.4)`
> > >
> > > I guess we need to skip that repository.
> > >
> > > Dongjoon
> > >
> > >
> > > On Fri, May 20, 2022 at 9:43 AM Dongjoon Hyun  >
> > > wrote:
> > >
> > > > Thank you for sharing it, Ian.
> > > >
> > > > Dongjoon.
> > > >
> > > >
> > > > On Fri, May 20, 2022 at 9:27 AM Ian Joiner 
> > > wrote:
> > > >
> > > >> There is already a Rust ORC reader:
> > > >> https://rustrepo.com/repo/travisbrown-orcrs
> > > >> We still need a writer though. If I have 6 months to do so I can
> write
> > > >> one.
> > > >> Then I can also integrate it into Arrow Rust.
> > > >>
> > > >> Ian
> > > >>
> > > >> On Friday, May 20, 2022, Dongjoon Hyun 
> > wrote:
> > > >>
> > > >> > +1 for Owen's advice.
> > > >> >
> > > >> > BTW, Rust ORC reader & writer sounds like a great idea.
> > > >> >
> > > >> > Dongjoon.
> > > >> >
> > > >> >
> > > >> > On Thu, May 19, 2022 at 10:28 PM Owen O'Malley <
> > > owen.omal...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > This is interesting, but it sounds like it corresponds more to
> the
> > > rle
> > > >> > > encoding that we do rather than the generic compression code.
> > > >> > >
> > > >> > > Has anyone done a Java version of the library? It is faster to
> > > >> iterate on
> > > >> > > this kind of design in Java. On the other hand, I’ve heard that
> > > >> someone
> > > >> > is
> > > >> > > thinking about doing a Rust ORC reader & writer, but it isn’t
> done
> > > >> yet.
> > > >> > 😊
> > > >> > >
> > > >> > > .. Owen
> > > >> > >
> > > >> > > > On May 19, 2022, at 17:16, Dongjoon Hyun <
> > dongjoon.h...@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > Thank you for sharing, Martin.
> > > >> > > >
> > > >> > > > For codec one, you can take advantage of our benchmark suite
> > > >> > > > (NONE/ZLIB/SNAPPY/ZSTD) to show the benefits in ORC format.
> > > >> > > >
> > > >> > > >
> > > >> > > https://github.com/apache/orc/blob/main/java/bench/core/src/
> > > >> > java/org/apache/orc/bench/core/CompressionKind.java#L34-L37
> > > >> > > >
> > > >> > > > For the Spark connector one, I'd like to recommend you to send
> > > >> > dev@spark
> > > >> > > > too. You will get attention in both parts (codec and
> connector).
> > > >> > > >
> > > >> > > > Then, I'm looking forward to seeing your benchmark result.
> > > >> > > >
> > > >> > > > Dongjoon.
> > > >> > > >
> > > >> > > >
> > > >> > > >> On Thu, May 19, 2022 at 4:12 PM Martin Loncaric <
> > > >> > m.w.lonca...@gmail.com
> > > >> > > >
> > > >> > > >> wrote:
> > > >> > > >>
> > > >> > > >> I've developed a stable codec for numerical columns called
> > > Quantile
> > > >> > > >> Compression .
> > > >> > > >> It has about 30% higher compression ratio than even Zstd for
> > > >> similar
> > > >> > > >> compression and decompression time. It achieves this by
> > tailoring
> > > >> to
> > > >> > the
> > > >> > > >> data type (floats, ints, timestamps, bools).
> > > >> > > >>
> > > >> > > >> I'm using it in my own projects, and a few others have
> adopted
> > > it,
> > > >> but
> > > >> > > it
> > > >> > > >> would also be perfect for ORC columns. Assuming a 50-50 split
> > > >> between
> > > >> > > >> text-like and numerical data, it could reduce the average ORC
> > > file
> > > >> > size
> > > >> > > by
> > > >> > > >> over 10% with no extra compute cost. Incorporating it into
> ORC
> > > >> would
> > > >> > be
> > > >> > > >> quite powerful since the codec by itself only works on a
> single
> > > >> flat
> > > >> > > column
> > > >> > > >> of non-nullable numbers.
> > > >> > > >>
> > > >> > > >> Would the ORC community be interested in this? How can we
> make
> > > this
> > 

[jira] [Created] (ORC-1180) Implement an ORC writer in Rust

2022-05-20 Thread Ian Alexander Joiner (Jira)
Ian Alexander Joiner created ORC-1180:
-

 Summary: Implement an ORC writer in Rust
 Key: ORC-1180
 URL: https://issues.apache.org/jira/browse/ORC-1180
 Project: ORC
  Issue Type: New Feature
Reporter: Ian Alexander Joiner
Assignee: Ian Alexander Joiner






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ORC-1181) Implement an ORC reader in Rust

2022-05-20 Thread Ian Alexander Joiner (Jira)
Ian Alexander Joiner created ORC-1181:
-

 Summary: Implement an ORC reader in Rust
 Key: ORC-1181
 URL: https://issues.apache.org/jira/browse/ORC-1181
 Project: ORC
  Issue Type: New Feature
Reporter: Ian Alexander Joiner
Assignee: Ian Alexander Joiner






--
This message was sent by Atlassian Jira
(v8.20.7#820007)