Is this expected behavior of ORC acid writers? If so, is it documented
somewhere?
-dain
Dain Sundstrom
Co-founder @ Presto Software Foundation, Co-creator of Presto
(https://prestosql.io)
> On Jun 14, 2019, at 6:17 PM, Owen O'Malley wrote:
>
> The hive acid format uses a
It would be nice if the there were some reserved space in the enums for
experimentations like this.
-dain
Dain Sundstrom
Co-founder @ Presto Software Foundation, Co-creator of Presto
(https://prestosql.io)
> On Jun 6, 2019, at 4:14 AM, Praveen Krishna
> wrote:
>
After a bit more investigation, it looks like this was a regression only
present in Hive 2.0 to 2.2.
Dain Sundstrom
Co-founder @ Presto Software Foundation, Co-creator of Presto
(https://prestosql.io)
> On Mar 1, 2019, at 5:58 PM, Dain Sundstrom wrote:
>
> Thanks Owen,
rs ago.
----
Dain Sundstrom
Co-founder @ Presto Software Foundation, Co-creator of Presto
(https://prestosql.io)
> On Mar 1, 2019, at 3:19 PM, Owen O'Malley wrote:
>
> The goal of WriterVersion is to record changes to the writer software so
> that the readers can cope with unknown bugs.
Is there any documentation on which type coercions should be supported when the
partition schema does not match the file schema?
-dain
> On Oct 8, 2018, at 5:19 PM, Xiening Dai wrote:
>> On Oct 7, 2018, at 6:42 AM, Dain Sundstrom wrote:
>>> On Oct 6, 2018, at 11:42 AM, Owen O'Malley wrote:
>>>
>>> On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom wrote:
>>>>>> * Breakin
> On Oct 6, 2018, at 11:42 AM, Owen O'Malley wrote:
>
> On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom wrote:
>
>>
>> Interesting idea. This could help some processors of the data. Also, if
>> the format has this, it would be good to support "clustered
can be easily turned into a
>> key range. For example “WHERE id > 0 and id <= 100” will be translated into
>> range (0, 100], and this key range can be passed down all the way to the
>> Orc reader. Then we only need to load the corresponding row groups that
>> covers thi
lock compression and the
data encodings, can have dramatic effects on the format. Can we consider
limiting the compression to LZ4 and ZSTD (or may be just ZSTD), and then design
encodings that play well with them? Also, ZSTD can have pre-trained
"dictionary" that might help with specific encodings…. Just a thought.
-dain
Dain Sundstrom created ORC-369:
--
Summary: Seek to end of BitFieldReader causes EOFException
Key: ORC-369
URL: https://issues.apache.org/jira/browse/ORC-369
Project: ORC
Issue Type: Bug
Our expectation is maybe in a quarter.
-dain
> On May 17, 2018, at 11:42 AM, Xiening Dai <xndai@live.com> wrote:
>
> Hi Dain,
>
> Do you have a roughly timeline regarding when the Java zstd compressor will
> be available? Thanks.
>
>
>> On May 7,
The fixes are released in v0.11
-dain
> On May 6, 2018, at 9:36 PM, Xiening Dai <xndai@live.com> wrote:
>
> Thanks for clarification. It makes sense to wait for your fixes. Thx.
>
>> On May 5, 2018, at 1:04 PM, Dain Sundstrom <d...@iq80.com> wrote:
>>
tly. We missed
this one because the default native implementation was not adding checksums so
the code wasn’t actually being tested.
-dain
0.11 is released.
-dain
Sent from my iPhone
> On May 4, 2018, at 1:41 PM, Owen O'Malley <owen.omal...@gmail.com> wrote:
>
> I just upgraded ORC to use aircompressor 0.10. I assume we'll want to move
> to 0.11 before we use zstd?
>
> .. Owen
>
>> On Fri, May 4,
.
-dain
> On May 4, 2018, at 11:19 AM, Xiening Dai <xndai@live.com> wrote:
>
> Hi all,
>
> I think the major reason that we don’t support zstd compressor today is that
> there’s no native java library currently. But I do see a java decompressor in
> pr
I’m pretty sure that JMH has class path exception, just like the JVM, so you
can link to it, but can't ship it.
-dain
> On Apr 20, 2018, at 9:09 AM, Owen O'Malley <owen.omal...@gmail.com> wrote:
>
> On Thu, Apr 19, 2018 at 1:20 PM, Dain Sundstrom <d...@iq80.com> wrote
I don’t think there is anything like JMH, or any team that understands Java
micro benchmarking as well. Have you tried asking them to open source the APIs
under a better license so you can code against it.
-dain
> On Apr 18, 2018, at 8:52 AM, Owen O'Malley <owen.omal...@gmail.com&
Github user dain commented on a diff in the pull request:
https://github.com/apache/orc/pull/245#discussion_r181164570
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean
RLE
Decimal was introduced in Hive 0.11
e IO optimizer ends up reading the full streams anyway (e.g., a seek on
a disk is about as expensive as reading ~1MiB of data so you coalesce reads
with a gap less than ~1MiB to avoid the extra seek).
-dain
) number columns
mixed with (large) string columns. If you only want the numbers, you end up
doing a lot of IOs (because of the large string comus in the middle), and with
this model you have a higher chance of getting a shared IO.
-dain
> On Mar 26, 2018, at 4:23 PM, Owen O'Malley <owe
Does the ORC spec require that a file start with “ORC”?
-dain
Github user dain commented on the issue:
https://github.com/apache/orc/pull/169
@xndai if the min or max happens to be a multi megabyte value it can be
really expensive for the reader. Additionally, for filtering the first few
bytes are the most valuable (they establish the range).
---
Github user dain commented on the issue:
https://github.com/apache/orc/pull/163
Also this is a backwards incompatible change, so we would, at the very
least, need to do the trick where it is disabled by default in the writer until
the reader is rolled out everywhere.
---
If your
Github user dain commented on the issue:
https://github.com/apache/orc/pull/163
We were considering doing this internally and then we ran into a production
bug where files got truncated to zero bytes. Since empty files are illegal we
could find all of the effected partitions easily
t; unusable. Before throwing that switch, he would get none of the benefits
>> of ORC 2.0. Is this summary correct?
>>
>
> Yes, exactly.
I think the important part is not to change the APIs so tools can be updated by
just upgrading the dep.
-dain
encodings, we should pick encodings that play well with
vectorization which is coming in Java 10 (Java 9 also has vastly improved auto
vectorization).
-dain
> On Aug 4, 2017, at 9:29 AM, Owen O'Malley <owen.omal...@gmail.com> wrote:
>
> All,
> We've started the process of upda
Github user dain commented on a diff in the pull request:
https://github.com/apache/orc/pull/132#discussion_r122580567
--- Diff: java/core/src/java/org/apache/orc/impl/OrcTail.java ---
@@ -70,8 +70,11 @@ public long getFileModificationTime() {
public
Github user dain commented on a diff in the pull request:
https://github.com/apache/orc/pull/132#discussion_r122575288
--- Diff: proto/orc_proto.proto ---
@@ -221,15 +227,32 @@ message PostScript {
// [0, 12] = Hive 0.12
repeated uint32 version = 4 [packed = true
Github user dain commented on a diff in the pull request:
https://github.com/apache/orc/pull/132#discussion_r122575252
--- Diff: java/core/src/java/org/apache/orc/impl/OrcTail.java ---
@@ -70,8 +70,11 @@ public long getFileModificationTime() {
public
Github user dain commented on a diff in the pull request:
https://github.com/apache/orc/pull/132#discussion_r122521442
--- Diff: proto/orc_proto.proto ---
@@ -221,15 +227,29 @@ message PostScript {
// [0, 12] = Hive 0.12
repeated uint32 version = 4 [packed = true
Recently I have been working on a custom writer for Presto and during this I
kept notes on sections of the documentation that might have problems. Some of
these may have already been addressed:
## Compression
see https://orc.apache.org/docs/compression.html
I think the hex sequence for 10
Github user dain commented on a diff in the pull request:
https://github.com/apache/orc/pull/132#discussion_r122507019
--- Diff: proto/orc_proto.proto ---
@@ -221,15 +227,29 @@ message PostScript {
// [0, 12] = Hive 0.12
repeated uint32 version = 4 [packed = true
Github user dain commented on a diff in the pull request:
https://github.com/apache/orc/pull/132#discussion_r122506293
--- Diff: java/core/src/java/org/apache/orc/OrcFile.java ---
@@ -108,66 +108,118 @@ public int getMinor() {
}
}
+ public enum
On reading the HIVE-12055 referenced from the docs for writer version 3, I’m
not sure what was fixed. Does any one remember?
-dain
/6072e3aed88d9246e1130abadf3c15a88e975b4e#diff-340d190f994d92658b24aae1edf610b3
Is writer version "1 = HIVE-8732 fixed” after 0.14? If so I can update my
reader to detect this.
-dain
> On Jun 6, 2017, at 3:36 PM, Owen O'Malley <owen.omal...@gmail.com> wrote:
>
> On Tue, Jun 6, 2017 at 3:02 P
u want to transition back to Java Strings, we would need to be a bit smarter.
-dain
> On Jun 6, 2017, at 3:39 PM, Owen O'Malley <owen.omal...@gmail.com> wrote:
>
> I'm confused. TimestampStatistics uses integers not strings.
>
> .. Owen
>
> On Mon, Jun 5, 2017 at 9:53
at the first surrogate
pair, so the value is slightly smaller than the min or larger than the max, and
still a valid UTF-8 sequence.
Thoughts?
-dain
> On Dec 12, 2016, at 4:48 PM, Dain Sundstrom <d...@iq80.com> wrote:
> On Dec 12, 2016, at 4:36 PM, Owen O'Malley <omal...@apache.org> wrote:
>>> I think this should also be documented in the statistics section which
>> also uses UTF-16 BE, which is at lea
Github user dain commented on the issue:
https://github.com/apache/orc/pull/78
Just curious, how do you plan on using this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user dain commented on the issue:
https://github.com/apache/orc/pull/76
Can you describe why you want to make this change?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
bloom filters from these files.
-dain
Sounds good to me.
Should we add a version field to the BLOOM_FILTER_UTF8 to deal with any future
problems?
One other thought, in the protobuf definition I think it would be more
efficient to have the bitset encoded as a byte[] to avoid the boxed long array.
-dain
> On Sep 8, 2016, at 3
42 matches
Mail list logo