[jira] [Created] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-07-11 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5916:
---

 Summary: [C++] Allow RecordBatch.length to be less than array 
lengths
 Key: ARROW-5916
 URL: https://issues.apache.org/jira/browse/ARROW-5916
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: John Muehlhausen
 Attachments: test.arrow_ipc

0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
array length be equal.  As per 
[https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
 , we discussed changing this so that RecordBatch.length can be [0,array 
length].

 If RecordBatch.length is less than array length, the reader should ignore the 
portion of the array(s) beyond RecordBatch.length.  This will allow partially 
populated batches to be read in scenarios identified in the above discussion.

{code:c++}
  Status GetFieldMetadata(int field_index, ArrayData* out) {
auto nodes = metadata_->nodes();
// pop off a field
if (field_index >= static_cast(nodes->size())) {
  return Status::Invalid("Ran out of field metadata, likely malformed");
}
const flatbuf::FieldNode* node = nodes->Get(field_index);

*//out->length = node->length();*
*out->length = metadata_->length();*
out->null_count = node->null_count();
out->offset = 0;
return Status::OK();
  }
{code}

Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Fan Liya
@Wes McKinney,

Thanks a lot for the brainstorming. I think your ideas are reasonable and
feasible.
About IPC, my idea is that we can send the vector as a PointerStringVector,
and receive it as a VarCharVector, so that the overhead of memory
compaction can be hidden.
What do you think?

Best,
Liya Fan

On Fri, Jul 12, 2019 at 11:07 AM Fan Liya  wrote:

> @Uwe L. Korn
>
> Thanks a lot for the suggestion. I think this is exactly what we are doing
> right now.
>
> Best,
> Liya Fan
>
> On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney  wrote:
>
>> hi Liya -- have you thought about implementing this as an
>> ExtensionType / ExtensionVector? You actually can already do this, so
>> if this helps you reference strings stored in some external memory
>> then that seems reasonable. Such a PointerStringVector could have a
>> method that converts it into the Arrow varbinary columnar
>> representation.
>>
>> You wouldn't be able to put such an object into the IPC binary
>> protocol, though. If that's a requirement (being able to use the IPC
>> protocol) for this kind of data, before going any further in the
>> discussion I would suggest that you work out exactly how such data
>> would be moved from one process address space to another (using
>> Buffers).
>>
>> - Wes
>>
>> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn  wrote:
>> >
>> > Hello Liya Fan,
>> >
>> > here your best approach is to copy into the Arrow format as you can
>> then use this as the basis for working with the Arrow-native representation
>> as well as your internal representation. You will have to use two different
>> offset vector as those two will always differ but in the case of your
>> internal representation, you don't have the requirement of consecutive data
>> as Arrow has but you can still work with the strings just as before even
>> when stored consecutively.
>> >
>> > Uwe
>> >
>> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
>> > > Hi Korn,
>> > >
>> > > Thanks a lot for your comments.
>> > >
>> > > In my opinion, your comments make sense to me. Allowing
>> non-consecutive
>> > > memory segments will break some good design choices of Arrow.
>> > > However, there are wide-spread user requirements for non-consecutive
>> memory
>> > > segments. I am wondering how can we help such users. What advice we
>> can
>> > > give to them?
>> > >
>> > > Memory copy/move can be a solution, but is there a better solution?
>> > > Is there a third alternative? Can we virtualize the non-consecutive
>> memory
>> > > segments into a consecutive one? (Although performance overhead is
>> > > unavoidable.)
>> > >
>> > > What do you think? Let's brain-storm it.
>> > >
>> > > Best,
>> > > Liya Fan
>> > >
>> > >
>> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn  wrote:
>> > >
>> > > > Hello Liya,
>> > > >
>> > > > I'm quite -1 on this type as Arrow is about efficient columnar
>> structures.
>> > > > We have opened the standard also to matrix-like types but always
>> keep the
>> > > > constraint of consecutive memory. Now also adding types where
>> memory is no
>> > > > longer consecutive but spread in the heap will make the scope of the
>> > > > project much wider (It seems that we then just turn into a general
>> > > > serialization framework).
>> > > >
>> > > > One of the ideas of a common standard is that some need to make
>> > > > compromises. I think in this case it is a necessary compromise to
>> not allow
>> > > > all kind of string representations.
>> > > >
>> > > > Uwe
>> > > >
>> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
>> > > > > Hi all,
>> > > > >
>> > > > >
>> > > > > We are thinking of providing varchar/varbinary vectors with a
>> different
>> > > > > memory layout which exists in a wide range of systems. The memory
>> layout
>> > > > is
>> > > > > different from that of VarCharVector in the following ways:
>> > > > >
>> > > > >
>> > > > >1.
>> > > > >
>> > > > >Instead of storing (start offset, end offset), the new layout
>> stores
>> > > > >(start offset, length)
>> > > > >2.
>> > > > >
>> > > > >The content of varchars may not be in a consecutive memory
>> region.
>> > > > >Instead, it can be in arbitrary memory address.
>> > > > >
>> > > > >
>> > > > > Due to these differences in memory layout, it incurs performance
>> overhead
>> > > > > when converting data between existing systems and VarCharVectors.
>> > > > >
>> > > > > The above difference 1 seems insignificant, while difference 2 is
>> > > > difficult
>> > > > > to overcome. However, the scenario of difference 2 is prevalent in
>> > > > > practice: for example we store strings in a series of memory
>> segments.
>> > > > > Whenever a segment is full, we request a new one. However, these
>> memory
>> > > > > segments may not be consecutive, because other processes/threads
>> are also
>> > > > > requesting/releasing memory segments in the meantime.
>> > > > >
>> > > > > So we are wondering if it is possible to support such memory
>> layout in
>> > > > > Arrow.

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Fan Liya
@Uwe L. Korn

Thanks a lot for the suggestion. I think this is exactly what we are doing
right now.

Best,
Liya Fan

On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney  wrote:

> hi Liya -- have you thought about implementing this as an
> ExtensionType / ExtensionVector? You actually can already do this, so
> if this helps you reference strings stored in some external memory
> then that seems reasonable. Such a PointerStringVector could have a
> method that converts it into the Arrow varbinary columnar
> representation.
>
> You wouldn't be able to put such an object into the IPC binary
> protocol, though. If that's a requirement (being able to use the IPC
> protocol) for this kind of data, before going any further in the
> discussion I would suggest that you work out exactly how such data
> would be moved from one process address space to another (using
> Buffers).
>
> - Wes
>
> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn  wrote:
> >
> > Hello Liya Fan,
> >
> > here your best approach is to copy into the Arrow format as you can then
> use this as the basis for working with the Arrow-native representation as
> well as your internal representation. You will have to use two different
> offset vector as those two will always differ but in the case of your
> internal representation, you don't have the requirement of consecutive data
> as Arrow has but you can still work with the strings just as before even
> when stored consecutively.
> >
> > Uwe
> >
> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> > > Hi Korn,
> > >
> > > Thanks a lot for your comments.
> > >
> > > In my opinion, your comments make sense to me. Allowing non-consecutive
> > > memory segments will break some good design choices of Arrow.
> > > However, there are wide-spread user requirements for non-consecutive
> memory
> > > segments. I am wondering how can we help such users. What advice we can
> > > give to them?
> > >
> > > Memory copy/move can be a solution, but is there a better solution?
> > > Is there a third alternative? Can we virtualize the non-consecutive
> memory
> > > segments into a consecutive one? (Although performance overhead is
> > > unavoidable.)
> > >
> > > What do you think? Let's brain-storm it.
> > >
> > > Best,
> > > Liya Fan
> > >
> > >
> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn  wrote:
> > >
> > > > Hello Liya,
> > > >
> > > > I'm quite -1 on this type as Arrow is about efficient columnar
> structures.
> > > > We have opened the standard also to matrix-like types but always
> keep the
> > > > constraint of consecutive memory. Now also adding types where memory
> is no
> > > > longer consecutive but spread in the heap will make the scope of the
> > > > project much wider (It seems that we then just turn into a general
> > > > serialization framework).
> > > >
> > > > One of the ideas of a common standard is that some need to make
> > > > compromises. I think in this case it is a necessary compromise to
> not allow
> > > > all kind of string representations.
> > > >
> > > > Uwe
> > > >
> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > > > > Hi all,
> > > > >
> > > > >
> > > > > We are thinking of providing varchar/varbinary vectors with a
> different
> > > > > memory layout which exists in a wide range of systems. The memory
> layout
> > > > is
> > > > > different from that of VarCharVector in the following ways:
> > > > >
> > > > >
> > > > >1.
> > > > >
> > > > >Instead of storing (start offset, end offset), the new layout
> stores
> > > > >(start offset, length)
> > > > >2.
> > > > >
> > > > >The content of varchars may not be in a consecutive memory
> region.
> > > > >Instead, it can be in arbitrary memory address.
> > > > >
> > > > >
> > > > > Due to these differences in memory layout, it incurs performance
> overhead
> > > > > when converting data between existing systems and VarCharVectors.
> > > > >
> > > > > The above difference 1 seems insignificant, while difference 2 is
> > > > difficult
> > > > > to overcome. However, the scenario of difference 2 is prevalent in
> > > > > practice: for example we store strings in a series of memory
> segments.
> > > > > Whenever a segment is full, we request a new one. However, these
> memory
> > > > > segments may not be consecutive, because other processes/threads
> are also
> > > > > requesting/releasing memory segments in the meantime.
> > > > >
> > > > > So we are wondering if it is possible to support such memory
> layout in
> > > > > Arrow. I think there are more systems that are trying to adopting
> Arrow,
> > > > > but are hindered by such difficulty.
> > > > >
> > > > > Would you please give your valuable feedback?
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Liya Fan
> > > > >
> > > >
> > >
>


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Francois Saint-Jacques
I just merged PARQUET-1623, I think it's worth inserting since it
fixes an invalid memory write. Note that I couldn't resolve/close the
parquet issue, do I have to be contributor to the project?

François

On Thu, Jul 11, 2019 at 6:10 PM Wes McKinney  wrote:
>
> I just merged Eric's 2nd patch ARROW-5908 and I went through all the
> patches since the release commit and have come up with the following
> list of 32 fix-only patches to pick into a maintenance branch:
>
> https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014
>
> Note there's still unresolved Parquet forward/backward compatibility
> issues in C++ that we haven't merged patches for yet, so that is
> pending.
>
> Are there any other patches / JIRA issues people would like to see
> resolved in a patch release?
>
> Thanks
>
> On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney  wrote:
> >
> > Eric -- you are free to set the Fix Version prior to the patch being merged
> >
> > On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
> >  wrote:
> > >
> > > The two C# fixes I'd like in the 0.14.1 release are:
> > >
> > > https://issues.apache.org/jira/browse/ARROW-5887 - already marked with 
> > > 0.14.1 fix version.
> > > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been resolved 
> > > yet. The PR https://github.com/apache/arrow/pull/4851 has one approver 
> > > and the Rust failure doesn't appear to be caused by my change.
> > >
> > > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version until the 
> > > PR has been merged.
> > >
> > > -Original Message-
> > > From: Neal Richardson 
> > > Sent: Thursday, July 11, 2019 11:59 AM
> > > To: dev@arrow.apache.org
> > > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package 
> > > problems, Parquet forward compatibility problems
> > >
> > > I just moved 
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373&sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3D&reserved=0
> > >  from 1.0.0 to 0.14.1.
> > >
> > > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney  wrote:
> > >
> > > > To limit uncertainty, I'm going to start preparing a 0.14.1 patch
> > > > release branch. I will update the list with the patches that are being
> > > > cherry-picked. If other folks could give me a list of other PRs that
> > > > need to be backported I will add them to the list. Any JIRA that needs
> > > > to be included should have the "0.14.1" fix version added so we can
> > > > keep track
> > > >
> > > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche
> > > >  wrote:
> > > > >
> > > > > I personally prefer 0.14.1 over 0.15.0. I think that is clearer in
> > > > > communication, as we are fixing regressions of the 0.14.0 release.
> > > > >
> > > > > (but I haven't been involved much in releases, so certainly no
> > > > > strong
> > > > > opinion)
> > > > >
> > > > > Joris
> > > > >
> > > > >
> > > > > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney 
> > > > > :
> > > > >
> > > > > > hi folks,
> > > > > >
> > > > > > Are there any opinions / strong feelings about the two options:
> > > > > >
> > > > > > * Prepare patch 0.14.1 release from a maintenance branch
> > > > > > * Release 0.15.0 out of master
> > > > > >
> > > > > > Aside from the Parquet forward compatibility issues we're still
> > > > > > discussing, and Eric's C# patch PR 4836, are there any other
> > > > > > issues that need to be fixed before we go down one of these paths?
> > > > > >
> > > > > > Would anyone like to help with release management? I can do so if
> > > > > > necessary, but I've already done a lot of release management :)
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > > On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney 
> > > > wrote:
> > > > > > >
> > > > > > > Hi Eric -- of course!
> > > > > > >
> > > > > > > On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt <
> > > > eric.erha...@microsoft.com.invalid>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> Can we propose getting changes other than Python or Parquet
> > > > > > >> related
> > > > > > into this release?
> > > > > > >>
> > > > > > >> For example, I found a critical issue in the C# implementation
> > > > that, if
> > > > > > possible, I'd like to get included in a patch release.
> > > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > > > > github.com%2Fapache%2Farrow%2Fpull%2F4836&data=02%7C01%7CEric.
> > > > > > Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f98
> > > > > > 8bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747781365&sdata
> > > > > > =5wJ%2FGdh8LTxRyrB%2F2Lc3ue46%2FRqE6WUM6brsSDv2FR0%3D&reserved
> > > > > > =0
> > > > > > >>
> > > > > > >> Eric
> > > > > > >>
> > > > > > >> -Original Message-
> > > > > > >> From: Wes McKinney 
> > > > > > >> Sent: Tuesday, July 9, 2019 7:59 AM
> > > > > > >> To:

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Wes McKinney
I just merged Eric's 2nd patch ARROW-5908 and I went through all the
patches since the release commit and have come up with the following
list of 32 fix-only patches to pick into a maintenance branch:

https://gist.github.com/wesm/1e4ac14baaa8b27bf13b071d2d715014

Note there's still unresolved Parquet forward/backward compatibility
issues in C++ that we haven't merged patches for yet, so that is
pending.

Are there any other patches / JIRA issues people would like to see
resolved in a patch release?

Thanks

On Thu, Jul 11, 2019 at 3:03 PM Wes McKinney  wrote:
>
> Eric -- you are free to set the Fix Version prior to the patch being merged
>
> On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
>  wrote:
> >
> > The two C# fixes I'd like in the 0.14.1 release are:
> >
> > https://issues.apache.org/jira/browse/ARROW-5887 - already marked with 
> > 0.14.1 fix version.
> > https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been resolved 
> > yet. The PR https://github.com/apache/arrow/pull/4851 has one approver and 
> > the Rust failure doesn't appear to be caused by my change.
> >
> > I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version until the PR 
> > has been merged.
> >
> > -Original Message-
> > From: Neal Richardson 
> > Sent: Thursday, July 11, 2019 11:59 AM
> > To: dev@arrow.apache.org
> > Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package 
> > problems, Parquet forward compatibility problems
> >
> > I just moved 
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373&sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3D&reserved=0
> >  from 1.0.0 to 0.14.1.
> >
> > On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney  wrote:
> >
> > > To limit uncertainty, I'm going to start preparing a 0.14.1 patch
> > > release branch. I will update the list with the patches that are being
> > > cherry-picked. If other folks could give me a list of other PRs that
> > > need to be backported I will add them to the list. Any JIRA that needs
> > > to be included should have the "0.14.1" fix version added so we can
> > > keep track
> > >
> > > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche
> > >  wrote:
> > > >
> > > > I personally prefer 0.14.1 over 0.15.0. I think that is clearer in
> > > > communication, as we are fixing regressions of the 0.14.0 release.
> > > >
> > > > (but I haven't been involved much in releases, so certainly no
> > > > strong
> > > > opinion)
> > > >
> > > > Joris
> > > >
> > > >
> > > > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney :
> > > >
> > > > > hi folks,
> > > > >
> > > > > Are there any opinions / strong feelings about the two options:
> > > > >
> > > > > * Prepare patch 0.14.1 release from a maintenance branch
> > > > > * Release 0.15.0 out of master
> > > > >
> > > > > Aside from the Parquet forward compatibility issues we're still
> > > > > discussing, and Eric's C# patch PR 4836, are there any other
> > > > > issues that need to be fixed before we go down one of these paths?
> > > > >
> > > > > Would anyone like to help with release management? I can do so if
> > > > > necessary, but I've already done a lot of release management :)
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney 
> > > wrote:
> > > > > >
> > > > > > Hi Eric -- of course!
> > > > > >
> > > > > > On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt <
> > > eric.erha...@microsoft.com.invalid>
> > > > > wrote:
> > > > > >>
> > > > > >> Can we propose getting changes other than Python or Parquet
> > > > > >> related
> > > > > into this release?
> > > > > >>
> > > > > >> For example, I found a critical issue in the C# implementation
> > > that, if
> > > > > possible, I'd like to get included in a patch release.
> > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > > > github.com%2Fapache%2Farrow%2Fpull%2F4836&data=02%7C01%7CEric.
> > > > > Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f98
> > > > > 8bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747781365&sdata
> > > > > =5wJ%2FGdh8LTxRyrB%2F2Lc3ue46%2FRqE6WUM6brsSDv2FR0%3D&reserved
> > > > > =0
> > > > > >>
> > > > > >> Eric
> > > > > >>
> > > > > >> -Original Message-
> > > > > >> From: Wes McKinney 
> > > > > >> Sent: Tuesday, July 9, 2019 7:59 AM
> > > > > >> To: dev@arrow.apache.org
> > > > > >> Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python
> > > > > >> package
> > > > > problems, Parquet forward compatibility problems
> > > > > >>
> > > > > >> On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei
> > > > > >> 
> > > > > wrote:
> > > > > >> >
> > > > > >> > Hi,
> > > > > >> >
> > > > > >> > > If the problems can be resolved quickly, I should think we
> > > could cut
> > > > > >> > > an RC for 0.14.1 by the end of this week. The RC could
> 

[jira] [Created] (ARROW-5915) [C++] [Python] Set up testing for backwards compatibility of the parquet reader

2019-07-11 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5915:


 Summary: [C++] [Python] Set up testing for backwards compatibility 
of the parquet reader
 Key: ARROW-5915
 URL: https://issues.apache.org/jira/browse/ARROW-5915
 Project: Apache Arrow
  Issue Type: Test
  Components: C++, Python
Reporter: Joris Van den Bossche


Given the recent parquet compat problems, we should have better testing for 
this.

For easy testing of backwards compatibility, we could add some files (with 
different types) written with older versions, add them to 
/pyarrow/tests/data/parquet (we already have some files from 0.7 there) and 
ensure they are read correctly with the current version.

Similarly as what Kartothek is doing: 
https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Wes McKinney
Eric -- you are free to set the Fix Version prior to the patch being merged

On Thu, Jul 11, 2019 at 3:01 PM Eric Erhardt
 wrote:
>
> The two C# fixes I'd like in the 0.14.1 release are:
>
> https://issues.apache.org/jira/browse/ARROW-5887 - already marked with 0.14.1 
> fix version.
> https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been resolved yet. 
> The PR https://github.com/apache/arrow/pull/4851 has one approver and the 
> Rust failure doesn't appear to be caused by my change.
>
> I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version until the PR 
> has been merged.
>
> -Original Message-
> From: Neal Richardson 
> Sent: Thursday, July 11, 2019 11:59 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package 
> problems, Parquet forward compatibility problems
>
> I just moved 
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373&sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3D&reserved=0
>  from 1.0.0 to 0.14.1.
>
> On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney  wrote:
>
> > To limit uncertainty, I'm going to start preparing a 0.14.1 patch
> > release branch. I will update the list with the patches that are being
> > cherry-picked. If other folks could give me a list of other PRs that
> > need to be backported I will add them to the list. Any JIRA that needs
> > to be included should have the "0.14.1" fix version added so we can
> > keep track
> >
> > On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche
> >  wrote:
> > >
> > > I personally prefer 0.14.1 over 0.15.0. I think that is clearer in
> > > communication, as we are fixing regressions of the 0.14.0 release.
> > >
> > > (but I haven't been involved much in releases, so certainly no
> > > strong
> > > opinion)
> > >
> > > Joris
> > >
> > >
> > > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney :
> > >
> > > > hi folks,
> > > >
> > > > Are there any opinions / strong feelings about the two options:
> > > >
> > > > * Prepare patch 0.14.1 release from a maintenance branch
> > > > * Release 0.15.0 out of master
> > > >
> > > > Aside from the Parquet forward compatibility issues we're still
> > > > discussing, and Eric's C# patch PR 4836, are there any other
> > > > issues that need to be fixed before we go down one of these paths?
> > > >
> > > > Would anyone like to help with release management? I can do so if
> > > > necessary, but I've already done a lot of release management :)
> > > >
> > > > - Wes
> > > >
> > > > On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney 
> > wrote:
> > > > >
> > > > > Hi Eric -- of course!
> > > > >
> > > > > On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt <
> > eric.erha...@microsoft.com.invalid>
> > > > wrote:
> > > > >>
> > > > >> Can we propose getting changes other than Python or Parquet
> > > > >> related
> > > > into this release?
> > > > >>
> > > > >> For example, I found a critical issue in the C# implementation
> > that, if
> > > > possible, I'd like to get included in a patch release.
> > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > > github.com%2Fapache%2Farrow%2Fpull%2F4836&data=02%7C01%7CEric.
> > > > Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f98
> > > > 8bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747781365&sdata
> > > > =5wJ%2FGdh8LTxRyrB%2F2Lc3ue46%2FRqE6WUM6brsSDv2FR0%3D&reserved
> > > > =0
> > > > >>
> > > > >> Eric
> > > > >>
> > > > >> -Original Message-
> > > > >> From: Wes McKinney 
> > > > >> Sent: Tuesday, July 9, 2019 7:59 AM
> > > > >> To: dev@arrow.apache.org
> > > > >> Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python
> > > > >> package
> > > > problems, Parquet forward compatibility problems
> > > > >>
> > > > >> On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei
> > > > >> 
> > > > wrote:
> > > > >> >
> > > > >> > Hi,
> > > > >> >
> > > > >> > > If the problems can be resolved quickly, I should think we
> > could cut
> > > > >> > > an RC for 0.14.1 by the end of this week. The RC could
> > > > >> > > either
> > be cut
> > > > >> > > from a maintenance branch or out of master -- any thoughts
> > > > >> > > about this (cutting from master is definitely easier)?
> > > > >> >
> > > > >> > How about just releasing 0.15.0 from master?
> > > > >> > It'll be simpler than creating a patch release.
> > > > >> >
> > > > >>
> > > > >> I'd be fine with that, too.
> > > > >>
> > > > >> >
> > > > >> > Thanks,
> > > > >> > --
> > > > >> > kou
> > > > >> >
> > > > >> > In  > > > nmvwuy8wxxddcctobuuamy4ee...@mail.gmail.com>
> > > > >> >   "[DISCUSS] Need for 0.14.1 release due to Python package
> > problems,
> > > > Parquet forward compatibility problems" on Mon, 8 Jul 2019
> > > > 11:32:07
> > -0500,
> > > > >> >   Wes McKinney  wrote:
> > > > >> >
> > > > >> > > hi folks,
> 

RE: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Eric Erhardt
The two C# fixes I'd like in the 0.14.1 release are:

https://issues.apache.org/jira/browse/ARROW-5887 - already marked with 0.14.1 
fix version.
https://issues.apache.org/jira/browse/ARROW-5908 - hasn't been resolved yet. 
The PR https://github.com/apache/arrow/pull/4851 has one approver and the Rust 
failure doesn't appear to be caused by my change.

I assume I shouldn't mark ARROW-5908 with a 0.14.1 fix version until the PR has 
been merged.

-Original Message-
From: Neal Richardson  
Sent: Thursday, July 11, 2019 11:59 AM
To: dev@arrow.apache.org
Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, 
Parquet forward compatibility problems

I just moved 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FARROW-5850&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747771373&sdata=B6xfFcBu4Iz0jJE5tUXkKvoJx36kMCS4UJCdTV7jqGA%3D&reserved=0
 from 1.0.0 to 0.14.1.

On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney  wrote:

> To limit uncertainty, I'm going to start preparing a 0.14.1 patch 
> release branch. I will update the list with the patches that are being 
> cherry-picked. If other folks could give me a list of other PRs that 
> need to be backported I will add them to the list. Any JIRA that needs 
> to be included should have the "0.14.1" fix version added so we can 
> keep track
>
> On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche 
>  wrote:
> >
> > I personally prefer 0.14.1 over 0.15.0. I think that is clearer in 
> > communication, as we are fixing regressions of the 0.14.0 release.
> >
> > (but I haven't been involved much in releases, so certainly no 
> > strong
> > opinion)
> >
> > Joris
> >
> >
> > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney :
> >
> > > hi folks,
> > >
> > > Are there any opinions / strong feelings about the two options:
> > >
> > > * Prepare patch 0.14.1 release from a maintenance branch
> > > * Release 0.15.0 out of master
> > >
> > > Aside from the Parquet forward compatibility issues we're still 
> > > discussing, and Eric's C# patch PR 4836, are there any other 
> > > issues that need to be fixed before we go down one of these paths?
> > >
> > > Would anyone like to help with release management? I can do so if 
> > > necessary, but I've already done a lot of release management :)
> > >
> > > - Wes
> > >
> > > On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney 
> wrote:
> > > >
> > > > Hi Eric -- of course!
> > > >
> > > > On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt <
> eric.erha...@microsoft.com.invalid>
> > > wrote:
> > > >>
> > > >> Can we propose getting changes other than Python or Parquet 
> > > >> related
> > > into this release?
> > > >>
> > > >> For example, I found a critical issue in the C# implementation
> that, if
> > > possible, I'd like to get included in a patch release.
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > github.com%2Fapache%2Farrow%2Fpull%2F4836&data=02%7C01%7CEric.
> > > Erhardt%40microsoft.com%7C244c0dd319dd4ea18a5508d7062125de%7C72f98
> > > 8bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984611747781365&sdata
> > > =5wJ%2FGdh8LTxRyrB%2F2Lc3ue46%2FRqE6WUM6brsSDv2FR0%3D&reserved
> > > =0
> > > >>
> > > >> Eric
> > > >>
> > > >> -Original Message-
> > > >> From: Wes McKinney 
> > > >> Sent: Tuesday, July 9, 2019 7:59 AM
> > > >> To: dev@arrow.apache.org
> > > >> Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python 
> > > >> package
> > > problems, Parquet forward compatibility problems
> > > >>
> > > >> On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei 
> > > >> 
> > > wrote:
> > > >> >
> > > >> > Hi,
> > > >> >
> > > >> > > If the problems can be resolved quickly, I should think we
> could cut
> > > >> > > an RC for 0.14.1 by the end of this week. The RC could 
> > > >> > > either
> be cut
> > > >> > > from a maintenance branch or out of master -- any thoughts 
> > > >> > > about this (cutting from master is definitely easier)?
> > > >> >
> > > >> > How about just releasing 0.15.0 from master?
> > > >> > It'll be simpler than creating a patch release.
> > > >> >
> > > >>
> > > >> I'd be fine with that, too.
> > > >>
> > > >> >
> > > >> > Thanks,
> > > >> > --
> > > >> > kou
> > > >> >
> > > >> > In  > > nmvwuy8wxxddcctobuuamy4ee...@mail.gmail.com>
> > > >> >   "[DISCUSS] Need for 0.14.1 release due to Python package
> problems,
> > > Parquet forward compatibility problems" on Mon, 8 Jul 2019 
> > > 11:32:07
> -0500,
> > > >> >   Wes McKinney  wrote:
> > > >> >
> > > >> > > hi folks,
> > > >> > >
> > > >> > > Perhaps unsurprisingly due to the expansion of our Python
> packages,
> > > >> > > a number of things are broken in 0.14.0 that we should fix
> sooner
> > > >> > > than the next major release. I'll try to send a complete 
> > > >> > > list to this thread to give a status within a day or two. 
> > > >> > > Other
> problems may
> > > >> > > ari

Re: [Python] Wheel questions

2019-07-11 Thread Wes McKinney
On Thu, Jul 11, 2019 at 11:26 AM Antoine Pitrou  wrote:
>
>
> Le 11/07/2019 à 17:52, Krisztián Szűcs a écrit :
> > Hi All,
> >
> > I have a couple of questions about the wheel packaging:
> > - why do we build an arrow namespaced boost on linux and osx, could we link
> > statically like with the windows wheels?
>
> No idea.  Boost shouldn't leak in the public APIs, so theoretically a
> static build would be fine...

In principle the privately-namespaced Boost could be statically
linked. We are using bcp to change the C++ namespace of the symbols so
that our Boost symbols don't conflict with other wheels' Boost symbols
(which may have come from a different Boost version).

I'll let Uwe comment further on the desire for dynamic linking

>
> > - do we explicitly say somewhere in the linux wheels to link the 3rdparty
> > dependencies statically or just implicitly, by removing (or not building)
> > the shared libs for the 3rdparty dependencies?
>
> It's implicit by removing the shared libs (or not building them).
> Some time ago the compression libs were always linked statically by
> default, but it was changed to dynamic along the time, probably to
> please system packagers.

I think only libz shared library is being bundled, for security reasons

>
> > - couldn't we use the 3rdparty toolchain to build the smaller 3rdparty
> > dependencies for the linux wheels instead of building them manually in the
> > manylinux docker image - it'd easier to say _SOURCE=BUNDLED
>
> I don't think so.  The conda-forge and Anaconda packages use a different
> build chain (different compiler, different libstdc++ version) and may
> not be usable directly on manylinux-compliant systems.

I think you may misunderstand. Krisztian is suggesting building the
dependencies through the ExternalProject mechanism during "docker run"
on the image rather than caching pre-built versions in the Docker
image.

For small dependencies, I don't see why we couldn't used the BUNDLED
approach. This might spare us having to maintain some of the build
scripts. It will strictly increase build times, though -- I think the
reason that everything is cached now is to save on build times (which
have historically been quite long)

>
> Regards
>
> Antoine.


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-11 Thread Wes McKinney
Hi Francois -- copying the metadata into memory isn't the end of the world
but it's a pretty ugly wart. This affects every IPC protocol message
everywhere.

We have an opportunity to address the wart now but such a fix post-1.0.0
will be much more difficult.

On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> If the data buffers are still aligned, then I don't think we should
> add a breaking change just for avoiding the copy on the metadata? I'd
> expect said metadata to be small enough that zero-copy doesn't really
> affect performance.
>
> François
>
> On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield 
> wrote:
> >
> > While working on trying to fix undefined behavior for unaligned memory
> > accesses [1], I ran into an issue with the IPC specification [2] which
> > prevents us from ever achieving zero-copy memory mapping and having
> aligned
> > accesses (i.e. clean UBSan runs).
> >
> > Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
> >
> > In the IPC format we align each message to 8-byte boundaries.  We then
> > write a int32_t integer to to denote the size of flat buffer metadata,
> > followed immediately  by the flatbuffer metadata.  This means the
> > flatbuffer metadata will never be 8 byte aligned.
> >
> > Do people care?  A simple fix  would be to use int64_t instead of int32_t
> > for length.  However, any fix essentially breaks all previous client
> > library versions or incurs a memory copy.
> >
> > [1] https://github.com/apache/arrow/pull/4757
> > [2] https://arrow.apache.org/docs/ipc.html
>


[jira] [Created] (ARROW-5914) [CI] Build bundled dependencies in docker build step

2019-07-11 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5914:
-

 Summary: [CI] Build bundled dependencies in docker build step
 Key: ARROW-5914
 URL: https://issues.apache.org/jira/browse/ARROW-5914
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Francois Saint-Jacques
 Fix For: 1.0.0


In the recently introduced ARROW-5803, some heavy dependencies (thrift, 
protobuf, flatbufers, grpc) are build at each invocation of docker-compose 
build (thus each travis test).

We should aim to build the third party dependencies in docker build phase 
instead, to exploit caching and docker-compose pull so that the CI step doesn't 
need to build said dependencies each time.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-11 Thread Francois Saint-Jacques
If the data buffers are still aligned, then I don't think we should
add a breaking change just for avoiding the copy on the metadata? I'd
expect said metadata to be small enough that zero-copy doesn't really
affect performance.

François

On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield  wrote:
>
> While working on trying to fix undefined behavior for unaligned memory
> accesses [1], I ran into an issue with the IPC specification [2] which
> prevents us from ever achieving zero-copy memory mapping and having aligned
> accesses (i.e. clean UBSan runs).
>
> Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
>
> In the IPC format we align each message to 8-byte boundaries.  We then
> write a int32_t integer to to denote the size of flat buffer metadata,
> followed immediately  by the flatbuffer metadata.  This means the
> flatbuffer metadata will never be 8 byte aligned.
>
> Do people care?  A simple fix  would be to use int64_t instead of int32_t
> for length.  However, any fix essentially breaks all previous client
> library versions or incurs a memory copy.
>
> [1] https://github.com/apache/arrow/pull/4757
> [2] https://arrow.apache.org/docs/ipc.html


Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-11 Thread Wes McKinney
Hi Bryan -- it wouldn't be forward compatible when using the 8 byte prefix,
but using the scheme we are proposing old clients would see the new prefix
as malformed (metadata length 0x = -1) rather than crashing.

We could possibly expose a forward compatibility option to write the 4 byte
prefix for the benefit of old clients, though that makes the implementation
more complicated

On Thu, Jul 11, 2019, 12:47 PM Bryan Cutler  wrote:

> So the proposal here will still be backwards compatible with a 4 byte
> prefix? Can you explain a little more how this might work if I have an
> older version of Java using 4 byte prefix and a new version of C++/Python
> with an 8 byte one for a roundtrip Java -> Python -> Java?
>
> On Wed, Jul 10, 2019 at 6:11 AM Wes McKinney  wrote:
>
> > The issue is fairly esoteric, so it will probably take more time to
> > collect feedback. I could create a C++ implementation of this if it
> > helps with the process.
> >
> > On Wed, Jul 10, 2019 at 2:25 AM Micah Kornfield 
> > wrote:
> > >
> > > Does anybody else have thoughts on this?   Other language contributors?
> > >
> > > It seems like we still might not have enough of a consensus for a vote?
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > >
> > >
> > > On Tue, Jul 2, 2019 at 7:32 AM Wes McKinney 
> wrote:
> > >
> > > > Correct. The encapsulated IPC message will just be 4 bytes bigger.
> > > >
> > > > On Tue, Jul 2, 2019, 9:31 AM Antoine Pitrou 
> > wrote:
> > > >
> > > > >
> > > > > I guess I still dont understand how the IPC stream format works :-/
> > > > >
> > > > > To put it clearly: what happens in Flight?  Will a Flight message
> > > > > automatically get the "stream continuation message" in front of it?
> > > > >
> > > > >
> > > > > Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> > > > > > On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou <
> anto...@python.org>
> > > > > wrote:
> > > > > >>
> > > > > >>
> > > > > >> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> > > > > >>> Thanks for the references.
> > > > > >>>
> > > > > >>> If we decided to make a change around this, we could call the
> > first 4
> > > > > >>> bytes a stream continuation marker to make it slightly less
> ugly
> > > > > >>>
> > > > > >>> * 0x: continue
> > > > > >>> * 0x: stop
> > > > > >>
> > > > > >> Do you mean it would be a separate IPC message?
> > > > > >
> > > > > > No, I think this is only about how we could change the message
> > prefix
> > > > > > from 4 bytes to 8 bytes
> > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> > > > > >
> > > > > > Currently a 0x (0 metadata size) is used as an
> > end-of-stream
> > > > > > marker. So what I was saying is that the first 8 bytes could be
> > > > > >
> > > > > > <4 bytes: stream continuation>
> > > > > >
> > > > > >>
> > > > > >>
> > > > > >>>
> > > > > >>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <
> > > > emkornfi...@gmail.com>
> > > > > wrote:
> > > > > 
> > > > >  Hi Wes,
> > > > >  I'm not an expert on this either, my inclination mostly comes
> > from
> > > > > some research I've done.  I think it is important to distinguish
> two
> > > > cases:
> > > > >  1.  unaligned access at the processor instruction level
> > > > >  2.  undefined behavior
> > > > > 
> > > > >  From my reading unaligned access is fine on most modern
> > > > architectures
> > > > > and it seems the performance penalty has mostly been eliminated.
> > > > > 
> > > > >  Undefined behavior is a compiler/language concept.  The
> problem
> > is
> > > > > the compiler can choose to do anything in UB scenarios, not just
> the
> > > > > "obvious" translation.  Specifically, the compiler is under no
> > obligation
> > > > > to generate the unaligned access instructions, and if it doesn't
> > SEGVs
> > > > > ensue.  Two examples, both of which relate to SIMD optimizations
> are
> > > > linked
> > > > > below.
> > > > > 
> > > > >  I tend to be on the conservative side with this type of thing
> > but if
> > > > > we have experts on the the ML that can offer a more informed
> > opinion, I
> > > > > would love to hear it.
> > > > > 
> > > > >  [1]
> > > > >
> http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> > > > >  [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> > > > > 
> > > > >  On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <
> > wesmck...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > The <0x> solution is downright ugly
> but I
> > > > think
> > > > > > it's one of the only ways that achieves
> > > > > >
> > > > > > * backward compatibility (new clients can read old data)
> > > > > > * opt-in forward compatibility (if we want to go to the labor
> > of
> > > > > doing
> > > > > > so, sort of dangerous)
> > > > > > * old clients receiving new data do not blow up (they wil

Re: Adding a new encoding for FP data - unsubscribe

2019-07-11 Thread Bryan Cutler
Mani, please send a reply to dev-unsubscr...@arrow.apache.org to remove
yourself from the list.

On Thu, Jul 11, 2019 at 11:10 AM mani vannan 
wrote:

> All,
>
> Can someone please help me to unsubscribe to this group?
>
> Thank you.
>
> -Original Message-
> From: Radev, Martin 
> Sent: Thursday, July 11, 2019 2:08 PM
> To: dev@arrow.apache.org; emkornfi...@gmail.com
> Cc: Raoofy, Amir ; Karlstetter, Roman <
> roman.karlstet...@tum.de>
> Subject: Re: Adding a new encoding for FP data
>
> Hello Micah,
>
>
> the changes will go to the C++ implementation of Parquet within Arrow.
>
> In that sense, if Arrow uses the compression and encoding methods
> available in Parquet in any way, I expect a benefit.
>
>
> My plan is to add the new encoding to parquet-cpp and parquer-mr (java).
>
>
> If you have any more questions or concerns, let me know.
>
> I am close to done with my patch.
>
>
> Regards,
>
> Martin
>
>
> 
> From: Micah Kornfield 
> Sent: Thursday, July 11, 2019 5:26:26 PM
> To: dev@arrow.apache.org
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Adding a new encoding for FP data
>
> Hi Martin,
> Can you clarify were you expecting the encoding to only be used in
> Parquet, or more generally in Arrow?
>
> Thanks,
> Micah
>
> On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney  wrote:
>
> > hi folks,
> >
> > If you could participate in Micah's discussion about compression and
> > encoding generally at
> >
> >
> > https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd
> > 25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
> >
> > it would be helpful. I personally think that Arrow would benefit from
> > an alternate protocol message type to the current RecordBatch (as
> > defined in Message.fbs) that allows for encoded or compressed columns.
> > This won't be an overnight change (more on the order of months of
> > work), but it's worth taking the time to carefully consider the
> > implications of developing and supporting such a feature for the long
> > term
> >
> > On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
> > >
> > > Hi Radev,
> > >
> > > Thanks a lot for providing so much technical details. I need to read
> > > them carefully.
> > >
> > > I think FP encoding is definitely a useful feature.
> > > I hope this feature can be implemented in Arrow soon, so that we can
> > > use
> > it
> > > in our system.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin 
> > wrote:
> > >
> > > > Hello Liya Fan,
> > > >
> > > >
> > > > this explains the technique but for a more complex case:
> > > >
> > > >
> > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrun
> > chy/
> [https://s0.wp.com/i/blank.jpg]<
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> >
>
> x86 code compression in kkrunchy | The ryg blog<
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> >
> fgiesen.wordpress.com
> This is about the "secret ingredient" in my EXE packer kkrunchy, which was
> used in our (Farbrausch) 64k intros starting from "fr-030: Candytron", and
> also in a lot of other ...
>
>
>
> > > >
> > > > For FP data, the approach which seemed to be the best is the
> following.
> > > >
> > > > Say we have a buffer of two 32-bit floating point values:
> > > >
> > > > buf = [af, bf]
> > > >
> > > > We interpret each FP value as a 32-bit uint and look at each
> > > > individual byte. We have 8 bytes in total for this small input.
> > > >
> > > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > > >
> > > > Then we apply stream splitting and the new buffer becomes:
> > > >
> > > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > > >
> > > > We compress newbuf.
> > > >
> > > > Due to similarities the sign bits, mantissa bits and MSB exponent
> > bits, we
> > > > might have a lot more repetitions in data. For scientific data,
> > > > the
> > 2nd and
> > > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> > original
> > > > representation we would always have a few bytes of data which
> > > > could
> > appear
> > > > somewhere else in the buffer and then a couple bytes of possible
> > noise. In
> > > > the new representation we have a long stream of data which could
> > compress
> > > > well and then a sequence of noise towards the end.
> > > >
> > > > This transformation improved compression ratio as can be seen in
> > > > the report.
> > > >
> > > > It also improved speed for ZSTD. This could be because ZSTD makes
> > > > a decision of how to compress the data - RLE, new huffman tree,
> > > > huffman
> > tree
> > > > of the previous frame, raw representation. Each can potentially
> > achieve a
> > > > different compression ratio and compression/decompression speed.
> > > > It
> > turned
> > > > out that when the transformation is applied, zstd would attempt to
> > compress
> > > > fewer frames and copy the other. This could lead to less attempts
> > > > 

RE: Adding a new encoding for FP data - unsubscribe

2019-07-11 Thread mani vannan
All, 

Can someone please help me to unsubscribe to this group?

Thank you.

-Original Message-
From: Radev, Martin  
Sent: Thursday, July 11, 2019 2:08 PM
To: dev@arrow.apache.org; emkornfi...@gmail.com
Cc: Raoofy, Amir ; Karlstetter, Roman 

Subject: Re: Adding a new encoding for FP data

Hello Micah,


the changes will go to the C++ implementation of Parquet within Arrow.

In that sense, if Arrow uses the compression and encoding methods available in 
Parquet in any way, I expect a benefit.


My plan is to add the new encoding to parquet-cpp and parquer-mr (java).


If you have any more questions or concerns, let me know.

I am close to done with my patch.


Regards,

Martin



From: Micah Kornfield 
Sent: Thursday, July 11, 2019 5:26:26 PM
To: dev@arrow.apache.org
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Adding a new encoding for FP data

Hi Martin,
Can you clarify were you expecting the encoding to only be used in Parquet, or 
more generally in Arrow?

Thanks,
Micah

On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney  wrote:

> hi folks,
>
> If you could participate in Micah's discussion about compression and 
> encoding generally at
>
>
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd
> 25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
>
> it would be helpful. I personally think that Arrow would benefit from 
> an alternate protocol message type to the current RecordBatch (as 
> defined in Message.fbs) that allows for encoded or compressed columns.
> This won't be an overnight change (more on the order of months of 
> work), but it's worth taking the time to carefully consider the 
> implications of developing and supporting such a feature for the long 
> term
>
> On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
> >
> > Hi Radev,
> >
> > Thanks a lot for providing so much technical details. I need to read 
> > them carefully.
> >
> > I think FP encoding is definitely a useful feature.
> > I hope this feature can be implemented in Arrow soon, so that we can 
> > use
> it
> > in our system.
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin 
> wrote:
> >
> > > Hello Liya Fan,
> > >
> > >
> > > this explains the technique but for a more complex case:
> > >
> > >
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrun
> chy/
[https://s0.wp.com/i/blank.jpg]

x86 code compression in kkrunchy | The ryg 
blog
fgiesen.wordpress.com
This is about the "secret ingredient" in my EXE packer kkrunchy, which was used 
in our (Farbrausch) 64k intros starting from "fr-030: Candytron", and also in a 
lot of other ...



> > >
> > > For FP data, the approach which seemed to be the best is the following.
> > >
> > > Say we have a buffer of two 32-bit floating point values:
> > >
> > > buf = [af, bf]
> > >
> > > We interpret each FP value as a 32-bit uint and look at each 
> > > individual byte. We have 8 bytes in total for this small input.
> > >
> > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > >
> > > Then we apply stream splitting and the new buffer becomes:
> > >
> > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > >
> > > We compress newbuf.
> > >
> > > Due to similarities the sign bits, mantissa bits and MSB exponent
> bits, we
> > > might have a lot more repetitions in data. For scientific data, 
> > > the
> 2nd and
> > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> original
> > > representation we would always have a few bytes of data which 
> > > could
> appear
> > > somewhere else in the buffer and then a couple bytes of possible
> noise. In
> > > the new representation we have a long stream of data which could
> compress
> > > well and then a sequence of noise towards the end.
> > >
> > > This transformation improved compression ratio as can be seen in 
> > > the report.
> > >
> > > It also improved speed for ZSTD. This could be because ZSTD makes 
> > > a decision of how to compress the data - RLE, new huffman tree, 
> > > huffman
> tree
> > > of the previous frame, raw representation. Each can potentially
> achieve a
> > > different compression ratio and compression/decompression speed. 
> > > It
> turned
> > > out that when the transformation is applied, zstd would attempt to
> compress
> > > fewer frames and copy the other. This could lead to less attempts 
> > > to
> build
> > > a huffman tree. It's hard to pin-point the exact reason.
> > >
> > > I did not try other lossless text compressors but I expect similar
> results.
> > >
> > > For code, I can polish my patches, create a Jira task and submit 
> > > the patches for review.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> > > 
> > > From: Fan Liya 
> > > Sent: Thursday, July 11, 2019 11:32:53 AM
> > > To: dev@ar

Re: Adding a new encoding for FP data

2019-07-11 Thread Radev, Martin
Hello Micah,


the changes will go to the C++ implementation of Parquet within Arrow.

In that sense, if Arrow uses the compression and encoding methods available in 
Parquet in any way, I expect a benefit.


My plan is to add the new encoding to parquet-cpp and parquer-mr (java).


If you have any more questions or concerns, let me know.

I am close to done with my patch.


Regards,

Martin



From: Micah Kornfield 
Sent: Thursday, July 11, 2019 5:26:26 PM
To: dev@arrow.apache.org
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Adding a new encoding for FP data

Hi Martin,
Can you clarify were you expecting the encoding to only be used in Parquet,
or more generally in Arrow?

Thanks,
Micah

On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney  wrote:

> hi folks,
>
> If you could participate in Micah's discussion about compression and
> encoding generally at
>
>
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
>
> it would be helpful. I personally think that Arrow would benefit from
> an alternate protocol message type to the current RecordBatch (as
> defined in Message.fbs) that allows for encoded or compressed columns.
> This won't be an overnight change (more on the order of months of
> work), but it's worth taking the time to carefully consider the
> implications of developing and supporting such a feature for the long
> term
>
> On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
> >
> > Hi Radev,
> >
> > Thanks a lot for providing so much technical details. I need to read them
> > carefully.
> >
> > I think FP encoding is definitely a useful feature.
> > I hope this feature can be implemented in Arrow soon, so that we can use
> it
> > in our system.
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin 
> wrote:
> >
> > > Hello Liya Fan,
> > >
> > >
> > > this explains the technique but for a more complex case:
> > >
> > >
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
[https://s0.wp.com/i/blank.jpg]

x86 code compression in kkrunchy | The ryg 
blog
fgiesen.wordpress.com
This is about the “secret ingredient” in my EXE packer kkrunchy, which was used 
in our (Farbrausch) 64k intros starting from “fr-030: Candytron”, and also in a 
lot of other …



> > >
> > > For FP data, the approach which seemed to be the best is the following.
> > >
> > > Say we have a buffer of two 32-bit floating point values:
> > >
> > > buf = [af, bf]
> > >
> > > We interpret each FP value as a 32-bit uint and look at each individual
> > > byte. We have 8 bytes in total for this small input.
> > >
> > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > >
> > > Then we apply stream splitting and the new buffer becomes:
> > >
> > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > >
> > > We compress newbuf.
> > >
> > > Due to similarities the sign bits, mantissa bits and MSB exponent
> bits, we
> > > might have a lot more repetitions in data. For scientific data, the
> 2nd and
> > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> original
> > > representation we would always have a few bytes of data which could
> appear
> > > somewhere else in the buffer and then a couple bytes of possible
> noise. In
> > > the new representation we have a long stream of data which could
> compress
> > > well and then a sequence of noise towards the end.
> > >
> > > This transformation improved compression ratio as can be seen in the
> > > report.
> > >
> > > It also improved speed for ZSTD. This could be because ZSTD makes a
> > > decision of how to compress the data - RLE, new huffman tree, huffman
> tree
> > > of the previous frame, raw representation. Each can potentially
> achieve a
> > > different compression ratio and compression/decompression speed. It
> turned
> > > out that when the transformation is applied, zstd would attempt to
> compress
> > > fewer frames and copy the other. This could lead to less attempts to
> build
> > > a huffman tree. It's hard to pin-point the exact reason.
> > >
> > > I did not try other lossless text compressors but I expect similar
> results.
> > >
> > > For code, I can polish my patches, create a Jira task and submit the
> > > patches for review.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> > > 
> > > From: Fan Liya 
> > > Sent: Thursday, July 11, 2019 11:32:53 AM
> > > To: dev@arrow.apache.org
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: Adding a new encoding for FP data
> > >
> > > Hi Radev,
> > >
> > > Thanks for the information. It seems interesting.
> > > IMO, Arrow has much to do for data compression. However, it seems
> there are
> > > some differences for memory data compression and external storage data
> > 

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-07-11 Thread Bryan Cutler
So the proposal here will still be backwards compatible with a 4 byte
prefix? Can you explain a little more how this might work if I have an
older version of Java using 4 byte prefix and a new version of C++/Python
with an 8 byte one for a roundtrip Java -> Python -> Java?

On Wed, Jul 10, 2019 at 6:11 AM Wes McKinney  wrote:

> The issue is fairly esoteric, so it will probably take more time to
> collect feedback. I could create a C++ implementation of this if it
> helps with the process.
>
> On Wed, Jul 10, 2019 at 2:25 AM Micah Kornfield 
> wrote:
> >
> > Does anybody else have thoughts on this?   Other language contributors?
> >
> > It seems like we still might not have enough of a consensus for a vote?
> >
> > Thanks,
> > Micah
> >
> >
> >
> >
> > On Tue, Jul 2, 2019 at 7:32 AM Wes McKinney  wrote:
> >
> > > Correct. The encapsulated IPC message will just be 4 bytes bigger.
> > >
> > > On Tue, Jul 2, 2019, 9:31 AM Antoine Pitrou 
> wrote:
> > >
> > > >
> > > > I guess I still dont understand how the IPC stream format works :-/
> > > >
> > > > To put it clearly: what happens in Flight?  Will a Flight message
> > > > automatically get the "stream continuation message" in front of it?
> > > >
> > > >
> > > > Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> > > > > On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou 
> > > > wrote:
> > > > >>
> > > > >>
> > > > >> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> > > > >>> Thanks for the references.
> > > > >>>
> > > > >>> If we decided to make a change around this, we could call the
> first 4
> > > > >>> bytes a stream continuation marker to make it slightly less ugly
> > > > >>>
> > > > >>> * 0x: continue
> > > > >>> * 0x: stop
> > > > >>
> > > > >> Do you mean it would be a separate IPC message?
> > > > >
> > > > > No, I think this is only about how we could change the message
> prefix
> > > > > from 4 bytes to 8 bytes
> > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> > > > >
> > > > > Currently a 0x (0 metadata size) is used as an
> end-of-stream
> > > > > marker. So what I was saying is that the first 8 bytes could be
> > > > >
> > > > > <4 bytes: stream continuation>
> > > > >
> > > > >>
> > > > >>
> > > > >>>
> > > > >>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <
> > > emkornfi...@gmail.com>
> > > > wrote:
> > > > 
> > > >  Hi Wes,
> > > >  I'm not an expert on this either, my inclination mostly comes
> from
> > > > some research I've done.  I think it is important to distinguish two
> > > cases:
> > > >  1.  unaligned access at the processor instruction level
> > > >  2.  undefined behavior
> > > > 
> > > >  From my reading unaligned access is fine on most modern
> > > architectures
> > > > and it seems the performance penalty has mostly been eliminated.
> > > > 
> > > >  Undefined behavior is a compiler/language concept.  The problem
> is
> > > > the compiler can choose to do anything in UB scenarios, not just the
> > > > "obvious" translation.  Specifically, the compiler is under no
> obligation
> > > > to generate the unaligned access instructions, and if it doesn't
> SEGVs
> > > > ensue.  Two examples, both of which relate to SIMD optimizations are
> > > linked
> > > > below.
> > > > 
> > > >  I tend to be on the conservative side with this type of thing
> but if
> > > > we have experts on the the ML that can offer a more informed
> opinion, I
> > > > would love to hear it.
> > > > 
> > > >  [1]
> > > > http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> > > >  [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> > > > 
> > > >  On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <
> wesmck...@gmail.com>
> > > > wrote:
> > > > >
> > > > > The <0x> solution is downright ugly but I
> > > think
> > > > > it's one of the only ways that achieves
> > > > >
> > > > > * backward compatibility (new clients can read old data)
> > > > > * opt-in forward compatibility (if we want to go to the labor
> of
> > > > doing
> > > > > so, sort of dangerous)
> > > > > * old clients receiving new data do not blow up (they will see
> a
> > > > > metadata length of -1)
> > > > >
> > > > > NB 0x  would look like:
> > > > >
> > > > > In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> > > > > Out[13]: array([4294967295,128], dtype=uint32)
> > > > >
> > > > > In [14]: np.array([(2 << 32) - 1, 128],
> > > > > dtype=np.uint32).view(np.int32)
> > > > > Out[14]: array([ -1, 128], dtype=int32)
> > > > >
> > > > > In [15]: np.array([(2 << 32) - 1, 128],
> > > > dtype=np.uint32).view(np.uint8)
> > > > > Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0],
> > > dtype=uint8)
> > > > >
> > > > > Flatbuffers are 32-bit limited so we don't need all 64 bits.
> > > > >
>

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Neal Richardson
I just moved https://issues.apache.org/jira/browse/ARROW-5850 from 1.0.0 to
0.14.1.

On Thu, Jul 11, 2019 at 8:12 AM Wes McKinney  wrote:

> To limit uncertainty, I'm going to start preparing a 0.14.1 patch
> release branch. I will update the list with the patches that are being
> cherry-picked. If other folks could give me a list of other PRs that
> need to be backported I will add them to the list. Any JIRA that needs
> to be included should have the "0.14.1" fix version added so we can
> keep track
>
> On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche
>  wrote:
> >
> > I personally prefer 0.14.1 over 0.15.0. I think that is clearer in
> > communication, as we are fixing regressions of the 0.14.0 release.
> >
> > (but I haven't been involved much in releases, so certainly no strong
> > opinion)
> >
> > Joris
> >
> >
> > Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney :
> >
> > > hi folks,
> > >
> > > Are there any opinions / strong feelings about the two options:
> > >
> > > * Prepare patch 0.14.1 release from a maintenance branch
> > > * Release 0.15.0 out of master
> > >
> > > Aside from the Parquet forward compatibility issues we're still
> > > discussing, and Eric's C# patch PR 4836, are there any other issues
> > > that need to be fixed before we go down one of these paths?
> > >
> > > Would anyone like to help with release management? I can do so if
> > > necessary, but I've already done a lot of release management :)
> > >
> > > - Wes
> > >
> > > On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney 
> wrote:
> > > >
> > > > Hi Eric -- of course!
> > > >
> > > > On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt <
> eric.erha...@microsoft.com.invalid>
> > > wrote:
> > > >>
> > > >> Can we propose getting changes other than Python or Parquet related
> > > into this release?
> > > >>
> > > >> For example, I found a critical issue in the C# implementation
> that, if
> > > possible, I'd like to get included in a patch release.
> > > https://github.com/apache/arrow/pull/4836
> > > >>
> > > >> Eric
> > > >>
> > > >> -Original Message-
> > > >> From: Wes McKinney 
> > > >> Sent: Tuesday, July 9, 2019 7:59 AM
> > > >> To: dev@arrow.apache.org
> > > >> Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package
> > > problems, Parquet forward compatibility problems
> > > >>
> > > >> On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei 
> > > wrote:
> > > >> >
> > > >> > Hi,
> > > >> >
> > > >> > > If the problems can be resolved quickly, I should think we
> could cut
> > > >> > > an RC for 0.14.1 by the end of this week. The RC could either
> be cut
> > > >> > > from a maintenance branch or out of master -- any thoughts about
> > > >> > > this (cutting from master is definitely easier)?
> > > >> >
> > > >> > How about just releasing 0.15.0 from master?
> > > >> > It'll be simpler than creating a patch release.
> > > >> >
> > > >>
> > > >> I'd be fine with that, too.
> > > >>
> > > >> >
> > > >> > Thanks,
> > > >> > --
> > > >> > kou
> > > >> >
> > > >> > In  > > nmvwuy8wxxddcctobuuamy4ee...@mail.gmail.com>
> > > >> >   "[DISCUSS] Need for 0.14.1 release due to Python package
> problems,
> > > Parquet forward compatibility problems" on Mon, 8 Jul 2019 11:32:07
> -0500,
> > > >> >   Wes McKinney  wrote:
> > > >> >
> > > >> > > hi folks,
> > > >> > >
> > > >> > > Perhaps unsurprisingly due to the expansion of our Python
> packages,
> > > >> > > a number of things are broken in 0.14.0 that we should fix
> sooner
> > > >> > > than the next major release. I'll try to send a complete list to
> > > >> > > this thread to give a status within a day or two. Other
> problems may
> > > >> > > arise in the next 48 hours as more people install the package.
> > > >> > >
> > > >> > > If the problems can be resolved quickly, I should think we
> could cut
> > > >> > > an RC for 0.14.1 by the end of this week. The RC could either
> be cut
> > > >> > > from a maintenance branch or out of master -- any thoughts about
> > > >> > > this (cutting from master is definitely easier)?
> > > >> > >
> > > >> > > Would someone (who is not Kou) be able to assist with creating
> the
> > > RC?
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Wes
> > >
>


Re: [Python] Wheel questions

2019-07-11 Thread Antoine Pitrou


Le 11/07/2019 à 17:52, Krisztián Szűcs a écrit :
> Hi All,
> 
> I have a couple of questions about the wheel packaging:
> - why do we build an arrow namespaced boost on linux and osx, could we link
> statically like with the windows wheels?

No idea.  Boost shouldn't leak in the public APIs, so theoretically a
static build would be fine...

> - do we explicitly say somewhere in the linux wheels to link the 3rdparty
> dependencies statically or just implicitly, by removing (or not building)
> the shared libs for the 3rdparty dependencies?

It's implicit by removing the shared libs (or not building them).
Some time ago the compression libs were always linked statically by
default, but it was changed to dynamic along the time, probably to
please system packagers.

> - couldn't we use the 3rdparty toolchain to build the smaller 3rdparty
> dependencies for the linux wheels instead of building them manually in the
> manylinux docker image - it'd easier to say _SOURCE=BUNDLED

I don't think so.  The conda-forge and Anaconda packages use a different
build chain (different compiler, different libstdc++ version) and may
not be usable directly on manylinux-compliant systems.

Regards

Antoine.


[Python] Wheel questions

2019-07-11 Thread Krisztián Szűcs
Hi All,

I have a couple of questions about the wheel packaging:
- why do we build an arrow namespaced boost on linux and osx, could we link
statically like with the windows wheels?
- do we explicitly say somewhere in the linux wheels to link the 3rdparty
dependencies statically or just implicitly, by removing (or not building)
the shared libs for the 3rdparty dependencies?
- couldn't we use the 3rdparty toolchain to build the smaller 3rdparty
dependencies for the linux wheels instead of building them manually in the
manylinux docker image - it'd easier to say _SOURCE=BUNDLED

Regards, Krisztian


Re: Adding a new encoding for FP data

2019-07-11 Thread Micah Kornfield
Hi Martin,
Can you clarify were you expecting the encoding to only be used in Parquet,
or more generally in Arrow?

Thanks,
Micah

On Thu, Jul 11, 2019 at 7:06 AM Wes McKinney  wrote:

> hi folks,
>
> If you could participate in Micah's discussion about compression and
> encoding generally at
>
>
> https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
>
> it would be helpful. I personally think that Arrow would benefit from
> an alternate protocol message type to the current RecordBatch (as
> defined in Message.fbs) that allows for encoded or compressed columns.
> This won't be an overnight change (more on the order of months of
> work), but it's worth taking the time to carefully consider the
> implications of developing and supporting such a feature for the long
> term
>
> On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
> >
> > Hi Radev,
> >
> > Thanks a lot for providing so much technical details. I need to read them
> > carefully.
> >
> > I think FP encoding is definitely a useful feature.
> > I hope this feature can be implemented in Arrow soon, so that we can use
> it
> > in our system.
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin 
> wrote:
> >
> > > Hello Liya Fan,
> > >
> > >
> > > this explains the technique but for a more complex case:
> > >
> > >
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> > >
> > > For FP data, the approach which seemed to be the best is the following.
> > >
> > > Say we have a buffer of two 32-bit floating point values:
> > >
> > > buf = [af, bf]
> > >
> > > We interpret each FP value as a 32-bit uint and look at each individual
> > > byte. We have 8 bytes in total for this small input.
> > >
> > > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> > >
> > > Then we apply stream splitting and the new buffer becomes:
> > >
> > > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> > >
> > > We compress newbuf.
> > >
> > > Due to similarities the sign bits, mantissa bits and MSB exponent
> bits, we
> > > might have a lot more repetitions in data. For scientific data, the
> 2nd and
> > > 3rd byte for 32-bit data is probably largely noise. Thus in the
> original
> > > representation we would always have a few bytes of data which could
> appear
> > > somewhere else in the buffer and then a couple bytes of possible
> noise. In
> > > the new representation we have a long stream of data which could
> compress
> > > well and then a sequence of noise towards the end.
> > >
> > > This transformation improved compression ratio as can be seen in the
> > > report.
> > >
> > > It also improved speed for ZSTD. This could be because ZSTD makes a
> > > decision of how to compress the data - RLE, new huffman tree, huffman
> tree
> > > of the previous frame, raw representation. Each can potentially
> achieve a
> > > different compression ratio and compression/decompression speed. It
> turned
> > > out that when the transformation is applied, zstd would attempt to
> compress
> > > fewer frames and copy the other. This could lead to less attempts to
> build
> > > a huffman tree. It's hard to pin-point the exact reason.
> > >
> > > I did not try other lossless text compressors but I expect similar
> results.
> > >
> > > For code, I can polish my patches, create a Jira task and submit the
> > > patches for review.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> > > 
> > > From: Fan Liya 
> > > Sent: Thursday, July 11, 2019 11:32:53 AM
> > > To: dev@arrow.apache.org
> > > Cc: Raoofy, Amir; Karlstetter, Roman
> > > Subject: Re: Adding a new encoding for FP data
> > >
> > > Hi Radev,
> > >
> > > Thanks for the information. It seems interesting.
> > > IMO, Arrow has much to do for data compression. However, it seems
> there are
> > > some differences for memory data compression and external storage data
> > > compression.
> > >
> > > Could you please provide some reference for stream splitting?
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin 
> wrote:
> > >
> > > > Hello people,
> > > >
> > > >
> > > > there has been discussion in the Apache Parquet mailing list on
> adding a
> > > > new encoder for FP data.
> > > > The reason for this is that the supported compressors by Apache
> Parquet
> > > > (zstd, gzip, etc) do not compress well raw FP data.
> > > >
> > > >
> > > > In my investigation it turns out that a very simple simple technique,
> > > > named stream splitting, can improve the compression ratio and even
> speed
> > > > for some of the compressors.
> > > >
> > > > You can read about the results here:
> > > >
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > > >
> > > >
> > > > I went through the developer guide for Apache Arrow and wrote a
> patch to
> > > > add the new encoding and test coverage for it.
> > > >
> > > > I will polish my patch and wor

Re: [DISCUSS] Need for 0.14.1 release due to Python package problems, Parquet forward compatibility problems

2019-07-11 Thread Wes McKinney
To limit uncertainty, I'm going to start preparing a 0.14.1 patch
release branch. I will update the list with the patches that are being
cherry-picked. If other folks could give me a list of other PRs that
need to be backported I will add them to the list. Any JIRA that needs
to be included should have the "0.14.1" fix version added so we can
keep track

On Wed, Jul 10, 2019 at 9:48 PM Joris Van den Bossche
 wrote:
>
> I personally prefer 0.14.1 over 0.15.0. I think that is clearer in
> communication, as we are fixing regressions of the 0.14.0 release.
>
> (but I haven't been involved much in releases, so certainly no strong
> opinion)
>
> Joris
>
>
> Op wo 10 jul. 2019 om 15:07 schreef Wes McKinney :
>
> > hi folks,
> >
> > Are there any opinions / strong feelings about the two options:
> >
> > * Prepare patch 0.14.1 release from a maintenance branch
> > * Release 0.15.0 out of master
> >
> > Aside from the Parquet forward compatibility issues we're still
> > discussing, and Eric's C# patch PR 4836, are there any other issues
> > that need to be fixed before we go down one of these paths?
> >
> > Would anyone like to help with release management? I can do so if
> > necessary, but I've already done a lot of release management :)
> >
> > - Wes
> >
> > On Tue, Jul 9, 2019 at 4:13 PM Wes McKinney  wrote:
> > >
> > > Hi Eric -- of course!
> > >
> > > On Tue, Jul 9, 2019, 4:03 PM Eric Erhardt 
> > > 
> > wrote:
> > >>
> > >> Can we propose getting changes other than Python or Parquet related
> > into this release?
> > >>
> > >> For example, I found a critical issue in the C# implementation that, if
> > possible, I'd like to get included in a patch release.
> > https://github.com/apache/arrow/pull/4836
> > >>
> > >> Eric
> > >>
> > >> -Original Message-
> > >> From: Wes McKinney 
> > >> Sent: Tuesday, July 9, 2019 7:59 AM
> > >> To: dev@arrow.apache.org
> > >> Subject: Re: [DISCUSS] Need for 0.14.1 release due to Python package
> > problems, Parquet forward compatibility problems
> > >>
> > >> On Tue, Jul 9, 2019 at 12:02 AM Sutou Kouhei 
> > wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > > If the problems can be resolved quickly, I should think we could cut
> > >> > > an RC for 0.14.1 by the end of this week. The RC could either be cut
> > >> > > from a maintenance branch or out of master -- any thoughts about
> > >> > > this (cutting from master is definitely easier)?
> > >> >
> > >> > How about just releasing 0.15.0 from master?
> > >> > It'll be simpler than creating a patch release.
> > >> >
> > >>
> > >> I'd be fine with that, too.
> > >>
> > >> >
> > >> > Thanks,
> > >> > --
> > >> > kou
> > >> >
> > >> > In  > nmvwuy8wxxddcctobuuamy4ee...@mail.gmail.com>
> > >> >   "[DISCUSS] Need for 0.14.1 release due to Python package problems,
> > Parquet forward compatibility problems" on Mon, 8 Jul 2019 11:32:07 -0500,
> > >> >   Wes McKinney  wrote:
> > >> >
> > >> > > hi folks,
> > >> > >
> > >> > > Perhaps unsurprisingly due to the expansion of our Python packages,
> > >> > > a number of things are broken in 0.14.0 that we should fix sooner
> > >> > > than the next major release. I'll try to send a complete list to
> > >> > > this thread to give a status within a day or two. Other problems may
> > >> > > arise in the next 48 hours as more people install the package.
> > >> > >
> > >> > > If the problems can be resolved quickly, I should think we could cut
> > >> > > an RC for 0.14.1 by the end of this week. The RC could either be cut
> > >> > > from a maintenance branch or out of master -- any thoughts about
> > >> > > this (cutting from master is definitely easier)?
> > >> > >
> > >> > > Would someone (who is not Kou) be able to assist with creating the
> > RC?
> > >> > >
> > >> > > Thanks,
> > >> > > Wes
> >


[jira] [Created] (ARROW-5913) Add support for Parquet's BYTE_STREAM_SPLIT encoding

2019-07-11 Thread Martin Radev (JIRA)
Martin Radev created ARROW-5913:
---

 Summary: Add support for Parquet's BYTE_STREAM_SPLIT encoding
 Key: ARROW-5913
 URL: https://issues.apache.org/jira/browse/ARROW-5913
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Martin Radev


*From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 ):*

Apache Parquet does not have any encodings suitable for FP data and the 
available text compressors (zstd, gzip, etc) do not handle FP data very well.

It is possible to apply a simple data transformation named "stream splitting". 
Such could be "byte stream splitting" which creates K streams of length N where 
K is the number of bytes in the data type (4 for floats, 8 for doubles) and N 
is the number of elements in the sequence.

The transformed data compresses significantly better on average than the 
original data and for some cases there is a performance improvement in 
compression and decompression speed.

You can read a more detailed report here:
[https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
]

*Apache Arrow can benefit from the reduced requirements for storing FP parquet 
column data and improvements in decompression speed.*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Adding a new encoding for FP data

2019-07-11 Thread Wes McKinney
hi folks,

If you could participate in Micah's discussion about compression and
encoding generally at

https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E

it would be helpful. I personally think that Arrow would benefit from
an alternate protocol message type to the current RecordBatch (as
defined in Message.fbs) that allows for encoded or compressed columns.
This won't be an overnight change (more on the order of months of
work), but it's worth taking the time to carefully consider the
implications of developing and supporting such a feature for the long
term

On Thu, Jul 11, 2019 at 5:34 AM Fan Liya  wrote:
>
> Hi Radev,
>
> Thanks a lot for providing so much technical details. I need to read them
> carefully.
>
> I think FP encoding is definitely a useful feature.
> I hope this feature can be implemented in Arrow soon, so that we can use it
> in our system.
>
> Best,
> Liya Fan
>
> On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin  wrote:
>
> > Hello Liya Fan,
> >
> >
> > this explains the technique but for a more complex case:
> >
> > https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
> >
> > For FP data, the approach which seemed to be the best is the following.
> >
> > Say we have a buffer of two 32-bit floating point values:
> >
> > buf = [af, bf]
> >
> > We interpret each FP value as a 32-bit uint and look at each individual
> > byte. We have 8 bytes in total for this small input.
> >
> > buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
> >
> > Then we apply stream splitting and the new buffer becomes:
> >
> > newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
> >
> > We compress newbuf.
> >
> > Due to similarities the sign bits, mantissa bits and MSB exponent bits, we
> > might have a lot more repetitions in data. For scientific data, the 2nd and
> > 3rd byte for 32-bit data is probably largely noise. Thus in the original
> > representation we would always have a few bytes of data which could appear
> > somewhere else in the buffer and then a couple bytes of possible noise. In
> > the new representation we have a long stream of data which could compress
> > well and then a sequence of noise towards the end.
> >
> > This transformation improved compression ratio as can be seen in the
> > report.
> >
> > It also improved speed for ZSTD. This could be because ZSTD makes a
> > decision of how to compress the data - RLE, new huffman tree, huffman tree
> > of the previous frame, raw representation. Each can potentially achieve a
> > different compression ratio and compression/decompression speed. It turned
> > out that when the transformation is applied, zstd would attempt to compress
> > fewer frames and copy the other. This could lead to less attempts to build
> > a huffman tree. It's hard to pin-point the exact reason.
> >
> > I did not try other lossless text compressors but I expect similar results.
> >
> > For code, I can polish my patches, create a Jira task and submit the
> > patches for review.
> >
> >
> > Regards,
> >
> > Martin
> >
> >
> > 
> > From: Fan Liya 
> > Sent: Thursday, July 11, 2019 11:32:53 AM
> > To: dev@arrow.apache.org
> > Cc: Raoofy, Amir; Karlstetter, Roman
> > Subject: Re: Adding a new encoding for FP data
> >
> > Hi Radev,
> >
> > Thanks for the information. It seems interesting.
> > IMO, Arrow has much to do for data compression. However, it seems there are
> > some differences for memory data compression and external storage data
> > compression.
> >
> > Could you please provide some reference for stream splitting?
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin  wrote:
> >
> > > Hello people,
> > >
> > >
> > > there has been discussion in the Apache Parquet mailing list on adding a
> > > new encoder for FP data.
> > > The reason for this is that the supported compressors by Apache Parquet
> > > (zstd, gzip, etc) do not compress well raw FP data.
> > >
> > >
> > > In my investigation it turns out that a very simple simple technique,
> > > named stream splitting, can improve the compression ratio and even speed
> > > for some of the compressors.
> > >
> > > You can read about the results here:
> > > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> > >
> > >
> > > I went through the developer guide for Apache Arrow and wrote a patch to
> > > add the new encoding and test coverage for it.
> > >
> > > I will polish my patch and work in parallel to extend the Apache Parquet
> > > format for the new encoding.
> > >
> > >
> > > If you have any concerns, please let me know.
> > >
> > >
> > > Regards,
> > >
> > > Martin
> > >
> > >
> >


[jira] [Created] (ARROW-5912) [Python] conversion from datetime objects with mixed timezones should normalize to UTC

2019-07-11 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5912:


 Summary: [Python] conversion from datetime objects with mixed 
timezones should normalize to UTC
 Key: ARROW-5912
 URL: https://issues.apache.org/jira/browse/ARROW-5912
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Currently, when having objects with mixed timezones, they are each separately 
interpreted as their local time:

{code:python}
>>> ts_pd_paris = pd.Timestamp("1970-01-01 01:00", tz="Europe/Paris")
>>> ts_pd_paris
Timestamp('1970-01-01 01:00:00+0100', tz='Europe/Paris')
>>> ts_pd_helsinki = pd.Timestamp("1970-01-01 02:00", tz="Europe/Helsinki")
>>> ts_pd_helsinki
Timestamp('1970-01-01 02:00:00+0200', tz='Europe/Helsinki')

>>> a = pa.array([ts_pd_paris, ts_pd_helsinki]) 
>>> 
>>>  
>>> a

[
  1970-01-01 01:00:00.00,
  1970-01-01 02:00:00.00
]
>>> a.type
TimestampType(timestamp[us])
{code}

So both times are actually about the same moment in time (the same value in 
UTC; in pandas their stored {{value}} is also the same), but once converted to 
pyarrow, they are both tz-naive but no longer the same time. That seems rather 
unexpected and a source for bugs.

I think a better option would be to normalize to UTC, and result in a tz-aware 
TimestampArray with UTC as timezone. 
That is also the behaviour of pandas if you force the conversion to result in 
datetimes (by default pandas will keep them as object array preserving the 
different timezones).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: New CI system: Ursabot

2019-07-11 Thread Krisztián Szűcs
Hi Eric!

On Thu, Jul 11, 2019 at 3:34 PM Eric Erhardt
 wrote:

> My apologies if this is already covered in the docs, but I couldn't find
> it.
>
> How do I re-run a single leg in the Ursabot tests? The 'AMD64 Debian 9
> Rust 1.35' failed on my PR, and I wanted to try re-running just that leg,
> but the only option I found was to re-run all Ursabot legs.
>
Currently you can't restart a single builder, just the whole buildset.
There is a ticket about supporting @ursabot build 
command and another ticker to provide control access for apache
members. Once the latter one is set up, I can also grant access
for contributors outside of apache.

BTW I think you can safely ignore the rust failure because it uses
the nightly toolchain.

>
> Eric
>
> -Original Message-
> From: Krisztián Szűcs 
> Sent: Friday, June 14, 2019 9:48 AM
> To: dev@arrow.apache.org
> Subject: New CI system: Ursabot
>
> Hello All,
>
> We're developing a buildbot application to utilize Ursa Labs’
> physical machines called Ursabot. Buildbot [1] is used by major open
> source projects, like CPython and WebKit [2].
>
> The source code is hosted at [3], the web interface is accessible at [4].
> The repository contains a short guide about the goals, implementation and
> the interfaces we can drive ursabot. The most notable way to trigger
> ursabot builds is via sending github comments mentioning @ursabot machine
> account, for more see [5].
>
> Currently we have builders for the C++ implementation and the Python
> bindings on AMD64 and ARM64 architectures.
> It is quite easy to attach workers to the buildmaster [7], so We can scale
> our build cluster to test and run on-demand builds (like benchmarks,
> packaging tasks) on more platforms.
>
> Yesterday we've enabled the github status push reporter to improve the
> visibility of ursabot, although we were testing the builders in the last
> couple of weeks. I hope no one has a hard objection against this new CI.
> Arrow has already started to outgrow Travis-CI and Appveyor's capacity and
> we're trying to make the build system quicker and more robust.
>
> Please don't hesitate to ask any questions!
>
> Thanks, Krisztian
>
> [1]:
> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fbuildbot.net%2F&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204963990536&sdata=CX7t5kh2wLH%2BHYZq%2BwMG3cGIeg1ZHx%2BDHnGqlyRw81g%3D&reserved=0
> [2]:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbuildbot%2Fbuildbot%2Fwiki%2FSuccessStories&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204963990536&sdata=vxbos9e%2BrJi7ZBIoqjUNbyj2Xmlfpj9JxsFbDc1CXrI%3D&reserved=0
> [3]:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fursa-labs%2Fursabot&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204963990536&sdata=77aMN03BotaAVZM4LhI1ER4lkEqVrYb%2B848yvELq%2BEk%3D&reserved=0
> [4]:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.ursalabs.org&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204963990536&sdata=JKLOOems6daX9OQGfZwsjxuvdYXxuM9Pj3r7BR869fg%3D&reserved=0
> [5]:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fursa-labs%2Fursabot%23driving-ursabot&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204964000528&sdata=x5oOrTOeedkfmvP9K9R4FYZZnR3jD1A7Q%2F5Qu8EC7M8%3D&reserved=0
> [7]:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fursa-labs%2Fursabot%2Fblob%2Fmaster%2Fdefault.yaml%23L115&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204964000528&sdata=FoqVfr4RPDmhEXxXWOK%2BUchzwm5mTv8tsN4nSrjKggQ%3D&reserved=0
>


Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Wes McKinney
hi Liya -- have you thought about implementing this as an
ExtensionType / ExtensionVector? You actually can already do this, so
if this helps you reference strings stored in some external memory
then that seems reasonable. Such a PointerStringVector could have a
method that converts it into the Arrow varbinary columnar
representation.

You wouldn't be able to put such an object into the IPC binary
protocol, though. If that's a requirement (being able to use the IPC
protocol) for this kind of data, before going any further in the
discussion I would suggest that you work out exactly how such data
would be moved from one process address space to another (using
Buffers).

- Wes

On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn  wrote:
>
> Hello Liya Fan,
>
> here your best approach is to copy into the Arrow format as you can then use 
> this as the basis for working with the Arrow-native representation as well as 
> your internal representation. You will have to use two different offset 
> vector as those two will always differ but in the case of your internal 
> representation, you don't have the requirement of consecutive data as Arrow 
> has but you can still work with the strings just as before even when stored 
> consecutively.
>
> Uwe
>
> On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> > Hi Korn,
> >
> > Thanks a lot for your comments.
> >
> > In my opinion, your comments make sense to me. Allowing non-consecutive
> > memory segments will break some good design choices of Arrow.
> > However, there are wide-spread user requirements for non-consecutive memory
> > segments. I am wondering how can we help such users. What advice we can
> > give to them?
> >
> > Memory copy/move can be a solution, but is there a better solution?
> > Is there a third alternative? Can we virtualize the non-consecutive memory
> > segments into a consecutive one? (Although performance overhead is
> > unavoidable.)
> >
> > What do you think? Let's brain-storm it.
> >
> > Best,
> > Liya Fan
> >
> >
> > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn  wrote:
> >
> > > Hello Liya,
> > >
> > > I'm quite -1 on this type as Arrow is about efficient columnar structures.
> > > We have opened the standard also to matrix-like types but always keep the
> > > constraint of consecutive memory. Now also adding types where memory is no
> > > longer consecutive but spread in the heap will make the scope of the
> > > project much wider (It seems that we then just turn into a general
> > > serialization framework).
> > >
> > > One of the ideas of a common standard is that some need to make
> > > compromises. I think in this case it is a necessary compromise to not 
> > > allow
> > > all kind of string representations.
> > >
> > > Uwe
> > >
> > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > > > Hi all,
> > > >
> > > >
> > > > We are thinking of providing varchar/varbinary vectors with a different
> > > > memory layout which exists in a wide range of systems. The memory layout
> > > is
> > > > different from that of VarCharVector in the following ways:
> > > >
> > > >
> > > >1.
> > > >
> > > >Instead of storing (start offset, end offset), the new layout stores
> > > >(start offset, length)
> > > >2.
> > > >
> > > >The content of varchars may not be in a consecutive memory region.
> > > >Instead, it can be in arbitrary memory address.
> > > >
> > > >
> > > > Due to these differences in memory layout, it incurs performance 
> > > > overhead
> > > > when converting data between existing systems and VarCharVectors.
> > > >
> > > > The above difference 1 seems insignificant, while difference 2 is
> > > difficult
> > > > to overcome. However, the scenario of difference 2 is prevalent in
> > > > practice: for example we store strings in a series of memory segments.
> > > > Whenever a segment is full, we request a new one. However, these memory
> > > > segments may not be consecutive, because other processes/threads are 
> > > > also
> > > > requesting/releasing memory segments in the meantime.
> > > >
> > > > So we are wondering if it is possible to support such memory layout in
> > > > Arrow. I think there are more systems that are trying to adopting Arrow,
> > > > but are hindered by such difficulty.
> > > >
> > > > Would you please give your valuable feedback?
> > > >
> > > >
> > > > Best,
> > > >
> > > > Liya Fan
> > > >
> > >
> >


RE: New CI system: Ursabot

2019-07-11 Thread Eric Erhardt
My apologies if this is already covered in the docs, but I couldn't find it.

How do I re-run a single leg in the Ursabot tests? The 'AMD64 Debian 9 Rust 
1.35' failed on my PR, and I wanted to try re-running just that leg, but the 
only option I found was to re-run all Ursabot legs.

Eric

-Original Message-
From: Krisztián Szűcs  
Sent: Friday, June 14, 2019 9:48 AM
To: dev@arrow.apache.org
Subject: New CI system: Ursabot

Hello All,

We're developing a buildbot application to utilize Ursa Labs’
physical machines called Ursabot. Buildbot [1] is used by major open source 
projects, like CPython and WebKit [2].

The source code is hosted at [3], the web interface is accessible at [4]. The 
repository contains a short guide about the goals, implementation and the 
interfaces we can drive ursabot. The most notable way to trigger ursabot builds 
is via sending github comments mentioning @ursabot machine account, for more 
see [5].

Currently we have builders for the C++ implementation and the Python bindings 
on AMD64 and ARM64 architectures.
It is quite easy to attach workers to the buildmaster [7], so We can scale our 
build cluster to test and run on-demand builds (like benchmarks, packaging 
tasks) on more platforms.

Yesterday we've enabled the github status push reporter to improve the 
visibility of ursabot, although we were testing the builders in the last couple 
of weeks. I hope no one has a hard objection against this new CI. Arrow has 
already started to outgrow Travis-CI and Appveyor's capacity and we're trying 
to make the build system quicker and more robust.

Please don't hesitate to ask any questions!

Thanks, Krisztian

[1]: 
https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fbuildbot.net%2F&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204963990536&sdata=CX7t5kh2wLH%2BHYZq%2BwMG3cGIeg1ZHx%2BDHnGqlyRw81g%3D&reserved=0
[2]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbuildbot%2Fbuildbot%2Fwiki%2FSuccessStories&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204963990536&sdata=vxbos9e%2BrJi7ZBIoqjUNbyj2Xmlfpj9JxsFbDc1CXrI%3D&reserved=0
[3]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fursa-labs%2Fursabot&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204963990536&sdata=77aMN03BotaAVZM4LhI1ER4lkEqVrYb%2B848yvELq%2BEk%3D&reserved=0
[4]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.ursalabs.org&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204963990536&sdata=JKLOOems6daX9OQGfZwsjxuvdYXxuM9Pj3r7BR869fg%3D&reserved=0
[5]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fursa-labs%2Fursabot%23driving-ursabot&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204964000528&sdata=x5oOrTOeedkfmvP9K9R4FYZZnR3jD1A7Q%2F5Qu8EC7M8%3D&reserved=0
[7]: 
https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fursa-labs%2Fursabot%2Fblob%2Fmaster%2Fdefault.yaml%23L115&data=02%7C01%7CEric.Erhardt%40microsoft.com%7C7df1445a86f747c3db9608d6f0d75462%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636961204964000528&sdata=FoqVfr4RPDmhEXxXWOK%2BUchzwm5mTv8tsN4nSrjKggQ%3D&reserved=0


Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Uwe L. Korn
Hello Liya Fan,

here your best approach is to copy into the Arrow format as you can then use 
this as the basis for working with the Arrow-native representation as well as 
your internal representation. You will have to use two different offset vector 
as those two will always differ but in the case of your internal 
representation, you don't have the requirement of consecutive data as Arrow has 
but you can still work with the strings just as before even when stored 
consecutively.

Uwe

On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> Hi Korn,
> 
> Thanks a lot for your comments.
> 
> In my opinion, your comments make sense to me. Allowing non-consecutive
> memory segments will break some good design choices of Arrow.
> However, there are wide-spread user requirements for non-consecutive memory
> segments. I am wondering how can we help such users. What advice we can
> give to them?
> 
> Memory copy/move can be a solution, but is there a better solution?
> Is there a third alternative? Can we virtualize the non-consecutive memory
> segments into a consecutive one? (Although performance overhead is
> unavoidable.)
> 
> What do you think? Let's brain-storm it.
> 
> Best,
> Liya Fan
> 
> 
> On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn  wrote:
> 
> > Hello Liya,
> >
> > I'm quite -1 on this type as Arrow is about efficient columnar structures.
> > We have opened the standard also to matrix-like types but always keep the
> > constraint of consecutive memory. Now also adding types where memory is no
> > longer consecutive but spread in the heap will make the scope of the
> > project much wider (It seems that we then just turn into a general
> > serialization framework).
> >
> > One of the ideas of a common standard is that some need to make
> > compromises. I think in this case it is a necessary compromise to not allow
> > all kind of string representations.
> >
> > Uwe
> >
> > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > > Hi all,
> > >
> > >
> > > We are thinking of providing varchar/varbinary vectors with a different
> > > memory layout which exists in a wide range of systems. The memory layout
> > is
> > > different from that of VarCharVector in the following ways:
> > >
> > >
> > >1.
> > >
> > >Instead of storing (start offset, end offset), the new layout stores
> > >(start offset, length)
> > >2.
> > >
> > >The content of varchars may not be in a consecutive memory region.
> > >Instead, it can be in arbitrary memory address.
> > >
> > >
> > > Due to these differences in memory layout, it incurs performance overhead
> > > when converting data between existing systems and VarCharVectors.
> > >
> > > The above difference 1 seems insignificant, while difference 2 is
> > difficult
> > > to overcome. However, the scenario of difference 2 is prevalent in
> > > practice: for example we store strings in a series of memory segments.
> > > Whenever a segment is full, we request a new one. However, these memory
> > > segments may not be consecutive, because other processes/threads are also
> > > requesting/releasing memory segments in the meantime.
> > >
> > > So we are wondering if it is possible to support such memory layout in
> > > Arrow. I think there are more systems that are trying to adopting Arrow,
> > > but are hindered by such difficulty.
> > >
> > > Would you please give your valuable feedback?
> > >
> > >
> > > Best,
> > >
> > > Liya Fan
> > >
> >
>


Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Antoine Pitrou


Same as Uwe.

Regards

Antoine.


Le 11/07/2019 à 14:05, Uwe L. Korn a écrit :
> Hello Liya,
> 
> I'm quite -1 on this type as Arrow is about efficient columnar structures. We 
> have opened the standard also to matrix-like types but always keep the 
> constraint of consecutive memory. Now also adding types where memory is no 
> longer consecutive but spread in the heap will make the scope of the project 
> much wider (It seems that we then just turn into a general serialization 
> framework).
> 
> One of the ideas of a common standard is that some need to make compromises. 
> I think in this case it is a necessary compromise to not allow all kind of 
> string representations.
> 
> Uwe
> 
> On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
>> Hi all,
>>
>>
>> We are thinking of providing varchar/varbinary vectors with a different
>> memory layout which exists in a wide range of systems. The memory layout is
>> different from that of VarCharVector in the following ways:
>>
>>
>>1.
>>
>>Instead of storing (start offset, end offset), the new layout stores
>>(start offset, length)
>>2.
>>
>>The content of varchars may not be in a consecutive memory region.
>>Instead, it can be in arbitrary memory address.
>>
>>
>> Due to these differences in memory layout, it incurs performance overhead
>> when converting data between existing systems and VarCharVectors.
>>
>> The above difference 1 seems insignificant, while difference 2 is difficult
>> to overcome. However, the scenario of difference 2 is prevalent in
>> practice: for example we store strings in a series of memory segments.
>> Whenever a segment is full, we request a new one. However, these memory
>> segments may not be consecutive, because other processes/threads are also
>> requesting/releasing memory segments in the meantime.
>>
>> So we are wondering if it is possible to support such memory layout in
>> Arrow. I think there are more systems that are trying to adopting Arrow,
>> but are hindered by such difficulty.
>>
>> Would you please give your valuable feedback?
>>
>>
>> Best,
>>
>> Liya Fan
>>


Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Fan Liya
Hi Korn,

Thanks a lot for your comments.

In my opinion, your comments make sense to me. Allowing non-consecutive
memory segments will break some good design choices of Arrow.
However, there are wide-spread user requirements for non-consecutive memory
segments. I am wondering how can we help such users. What advice we can
give to them?

Memory copy/move can be a solution, but is there a better solution?
Is there a third alternative? Can we virtualize the non-consecutive memory
segments into a consecutive one? (Although performance overhead is
unavoidable.)

What do you think? Let's brain-storm it.

Best,
Liya Fan


On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn  wrote:

> Hello Liya,
>
> I'm quite -1 on this type as Arrow is about efficient columnar structures.
> We have opened the standard also to matrix-like types but always keep the
> constraint of consecutive memory. Now also adding types where memory is no
> longer consecutive but spread in the heap will make the scope of the
> project much wider (It seems that we then just turn into a general
> serialization framework).
>
> One of the ideas of a common standard is that some need to make
> compromises. I think in this case it is a necessary compromise to not allow
> all kind of string representations.
>
> Uwe
>
> On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > Hi all,
> >
> >
> > We are thinking of providing varchar/varbinary vectors with a different
> > memory layout which exists in a wide range of systems. The memory layout
> is
> > different from that of VarCharVector in the following ways:
> >
> >
> >1.
> >
> >Instead of storing (start offset, end offset), the new layout stores
> >(start offset, length)
> >2.
> >
> >The content of varchars may not be in a consecutive memory region.
> >Instead, it can be in arbitrary memory address.
> >
> >
> > Due to these differences in memory layout, it incurs performance overhead
> > when converting data between existing systems and VarCharVectors.
> >
> > The above difference 1 seems insignificant, while difference 2 is
> difficult
> > to overcome. However, the scenario of difference 2 is prevalent in
> > practice: for example we store strings in a series of memory segments.
> > Whenever a segment is full, we request a new one. However, these memory
> > segments may not be consecutive, because other processes/threads are also
> > requesting/releasing memory segments in the meantime.
> >
> > So we are wondering if it is possible to support such memory layout in
> > Arrow. I think there are more systems that are trying to adopting Arrow,
> > but are hindered by such difficulty.
> >
> > Would you please give your valuable feedback?
> >
> >
> > Best,
> >
> > Liya Fan
> >
>


Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Uwe L. Korn
Hello Liya,

I'm quite -1 on this type as Arrow is about efficient columnar structures. We 
have opened the standard also to matrix-like types but always keep the 
constraint of consecutive memory. Now also adding types where memory is no 
longer consecutive but spread in the heap will make the scope of the project 
much wider (It seems that we then just turn into a general serialization 
framework).

One of the ideas of a common standard is that some need to make compromises. I 
think in this case it is a necessary compromise to not allow all kind of string 
representations.

Uwe

On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> Hi all,
> 
> 
> We are thinking of providing varchar/varbinary vectors with a different
> memory layout which exists in a wide range of systems. The memory layout is
> different from that of VarCharVector in the following ways:
> 
> 
>1.
> 
>Instead of storing (start offset, end offset), the new layout stores
>(start offset, length)
>2.
> 
>The content of varchars may not be in a consecutive memory region.
>Instead, it can be in arbitrary memory address.
> 
> 
> Due to these differences in memory layout, it incurs performance overhead
> when converting data between existing systems and VarCharVectors.
> 
> The above difference 1 seems insignificant, while difference 2 is difficult
> to overcome. However, the scenario of difference 2 is prevalent in
> practice: for example we store strings in a series of memory segments.
> Whenever a segment is full, we request a new one. However, these memory
> segments may not be consecutive, because other processes/threads are also
> requesting/releasing memory segments in the meantime.
> 
> So we are wondering if it is possible to support such memory layout in
> Arrow. I think there are more systems that are trying to adopting Arrow,
> but are hindered by such difficulty.
> 
> Would you please give your valuable feedback?
> 
> 
> Best,
> 
> Liya Fan
>


[jira] [Created] (ARROW-5911) [Java] Make ListVector and MapVector create reader lazily

2019-07-11 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5911:
---

 Summary: [Java] Make ListVector and MapVector create reader lazily
 Key: ARROW-5911
 URL: https://issues.apache.org/jira/browse/ARROW-5911
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Current implementation creates reader eagerly, which may cause unnecessary 
resource and time. This issue changes the behavior to lazily create the reader.

This is a follow-up issue for ARROW-5897.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5910) read_tensor() fails on non-seekable streams

2019-07-11 Thread Karsten Krispin (JIRA)
Karsten Krispin created ARROW-5910:
--

 Summary: read_tensor() fails on non-seekable streams
 Key: ARROW-5910
 URL: https://issues.apache.org/jira/browse/ARROW-5910
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
 Environment: pyarrow installed via pip, pyarrow==0.13.0
Reporter: Karsten Krispin


when reading a tensor from from a compressed pyarrow stream, it fails with
{code:java}
Traceback (most recent call last):
 File "test.py", line 10, in 
 tensor = pa.read_tensor(in_stream)
 File "pyarrow/ipc.pxi", line 470, in pyarrow.lib.read_tensor
 File "pyarrow/io.pxi", line 153, in 
pyarrow.lib.NativeFile.get_random_access_file
 File "pyarrow/io.pxi", line 182, in pyarrow.lib.NativeFile._assert_seekable
OSError: only valid on seekable files{code}
example code:
{code:java}
import pyarrow as pa
import numpy as np

a = np.random.random(size = (100,110,3) )

out_stream = pa.output_stream('test.pa', compression='gzip', buffer_size=None)
pa.write_tensor(pa.Tensor.from_numpy(a), out_stream)

in_stream = pa.input_stream('test.pa', compression='gzip', buffer_size=None)
tensor = pa.read_tensor(in_stream)
b = pa.Tensor.to_numpy(tensor){code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Adding a new encoding for FP data

2019-07-11 Thread Fan Liya
Hi Radev,

Thanks a lot for providing so much technical details. I need to read them
carefully.

I think FP encoding is definitely a useful feature.
I hope this feature can be implemented in Arrow soon, so that we can use it
in our system.

Best,
Liya Fan

On Thu, Jul 11, 2019 at 5:55 PM Radev, Martin  wrote:

> Hello Liya Fan,
>
>
> this explains the technique but for a more complex case:
>
> https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/
>
> For FP data, the approach which seemed to be the best is the following.
>
> Say we have a buffer of two 32-bit floating point values:
>
> buf = [af, bf]
>
> We interpret each FP value as a 32-bit uint and look at each individual
> byte. We have 8 bytes in total for this small input.
>
> buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]
>
> Then we apply stream splitting and the new buffer becomes:
>
> newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]
>
> We compress newbuf.
>
> Due to similarities the sign bits, mantissa bits and MSB exponent bits, we
> might have a lot more repetitions in data. For scientific data, the 2nd and
> 3rd byte for 32-bit data is probably largely noise. Thus in the original
> representation we would always have a few bytes of data which could appear
> somewhere else in the buffer and then a couple bytes of possible noise. In
> the new representation we have a long stream of data which could compress
> well and then a sequence of noise towards the end.
>
> This transformation improved compression ratio as can be seen in the
> report.
>
> It also improved speed for ZSTD. This could be because ZSTD makes a
> decision of how to compress the data - RLE, new huffman tree, huffman tree
> of the previous frame, raw representation. Each can potentially achieve a
> different compression ratio and compression/decompression speed. It turned
> out that when the transformation is applied, zstd would attempt to compress
> fewer frames and copy the other. This could lead to less attempts to build
> a huffman tree. It's hard to pin-point the exact reason.
>
> I did not try other lossless text compressors but I expect similar results.
>
> For code, I can polish my patches, create a Jira task and submit the
> patches for review.
>
>
> Regards,
>
> Martin
>
>
> 
> From: Fan Liya 
> Sent: Thursday, July 11, 2019 11:32:53 AM
> To: dev@arrow.apache.org
> Cc: Raoofy, Amir; Karlstetter, Roman
> Subject: Re: Adding a new encoding for FP data
>
> Hi Radev,
>
> Thanks for the information. It seems interesting.
> IMO, Arrow has much to do for data compression. However, it seems there are
> some differences for memory data compression and external storage data
> compression.
>
> Could you please provide some reference for stream splitting?
>
> Best,
> Liya Fan
>
> On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin  wrote:
>
> > Hello people,
> >
> >
> > there has been discussion in the Apache Parquet mailing list on adding a
> > new encoder for FP data.
> > The reason for this is that the supported compressors by Apache Parquet
> > (zstd, gzip, etc) do not compress well raw FP data.
> >
> >
> > In my investigation it turns out that a very simple simple technique,
> > named stream splitting, can improve the compression ratio and even speed
> > for some of the compressors.
> >
> > You can read about the results here:
> > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> >
> >
> > I went through the developer guide for Apache Arrow and wrote a patch to
> > add the new encoding and test coverage for it.
> >
> > I will polish my patch and work in parallel to extend the Apache Parquet
> > format for the new encoding.
> >
> >
> > If you have any concerns, please let me know.
> >
> >
> > Regards,
> >
> > Martin
> >
> >
>


Re: Adding a new encoding for FP data

2019-07-11 Thread Radev, Martin
Hello Liya Fan,


this explains the technique but for a more complex case:

https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/

For FP data, the approach which seemed to be the best is the following.

Say we have a buffer of two 32-bit floating point values:

buf = [af, bf]

We interpret each FP value as a 32-bit uint and look at each individual byte. 
We have 8 bytes in total for this small input.

buf = [af0, af1, af2, af3, bf0, bf1, bf2, bf3]

Then we apply stream splitting and the new buffer becomes:

newbuf = [af0, bf0, af1, bf1, af2, bf2, af3, bf3]

We compress newbuf.

Due to similarities the sign bits, mantissa bits and MSB exponent bits, we 
might have a lot more repetitions in data. For scientific data, the 2nd and 3rd 
byte for 32-bit data is probably largely noise. Thus in the original 
representation we would always have a few bytes of data which could appear 
somewhere else in the buffer and then a couple bytes of possible noise. In the 
new representation we have a long stream of data which could compress well and 
then a sequence of noise towards the end.

This transformation improved compression ratio as can be seen in the report.

It also improved speed for ZSTD. This could be because ZSTD makes a decision of 
how to compress the data - RLE, new huffman tree, huffman tree of the previous 
frame, raw representation. Each can potentially achieve a different compression 
ratio and compression/decompression speed. It turned out that when the 
transformation is applied, zstd would attempt to compress fewer frames and copy 
the other. This could lead to less attempts to build a huffman tree. It's hard 
to pin-point the exact reason.

I did not try other lossless text compressors but I expect similar results.

For code, I can polish my patches, create a Jira task and submit the patches 
for review.


Regards,

Martin



From: Fan Liya 
Sent: Thursday, July 11, 2019 11:32:53 AM
To: dev@arrow.apache.org
Cc: Raoofy, Amir; Karlstetter, Roman
Subject: Re: Adding a new encoding for FP data

Hi Radev,

Thanks for the information. It seems interesting.
IMO, Arrow has much to do for data compression. However, it seems there are
some differences for memory data compression and external storage data
compression.

Could you please provide some reference for stream splitting?

Best,
Liya Fan

On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin  wrote:

> Hello people,
>
>
> there has been discussion in the Apache Parquet mailing list on adding a
> new encoder for FP data.
> The reason for this is that the supported compressors by Apache Parquet
> (zstd, gzip, etc) do not compress well raw FP data.
>
>
> In my investigation it turns out that a very simple simple technique,
> named stream splitting, can improve the compression ratio and even speed
> for some of the compressors.
>
> You can read about the results here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
>
>
> I went through the developer guide for Apache Arrow and wrote a patch to
> add the new encoding and test coverage for it.
>
> I will polish my patch and work in parallel to extend the Apache Parquet
> format for the new encoding.
>
>
> If you have any concerns, please let me know.
>
>
> Regards,
>
> Martin
>
>


Re: Adding a new encoding for FP data

2019-07-11 Thread Fan Liya
Hi Radev,

Thanks for the information. It seems interesting.
IMO, Arrow has much to do for data compression. However, it seems there are
some differences for memory data compression and external storage data
compression.

Could you please provide some reference for stream splitting?

Best,
Liya Fan

On Thu, Jul 11, 2019 at 5:15 PM Radev, Martin  wrote:

> Hello people,
>
>
> there has been discussion in the Apache Parquet mailing list on adding a
> new encoder for FP data.
> The reason for this is that the supported compressors by Apache Parquet
> (zstd, gzip, etc) do not compress well raw FP data.
>
>
> In my investigation it turns out that a very simple simple technique,
> named stream splitting, can improve the compression ratio and even speed
> for some of the compressors.
>
> You can read about the results here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
>
>
> I went through the developer guide for Apache Arrow and wrote a patch to
> add the new encoding and test coverage for it.
>
> I will polish my patch and work in parallel to extend the Apache Parquet
> format for the new encoding.
>
>
> If you have any concerns, please let me know.
>
>
> Regards,
>
> Martin
>
>


Adding a new encoding for FP data

2019-07-11 Thread Radev, Martin
Hello people,


there has been discussion in the Apache Parquet mailing list on adding a new 
encoder for FP data.
The reason for this is that the supported compressors by Apache Parquet (zstd, 
gzip, etc) do not compress well raw FP data.


In my investigation it turns out that a very simple simple technique, named 
stream splitting, can improve the compression ratio and even speed for some of 
the compressors.

You can read about the results here: 
https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view


I went through the developer guide for Apache Arrow and wrote a patch to add 
the new encoding and test coverage for it.

I will polish my patch and work in parallel to extend the Apache Parquet format 
for the new encoding.


If you have any concerns, please let me know.


Regards,

Martin