Re: CPP : arrow symbols.map issue

2020-04-03 Thread Brian Bowman
Have not tried to reproduce in containerized environment Wes.

-Brian

On 4/3/20, 11:42 AM, "Wes McKinney"  wrote:

EXTERNAL

Are you able to reproduce the issue in a Dockerfile?

Because you are building with gcc 7.4, you need to ensure either that you
build everything with the gcc < 5 ABI otherwise (if you want the new gcc
ABI) ensure that the machine where you deploy has libstd++ for gcc 7. Using
Redhat's devtoolset toolchain also an option.

On Fri, Apr 3, 2020, 10:01 AM Brian Bowman 
wrote:

> Antoine/Wes,
>
> Thanks for your assistance!
>
> Here is the relevant info.  We suspect that our production build machines
> being at RHEL 6.7 is an issue.
>
> OS: RHEL 6.7
> Tools:  gcc 7.4,  bison-3.2, cmake-3.13.1, automake 1.16.1, autoconf 2.69,
> libtool 2.4.6, pkgcnf 1.1.0, texinfo 6.6, help2man 1.47.11, ld
> 2.20.51.0.2-5.43.el6
>
> Best,
>
> -Brian
>
> On 4/2/20, 1:22 PM, "Wes McKinney"  wrote:
>
> EXTERNAL
>
> On Thu, Apr 2, 2020 at 12:06 PM Antoine Pitrou 
> wrote:
> >
    >     >
> > Hi,
> >
> > On Thu, 2 Apr 2020 16:56:06 +
> > Brian Bowman  wrote:
> > > A new high-performance file system we are working with returns an
> error while writing a .parquet file.   The following arrow symbol does not
> resolve properly and the error is masked.
> > >
> > > libparquet.so: undefined symbol:
> _ZNK5arrow6Status8ToStringB5cxx11Ev
> > >
> > >  > nm libarrow.so* | grep -i 
ZNK5arrow6Status8ToStringB5cxx11Ev
> > >  002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev
> > >  002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev
> >
> > For clarity, you should use `nm --demangle`.  This will give you the
> > actual C++ symbol, i.e. "arrow::Status::ToString[abi:cxx11]() 
const".
> >
> > > One of our Linux dev/build experts tracked this down to an issue
> in arrow open source.  He says the lowercase ‘t’ (text) code (… 7760 t 
_ZNK
> …) in the nm command output is incorrect and it should instead be an
> uppercase ‘T’.
> >
> > I have the right output here:
> >
> > $ nm --demangle --defined-only --dynamic .../libarrow.so | \
> > grep Status::ToString
> > 012f1ff0 T arrow::Status::ToString[abi:cxx11]() const
> >
> > Which toolchain (linker etc.) are you using?
>
> My guess is also that you have a mixed-gcc-toolchain problem. What
> compiler/linker (and gcc toolchain, if you built with Clang) was used
> to produce libparquet.so (or where did you obtain the package), and
> which toolchain are you using to build and link your application?
>
> > Regards
> >
> > Antoine.
> >
> >
>
>
>




Re: CPP : arrow symbols.map issue

2020-04-03 Thread Brian Bowman
Antoine/Wes,

Thanks for your assistance! 

Here is the relevant info.  We suspect that our production build machines being 
at RHEL 6.7 is an issue.

OS: RHEL 6.7
Tools:  gcc 7.4,  bison-3.2, cmake-3.13.1, automake 1.16.1, autoconf 2.69, 
libtool 2.4.6, pkgcnf 1.1.0, texinfo 6.6, help2man 1.47.11, ld 
2.20.51.0.2-5.43.el6

Best,

-Brian

On 4/2/20, 1:22 PM, "Wes McKinney"  wrote:

EXTERNAL

On Thu, Apr 2, 2020 at 12:06 PM Antoine Pitrou  wrote:
>
>
> Hi,
>
> On Thu, 2 Apr 2020 16:56:06 +
> Brian Bowman  wrote:
> > A new high-performance file system we are working with returns an error 
while writing a .parquet file.   The following arrow symbol does not resolve 
properly and the error is masked.
> >
> > libparquet.so: undefined symbol: _ZNK5arrow6Status8ToStringB5cxx11Ev
> >
> >  > nm libarrow.so* | grep -i ZNK5arrow6Status8ToStringB5cxx11Ev
> >  002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev
> >  002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev
>
> For clarity, you should use `nm --demangle`.  This will give you the
> actual C++ symbol, i.e. "arrow::Status::ToString[abi:cxx11]() const".
>
> > One of our Linux dev/build experts tracked this down to an issue in 
arrow open source.  He says the lowercase ‘t’ (text) code (… 7760 t _ZNK …) in 
the nm command output is incorrect and it should instead be an uppercase ‘T’.
>
> I have the right output here:
>
> $ nm --demangle --defined-only --dynamic .../libarrow.so | \
> grep Status::ToString
> 012f1ff0 T arrow::Status::ToString[abi:cxx11]() const
>
> Which toolchain (linker etc.) are you using?

My guess is also that you have a mixed-gcc-toolchain problem. What
compiler/linker (and gcc toolchain, if you built with Clang) was used
to produce libparquet.so (or where did you obtain the package), and
which toolchain are you using to build and link your application?

> Regards
>
> Antoine.
>
>




CPP : arrow symbols.map issue

2020-04-02 Thread Brian Bowman
A new high-performance file system we are working with returns an error while 
writing a .parquet file.   The following arrow symbol does not resolve properly 
and the error is masked.

libparquet.so: undefined symbol: _ZNK5arrow6Status8ToStringB5cxx11Ev

 > nm libarrow.so* | grep -i ZNK5arrow6Status8ToStringB5cxx11Ev
 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev
 002b7760 t _ZNK5arrow6Status8ToStringB5cxx11Ev

One of our Linux dev/build experts tracked this down to an issue in arrow open 
source.  He says the lowercase ‘t’ (text) code (… 7760 t _ZNK …) in the nm 
command output is incorrect and it should instead be an uppercase ‘T’.

He traced the problem to this file:

../cpp/src/arrow/symbols.map

Here’s an update with his fix.  Lines 27-30 are new.  Nothing else changes.

  1 # Licensed to the Apache Software Foundation (ASF) under one
  2 # or more contributor license agreements.  See the NOTICE file
  3 # distributed with this work for additional information
  4 # regarding copyright ownership.  The ASF licenses this file
  5 # to you under the Apache License, Version 2.0 (the
  6 # "License"); you may not use this file except in compliance
  7 # with the License.  You may obtain a copy of the License at
  8 #
  9 #   http://www.apache.org/licenses/LICENSE-2.0
10 #
11 # Unless required by applicable law or agreed to in writing,
12 # software distributed under the License is distributed on an
13 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14 # KIND, either express or implied.  See the License for the
15 # specific language governing permissions and limitations
16 # under the License.
17
18 {
19   global:
20 extern "C++" {
21   # The leading asterisk is required for symbols such as
22   # "typeinfo for arrow::SomeClass".
23   # Unfortunately this will also catch template specializations
24   # (from e.g. STL or Flatbuffers) involving Arrow types.
25   *arrow::*;
26   *arrow_vendored::*;
27   *ToString*;
28   *key*;
29   *str*;
30   *value*;
31 };
32 # Also export C-level helpers
33 arrow_*;
34 pyarrow_*;
35
36   # Symbols marked as 'local' are not exported by the DSO and thus may not
37   # be used by client applications.  Everything except the above falls here.
38   # This ensures we hide symbols of static dependencies.
39   local:
40 *;
41
42 };

We have made these changes in our local clones the arrow open source 
repositories.   I’m passing this along for the community’s review.  Reply with 
a link and I’ll enter a jira ticket if needed.

-Brian






Re: Workaround for Thrift download ERRORs

2019-07-15 Thread Brian Bowman
Wes,

Here are the cmake thrift log lines from a build of apache-arrow git clone on 
06Jul2019 where cmake successfully downloads thrift. 
 
-- Checking for module 'thrift'
-- No package 'thrift' found
-- Could NOT find Thrift (missing: THRIFT_STATIC_LIB) 
Building Apache Thrift from source
Downloading Apache Thrift from 
http://mirror.metrocast.net/apache//thrift/0.12.0/thrift-0.12.0.tar.gz

Do you still want a JIRA issue entered, given that this git clone works and is 
a bit newer than the arrow-0.14.0 release .tar?

- Brian


On 7/15/19, 12:39 PM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

Can you please open a JIRA issue?

Does running the "get_apache_mirror.py" script work for you by itself?

$ python cpp/build-support/get_apache_mirror.py
https://www-eu.apache.org/dist/

- Wes
    
    On Mon, Jul 15, 2019 at 10:54 AM Brian Bowman  wrote:
>
> Is there a workaround for the following error?
>
> requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match 
either of '*.openoffice.org', 
'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
>
> I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz is not 
being found curing cmake.  This results in downstream compile errors during 
make.
>
> Here’s the log info from cmake:
>
> -- Checking for module 'thrift'
> --   No package 'thrift' found
> -- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR 
THRIFT_COMPILER)
> Building Apache Thrift from source
> Downloading Apache Thrift from Traceback (most recent call last):
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", 
line 38, in 
> suggested_mirror = get_url('https://www.apache.org/dyn/'
>   File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", 
line 27, in get_url
> return requests.get(url).content
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get
> return request('get', url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in 
request
> response = session.request(method=method, url=url, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, 
in request
> resp = self.send(prep, **send_kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, 
in send
> r = adapter.send(request, **kwargs)
>   File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, 
in send
> raise SSLError(e, request=request)
> requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match 
either of '*.openoffice.org', 
'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz
>
>
> Thanks,
>
>
> Brian
>




Workaround for Thrift download ERRORs

2019-07-15 Thread Brian Bowman
Is there a workaround for the following error?

requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of 
'*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz

I’ve inflated apache-arrow-0.14.0.tar and the thrift-0.12.0.tar.gz is not being 
found curing cmake.  This results in downstream compile errors during make.

Here’s the log info from cmake:

-- Checking for module 'thrift'
--   No package 'thrift' found
-- Could NOT find Thrift (missing: THRIFT_STATIC_LIB THRIFT_INCLUDE_DIR 
THRIFT_COMPILER)
Building Apache Thrift from source
Downloading Apache Thrift from Traceback (most recent call last):
  File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 38, 
in 
suggested_mirror = get_url('https://www.apache.org/dyn/'
  File "…/apache-arrow-0.14.0/cpp/build-support/get_apache_mirror.py", line 27, 
in get_url
return requests.get(url).content
  File "/usr/lib/python2.6/site-packages/requests/api.py", line 68, in get
return request('get', url, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 464, in 
request
resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.6/site-packages/requests/sessions.py", line 576, in 
send
r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.6/site-packages/requests/adapters.py", line 431, in 
send
raise SSLError(e, request=request)
requests.exceptions.SSLError: hostname 'www.apache.org' doesn't match either of 
'*.openoffice.org', 'openoffice.org'/thrift/0.12.0/thrift-0.12.0.tar.gz


Thanks,


Brian



Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-26 Thread Brian Bowman
Hello Wes,

Thanks for the info!  I'm working to better understand Parquet/Arrow design and 
development processes.   No hurry for LARGE_BYTE_ARRAY.

-Brian


On 4/26/19, 11:14 AM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

I doubt that such a change could be made on a short time horizon.
Collecting feedback and building consensus (if it is even possible)
with stakeholders would take some time. The appropriate place to have
the discussion is here on the mailing list, though

Thanks

On Mon, Apr 8, 2019 at 1:37 PM Brian Bowman  wrote:
>
> Hello Wes/all,
>
> A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without 
resorting to other alternatives.  Is this something that could be done in 
Parquet over the next few months?  I have a lot of experience with file 
formats/storage layer internals and can contribute for Parquet C++.
>
> -Brian
>
> On 4/5/19, 3:44 PM, "Wes McKinney"  wrote:
>
> EXTERNAL
>
> hi Brian,
>
> Just to comment from the C++ side -- the 64-bit issue is a limitation
> of the Parquet format itself and not related to the C++
> implementation. It would be possibly interesting to add a
> LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
> doing much the same in Apache Arrow for in-memory)
>
> - Wes
>
> On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue  
wrote:
> >
> > I don't think that's what you would want to do. Parquet will 
eventually
> > compress large values, but not after making defensive copies and 
attempting
> > to encode them. In the end, it will be a lot more overhead, plus 
the work
> > to make it possible. I think you'd be much better of compressing 
before
> > storing in Parquet if you expect good compression rates.
> >
> > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman  
wrote:
> >
> > > My hope is that these large ByteArray values will encode/compress 
to a
> > > fraction of their original size.  FWIW, cpp/src/parquet/
> > > column_writer.cc/.h has int64_t offset and length fields all over 
the
> > > place.
> > >
> > > External file references to BLOBS is doable but not the elegant,
> > > integrated solution I was hoping for.
> > >
> > > -Brian
> > >
> > > On Apr 5, 2019, at 1:53 PM, Ryan Blue  wrote:
> > >
> > > *EXTERNAL*
> > > Looks like we will need a new encoding for this:
> > > https://github.com/apache/parquet-format/blob/master/Encodings.md
> > >
> > > That doc specifies that the plain encoding uses a 4-byte length. 
That's
> > > not going to be a quick fix.
> > >
> > > Now that I'm thinking about this a bit more, does it make sense 
to support
> > > byte arrays that are more than 2GB? That's far larger than the 
size of a
> > > row group, let alone a page. This would completely break memory 
management
> > > in the JVM implementation.
> > >
> > > Can you solve this problem using a BLOB type that references an 
external
> > > file with the gigantic values? Seems to me that values this large 
should go
> > > in separate files, not in a Parquet file where it would destroy 
any benefit
> > > from using the format.
> > >
> > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
 wrote:
> > >
> > >> Hello Ryan,
> > >>
> > >> Looks like it's limited by both the Parquet implementation and 
the Thrift
> > >> message methods.  Am I missing anything?
> > >>
> > >> From cpp/src/parquet/types.h
> > >>
> > >> struct ByteArray {
> > >>   ByteArray() : len(0), ptr(NULLPTR) {}
> > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), 
ptr(ptr) {}
> > >>   uint32_t len;
> > >>   const uint8_t* ptr;
> > >> };
> > >>
> > >> From cpp/src/parquet/thrift.h
> > >>
> > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* 
len, T*
>   

Re: [VOTE] Add 64-bit offset list, binary, string (utf8) types to the Arrow columnar format

2019-04-26 Thread Brian Bowman
Can non-Arrow PMC members/committers vote?  

If so, +1 

-Brian

On 4/25/19, 4:34 PM, "Wes McKinney"  wrote:

EXTERNAL

In a recent mailing list discussion [1] Micah Kornfield has proposed
to add new list and variable-size binary and unicode types to the
Arrow columnar format with 64-bit signed integer offsets, to be used
in addition to the existing 32-bit offset varieties. These will be
implemented as new types in the Type union in Schema.fbs (the
particular names can be debated in the PR that implements them):

LargeList
LargeBinary
LargeString [UTF8]

While very large contiguous columns are not a principle use case for
the columnar format, it has been observed empirically that there are
applications that use the format to represent datasets where
realizations of data can sometimes exceed the 2^31 - 1 "capacity" of a
column and cannot be easily (or at all) split into smaller chunks.

Please vote whether to accept the changes. The vote will be open for at
least 72 hours.

[ ] +1 Accept the additions to the columnar format
[ ] +0
[ ] -1 Do not accept the changes because...

Thanks,
Wes

[1]: 
https://lists.apache.org/thread.html/8088eca21b53906315e2bbc35eb2d246acf10025b5457eccc7a0e8a3@%3Cdev.arrow.apache.org%3E




Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-08 Thread Brian Bowman
Hello Wes/all,

A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without 
resorting to other alternatives.  Is this something that could be done in 
Parquet over the next few months?  I have a lot of experience with file 
formats/storage layer internals and can contribute for Parquet C++.

-Brian

On 4/5/19, 3:44 PM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

Just to comment from the C++ side -- the 64-bit issue is a limitation
of the Parquet format itself and not related to the C++
implementation. It would be possibly interesting to add a
LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
doing much the same in Apache Arrow for in-memory)

- Wes

On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue  wrote:
>
> I don't think that's what you would want to do. Parquet will eventually
> compress large values, but not after making defensive copies and 
attempting
> to encode them. In the end, it will be a lot more overhead, plus the work
> to make it possible. I think you'd be much better of compressing before
> storing in Parquet if you expect good compression rates.
    >
> On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman  wrote:
>
> > My hope is that these large ByteArray values will encode/compress to a
> > fraction of their original size.  FWIW, cpp/src/parquet/
> > column_writer.cc/.h has int64_t offset and length fields all over the
> > place.
> >
> > External file references to BLOBS is doable but not the elegant,
> > integrated solution I was hoping for.
> >
> > -Brian
> >
> > On Apr 5, 2019, at 1:53 PM, Ryan Blue  wrote:
> >
> > *EXTERNAL*
> > Looks like we will need a new encoding for this:
> > https://github.com/apache/parquet-format/blob/master/Encodings.md
> >
> > That doc specifies that the plain encoding uses a 4-byte length. That's
> > not going to be a quick fix.
> >
> > Now that I'm thinking about this a bit more, does it make sense to 
support
> > byte arrays that are more than 2GB? That's far larger than the size of a
> > row group, let alone a page. This would completely break memory 
management
> > in the JVM implementation.
> >
> > Can you solve this problem using a BLOB type that references an external
> > file with the gigantic values? Seems to me that values this large 
should go
> > in separate files, not in a Parquet file where it would destroy any 
benefit
> > from using the format.
> >
> > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman  
wrote:
> >
> >> Hello Ryan,
> >>
> >> Looks like it's limited by both the Parquet implementation and the 
Thrift
> >> message methods.  Am I missing anything?
> >>
> >> From cpp/src/parquet/types.h
> >>
> >> struct ByteArray {
> >>   ByteArray() : len(0), ptr(NULLPTR) {}
> >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
> >>   uint32_t len;
> >>   const uint8_t* ptr;
> >> };
> >>
> >> From cpp/src/parquet/thrift.h
> >>
> >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> >> deserialized_msg) {
> >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
> >> out)
> >>
> >> -Brian
> >>
> >> On 4/5/19, 1:32 PM, "Ryan Blue"  wrote:
> >>
> >> EXTERNAL
> >>
> >> Hi Brian,
> >>
> >> This seems like something we should allow. What imposes the current
> >> limit?
> >> Is it in the thrift format, or just the implementations?
> >>
> >> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
> >> wrote:
> >>
> >> > All,
> >> >
> >> > SAS requires support for storing varying-length character and
> >> binary blobs
> >> > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> >> field is
> >> > a unint32_t.   Looks this the will require incrementing the 
Parquet
> >> file
> >> > format version and changing ByteArray len to uint64_t.
> >> >
> >> > Have there been any requests for this or other Parquet 
developments
> >> that
> >> > require file format versioning changes?
> >> >
> >> > I realize this a non-trivial ask.  Thanks for considering it.
> >> >
> >> > -Brian
> >> >
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >>
> >>
> >>
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
>
> --
> Ryan Blue
> Software Engineer
> Netflix




Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Thanks Ryan,

After further pondering this, I came to similar conclusions.

Compress the data before putting it into a Parquet ByteArray and if that’s not 
feasible reference it in an external/persisted data structure

Another alternative is to create one or more “shadow columns” to store the 
overflow horizontally.

-Brian

On Apr 5, 2019, at 3:11 PM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:


EXTERNAL

I don't think that's what you would want to do. Parquet will eventually 
compress large values, but not after making defensive copies and attempting to 
encode them. In the end, it will be a lot more overhead, plus the work to make 
it possible. I think you'd be much better of compressing before storing in 
Parquet if you expect good compression rates.

On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:
My hope is that these large ByteArray values will encode/compress to a fraction 
of their original size.  FWIW, 
cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t 
offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated 
solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: 
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not 
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte 
arrays that are more than 2GB? That's far larger than the size of a row group, 
let alone a page. This would completely break memory management in the JVM 
implementation.

Can you solve this problem using a BLOB type that references an external file 
with the gigantic values? Seems to me that values this large should go in 
separate files, not in a Parquet file where it would destroy any benefit from 
using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" 
mailto:rb...@netflix.com.INVALID>> wrote:

EXTERNAL

Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


--
Ryan Blue
Software Engineer
Netflix




--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
My hope is that these large ByteArray values will encode/compress to a fraction 
of their original size.  FWIW, 
cpp/src/parquet/column_writer.cc/.h<http://column_writer.cc/.h> has int64_t 
offset and length fields all over the place.

External file references to BLOBS is doable but not the elegant, integrated 
solution I was hoping for.

-Brian

On Apr 5, 2019, at 1:53 PM, Ryan Blue 
mailto:rb...@netflix.com>> wrote:


EXTERNAL

Looks like we will need a new encoding for this: 
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not 
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support byte 
arrays that are more than 2GB? That's far larger than the size of a row group, 
let alone a page. This would completely break memory management in the JVM 
implementation.

Can you solve this problem using a BLOB type that references an external file 
with the gigantic values? Seems to me that values this large should go in 
separate files, not in a Parquet file where it would destroy any benefit from 
using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue" 
mailto:rb...@netflix.com.INVALID>> wrote:

EXTERNAL

Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


--
Ryan Blue
Software Engineer
Netflix




--
Ryan Blue
Software Engineer
Netflix


Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Hello Ryan,

Looks like it's limited by both the Parquet implementation and the Thrift 
message methods.  Am I missing anything?

From cpp/src/parquet/types.h 

struct ByteArray {
  ByteArray() : len(0), ptr(NULLPTR) {}
  ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
  uint32_t len;
  const uint8_t* ptr;
};

From cpp/src/parquet/thrift.h

inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* 
deserialized_msg) {
inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out) 

-Brian

On 4/5/19, 1:32 PM, "Ryan Blue"  wrote:

EXTERNAL

Hi Brian,

This seems like something we should allow. What imposes the current limit?
Is it in the thrift format, or just the implementations?

On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman  wrote:

> All,
>
> SAS requires support for storing varying-length character and binary blobs
> with a 2^64 max length in Parquet.   Currently, the ByteArray len field is
> a unint32_t.   Looks this the will require incrementing the Parquet file
> format version and changing ByteArray len to uint64_t.
>
> Have there been any requests for this or other Parquet developments that
> require file format versioning changes?
>
> I realize this a non-trivial ask.  Thanks for considering it.
>
> -Brian
>


--
Ryan Blue
Software Engineer
Netflix




Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
All,

SAS requires support for storing varying-length character and binary blobs with 
a 2^64 max length in Parquet.   Currently, the ByteArray len field is a 
unint32_t.   Looks this the will require incrementing the Parquet file format 
version and changing ByteArray len to uint64_t.

Have there been any requests for this or other Parquet developments that 
require file format versioning changes?

I realize this a non-trivial ask.  Thanks for considering it.

-Brian


Re: Passing File Descriptors in the Low-Level API

2019-03-16 Thread Brian Bowman
Thanks Wes!

I'm working on the integrating and testing the necessary changes in our dev 
environment.  I'll submit a PR once things are working.

Best,

Brian 

On 3/16/19, 4:24 PM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

Please feel free to submit a PR to add the requisite APIs that you
need for your application. Antoine or I or others should be able to
give prompt feedback since we know this code pretty well.

Thanks
Wes

On Sat, Mar 16, 2019 at 11:40 AM Brian Bowman  wrote:
>
> Hi Wes,
>
> Thanks for the quick reply!  To be clear, the usage I'm working on needs 
to own both the Open FileDescriptor and corresponding mapped memory.  In other 
words ...
>
> SAS component does both open() and mmap() which could be for READ or 
WRITE.
>
> -> Calls low-level Parquet APIs to read an existing file or write a new 
one.  The open() and mmap() flags are guaranteed to be correct.
>
> At some later point SAS component does an unmap() and close().
>
> -Brian
>
>
> On 3/14/19, 3:42 PM, "Wes McKinney"  wrote:
>
> hi Brian,
>
> This is mostly an Arrow platform question so I'm copying the Arrow 
mailing list.
>
> You can open a file using an existing file descriptor using 
ReadableFile::Open
>
> 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.h#L145
>
> The documentation for this function says:
>
> "The file descriptor becomes owned by the ReadableFile, and will be
> closed on Close() or destruction."
>
> If you want to do the equivalent thing, but using memory mapping, I
> think you'll need to add a corresponding API to MemoryMappedFile. This
> is more perilous because of the API requirements of mmap -- you need
> to pass the right flags and they may need to be the same flags that
> were passed when opening the file descriptor, see
>
> 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L378
>
>     and
    >
> 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L476
>
> - Wes
>
> On Thu, Mar 14, 2019 at 1:47 PM Brian Bowman  
wrote:
> >
> >  The ReadableFile class (arrow/io/file.cc) has utility methods 
where a FileDescriptor is either passed in or returned, but I don’t see how 
this surfaces through the API.
> >
> > Is there a way for application code to control the open lifetime of 
mmap()’d Parquet files by passing an already open FileDescriptor to Parquet 
low-level API open/close methods?
> >
> > Thanks,
> >
> > Brian
> >
>
>
>




Re: Passing File Descriptors in the Low-Level API

2019-03-16 Thread Brian Bowman
Hi Wes,

Thanks for the quick reply!  To be clear, the usage I'm working on needs to own 
both the Open FileDescriptor and corresponding mapped memory.  In other words 
...

SAS component does both open() and mmap() which could be for READ or WRITE.

-> Calls low-level Parquet APIs to read an existing file or write a new one.  
The open() and mmap() flags are guaranteed to be correct.

At some later point SAS component does an unmap() and close(). 

-Brian


On 3/14/19, 3:42 PM, "Wes McKinney"  wrote:

hi Brian,

This is mostly an Arrow platform question so I'm copying the Arrow mailing 
list.

You can open a file using an existing file descriptor using 
ReadableFile::Open

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.h#L145

The documentation for this function says:

"The file descriptor becomes owned by the ReadableFile, and will be
closed on Close() or destruction."

If you want to do the equivalent thing, but using memory mapping, I
think you'll need to add a corresponding API to MemoryMappedFile. This
is more perilous because of the API requirements of mmap -- you need
to pass the right flags and they may need to be the same flags that
were passed when opening the file descriptor, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L378

and

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L476

- Wes
    
    On Thu, Mar 14, 2019 at 1:47 PM Brian Bowman  wrote:
>
>  The ReadableFile class (arrow/io/file.cc) has utility methods where a 
FileDescriptor is either passed in or returned, but I don’t see how this 
surfaces through the API.
>
> Is there a way for application code to control the open lifetime of 
mmap()’d Parquet files by passing an already open FileDescriptor to Parquet 
low-level API open/close methods?
>
> Thanks,
>
> Brian
>





Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Brian Bowman
Distributed row-level indexing has been done well in a particular large-scale 
data system that I'm very familiar with, albeit within a row-wise organization. 

-Brian 

On 9/19/18, 5:04 PM, "Paul Rogers"  wrote:

EXTERNAL

Hi Gerlando,

Parquet does not allow row-level indexing because some data for a row might 
not even exist, it is encoded in data about a group of similar rows.

In the world of Big Data, it seems that the most common practice is to 
simply scan all the data to find the bits you want. Indexing is very hard in a 
distributed system. (See "Building Data Intensive Applications" from O'Reilly 
for a good summary.) Parquet is optimized for this case.

You use partitions to whittle down the haystacks (set of Parquet files) you 
must search. Then you use Drill to scan those haystacks to find the needle.

Thanks,
- Paul



On Wednesday, September 19, 2018, 12:30:36 PM PDT, Brian Bowman 
 wrote:

 Gerlando,

AFAIK Parquet does not yet support indexing.  I believe it does store 
min/max values at the row batch (or maybe it's page) level which may help 
eliminate large "swaths" of data depending on how actual data values 
corresponding to a search predicate are distributed across large Parquet files.

I have an interest in the future of indexing within the native Parquet 
structure as well.  It will be interesting to see where this discussion goes 
from here.

-Brian

On 9/19/18, 3:21 PM, "Gerlando Falauto"  wrote:

EXTERNAL

Thank you all guys, you've been extremely helpful with your ideas.
I'll definitely have a look at all your suggestions to see what others 
have
been doing in this respect.

What I forgot to mention was that while the service uses the S3 API, 
it's
not provided by AWS so any solution should be based on a cloud offering
from a different big player (it's the three-letter big-blue one, in case
you're wondering).

However, I'm still not clear as to how Drill (or pyarrow) would be able 
to
gather data with random access. In any database, you just build an 
index on
the fields you're going to run most of your queries over, and then the
database takes care of everything else.

With Parquet, as I understand, you can do folder-based partitioning (is
that called "hive" partitioning?) so that you can get random access over
let's say
source=source1/date=20180918/*.parquet.
I assume drill could be instructed into doing this or even figure it 
out by
itself, by just looking at the folder structure.
What I still don't get though, is how to "index" the parquet file(s), so
that random (rather than sequential) access can be performed over the 
whole
file.
Brian mentioned metadata, I had a quick look at the parquet 
specification
and I sortof understand it somehow resembles an index.
Yet I fail to understand how such an index could be built (if at all
possible), for instance using pyarrow (or any other tool, for that 
matter)
for reading and/or writing.

Thank you!
Gerlando

On Wed, Sep 19, 2018 at 7:55 PM Ted Dunning  
wrote:

> The effect of rename can be had by handling a small inventory file 
that is
> updated atomically.
>
> Having real file semantics is sooo much nicer, though.
>
>
>
> On Wed, Sep 19, 2018 at 1:51 PM Bill Glennon  
wrote:
>
> > Also, may want to take a look at https://aws.amazon.com/athena/.
> >
> > Thanks,
> > Bill
> >
> > On Wed, Sep 19, 2018 at 1:43 PM Paul Rogers 

> > wrote:
> >
> > > Hi Gerlando,
> > >
> > > I believe AWS has entire logging pipeline they offer. If you want
> > > something quick, perhaps look into that offering.
> > >
> > > What you describe is pretty much the classic approach to log
> aggregation:
> > > partition data, gather data incrementally, then later 
consolidate. A
> > while
> > > back, someone invented the term "lambda architecture" for this 
idea.
> You
> > > should be able to find examples of how others have done something
> > similar.
> > >
> > > Drill can scan directories of files. So, in your buckets 
(source-date)
> > > directories, you can have multiple files. If you rece

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Brian Bowman
Gerlando,

AFAIK Parquet does not yet support indexing.  I believe it does store min/max 
values at the row batch (or maybe it's page) level which may help eliminate 
large "swaths" of data depending on how actual data values corresponding to a 
search predicate are distributed across large Parquet files.

I have an interest in the future of indexing within the native Parquet 
structure as well.  It will be interesting to see where this discussion goes 
from here.

-Brian 

On 9/19/18, 3:21 PM, "Gerlando Falauto"  wrote:

EXTERNAL

Thank you all guys, you've been extremely helpful with your ideas.
I'll definitely have a look at all your suggestions to see what others have
been doing in this respect.

What I forgot to mention was that while the service uses the S3 API, it's
not provided by AWS so any solution should be based on a cloud offering
from a different big player (it's the three-letter big-blue one, in case
you're wondering).

However, I'm still not clear as to how Drill (or pyarrow) would be able to
gather data with random access. In any database, you just build an index on
the fields you're going to run most of your queries over, and then the
database takes care of everything else.

With Parquet, as I understand, you can do folder-based partitioning (is
that called "hive" partitioning?) so that you can get random access over
let's say
source=source1/date=20180918/*.parquet.
I assume drill could be instructed into doing this or even figure it out by
itself, by just looking at the folder structure.
What I still don't get though, is how to "index" the parquet file(s), so
that random (rather than sequential) access can be performed over the whole
file.
Brian mentioned metadata, I had a quick look at the parquet specification
and I sortof understand it somehow resembles an index.
Yet I fail to understand how such an index could be built (if at all
possible), for instance using pyarrow (or any other tool, for that matter)
for reading and/or writing.

Thank you!
Gerlando

On Wed, Sep 19, 2018 at 7:55 PM Ted Dunning  wrote:

> The effect of rename can be had by handling a small inventory file that is
> updated atomically.
>
> Having real file semantics is sooo much nicer, though.
>
>
>
> On Wed, Sep 19, 2018 at 1:51 PM Bill Glennon  wrote:
>
> > Also, may want to take a look at https://aws.amazon.com/athena/.
> >
> > Thanks,
> > Bill
> >
> > On Wed, Sep 19, 2018 at 1:43 PM Paul Rogers 
> > wrote:
> >
> > > Hi Gerlando,
> > >
> > > I believe AWS has entire logging pipeline they offer. If you want
> > > something quick, perhaps look into that offering.
> > >
> > > What you describe is pretty much the classic approach to log
> aggregation:
> > > partition data, gather data incrementally, then later consolidate. A
> > while
> > > back, someone invented the term "lambda architecture" for this idea.
> You
> > > should be able to find examples of how others have done something
> > similar.
> > >
> > > Drill can scan directories of files. So, in your buckets (source-date)
> > > directories, you can have multiple files. If you receive data, say,
> > every 5
> > > or 10 minutes, you can just create a separate file for each new drop 
of
> > > data. You'll end up with many files, but you can query the data as it
> > > arrives.
> > >
> > > Then, later, say once per day, you can consolidate the files into a 
few
> > > big files. The only trick is the race condition of doing the
> > consolidation
> > > while running queries. Not sure how to do that on S3, since you can't
> > > exploit rename operations as you can on Linux. Anyone have suggestions
> > for
> > > this step?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > > On Wednesday, September 19, 2018, 6:23:13 AM PDT, Gerlando Falauto
> <
> > > gerlando.fala...@gmail.com> wrote:
> > >
> > >  Hi,
> > >
> > > I'm looking for a way to store huge amounts of logging data in the
> cloud
> > > from about 100 different data sources, each producing about 50MB/day
> (so
> > > it's something like 5GB/day).
> > > The target storage would be an S3 object storage for cost-efficiency
> > > reasons.
> > > I would like to be able to store (i.e. append-like) data in realtime,
> and
> > > retrieve data based on time frame and data source with fast access. I
> was
> > > thinking of partitioning data based on datasource and calendar day, so
> to
> > > have one file per day, per data source, each 50MB.
> > >
> > > I played around with pyarrow and parquet (using s3fs), and came across
> > the
> > > following limitations:
> > >
> > > 1)

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Brian Bowman
Gerlando is correct that S3 Objects, once created are immutable.  They cannot 
updated-in-place, appended to, nor even renamed.   However, S3 supports seeking 
to offsets within the object being read.  The challenge is knowing where to 
read within the S3 object, which to perform well will require metadata that can 
be derived by doing minimal I/O operations prior to seeking/reading the needed 
parts of the S3 object.

-Brian

-Brian 

On 9/19/18, 9:23 AM, "Gerlando Falauto"  wrote:

EXTERNAL

Hi,

I'm looking for a way to store huge amounts of logging data in the cloud
from about 100 different data sources, each producing about 50MB/day (so
it's something like 5GB/day).
The target storage would be an S3 object storage for cost-efficiency
reasons.
I would like to be able to store (i.e. append-like) data in realtime, and
retrieve data based on time frame and data source with fast access. I was
thinking of partitioning data based on datasource and calendar day, so to
have one file per day, per data source, each 50MB.

I played around with pyarrow and parquet (using s3fs), and came across the
following limitations:

1) I found no way to append to existing files. I believe that's some
limitation with S3, but it could be worked around by using datasets
instead. In principle, I believe I could also trigger some daily job which
coalesces, today's data into a single file, if having too much
fragmentation causes any disturbance. Would that make any sense?

2) When reading, if I'm only interested in a small portion of the data (for
instance, based on a timestamp field), I obviously wouldn't want to have to
read (i.e. download) the whole file. I believe Parquet was designed to
handle huge amounts of data with relatively fast access. Yet I fail to
understand if there's some way to allow for random access, particularly
when dealing with a file stored within S3.
The following code snippet refers to a 150MB dataset composed of 1000
rowgroups of 150KB each. I was expecting it to run very fast, yet it
apparently downloads the whole file (pyarrow 0.9.0):

fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
client_kwargs=client_kwargs)
with fs.open(bucket_uri) as f:
pf = pq.ParquetFile(f)
print(pf.num_row_groups) # yields 1000
pf.read_row_group(1)

3) I was also expecting to be able to perform some sort of query, but I'm
also failing to see how to specify index columns or such.

What am I missing? Did I get it all wrong?

Thank you!
Gerlando




Re: IDE Prefs for Arrow/Parquet C++ development/debugging

2018-09-02 Thread Brian Bowman
Thanks Wes,

I’ve began using either CLI gdb or emacs/gud-gdb to get into Arrow/Parquet, yet 
CLion looks like it might improve productivity.

-Brian

On Sep 2, 2018, at 5:01 PM, Wes McKinney 
mailto:wesmck...@gmail.com>> wrote:

EXTERNAL

hey Brian,

I personally do all my work with Emacs+gdb on the command line, but
I've heard CLion is really useful for debugging and so I've been
meaning to set it up myself. I think Phillip C has used CLion some in
the past

I recently opened https://issues.apache.org/jira/browse/ARROW-3118
about documenting the setup process for this.

Apache Impala has a guide for using CLion with Impala so that's a
reasonable starting point
https://cwiki.apache.org/confluence/display/IMPALA/IntelliJ+and+CLion+Setup+for+Impala+Development

- Wes
On Sun, Sep 2, 2018 at 4:49 PM Brian Bowman 
mailto:brian.bow...@sas.com>> wrote:

Community,

I’m curious what your preferred IDEs are for Arrow/Parquet C++ work?  Is anyone 
using CLion from Jetbrains?

Thanks,

Brian


IDE Prefs for Arrow/Parquet C++ development/debugging

2018-09-02 Thread Brian Bowman
Community,

I’m curious what your preferred IDEs are for Arrow/Parquet C++ work?  Is anyone 
using CLion from Jetbrains?

Thanks,  

Brian

Re: arrow-glib 0.10.0

2018-08-23 Thread Brian Bowman
Kouhei,

Thank you for the these details!  I've built both the Apache Parquet and Arrow 
libraries and also installed their respective binaries and includes directly 
via yum.   

I plan to integrate both Arrow and Parquet into a threaded C environment.   
Arrow for in-memory buffer management and Parquet for columnar 'on-storage' 
support.

-Brian 

On 8/22/18, 11:10 PM, "Kouhei Sutou"  wrote:

EXTERNAL

Hi,

Arrow GLib is as capable/robust as Arrow C++ because Arrow
GLib is a wrapper of Arrow C++ and most features are
covered.

But you can't read/write Parquet data because Arrow C++
doesn't have these features. The features are provided by
parquet-cpp https://github.com/apache/parquet-cpp .

So you need to use both Arrow GLib and Parquet GLib. Parquet
GLib has only features that reading Parquet data to Arrow
data and writing Arrow data as Parquet data for now.


Thanks,
--
kou

In <5beed7f6-78f9-474e-8aa3-e40d02e5b...@sas.com>
  "arrow-glib 0.10.0" on Wed, 22 Aug 2018 20:17:40 +,
  Brian Bowman  wrote:

> I hope this is not too naïve a question.  Is arrow-glib 
0.10.0<https://arrow.apache.org/docs/c_glib/> as capable/robust as the Arrow 
C++ library<https://arrow.apache.org/docs/cpp/>, especially with regarding to 
reading and ultimately writing the parquet file format?
>
> Thanks,
>
> Brian
>




Re: arrow-glib 0.10.0

2018-08-22 Thread Brian Bowman
Thanks Wes,

Just discovered that!

-Brian 

On 8/22/18, 5:20 PM, "Wes McKinney"  wrote:

EXTERNAL

Hi Brian

The C GLib library is a wrapper for the C++ library, so it's the same code
executing under the hood.

Wes


On Wed, Aug 22, 2018, 4:17 PM Brian Bowman  wrote:

> I hope this is not too naïve a question.  Is arrow-glib 0.10.0<
> https://arrow.apache.org/docs/c_glib/> as capable/robust as the Arrow C++
> library<https://arrow.apache.org/docs/cpp/>, especially with regarding to
> reading and ultimately writing the parquet file format?
>
> Thanks,
>
> Brian
>
>




arrow-glib 0.10.0

2018-08-22 Thread Brian Bowman
I hope this is not too naïve a question.  Is arrow-glib 
0.10.0 as capable/robust as the Arrow 
C++ library, especially with regarding to 
reading and ultimately writing the parquet file format?

Thanks,

Brian



Re: Parquet Build issues

2018-08-20 Thread Brian Bowman
All,

My final hurdle to make parquet was updating zlib on a fresh Ubuntu VM:  sudo 
apt-get install zlib1g-dev

The happy result:

 54838568 Aug 20 18:17 build/debug/libparquet.a
 15 Aug 20 18:17 build/debug/libparquet.so -> libparquet.so.1
19 Aug 20 18:17 build/debug/libparquet.so.1 -> libparquet.so.1.4.1
21768264 Aug 20 18:17 build/debug/libparquet.so.1.4.1
54838568 Aug 20 18:17 build/latest/libparquet.a
15 Aug 20 18:17 build/latest/libparquet.so -> libparquet.so.1
19 Aug 20 18:17 build/latest/libparquet.so.1 -> libparquet.so.1.4.1
21768264 Aug 20 18:17 build/latest/libparquet.so.1.4.1


Best,

Brian

On 8/17/18, 1:16 PM, "Brian Bowman"  wrote:

Thanks for the quick reply Wes!  Indeed, I need to set up a fresh Linux 
system with the correct tooling.  I'll send an update once that's done.

-Brian 

On 8/17/18, 12:27 PM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

It doesn't look like you're using GNU make. Can you detail your build
environment / OS?

- Wes
    
    On Fri, Aug 17, 2018 at 9:54 AM, Brian Bowman  
wrote:
> All,
>
> It’s been 2-3 years since I joined this email list and I’ve not 
contributed yet.  I’ve just begun working with Parquet/Arrow and have 
downloaded the parquet-cpp-master 1.10.0 bundle from 
GITHUB<https://github.com/apache/parquet-format>.
>
> cmake appears to run successfully but make gets the errors seen 
below.  I’m not sure how to proceed with diagnosing the cause of these make 
errors.  Any help is appreciated from list members – including pointing me to a 
different email list :).Full disclosure – SAS has its own internal build 
system so I’m no expert on cmake/make.   Please point me to foundational 
reading on these tools if that’s what I’m missing.
    >
> Thanks,
>
> Brian
>
> Brian Bowman
> Principal Software Developer
> Analytic Server R&D
> SAS Institute Inc.
>
> brian.bow...@sas.com
>
> ___
>
> [ 14%] No update step for 'zstd_ep'
> [ 16%] No patch step for 'zstd_ep'
> [ 17%] No configure step for 'zstd_ep'
> [ 18%] Performing build step for 'zstd_ep'
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 19: Missing dependency operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 21: Need an operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 23: Need an operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 111: Missing dependency operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 271: Need an operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 274: Missing dependency operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 277: Need an operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 283: Missing dependency operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 325: Need an operator
> make[6]: Fatal errors encountered -- cannot continue
> make[6]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep
> *** Error code 1
>
> Stop.
> make[5]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build
> *** Error code 1
>
> Stop.
> make[4]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build
> *** Error code 1
>
> Stop.
> make[3]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build
> *** Error code 1
>
> Stop.
> make[2]: stopped in .../Apache/parquet-cpp-master
> *** Error code 1
>
> Stop.
> make[1]: stopped in .../Apache/parquet-cpp-master
> *** Error code 1
>
> Stop.
> make: stopped in .../Apache/parquet-cpp-master
>
>
>
>
>






Re: Parquet Build issues

2018-08-17 Thread Brian Bowman
Thanks for the quick reply Wes!  Indeed, I need to set up a fresh Linux system 
with the correct tooling.  I'll send an update once that's done.

-Brian 

On 8/17/18, 12:27 PM, "Wes McKinney"  wrote:

EXTERNAL

hi Brian,

It doesn't look like you're using GNU make. Can you detail your build
environment / OS?

- Wes

On Fri, Aug 17, 2018 at 9:54 AM, Brian Bowman  wrote:
> All,
>
> It’s been 2-3 years since I joined this email list and I’ve not 
contributed yet.  I’ve just begun working with Parquet/Arrow and have 
downloaded the parquet-cpp-master 1.10.0 bundle from 
GITHUB<https://github.com/apache/parquet-format>.
>
> cmake appears to run successfully but make gets the errors seen below.  
I’m not sure how to proceed with diagnosing the cause of these make errors.  
Any help is appreciated from list members – including pointing me to a 
different email list :).Full disclosure – SAS has its own internal build 
system so I’m no expert on cmake/make.   Please point me to foundational 
reading on these tools if that’s what I’m missing.
>
    > Thanks,
>
> Brian
>
> Brian Bowman
> Principal Software Developer
> Analytic Server R&D
> SAS Institute Inc.
>
> brian.bow...@sas.com
>
> ___
>
> [ 14%] No update step for 'zstd_ep'
> [ 16%] No patch step for 'zstd_ep'
> [ 17%] No configure step for 'zstd_ep'
> [ 18%] Performing build step for 'zstd_ep'
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 19: Missing dependency operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 21: Need an operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 23: Need an operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 111: Missing dependency operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 271: Need an operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 274: Missing dependency operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 277: Need an operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 283: Missing dependency operator
> make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 325: Need an operator
> make[6]: Fatal errors encountered -- cannot continue
> make[6]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep
> *** Error code 1
>
> Stop.
> make[5]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build
> *** Error code 1
>
> Stop.
> make[4]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build
> *** Error code 1
>
> Stop.
> make[3]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build
> *** Error code 1
>
> Stop.
> make[2]: stopped in .../Apache/parquet-cpp-master
> *** Error code 1
>
> Stop.
> make[1]: stopped in .../Apache/parquet-cpp-master
> *** Error code 1
>
> Stop.
> make: stopped in .../Apache/parquet-cpp-master
>
>
>
>
>




Parquet Build issues

2018-08-17 Thread Brian Bowman
All,

It’s been 2-3 years since I joined this email list and I’ve not contributed 
yet.  I’ve just begun working with Parquet/Arrow and have downloaded the 
parquet-cpp-master 1.10.0 bundle from 
GITHUB<https://github.com/apache/parquet-format>.

cmake appears to run successfully but make gets the errors seen below.  I’m not 
sure how to proceed with diagnosing the cause of these make errors.  Any help 
is appreciated from list members – including pointing me to a different email 
list :).Full disclosure – SAS has its own internal build system so I’m no 
expert on cmake/make.   Please point me to foundational reading on these tools 
if that’s what I’m missing.

Thanks,

Brian

Brian Bowman
Principal Software Developer
Analytic Server R&D
SAS Institute Inc.

brian.bow...@sas.com

___

[ 14%] No update step for 'zstd_ep'
[ 16%] No patch step for 'zstd_ep'
[ 17%] No configure step for 'zstd_ep'
[ 18%] Performing build step for 'zstd_ep'
make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 19: Missing dependency operator
make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 21: Need an operator
make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 23: Need an operator
make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 111: Missing dependency operator
make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 271: Need an operator
make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 274: Missing dependency operator
make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 277: Need an operator
make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 283: Missing dependency operator
make[6]: 
".../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep/Makefile"
 line 325: Need an operator
make[6]: Fatal errors encountered -- cannot continue
make[6]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build/zstd_ep-prefix/src/zstd_ep
*** Error code 1

Stop.
make[5]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build
*** Error code 1

Stop.
make[4]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build
*** Error code 1

Stop.
make[3]: stopped in 
.../Apache/parquet-cpp-master/arrow_ep-prefix/src/arrow_ep-build
*** Error code 1

Stop.
make[2]: stopped in .../Apache/parquet-cpp-master
*** Error code 1

Stop.
make[1]: stopped in .../Apache/parquet-cpp-master
*** Error code 1

Stop.
make: stopped in .../Apache/parquet-cpp-master







Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)

2016-09-06 Thread Brian Bowman
Forgive me if interposing my first post for the Apache Arrow project on this 
thread is incorrect procedure. 

What Julien proposes with each storage layer producing Arrow Record Batches is 
exactly how I envision it working and would certainly make Arrow integration 
with SAS much more palatable.  This is likely true for other storage layer 
providers as well. 

Brian Bowman (SAS)

> On Sep 6, 2016, at 7:52 PM, Julien Le Dem  wrote:
> 
> Thanks Wes,
> No worries, I know you are on top of those things.
> On a side note, I was wondering if the arrow-parquet integration should be
> in Parquet instead.
> Parquet would depend on Arrow and not the other way around.
> Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra,
> ...) provides a way to produce Arrow Record Batches.
> thoughts?
> 
>> On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney  wrote:
>> 
>> hi Julien,
>> 
>> I'm very sorry about the inconvenience with this and the delay in
>> getting it sorted out. I will triage this evening by disabling the
>> Parquet tests in Arrow until we get the current problems under
>> control. When we re-enable the Parquet tests in Travis CI I agree we
>> should pin the version SHA.
>> 
>> - Wes
>> 
>>> On Tue, Sep 6, 2016 at 5:30 PM, Julien Le Dem  wrote:
>>> The Arrow cpp travis-ci build is broken right now because it depends on
>>> parquet-cpp which has changed in an incompatible way. [1] [2] (or so it
>>> looks to me)
>>> Since parquet-cpp is not released yet it is totally fine to make
>>> incompatible API changes.
>>> However, we may want to pin the Arrow to Parquet dependency (on a git
>> sha?)
>>> to prevent cross project changes from breaking the master build.
>>> Since I'm not one of the core cpp dev on those projects I mainly want to
>>> start that conversation rather than prescribe a solution. Feel free to
>> take
>>> this as a straw man and suggest something else.
>>> 
>>> [1] https://travis-ci.org/apache/arrow/jobs/156080555
>>> [2]
>>> https://github.com/apache/arrow/blob/2d8ec789365f3c0f82b1f22d76160d
>> 5af150dd31/ci/travis_before_script_cpp.sh
>>> 
>>> 
>>> --
>>> Julien
> 
> 
> 
> -- 
> Julien