Re: arrow-glib 0.10.0

2018-08-22 Thread Kouhei Sutou
Hi,

Arrow GLib is as capable/robust as Arrow C++ because Arrow
GLib is a wrapper of Arrow C++ and most features are
covered.

But you can't read/write Parquet data because Arrow C++
doesn't have these features. The features are provided by
parquet-cpp https://github.com/apache/parquet-cpp .

So you need to use both Arrow GLib and Parquet GLib. Parquet
GLib has only features that reading Parquet data to Arrow
data and writing Arrow data as Parquet data for now.


Thanks,
--
kou

In <5beed7f6-78f9-474e-8aa3-e40d02e5b...@sas.com>
  "arrow-glib 0.10.0" on Wed, 22 Aug 2018 20:17:40 +,
  Brian Bowman  wrote:

> I hope this is not too naïve a question.  Is arrow-glib 
> 0.10.0 as capable/robust as the Arrow 
> C++ library, especially with regarding to 
> reading and ultimately writing the parquet file format?
> 
> Thanks,
> 
> Brian
> 


[jira] [Created] (ARROW-3111) [Java] Enable changing default logging level when running tests

2018-08-22 Thread Bryan Cutler (JIRA)
Bryan Cutler created ARROW-3111:
---

 Summary: [Java] Enable changing default logging level when running 
tests
 Key: ARROW-3111
 URL: https://issues.apache.org/jira/browse/ARROW-3111
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Currently tests use the logback logger which has a default level of DEBUG. We 
should provide a way to change this level so that CI can run a build without 
seeing DEBUG messages if needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: arrow-glib 0.10.0

2018-08-22 Thread Brian Bowman
Thanks Wes,

Just discovered that!

-Brian 

On 8/22/18, 5:20 PM, "Wes McKinney"  wrote:

EXTERNAL

Hi Brian

The C GLib library is a wrapper for the C++ library, so it's the same code
executing under the hood.

Wes


On Wed, Aug 22, 2018, 4:17 PM Brian Bowman  wrote:

> I hope this is not too naïve a question.  Is arrow-glib 0.10.0<
> https://arrow.apache.org/docs/c_glib/> as capable/robust as the Arrow C++
> library, especially with regarding to
> reading and ultimately writing the parquet file format?
>
> Thanks,
>
> Brian
>
>




Re: arrow-glib 0.10.0

2018-08-22 Thread Wes McKinney
Hi Brian

The C GLib library is a wrapper for the C++ library, so it's the same code
executing under the hood.

Wes


On Wed, Aug 22, 2018, 4:17 PM Brian Bowman  wrote:

> I hope this is not too naïve a question.  Is arrow-glib 0.10.0<
> https://arrow.apache.org/docs/c_glib/> as capable/robust as the Arrow C++
> library, especially with regarding to
> reading and ultimately writing the parquet file format?
>
> Thanks,
>
> Brian
>
>


Re: Using Arrow IPC/GPU for column-like structures

2018-08-22 Thread Wes McKinney
Thanks!

We have already implemented GPU IPC for CUDA:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/gpu/cuda_arrow_ipc.h

Is it possible to use these APIs? If not, what could be changed or added to
allow you to? I don't think it's worthwhile to maintain an alternative
implementation of the IPC protocol in a third party package. The results
can be converted to the C data structure that you listed.

Wes

On Wed, Aug 22, 2018, 4:56 PM Pearu Peterson 
wrote:

> Hi Wes,
>
> Yes, sorry for the mess. Here is the message in plain text:
>
> The libgdf project defines a column structure that in a simplified form
> could be represented as
>
> typedef struct {
> void *data;  // column data
> unsigned char *valid;  // validity mask, one bit per column
> item
> size_t size; // nof items
> enum {INT8, INT16, ...} dtype; // type of column item
> size_t null_count;   // nof non-valid items
> } my_column_t;
>
> The aim is to implement IPC protocol for sharing my_column_t data between
> host and GPU devices.
>
> What would be the most sensible way to do that using tools available in
> Arrow library?
>
> We are currently considering the following approaches:
>
> 1. Re-using Arrow Array: my_column_t and Arrow Array have one-to-one
> correspondence regarding data content.
>
> 2. Defining new Arrow format MyColumn (using Arrow Tensor as an example):
>
> table MyColumn {
>   /// The type of data contained in a value cell.
>   type: Type;
>   /// The number of non-valid items
>   null_count: long;
>   /// The location and size of the column's data
>   data: Buffer;
>   /// The location and size of the column's mask
>   valid: Buffer;
> }
>
> We are uncertain which approach would be easiest to implement and maintain,
> be efficient (0-copy), or would make sense at all.
>
> Defining Arrow MyColumn seems appealing because of about 7 times less code
> in Arrow Tensor than in Arrow Array. However, Arrow Array includes validity
> mask already.
>
> What do you think?
>
> Best regards,
> Pearu
>
>
> On Wed, Aug 22, 2018 at 11:53 PM, Wes McKinney 
> wrote:
>
> > Hi Pearu,
> >
> > Seems the formatting of your email got messed up a little bit. Can you
> > resend with some more line breaks?
> >
> > Thanks
> >
> >
> > On Wed, Aug 22, 2018, 4:46 PM Pearu Peterson <
> pearu.peter...@quansight.com
> > >
> > wrote:
> >
> > > *Hi,The libgdf project defines a column structure that in a simplified
> > form
> > > could be represented astypedef struct {void *data;
> > //
> > > column dataunsigned char *valid; // validity mask // one bit per
> > column
> > > itemsize_t size; // nof itemsenum {INT8, INT16,
> > > ...} dtype; // type of column itemsize_t null_count;   //
> nof
> > > non-valid items} my_column_t;The aim is to implement IPC protocol for
> > > sharing my_column_t data between host and GPU devices. What would be
> the
> > > most sensible way to do that using tools available in Arrow library?We
> > are
> > > currently considering the following approaches:1. Re-using Arrow Array
> > > (C++): my_column_t and Arrow Array have one-to-one correspondence
> > regarding
> > > data content.2. Defining new Arrow format MyColumn (using Arrow Tensor
> as
> > > an example):table MyColumn {  /// The type of data contained in a value
> > > cell.  type: Type;  /// The number of non-valid items  null_count:
> long;
> > >  /// The location and size of the column's data  data: Buffer;  /// The
> > > location and size of the column's mask  valid: Buffer;}We are uncertain
> > > which approach would be easiest to implement and maintain, be efficient
> > > (0-copy), or would make sense at all.Defining Arrow MyColumn seems
> > > appealing because of about 7 times less code in Arrow Tensor than in
> > Arrow
> > > Array. However, Arrow Array includes validity mask already.What do you
> > > think?Best regards,Pearu*
> > >
> >
>


Re: Using Arrow IPC/GPU for column-like structures

2018-08-22 Thread Pearu Peterson
Hi Wes,

Yes, sorry for the mess. Here is the message in plain text:

The libgdf project defines a column structure that in a simplified form
could be represented as

typedef struct {
void *data;  // column data
unsigned char *valid;  // validity mask, one bit per column item
size_t size; // nof items
enum {INT8, INT16, ...} dtype; // type of column item
size_t null_count;   // nof non-valid items
} my_column_t;

The aim is to implement IPC protocol for sharing my_column_t data between
host and GPU devices.

What would be the most sensible way to do that using tools available in
Arrow library?

We are currently considering the following approaches:

1. Re-using Arrow Array: my_column_t and Arrow Array have one-to-one
correspondence regarding data content.

2. Defining new Arrow format MyColumn (using Arrow Tensor as an example):

table MyColumn {
  /// The type of data contained in a value cell.
  type: Type;
  /// The number of non-valid items
  null_count: long;
  /// The location and size of the column's data
  data: Buffer;
  /// The location and size of the column's mask
  valid: Buffer;
}

We are uncertain which approach would be easiest to implement and maintain,
be efficient (0-copy), or would make sense at all.

Defining Arrow MyColumn seems appealing because of about 7 times less code
in Arrow Tensor than in Arrow Array. However, Arrow Array includes validity
mask already.

What do you think?

Best regards,
Pearu


On Wed, Aug 22, 2018 at 11:53 PM, Wes McKinney  wrote:

> Hi Pearu,
>
> Seems the formatting of your email got messed up a little bit. Can you
> resend with some more line breaks?
>
> Thanks
>
>
> On Wed, Aug 22, 2018, 4:46 PM Pearu Peterson  >
> wrote:
>
> > *Hi,The libgdf project defines a column structure that in a simplified
> form
> > could be represented astypedef struct {void *data;
> //
> > column dataunsigned char *valid; // validity mask // one bit per
> column
> > itemsize_t size; // nof itemsenum {INT8, INT16,
> > ...} dtype; // type of column itemsize_t null_count;   // nof
> > non-valid items} my_column_t;The aim is to implement IPC protocol for
> > sharing my_column_t data between host and GPU devices. What would be the
> > most sensible way to do that using tools available in Arrow library?We
> are
> > currently considering the following approaches:1. Re-using Arrow Array
> > (C++): my_column_t and Arrow Array have one-to-one correspondence
> regarding
> > data content.2. Defining new Arrow format MyColumn (using Arrow Tensor as
> > an example):table MyColumn {  /// The type of data contained in a value
> > cell.  type: Type;  /// The number of non-valid items  null_count: long;
> >  /// The location and size of the column's data  data: Buffer;  /// The
> > location and size of the column's mask  valid: Buffer;}We are uncertain
> > which approach would be easiest to implement and maintain, be efficient
> > (0-copy), or would make sense at all.Defining Arrow MyColumn seems
> > appealing because of about 7 times less code in Arrow Tensor than in
> Arrow
> > Array. However, Arrow Array includes validity mask already.What do you
> > think?Best regards,Pearu*
> >
>


Re: Using Arrow IPC/GPU for column-like structures

2018-08-22 Thread Wes McKinney
Hi Pearu,

Seems the formatting of your email got messed up a little bit. Can you
resend with some more line breaks?

Thanks


On Wed, Aug 22, 2018, 4:46 PM Pearu Peterson 
wrote:

> *Hi,The libgdf project defines a column structure that in a simplified form
> could be represented astypedef struct {void *data;  //
> column dataunsigned char *valid; // validity mask // one bit per column
> itemsize_t size; // nof itemsenum {INT8, INT16,
> ...} dtype; // type of column itemsize_t null_count;   // nof
> non-valid items} my_column_t;The aim is to implement IPC protocol for
> sharing my_column_t data between host and GPU devices. What would be the
> most sensible way to do that using tools available in Arrow library?We are
> currently considering the following approaches:1. Re-using Arrow Array
> (C++): my_column_t and Arrow Array have one-to-one correspondence regarding
> data content.2. Defining new Arrow format MyColumn (using Arrow Tensor as
> an example):table MyColumn {  /// The type of data contained in a value
> cell.  type: Type;  /// The number of non-valid items  null_count: long;
>  /// The location and size of the column's data  data: Buffer;  /// The
> location and size of the column's mask  valid: Buffer;}We are uncertain
> which approach would be easiest to implement and maintain, be efficient
> (0-copy), or would make sense at all.Defining Arrow MyColumn seems
> appealing because of about 7 times less code in Arrow Tensor than in Arrow
> Array. However, Arrow Array includes validity mask already.What do you
> think?Best regards,Pearu*
>


Using Arrow IPC/GPU for column-like structures

2018-08-22 Thread Pearu Peterson
*Hi,The libgdf project defines a column structure that in a simplified form
could be represented astypedef struct {void *data;  //
column dataunsigned char *valid; // validity mask // one bit per column
itemsize_t size; // nof itemsenum {INT8, INT16,
...} dtype; // type of column itemsize_t null_count;   // nof
non-valid items} my_column_t;The aim is to implement IPC protocol for
sharing my_column_t data between host and GPU devices. What would be the
most sensible way to do that using tools available in Arrow library?We are
currently considering the following approaches:1. Re-using Arrow Array
(C++): my_column_t and Arrow Array have one-to-one correspondence regarding
data content.2. Defining new Arrow format MyColumn (using Arrow Tensor as
an example):table MyColumn {  /// The type of data contained in a value
cell.  type: Type;  /// The number of non-valid items  null_count: long;
 /// The location and size of the column's data  data: Buffer;  /// The
location and size of the column's mask  valid: Buffer;}We are uncertain
which approach would be easiest to implement and maintain, be efficient
(0-copy), or would make sense at all.Defining Arrow MyColumn seems
appealing because of about 7 times less code in Arrow Tensor than in Arrow
Array. However, Arrow Array includes validity mask already.What do you
think?Best regards,Pearu*


arrow-glib 0.10.0

2018-08-22 Thread Brian Bowman
I hope this is not too naïve a question.  Is arrow-glib 
0.10.0 as capable/robust as the Arrow 
C++ library, especially with regarding to 
reading and ultimately writing the parquet file format?

Thanks,

Brian



[jira] [Created] (ARROW-3110) [C++] Compilation warnings with gcc 7.3.0

2018-08-22 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3110:
-

 Summary: [C++] Compilation warnings with gcc 7.3.0
 Key: ARROW-3110
 URL: https://issues.apache.org/jira/browse/ARROW-3110
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Affects Versions: 0.10.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


This is happening when building in release mode:
{code}
../src/arrow/python/python_to_arrow.cc: In function 'arrow::Status 
arrow::py::detail::BuilderAppend(arrow::BinaryBuilder*, PyObject*, bool*)':
../src/arrow/python/python_to_arrow.cc:388:56: warning: 'length' may be used 
uninitialized in this function [-Wmaybe-uninitialized]
   if (ARROW_PREDICT_FALSE(builder->value_data_length() + length > 
kBinaryMemoryLimit)) {
^
../src/arrow/python/python_to_arrow.cc:385:11: note: 'length' was declared here
   int32_t length;
   ^~
In file included from ../src/arrow/python/serialize.cc:32:0:
../src/arrow/builder.h: In member function 'arrow::Status 
arrow::py::SequenceBuilder::Update(int64_t, int8_t*)':
../src/arrow/builder.h:413:5: warning: 'offset32' may be used uninitialized in 
this function [-Wmaybe-uninitialized]
 raw_data_[length_++] = val;
 ^
../src/arrow/python/serialize.cc:90:13: note: 'offset32' was declared here
 int32_t offset32;
 ^~~~
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Python 3.7 (was: Re: Arrow Sync)

2018-08-22 Thread Antoine Pitrou


Le 22/08/2018 à 18:27, Wes McKinney a écrit :
> - Python 3.7
>   - Easiest to do Linux wheels for now
>   - Will take a while for conda-forge toolchain to catch up

Note I'm current developing with Anaconda-originated packages and it
seems to work fine.

Regards

Antoine.


> 
> On Wed, Aug 22, 2018 at 11:29 AM, Phillip Cloud  wrote:
>> I won't be able to make the call today, I have a conflict.
>>
>> On Wed, Aug 22, 2018 at 11:04 AM Wes McKinney  wrote:
>>
>>> No worries. It's 12pm Eastern today at
>>> https://meet.google.com/vtm-teks-phx
>>>
>>> On Wed, Aug 22, 2018 at 10:56 AM, Siddharth Teotia 
>>> wrote:
 I have a clash this morning so won't be able to join the call.
>>>


Re: Arrow Sync

2018-08-22 Thread Wes McKinney
Notes from today's sync

- Wes (Ursa Labs)
  - Parquet merge
  - Scalars in C++
- Uwe (Blue Yonder)
  - Same topics
- Li (Two Sigma)
  - Arrow Flight
- Laurent (Dremio)

- Arrow Flight
  - PR ARROW-249, more feedback needed
  - Wes planning to work on C++ prototype
  - Future: Extending Calcite Avatica with Arrow Flight
- Database clients
  - Optimized Arrow interface to native protocols
  - ODBC separate interface for interacting with databases
  - ODBC drivers can be fast or slow based on the implementation
- Scalars in C++
  - Impala UDF scalar API as one inspiration
  - Need type metadata
  - Explore kernel extensions for type casting for scalars
- Parquet C++ merge
  - Vote up on dev@ mailing list now
  - Parquet release happening soon
  - Wes/Uwe investigating solutions for the git commit history
- Python 3.7
  - Easiest to do Linux wheels for now
  - Will take a while for conda-forge toolchain to catch up

On Wed, Aug 22, 2018 at 11:29 AM, Phillip Cloud  wrote:
> I won't be able to make the call today, I have a conflict.
>
> On Wed, Aug 22, 2018 at 11:04 AM Wes McKinney  wrote:
>
>> No worries. It's 12pm Eastern today at
>> https://meet.google.com/vtm-teks-phx
>>
>> On Wed, Aug 22, 2018 at 10:56 AM, Siddharth Teotia 
>> wrote:
>> > I have a clash this morning so won't be able to join the call.
>>


Re: Arrow Sync

2018-08-22 Thread Phillip Cloud
I won't be able to make the call today, I have a conflict.

On Wed, Aug 22, 2018 at 11:04 AM Wes McKinney  wrote:

> No worries. It's 12pm Eastern today at
> https://meet.google.com/vtm-teks-phx
>
> On Wed, Aug 22, 2018 at 10:56 AM, Siddharth Teotia 
> wrote:
> > I have a clash this morning so won't be able to join the call.
>


Re: Arrow Sync

2018-08-22 Thread Wes McKinney
No worries. It's 12pm Eastern today at https://meet.google.com/vtm-teks-phx

On Wed, Aug 22, 2018 at 10:56 AM, Siddharth Teotia  wrote:
> I have a clash this morning so won't be able to join the call.


Arrow Sync

2018-08-22 Thread Siddharth Teotia
I have a clash this morning so won't be able to join the call.


[jira] [Created] (ARROW-3109) [Python] Add Python 3.7 virtualenvs to manylinux1 container

2018-08-22 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-3109:
--

 Summary: [Python] Add Python 3.7 virtualenvs to manylinux1 
container
 Key: ARROW-3109
 URL: https://issues.apache.org/jira/browse/ARROW-3109
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.11.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3108) [C++] arrow::PrettyPrint for Table instances

2018-08-22 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-3108:
--

 Summary: [C++] arrow::PrettyPrint for Table instances
 Key: ARROW-3108
 URL: https://issues.apache.org/jira/browse/ARROW-3108
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.10.0
Reporter: Uwe L. Korn
 Fix For: 0.12.0


Extend the {{arrow::PrettyPrint}} functionality to also support 
{{arrow::Table}} instances in addition to {{RecordBatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3107) [C++] arrow::PrettyPrint for Column instances

2018-08-22 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-3107:
--

 Summary: [C++] arrow::PrettyPrint for Column instances
 Key: ARROW-3107
 URL: https://issues.apache.org/jira/browse/ARROW-3107
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.10.0
Reporter: Uwe L. Korn
 Fix For: 0.12.0


Currently, we support {{arrow::ChunkedArray}} instances in {{PrettyPrint}}. We 
should also support columns. The main addition will be here that will also 
print the specified field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)