Re: [VOTE] Apache Arrow JavaScript 0.3.1 - RC1

2018-03-15 Thread Uwe L. Korn
+1 (binding). Ran js-verify-release-candidate.sh with Node 9.8.0

On Thu, Mar 15, 2018, at 1:50 AM, Wes McKinney wrote:
> +1 (binding). Ran js-verify-release-candidate.sh with Node 8.10.0 LTS
> 
> On Wed, Mar 14, 2018 at 8:40 PM, Paul Taylor  wrote:
> > +1 (non-binding)
> >
> >> On Mar 14, 2018, at 5:10 PM, Wes McKinney  wrote:
> >>
> >> Hello all,
> >>
> >> I\'d like to propose the following release candidate (rc1) of Apache Arrow
> >> JavaScript version 0.3.1.
> >>
> >> The source release rc1 is hosted at [1].
> >>
> >> This release candidate is based on commit
> >> 077bd53df590cafe26fc784b3c6d03bf1ac24f67
> >>
> >> Please download, verify checksums and signatures, run the unit tests, and 
> >> vote
> >> on the release. The easiest way is to use the JavaScript-specific release
> >> verification script dev/release/js-verify-release-candidate.sh.
> >>
> >> The vote will be open for at least 24 hours and will close once
> >> enough PMCs have approved the release.
> >>
> >> [ ] +1 Release this as Apache Arrow JavaScript 0.3.1
> >> [ ] +0
> >> [ ] -1 Do not release this as Apache Arrow JavaScript 0.3.1 because...
> >>
> >>
> >> How to validate a release signature:
> >> https://httpd.apache.org/dev/verification.html
> >>
> >> [1]: 
> >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.1-rc1/
> >> [2]: 
> >> https://github.com/apache/arrow/tree/077bd53df590cafe26fc784b3c6d03bf1ac24f67
> >


[jira] [Created] (ARROW-2314) [Python] Union array slicing is defective

2018-03-15 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2314:
-

 Summary: [Python] Union array slicing is defective
 Key: ARROW-2314
 URL: https://issues.apache.org/jira/browse/ARROW-2314
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou


{code:python}
>>> a = pa.UnionArray.from_sparse(pa.array([0,1,1], type=pa.int8()), 
>>> [pa.array(["a", "b", "c"]), pa.array([2,3,4])])
>>> a

[
  'a',
  3,
  4
]
>>> a[1:]

[
  2,
  3
]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Apache Arrow JavaScript 0.3.1 - RC1

2018-03-15 Thread Brian Hulette
+1 (non-binding). Ran js-verify-release-candidate.sh with Node 8.9.1 on 
Ubuntu 16.04. Thanks Wes!



On 03/15/2018 05:17 AM, Uwe L. Korn wrote:

+1 (binding). Ran js-verify-release-candidate.sh with Node 9.8.0

On Thu, Mar 15, 2018, at 1:50 AM, Wes McKinney wrote:

+1 (binding). Ran js-verify-release-candidate.sh with Node 8.10.0 LTS

On Wed, Mar 14, 2018 at 8:40 PM, Paul Taylor  wrote:

+1 (non-binding)


On Mar 14, 2018, at 5:10 PM, Wes McKinney  wrote:

Hello all,

I\'d like to propose the following release candidate (rc1) of Apache Arrow
JavaScript version 0.3.1.

The source release rc1 is hosted at [1].

This release candidate is based on commit
077bd53df590cafe26fc784b3c6d03bf1ac24f67

Please download, verify checksums and signatures, run the unit tests, and vote
on the release. The easiest way is to use the JavaScript-specific release
verification script dev/release/js-verify-release-candidate.sh.

The vote will be open for at least 24 hours and will close once
enough PMCs have approved the release.

[ ] +1 Release this as Apache Arrow JavaScript 0.3.1
[ ] +0
[ ] -1 Do not release this as Apache Arrow JavaScript 0.3.1 because...


How to validate a release signature:
https://httpd.apache.org/dev/verification.html

[1]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.3.1-rc1/
[2]: 
https://github.com/apache/arrow/tree/077bd53df590cafe26fc784b3c6d03bf1ac24f67




[jira] [Created] (ARROW-2315) [C++/Python] Add method to flatten a struct array

2018-03-15 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2315:
-

 Summary: [C++/Python] Add method to flatten a struct array
 Key: ARROW-2315
 URL: https://issues.apache.org/jira/browse/ARROW-2315
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 0.9.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


See ARROW-1886. We want to be able to take a StructArray and flatten it into 
independent field arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2316) [C++] Revert Buffer::mutable_data member to always inline

2018-03-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2316:
---

 Summary: [C++] Revert Buffer::mutable_data member to always inline
 Key: ARROW-2316
 URL: https://issues.apache.org/jira/browse/ARROW-2316
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.9.0


If not, linkers must remember to define {{NDEBUG}} depending on whether Arrow 
was built in release mode or not



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[ANNOUNCE] New Arrow PMC member: Phillip Cloud

2018-03-15 Thread Wes McKinney
On behalf of the Apache Arrow PMC, I'm happy to announce that Phillip
Cloud has been invited to join the PMC.

Congrats Phillip, and thank you for your hard work growing the Arrow
project and developer community!

- Wes


Re: unwrap_table in c++ library

2018-03-15 Thread Wes McKinney
hi Viktor,

> might be more frequently checked than the slack channel as i initially posted 
> my problem there

Yes -- the mailing list or JIRA is the place you want to go for help.
Slack is helpful for real-time interactions but a lot of us (like me)
do not spend any time there. Asynchronous, publicly viewable /
searchable discussions are better for a worldwide project like this.

I am not sure about the linking problem -- this function is being used
successfully in the turbodbc project:

https://github.com/blue-yonder/turbodbc/blob/c6dd8e2a3d575054b4ec0fe6cbab9a60019ebb0a/cpp/turbodbc_arrow/Library/src/set_arrow_parameters.cpp

It would be nice to fix the warning re: C linkage. That's a Cython
artifact, I think. Could you open a JIRA about this?

Thanks
Wes

On Wed, Mar 14, 2018 at 12:40 PM, Viktor Gal  wrote:
> Hi,
>
> i’m writing this email as i thought it might be more frequently checked than 
> the slack channel as i initially posted my problem there:
> i’m trying to use `arrow::py::unwrap_table` in a c++ library that would use 
> pyarrow tables…. unfortunately when i try to load the generated python module 
> (shared lib) in the python interpreter, i’m getting a `Symbol not found: 
> _unwrap_table` exception…. i thought the problem was that i havent linked the 
> library with `libarrow_python.dylib`, but even after linking with that lib 
> i’m still getting the same error… anybody has any ideas why there’s `Symbol 
> not found` error?
>
> note that when i compile my shared library i’m getting the following warning:
> warning: 'unwrap_table' has C-linkage specified, but returns user-defined 
> type 'arrow::Status' which is incompatible with C [-Wreturn-type-c-linkage]
> ARROW_EXPORT Status unwrap_table(PyObject* table, std::shared_ptr* 
> out);
>
> the main idea is that i would pass the pyarrow table to my lightweight python 
> library that is basically a wrapper around a c++ library. In that wrapper i 
> would unwrap the pyarrow Table to a simple std::shared_ptr 
> which i would pass to the c++ library, as it can handle standard arrow data 
> structures.
>
> cheers,
> viktor
>


Re: [ANNOUNCE] New Arrow PMC member: Phillip Cloud

2018-03-15 Thread Li Jin
Congrats!

On Thu, Mar 15, 2018 at 11:32 AM, Wes McKinney  wrote:

> On behalf of the Apache Arrow PMC, I'm happy to announce that Phillip
> Cloud has been invited to join the PMC.
>
> Congrats Phillip, and thank you for your hard work growing the Arrow
> project and developer community!
>
> - Wes
>


Re: [ANNOUNCE] New Arrow PMC member: Phillip Cloud

2018-03-15 Thread Phillip Cloud
Thanks Wes and Li,

Very glad to be part of the project, and look forward to continuing to make
Arrow even better. Exciting times!

On Thu, Mar 15, 2018 at 1:31 PM Li Jin  wrote:

> Congrats!
>
> On Thu, Mar 15, 2018 at 11:32 AM, Wes McKinney 
> wrote:
>
> > On behalf of the Apache Arrow PMC, I'm happy to announce that Phillip
> > Cloud has been invited to join the PMC.
> >
> > Congrats Phillip, and thank you for your hard work growing the Arrow
> > project and developer community!
> >
> > - Wes
> >
>


[jira] [Created] (ARROW-2317) [Python] fix C linkage warning

2018-03-15 Thread Viktor Gal (JIRA)
Viktor Gal created ARROW-2317:
-

 Summary: [Python] fix C linkage warning
 Key: ARROW-2317
 URL: https://issues.apache.org/jira/browse/ARROW-2317
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Viktor Gal


When using pyarrow interface from a c++ library one will get the following 
compiler warning:
{quote}{{warning: 'unwrap_table' has C-linkage specified, but returns 
user-defined type 'arrow::Status' which is incompatible with C 
[-Wreturn-type-c-linkage]}}
{{ARROW_EXPORT Status unwrap_table(PyObject* table, std::shared_ptr* 
out);}}
{quote}
This is due to a Cython artifact.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [ANNOUNCE] New Arrow PMC member: Phillip Cloud

2018-03-15 Thread Philipp Moritz
Congrats Phillip!

On Thu, Mar 15, 2018 at 10:43 AM, Phillip Cloud  wrote:

> Thanks Wes and Li,
>
> Very glad to be part of the project, and look forward to continuing to make
> Arrow even better. Exciting times!
>
> On Thu, Mar 15, 2018 at 1:31 PM Li Jin  wrote:
>
> > Congrats!
> >
> > On Thu, Mar 15, 2018 at 11:32 AM, Wes McKinney 
> > wrote:
> >
> > > On behalf of the Apache Arrow PMC, I'm happy to announce that Phillip
> > > Cloud has been invited to join the PMC.
> > >
> > > Congrats Phillip, and thank you for your hard work growing the Arrow
> > > project and developer community!
> > >
> > > - Wes
> > >
> >
>


[jira] [Created] (ARROW-2318) [C++] TestPlasmaStore.MultipleClientTest is flaky (hangs) in release builds

2018-03-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2318:
---

 Summary: [C++] TestPlasmaStore.MultipleClientTest is flaky (hangs) 
in release builds
 Key: ARROW-2318
 URL: https://issues.apache.org/jira/browse/ARROW-2318
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


Here's my setup:

* Ubuntu 16.04
* clang 5.0 (gcc 5.4.0 base toolchain)

I can reliably get {{release/client_tests}} to hang. I hit this while looking 
at a possible 0.9.0 release candidate



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Working towards getting 0.9.0 release candidate up next week

2018-03-15 Thread Wes McKinney
Unfortunately, we aren't quite out of the woods yet.

https://issues.apache.org/jira/browse/ARROW-2318

Robert or Philipp, could you please advise on this issue?

I think if we're learning anything from this experience, we need to
work on our continuous integration and delivery. We shouldn't be
finding out about all these problems at the least minute -- the master
branch should be kept in a releasable state every day (i.e. the
release verification script passes on a snapshot of master)

On Thu, Mar 15, 2018 at 2:06 AM, Kouhei Sutou  wrote:
> https://github.com/apache/arrow/pull/1752 will solve this.
>
> --
> kou
>
> In <20180315.135146.1891617944364686374@clear-code.com>
>   "Re: Working towards getting 0.9.0 release candidate up next week" on Thu, 
> 15 Mar 2018 13:51:46 +0900 (JST),
>   Kouhei Sutou  wrote:
>
>> Sure. Please wait for a while.
>>
>> --
>> kou
>>
>> In 
>>   "Re: Working towards getting 0.9.0 release candidate up next week" on Thu, 
>> 15 Mar 2018 00:49:02 -0400,
>>   Wes McKinney  wrote:
>>
>>> I ran into another problem:
>>>
>>> https://issues.apache.org/jira/browse/ARROW-2313
>>>
>>> @Kou, could you have a look at this? Thanks
>>>
>>> On Wed, Mar 14, 2018 at 10:36 PM, Wes McKinney  wrote:
 Thanks Paul! This seems to have done the trick. I will get the RC out
 later this evening

 On Wed, Mar 14, 2018 at 9:50 PM, Paul Taylor  wrote:
> This should fix it: https://github.com/apache/arrow/pull/1751 
> 
>
>> On Mar 14, 2018, at 6:42 PM, Wes McKinney  wrote:
>>
>> Last item https://issues.apache.org/jira/browse/ARROW-2312
>>
>> I can start the release vote as soon as we have the release
>> verification script working again
>>
>> On Wed, Mar 14, 2018 at 2:28 PM, Wes McKinney  
>> wrote:
>>> OK, patch is up for ARROW-2307:
>>> https://github.com/apache/arrow/pull/1747. Once that is reviewed and
>>> merged I will start the release vote
>>>
>>> On Wed, Mar 14, 2018 at 10:18 AM, Wes McKinney  
>>> wrote:
 I'm going to have a look at ARROW-2307 to see if it's an easy fix. If
 not, I will go ahead with the RC

 On Wed, Mar 14, 2018 at 7:36 AM, Uwe L. Korn  wrote:
> The mentioned bugfixes were merged and I also tested the 
> Arrow<->Parquet-cpp as well as the Arrow<->Dask(parquet) integration. 
> Both seem to work fine. From my side it looks like we're ready to 
> make an RC.
>
> Uwe
>
> On Wed, Mar 14, 2018, at 5:07 AM, Wes McKinney wrote:
>> I fixed these bugs
>>
>> https://github.com/apache/arrow/pull/1742
>> https://github.com/apache/arrow/pull/1743
>>
>> As soon as these patches go in, we can cut the RC0. I can do this as
>> soon as tomorrow (Wednesday) morning
>>
>> On Tue, Mar 13, 2018 at 5:00 PM, Wes McKinney  
>> wrote:
>>> I found 2 bugs -- ARROW-2304 and ARROW-2306 -- while doing some 
>>> final
>>> testing of master (stuff we haven't been testing in CI -- we 
>>> _really_
>>> need to set up nightly CI jobs for more time consuming tests, like
>>> HDFS, that we want to test periodically but perhaps not on every
>>> commit). I'm going to see if these are easy to fix
>>>
>>> On Tue, Mar 13, 2018 at 11:21 AM, Wes McKinney 
>>>  wrote:
 I fixed ARROW-2227 in https://github.com/apache/arrow/pull/1740, 
 so if
 someone could review that would be great.

 I am going to make sure the C++/Python HDFS tests run locally, then
 after the patch above is merged we should be good to make the RC.

 Note to other onlookers -- please feel free to keep writing new
 patches not mentioned here. They may just have to go in to the next
 release

 Thanks
 Wes

 On Tue, Mar 13, 2018 at 12:56 AM, Wes McKinney 
  wrote:
> Things are looking pretty good. I'm waiting on a build for 
> ARROW-1643
> to go in, and there's a last blocker bug (ARROW-2227) that we 
> ought to
> try to fix tomorrow before cutting the RC. I can start the vote 
> after
> that if nothing else comes up
>
> I created ARROW-2300 in the course of trying to test ARROW-1643. I
> have an alternative way to run the HDFS tests (since we aren't 
> running
> them in Travis CI) so most likely I will try out the HDFS tests 
> and
> then move ARROW-2300 to the next release milestone.
>
> Thanks
> Wes
>
> On Mon, Mar 12, 2018 at 12:03 AM, Wes McKinney 

Re: Working towards getting 0.9.0 release candidate up next week

2018-03-15 Thread Philipp Moritz
Thanks, I'm trying to reproduce it now!

On Thu, Mar 15, 2018 at 1:28 PM, Wes McKinney  wrote:

> Unfortunately, we aren't quite out of the woods yet.
>
> https://issues.apache.org/jira/browse/ARROW-2318
>
> Robert or Philipp, could you please advise on this issue?
>
> I think if we're learning anything from this experience, we need to
> work on our continuous integration and delivery. We shouldn't be
> finding out about all these problems at the least minute -- the master
> branch should be kept in a releasable state every day (i.e. the
> release verification script passes on a snapshot of master)
>
> On Thu, Mar 15, 2018 at 2:06 AM, Kouhei Sutou  wrote:
> > https://github.com/apache/arrow/pull/1752 will solve this.
> >
> > --
> > kou
> >
> > In <20180315.135146.1891617944364686374@clear-code.com>
> >   "Re: Working towards getting 0.9.0 release candidate up next week" on
> Thu, 15 Mar 2018 13:51:46 +0900 (JST),
> >   Kouhei Sutou  wrote:
> >
> >> Sure. Please wait for a while.
> >>
> >> --
> >> kou
> >>
> >> In 
> >>   "Re: Working towards getting 0.9.0 release candidate up next week" on
> Thu, 15 Mar 2018 00:49:02 -0400,
> >>   Wes McKinney  wrote:
> >>
> >>> I ran into another problem:
> >>>
> >>> https://issues.apache.org/jira/browse/ARROW-2313
> >>>
> >>> @Kou, could you have a look at this? Thanks
> >>>
> >>> On Wed, Mar 14, 2018 at 10:36 PM, Wes McKinney 
> wrote:
>  Thanks Paul! This seems to have done the trick. I will get the RC out
>  later this evening
> 
>  On Wed, Mar 14, 2018 at 9:50 PM, Paul Taylor 
> wrote:
> > This should fix it: https://github.com/apache/arrow/pull/1751 <
> https://github.com/apache/arrow/pull/1751>
> >
> >> On Mar 14, 2018, at 6:42 PM, Wes McKinney 
> wrote:
> >>
> >> Last item https://issues.apache.org/jira/browse/ARROW-2312
> >>
> >> I can start the release vote as soon as we have the release
> >> verification script working again
> >>
> >> On Wed, Mar 14, 2018 at 2:28 PM, Wes McKinney 
> wrote:
> >>> OK, patch is up for ARROW-2307:
> >>> https://github.com/apache/arrow/pull/1747. Once that is reviewed
> and
> >>> merged I will start the release vote
> >>>
> >>> On Wed, Mar 14, 2018 at 10:18 AM, Wes McKinney <
> wesmck...@gmail.com> wrote:
>  I'm going to have a look at ARROW-2307 to see if it's an easy
> fix. If
>  not, I will go ahead with the RC
> 
>  On Wed, Mar 14, 2018 at 7:36 AM, Uwe L. Korn 
> wrote:
> > The mentioned bugfixes were merged and I also tested the
> Arrow<->Parquet-cpp as well as the Arrow<->Dask(parquet) integration. Both
> seem to work fine. From my side it looks like we're ready to make an RC.
> >
> > Uwe
> >
> > On Wed, Mar 14, 2018, at 5:07 AM, Wes McKinney wrote:
> >> I fixed these bugs
> >>
> >> https://github.com/apache/arrow/pull/1742
> >> https://github.com/apache/arrow/pull/1743
> >>
> >> As soon as these patches go in, we can cut the RC0. I can do
> this as
> >> soon as tomorrow (Wednesday) morning
> >>
> >> On Tue, Mar 13, 2018 at 5:00 PM, Wes McKinney <
> wesmck...@gmail.com> wrote:
> >>> I found 2 bugs -- ARROW-2304 and ARROW-2306 -- while doing
> some final
> >>> testing of master (stuff we haven't been testing in CI -- we
> _really_
> >>> need to set up nightly CI jobs for more time consuming tests,
> like
> >>> HDFS, that we want to test periodically but perhaps not on
> every
> >>> commit). I'm going to see if these are easy to fix
> >>>
> >>> On Tue, Mar 13, 2018 at 11:21 AM, Wes McKinney <
> wesmck...@gmail.com> wrote:
>  I fixed ARROW-2227 in https://github.com/apache/
> arrow/pull/1740, so if
>  someone could review that would be great.
> 
>  I am going to make sure the C++/Python HDFS tests run
> locally, then
>  after the patch above is merged we should be good to make the
> RC.
> 
>  Note to other onlookers -- please feel free to keep writing
> new
>  patches not mentioned here. They may just have to go in to
> the next
>  release
> 
>  Thanks
>  Wes
> 
>  On Tue, Mar 13, 2018 at 12:56 AM, Wes McKinney <
> wesmck...@gmail.com> wrote:
> > Things are looking pretty good. I'm waiting on a build for
> ARROW-1643
> > to go in, and there's a last blocker bug (ARROW-2227) that
> we ought to
> > try to fix tomorrow before cutting the RC. I can start the
> vote after
> > that if nothing else comes up
> >
> > I created ARROW-2300 in the course of trying to test
> ARROW-1643. I
> > have an alternative way to run the HDFS tests (since we
> aren't running
> > them in Travis CI) so most likely I will 

Re: C++ optimize stream output

2018-03-15 Thread Wes McKinney
hi Rares,

sorry for the delay in writing back. To your questions

   - Does this look good decent? Could the API be used in more efficient
   ways in order to achieve the same goal?

This seems perfectly reasonable with the API as it is now

   - On each node side, do steps #3 and #7 copy data?

Step 3 does not copy data, but step 7 does. You could do better by
writing directly to the network protocol instead of using
io::BufferOutputStream (which accumulates data in an in-memory
buffer). It would be useful to have a buffered writer to limit the
number of writes to the socket (or whatever protocol is being used).
Feel free to open a JIRA about this

   - On the coordinator side, do steps #3.2, and #3.3 copy data?

Per ARROW-2189 -- do you get these problems if you build from source?
I'm trying to understand if this is a packaging issue or a code issue

Step 3.2 does not copy data Step 3.3 does write data to the FileOutputStream

   - On the coordinator side, do I really need to read and write a record
   batch? Could I copy the buffer directly somehow?

No, you don't need to necessarily. The idea of the Message* classes in
arrow::ipc is to facilitate transporting messages while being agnostic
to their comments. This would be a useful test case to flesh out these
APIs. Could you please open some JIRAs about this? There are already
some Message-related JIRAs open so take a look at what is there
already.

Thanks
Wes


On Mon, Feb 26, 2018 at 10:22 AM, Rares Vernica  wrote:
> Hello,
>
> I am using the C++ API to serialize and centralize data over the network. I
> am wondering if I am using the API in an efficient way.
>
> I have multiple nodes and a coordinator communicating over the network. I
> do not have fine control over the network communication. Individual nodes
> write one chunk of data to the network. The coordinator will receive all
> the chunks and can loop over them.
>
> On each node I do the following (the code is here
> 
> ):
>
>1. Append data to Builders
>2. Finish Builders and get Arrays
>3. Create Record Batch from Arrays
>4. Create Pool Buffer
>5. Create Buffer Output Stream using Pool Buffer
>6. Open Record Batch Stream Writer using Buffer Output Stream
>7. Write Record Batch to writer
>8. Write Buffer data to network
>
> On the coordinator I do the following (the code is here
> 
> ):
>
>1. Open File Output Stream
>2. Open Record Batch Stream Writer using File Output Stream
>3. For each chunk retrieved from the network
>   1. Create Buffer Reader using data retrieved from the network (the
>   code is here
>   
> 
>   )
>   2. Create Record Batch Stream Reader using Buffer Reader and read
>   Record Batch (I plan to use ReadRecordBatch, but I'm having
> issues like in
>   ARROW-2189)
>   3. Write Record Batch
>
> A few questions:
>
>- Does this look good decent? Could the API be used in more efficient
>ways in order to achieve the same goal?
>- On each node side, do steps #3 and #7 copy data?
>- On the coordinator side, do steps #3.2, and #3.3 copy data?
>- On the coordinator side, do I really need to read and write a record
>batch? Could I copy the buffer directly somehow?
>
> Thank you so much!
> Rares
>
>
>
> (Could the same Pool Buffer be reused across calls?)


[jira] [Created] (ARROW-2319) [C++] Add buffered output class implementing OutputStream interface

2018-03-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2319:
---

 Summary: [C++] Add buffered output class implementing OutputStream 
interface
 Key: ARROW-2319
 URL: https://issues.apache.org/jira/browse/ARROW-2319
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.10.0


This purpose of this is to throttle smaller writes to the actual underlying 
{{OutputStream}} interface, which might be a file or network protocol. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: C++ optimize stream output

2018-03-15 Thread Wes McKinney
I opened https://issues.apache.org/jira/browse/ARROW-2319 per the
buffered output point

On Thu, Mar 15, 2018 at 8:38 PM, Wes McKinney  wrote:
> hi Rares,
>
> sorry for the delay in writing back. To your questions
>
>- Does this look good decent? Could the API be used in more efficient
>ways in order to achieve the same goal?
>
> This seems perfectly reasonable with the API as it is now
>
>- On each node side, do steps #3 and #7 copy data?
>
> Step 3 does not copy data, but step 7 does. You could do better by
> writing directly to the network protocol instead of using
> io::BufferOutputStream (which accumulates data in an in-memory
> buffer). It would be useful to have a buffered writer to limit the
> number of writes to the socket (or whatever protocol is being used).
> Feel free to open a JIRA about this
>
>- On the coordinator side, do steps #3.2, and #3.3 copy data?
>
> Per ARROW-2189 -- do you get these problems if you build from source?
> I'm trying to understand if this is a packaging issue or a code issue
>
> Step 3.2 does not copy data Step 3.3 does write data to the FileOutputStream
>
>- On the coordinator side, do I really need to read and write a record
>batch? Could I copy the buffer directly somehow?
>
> No, you don't need to necessarily. The idea of the Message* classes in
> arrow::ipc is to facilitate transporting messages while being agnostic
> to their comments. This would be a useful test case to flesh out these
> APIs. Could you please open some JIRAs about this? There are already
> some Message-related JIRAs open so take a look at what is there
> already.
>
> Thanks
> Wes
>
>
> On Mon, Feb 26, 2018 at 10:22 AM, Rares Vernica  wrote:
>> Hello,
>>
>> I am using the C++ API to serialize and centralize data over the network. I
>> am wondering if I am using the API in an efficient way.
>>
>> I have multiple nodes and a coordinator communicating over the network. I
>> do not have fine control over the network communication. Individual nodes
>> write one chunk of data to the network. The coordinator will receive all
>> the chunks and can loop over them.
>>
>> On each node I do the following (the code is here
>> 
>> ):
>>
>>1. Append data to Builders
>>2. Finish Builders and get Arrays
>>3. Create Record Batch from Arrays
>>4. Create Pool Buffer
>>5. Create Buffer Output Stream using Pool Buffer
>>6. Open Record Batch Stream Writer using Buffer Output Stream
>>7. Write Record Batch to writer
>>8. Write Buffer data to network
>>
>> On the coordinator I do the following (the code is here
>> 
>> ):
>>
>>1. Open File Output Stream
>>2. Open Record Batch Stream Writer using File Output Stream
>>3. For each chunk retrieved from the network
>>   1. Create Buffer Reader using data retrieved from the network (the
>>   code is here
>>   
>> 
>>   )
>>   2. Create Record Batch Stream Reader using Buffer Reader and read
>>   Record Batch (I plan to use ReadRecordBatch, but I'm having
>> issues like in
>>   ARROW-2189)
>>   3. Write Record Batch
>>
>> A few questions:
>>
>>- Does this look good decent? Could the API be used in more efficient
>>ways in order to achieve the same goal?
>>- On each node side, do steps #3 and #7 copy data?
>>- On the coordinator side, do steps #3.2, and #3.3 copy data?
>>- On the coordinator side, do I really need to read and write a record
>>batch? Could I copy the buffer directly somehow?
>>
>> Thank you so much!
>> Rares
>>
>>
>>
>> (Could the same Pool Buffer be reused across calls?)


[jira] [Created] (ARROW-2320) [C++] Vendored Boost build does not build regex library

2018-03-15 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2320:
---

 Summary: [C++] Vendored Boost build does not build regex library
 Key: ARROW-2320
 URL: https://issues.apache.org/jira/browse/ARROW-2320
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


This can be hit via {{-DARROW_BOOST_VENDORED=ON}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)