[jira] [Created] (ARROW-3935) [Packaging/Docker] Mount ccache directroy in docker-compose setup

2018-12-04 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-3935:
--

 Summary: [Packaging/Docker] Mount ccache directroy in 
docker-compose setup
 Key: ARROW-3935
 URL: https://issues.apache.org/jira/browse/ARROW-3935
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs


Hopefully this will speed up compilation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3936) Add _O_NOINHERIT to the file open flags on Windows

2018-12-04 Thread Philip Felton (JIRA)
Philip Felton created ARROW-3936:


 Summary: Add _O_NOINHERIT to the file open flags on Windows
 Key: ARROW-3936
 URL: https://issues.apache.org/jira/browse/ARROW-3936
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philip Felton


Unlike Linux, Windows doesn't let you delete files that are currently opened by 
another process. So if you create a child process while a Parquet file is open, 
with the current code the file handle is inherited to the child process, and 
the parent process can't then delete the file after closing it without the 
child process terminating first.

By default, Win32 file handles are not inheritable (likely because of the 
aforementioned problems). Except for _wsopen_s, which tries to maintain POSIX 
compatibility.

This is a serious problem for us.

We would argue that specifying _O_NOINHERIT by default in the _MSC_VER path is 
a sensible approach and would likely be the correct behaviour as it matches the 
main Win32 API.

However, it could be that some developers rely on the current inheritable 
behaviour. In which case, the Arrow public API should take a boolean argument 
on whether the created file descriptor should be inheritable. But this would 
break API backward compatibility (unless a new overloaded method is introduced).

Is forking and inheriting Arrow internal file descriptor something that Arrow 
actually means to support?

What do we think of the proposed fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Go] High memory usage on CSV read into table

2018-12-04 Thread Daniel Harper
Sorry I've been away at reinvent.

Just tried out what's currently on master (with the chunked change that
looks like it has merged). I'll do the break down of the different parts
later but as a high level look at just running the same script as described
above these are the numbers

https://docs.google.com/spreadsheets/d/1SE4S-wcKQ5cwlHoN7rQm7XOZLjI0HSyMje6q-zLvUHM/edit?usp=sharing

Looks to me like the change has definitely helped, with memory usage
dropping to around 300mb, although the usage doesn't really change that
much once chunk size is > 1000




Daniel Harper
http://djhworld.github.io


On Fri, 23 Nov 2018 at 10:58, Sebastien Binet  wrote:

> On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney  wrote:
>
> > That seems buggy then. There is only 4.125 bytes of overhead per
> > string value on average (a 32-bit offset, plus a valid bit)
> > On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper 
> > wrote:
> > >
> > > Uncompressed
> > >
> > > $ ls -la concurrent_streams.csv
> > > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
> > >
> > > $ wc -l concurrent_streams.csv
> > >  1007481 concurrent_streams.csv
> > >
> > >
> > > Daniel Harper
> > > http://djhworld.github.io
> > >
> > >
> > > On Mon, 19 Nov 2018 at 21:55, Wes McKinney 
> wrote:
> > >
> > > > I'm curious how the file is only 100MB if it's producing ~6GB of
> > > > strings in memory. Is it compressed?
> > > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper 
> > > > wrote:
> > > > >
> > > > > Thanks,
> > > > >
> > > > > I've tried the new code and that seems to have shaved about 1GB of
> > memory
> > > > > off, so the heap is about 8.84GB now, here is the updated pprof
> > output
> > > > > https://i.imgur.com/itOHqBf.png
> > > > >
> > > > > It looks like the majority of allocations are in the
> > memory.GoAllocator
> > > > >
> > > > > (pprof) top
> > > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > > > > Showing top 10 nodes out of 41
> > > > >   flat  flat%   sum%cum   cum%
> > > > > 4.24GB 47.91% 47.91% 4.24GB 47.91%
> > > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > > > > 2.12GB 23.97% 71.88% 2.12GB 23.97%
> > > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer
> (inline)
> > > > > 1.07GB 12.07% 83.95% 1.07GB 12.07%
> > > > > github.com/apache/arrow/go/arrow/array.NewData
> > > > > 0.83GB  9.38% 93.33% 0.83GB  9.38%
> > > > > github.com/apache/arrow/go/arrow/array.NewStringData
> > > > > 0.33GB  3.69% 97.02% 1.31GB 14.79%
> > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > > > > 0.18GB  2.04% 99.06% 0.18GB  2.04%
> > > > > github.com/apache/arrow/go/arrow/array.NewChunked
> > > > > 0.07GB  0.78% 99.85% 0.07GB  0.78%
> > > > > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > > > > 0.01GB  0.15%   100% 0.21GB  2.37%
> > > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> > > > >  0 0%   100%6GB 67.91%
> > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> > > > >  0 0%   100% 4.03GB 45.54%
> > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> > > > >
> > > > >
> > > > > I'm a bit busy at the moment but I'll probably repeat the same test
> > on
> > > > the
> > > > > other Arrow implementations (e.g. Java) to see if they allocate a
> > similar
> > > > > amount.
> >
>
> I've implemented chunking over there:
>
> - https://github.com/apache/arrow/pull/3019
>
> could you try with a couple of chunking values?
> e.g.:
> - csv.WithChunk(-1): reads the whole file into memory, creates one big
> record
> - csv.WithChunk(nrows/10): creates 10 records
>
> also, it would be great to try to disentangle the memory usage of the "CSV
> reading part" from the "Table creation" one:
> - have some perf numbers w/o storing all these Records into a []Record
> slice,
> - have some perf numbers w/ only storing these Records into a []Record
> slice,
> - have some perf numbers w/ storing the records into the slice + creating
> the Table.
>
> hth,
> -s
>


Re: [Go] High memory usage on CSV read into table

2018-12-04 Thread Sebastien Binet
On Tue, Dec 4, 2018 at 10:23 PM Daniel Harper  wrote:

> Sorry I've been away at reinvent.
>
> Just tried out what's currently on master (with the chunked change that
> looks like it has merged). I'll do the break down of the different parts
> later but as a high level look at just running the same script as described
> above these are the numbers
>
>
> https://docs.google.com/spreadsheets/d/1SE4S-wcKQ5cwlHoN7rQm7XOZLjI0HSyMje6q-zLvUHM/edit?usp=sharing
>


>
> Looks to me like the change has definitely helped, with memory usage
> dropping to around 300mb, although the usage doesn't really change that
> much once chunk size is > 1000
>

good. you might want to try with a chunk size of -1 (this loads the whole
CSV file into memory in one fell swoop.)

also, there's this PR wich should probably also reduce the memory pressure:
- https://github.com/apache/arrow/pull/3073

cheers,
-s


>
>
>
>
> Daniel Harper
> http://djhworld.github.io
>
>
> On Fri, 23 Nov 2018 at 10:58, Sebastien Binet  wrote:
>
> > On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney 
> wrote:
> >
> > > That seems buggy then. There is only 4.125 bytes of overhead per
> > > string value on average (a 32-bit offset, plus a valid bit)
> > > On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper 
> > > wrote:
> > > >
> > > > Uncompressed
> > > >
> > > > $ ls -la concurrent_streams.csv
> > > > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
> > > >
> > > > $ wc -l concurrent_streams.csv
> > > >  1007481 concurrent_streams.csv
> > > >
> > > >
> > > > Daniel Harper
> > > > http://djhworld.github.io
> > > >
> > > >
> > > > On Mon, 19 Nov 2018 at 21:55, Wes McKinney 
> > wrote:
> > > >
> > > > > I'm curious how the file is only 100MB if it's producing ~6GB of
> > > > > strings in memory. Is it compressed?
> > > > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper <
> djharpe...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > I've tried the new code and that seems to have shaved about 1GB
> of
> > > memory
> > > > > > off, so the heap is about 8.84GB now, here is the updated pprof
> > > output
> > > > > > https://i.imgur.com/itOHqBf.png
> > > > > >
> > > > > > It looks like the majority of allocations are in the
> > > memory.GoAllocator
> > > > > >
> > > > > > (pprof) top
> > > > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > > > > > Showing top 10 nodes out of 41
> > > > > >   flat  flat%   sum%cum   cum%
> > > > > > 4.24GB 47.91% 47.91% 4.24GB 47.91%
> > > > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > > > > > 2.12GB 23.97% 71.88% 2.12GB 23.97%
> > > > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer
> > (inline)
> > > > > > 1.07GB 12.07% 83.95% 1.07GB 12.07%
> > > > > > github.com/apache/arrow/go/arrow/array.NewData
> > > > > > 0.83GB  9.38% 93.33% 0.83GB  9.38%
> > > > > > github.com/apache/arrow/go/arrow/array.NewStringData
> > > > > > 0.33GB  3.69% 97.02% 1.31GB 14.79%
> > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > > > > > 0.18GB  2.04% 99.06% 0.18GB  2.04%
> > > > > > github.com/apache/arrow/go/arrow/array.NewChunked
> > > > > > 0.07GB  0.78% 99.85% 0.07GB  0.78%
> > > > > > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > > > > > 0.01GB  0.15%   100% 0.21GB  2.37%
> > > > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> > > > > >  0 0%   100%6GB 67.91%
> > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> > > > > >  0 0%   100% 4.03GB 45.54%
> > > > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> > > > > >
> > > > > >
> > > > > > I'm a bit busy at the moment but I'll probably repeat the same
> test
> > > on
> > > > > the
> > > > > > other Arrow implementations (e.g. Java) to see if they allocate a
> > > similar
> > > > > > amount.
> > >
> >
> > I've implemented chunking over there:
> >
> > - https://github.com/apache/arrow/pull/3019
> >
> > could you try with a couple of chunking values?
> > e.g.:
> > - csv.WithChunk(-1): reads the whole file into memory, creates one big
> > record
> > - csv.WithChunk(nrows/10): creates 10 records
> >
> > also, it would be great to try to disentangle the memory usage of the
> "CSV
> > reading part" from the "Table creation" one:
> > - have some perf numbers w/o storing all these Records into a []Record
> > slice,
> > - have some perf numbers w/ only storing these Records into a []Record
> > slice,
> > - have some perf numbers w/ storing the records into the slice + creating
> > the Table.
> >
> > hth,
> > -s
> >
>


Re: Timeline for Arrow 0.12.0 release

2018-12-04 Thread Wes McKinney
Andy, Paddy, or other Rust developers -- could you review the 6 issues
in TODO in the 0.12 backlog and either assign them or move them to the
next release if they aren't going to be completed this week or next?


On Fri, Nov 30, 2018 at 4:34 PM Wes McKinney  wrote:
>
> hi folks,
>
> Tomorrow is December 1. The last major Arrow release (0.11.0) took
> place on October 8. Given how much work has happened in the project in
> the last ~2 months, I think it would be great to complete the next
> major release before the end-of-year holidays set in.
>
> I've been curating the JIRA backlog the last couple of weeks, and have
> just created a 0.12.0 release wiki page to help us stay organized
>
> https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.12.0+Release
>
> Given that there are only 3 full working weeks between now and
> Christmas, I think we should be in position to cut a release by the
> end of the week of December 10, i.e. by Friday December 14. Not all of
> the TODO issues have to be completed to make the release, but it would
> be good to push to complete as much as possible. Please help by
> reviewing the backlog, and if possible, assigning issues to yourself
> that you'd like to pursue in the next 2 weeks.
>
> Let me know if this sounds reasonable, or any concerns.
>
> Thanks
> Wes


Re: Timeline for Arrow 0.12.0 release

2018-12-04 Thread Andy Grove
I'd love to tackle the three related issues for supporting simple
math/comparison operations on primitive arrays and casting primitive arrays
but since the change to use Rust specialization feature I'm a bit stuck and
need some assistance applying the math operations to the numeric types and
not the boolean primitives. I have added a comment to
https://github.com/apache/arrow/pull/3033 ... if I can get help solving for
this PR then I should be able to handle the others. I'll also do some
research and try and figure this out myself.

Andy.






On Tue, Dec 4, 2018 at 7:03 PM Wes McKinney  wrote:

> Andy, Paddy, or other Rust developers -- could you review the 6 issues
> in TODO in the 0.12 backlog and either assign them or move them to the
> next release if they aren't going to be completed this week or next?
>
>
> On Fri, Nov 30, 2018 at 4:34 PM Wes McKinney  wrote:
> >
> > hi folks,
> >
> > Tomorrow is December 1. The last major Arrow release (0.11.0) took
> > place on October 8. Given how much work has happened in the project in
> > the last ~2 months, I think it would be great to complete the next
> > major release before the end-of-year holidays set in.
> >
> > I've been curating the JIRA backlog the last couple of weeks, and have
> > just created a 0.12.0 release wiki page to help us stay organized
> >
> > https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.12.0+Release
> >
> > Given that there are only 3 full working weeks between now and
> > Christmas, I think we should be in position to cut a release by the
> > end of the week of December 10, i.e. by Friday December 14. Not all of
> > the TODO issues have to be completed to make the release, but it would
> > be good to push to complete as much as possible. Please help by
> > reviewing the backlog, and if possible, assigning issues to yourself
> > that you'd like to pursue in the next 2 weeks.
> >
> > Let me know if this sounds reasonable, or any concerns.
> >
> > Thanks
> > Wes
>


Arrow JS 0.4.0 Release

2018-12-04 Thread Brian Hulette
Hi all,
It's been quite a while since our last major Arrow JS release (0.3.0 on
February 22!), and since then we've added several new features that will
make Arrow JS much easier to adopt. We've added convenience functions for
creating Arrow vectors and tables natively in JavaScript, an IPC writer,
and a row proxy interface that will make integrating with existing JS
libraries much simpler.

I think it's time we cut 0.4.0, so I spent some time closing out or
postponing the last few JIRAs in JS-0.4.0. I got it down to just one JIRA
which involves documenting the release process - hopefully we can close
that out as we go through it again.

Please let me know if you think it makes sense to cut JS-0.4.0 now, or if
you have any concerns.

Brian


[jira] [Created] (ARROW-3937) [Rust] Rust nightly build is failing

2018-12-04 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3937:
---

 Summary: [Rust] Rust nightly build is failing
 Key: ARROW-3937
 URL: https://issues.apache.org/jira/browse/ARROW-3937
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Wes McKinney
 Fix For: 0.12.0


See recent CI failures such as 
https://travis-ci.org/apache/arrow/jobs/463656608#L650



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Rust nightly + formatting changes

2018-12-04 Thread Andy Grove
One of the challenges of using Rust nightly is that things can change
often. Master builds have been failing because CI is picking up the latest
nightly all the time, and the formatting rules change over time.

One option to avoid this is to have CI use a specific nightly version
rather than always using the latest, and then we can use PRs to update the
nightly version being used. For example we can have the CI script run the
following command:

rustup default nightly-2018-12-05

If there are no objections I will create a JIRA issue for this.

Andy.


Re: Rust nightly + formatting changes

2018-12-04 Thread Chao Sun
One question is whether we should only restrict this on rustfmt but still
use nightly for testing. Also, how often should we update the nightly
version? does it require a JIRA every time?

Another option is to run stable rustfmt but use nightly for CI. Since we're
already running CI with stable, this should be an easy change.

On Tue, Dec 4, 2018 at 8:54 PM Andy Grove  wrote:

> One of the challenges of using Rust nightly is that things can change
> often. Master builds have been failing because CI is picking up the latest
> nightly all the time, and the formatting rules change over time.
>
> One option to avoid this is to have CI use a specific nightly version
> rather than always using the latest, and then we can use PRs to update the
> nightly version being used. For example we can have the CI script run the
> following command:
>
> rustup default nightly-2018-12-05
>
> If there are no objections I will create a JIRA issue for this.
>
> Andy.
>


[jira] [Created] (ARROW-3938) [Packaging] Stop to refer java/pom.xml to get version information

2018-12-04 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3938:
---

 Summary: [Packaging] Stop to refer java/pom.xml to get version 
information
 Key: ARROW-3938
 URL: https://issues.apache.org/jira/browse/ARROW-3938
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 0.12.0


https://github.com/apache/arrow/pull/3096#issuecomment-444345068

I want to stop the current version sharing style. (Referring {{java/pom.xml}} 
from C++, Python, C, Ruby, )
It introduces complexity. For example, we generates {{version.rb}} dynamically 
to create a Ruby package: 
https://github.com/apache/arrow/blob/master/ruby/red-arrow/version.rb

I think that we can just replace all versions in {{cpp/CMakeLists.txt}}, 
{{python/setup.py}}, {{c_glib/configure.ac}}, {{ruby/*/lib/*/version.rb}}, 
{{rust/Cargo.toml}}, ... by {{sed}} in the release process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3939) [Rust] Remove macro definition for ListArrayBuilder

2018-12-04 Thread Chao Sun (JIRA)
Chao Sun created ARROW-3939:
---

 Summary: [Rust] Remove macro definition for ListArrayBuilder
 Key: ARROW-3939
 URL: https://issues.apache.org/jira/browse/ARROW-3939
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Chao Sun
Assignee: Chao Sun


Currently `ListArrayBuilder` is done using macro and only implemented for a few 
value builder types. We should lift this restriction and allow creation of list 
builders with arbitrary value builder types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Rust nightly + formatting changes

2018-12-04 Thread Marco Neumann
rustfmt will hit 1.0 tomorrow (Rust 1.31 release) and ensures stable formatting 
over multiple releases. As far as I understood, this "formatting style 
stability" will affect all channels, so we could run rustfmt on the same 
channel(s) as our CI. Might be that we need a jira ticket to change from 
"rustfmt preview" to normal rustfmt in our CI scripts. 

Reference: https://www.ncameron.org/blog/rustfmt-1-rc/

On December 5, 2018 6:55:40 AM GMT+01:00, Chao Sun  wrote:
>One question is whether we should only restrict this on rustfmt but
>still
>use nightly for testing. Also, how often should we update the nightly
>version? does it require a JIRA every time?
>
>Another option is to run stable rustfmt but use nightly for CI. Since
>we're
>already running CI with stable, this should be an easy change.
>
>On Tue, Dec 4, 2018 at 8:54 PM Andy Grove 
>wrote:
>
>> One of the challenges of using Rust nightly is that things can change
>> often. Master builds have been failing because CI is picking up the
>latest
>> nightly all the time, and the formatting rules change over time.
>>
>> One option to avoid this is to have CI use a specific nightly version
>> rather than always using the latest, and then we can use PRs to
>update the
>> nightly version being used. For example we can have the CI script run
>the
>> following command:
>>
>> rustup default nightly-2018-12-05
>>
>> If there are no objections I will create a JIRA issue for this.
>>
>> Andy.
>>