[jira] [Created] (ARROW-2954) [Plasma] Store object_id only once in object table

2018-07-31 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2954:
-

 Summary: [Plasma] Store object_id only once in object table
 Key: ARROW-2954
 URL: https://issues.apache.org/jira/browse/ARROW-2954
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Philipp Moritz
Assignee: Philipp Moritz
 Fix For: 0.10.0


This is the first part of ARROW-2953, i.e. the duplicated storage of the object 
id both in the key and the value of the object hash table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2953) [Plasma] Store memory usage

2018-07-31 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2953:
-

 Summary: [Plasma] Store memory usage
 Key: ARROW-2953
 URL: https://issues.apache.org/jira/browse/ARROW-2953
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


While doing some memory profiling on the store, it became clear that at the 
moment the metadata of the objects takes up much more space than it should. In 
particular, for each object:
 * The object id (20 bytes) is stored three times
 * The object checksum (8 bytes) is stored twice
 * data_size and metadata_size (each 8 bytes) are stored twice

We can therefore significantly reduce the metadata overhead with some 
refactoring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Wes McKinney
hi,

On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti  wrote:
> I think the circular dependency can be broken if we build a new library for
> the platform code. This will also make it easy for other projects such as
> ORC to use it.
> I also remember your proposal a while ago of having a separate project for
> the platform code.  That project can live in the arrow repo. However, one
> has to clone the entire apache arrow repo but can just build the platform
> code. This will be temporary until we can find a new home for it.
>
> The dependency will look like:
> libarrow(arrow core / bindings) <- libparquet (parquet core) <-
> libplatform(platform api)
>
> CI workflow will clone the arrow project twice, once for the platform
> library and once for the arrow-core/bindings library.

This seems like an interesting proposal; the best place to work toward
this goal (if it is even possible; the build system interactions and
ASF release management are the hard problems) is to have all of the
code in a single repository. ORC could already be using Arrow if it
wanted, but the ORC contributors aren't active in Arrow.

>
> There is no doubt that the collaborations between the Arrow and Parquet
> communities so far have been very successful.
> The reason to maintain this relationship moving forward is to continue to
> reap the mutual benefits.
> We should continue to take advantage of sharing code as well. However, I
> don't see any code sharing opportunities between arrow-core and the
> parquet-core. Both have different functions.

I think you mean the Arrow columnar format. The Arrow columnar format
is only one part of a project that has become quite large already
(https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-platform-for-inmemory-data-105427919).

>
> We are at a point where the parquet-cpp public API is pretty stable. We
> already passed that difficult stage. My take at arrow and parquet is to
> keep them nimble since we can.

I believe that parquet-core has progress to make yet ahead of it. We
have done little work in asynchronous IO and concurrency which would
yield both improved read and write throughput. This aligns well with
other concurrency and async-IO work planned in the Arrow platform. I
believe that more development will happen on parquet-core once the
development process issues are resolved by having a single codebase,
single build system, and a single CI framework.

I have some gripes about design decisions made early in parquet-cpp,
like the use of C++ exceptions. So while "stability" is a reasonable
goal I think we should still be open to making significant changes in
the interest of long term progress.

Having now worked on these projects for more than 2 and a half years
and the most frequent contributor to both codebases, I'm sadly far
past the "breaking point" and not willing to continue contributing in
a significant way to parquet-cpp if the projects remained structured
as they are now. It's hampering progress and not serving the
community.

- Wes

>
>
>
>
> On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney  wrote:
>
>> > The current Arrow adaptor code for parquet should live in the arrow
>> repo. That will remove a majority of the dependency issues. Joshua's work
>> would not have been blocked in parquet-cpp if that adapter was in the arrow
>> repo.  This will be similar to the ORC adaptor.
>>
>> This has been suggested before, but I don't see how it would alleviate
>> any issues because of the significant dependencies on other parts of
>> the Arrow codebase. What you are proposing is:
>>
>> - (Arrow) arrow platform
>> - (Parquet) parquet core
>> - (Arrow) arrow columnar-parquet adapter interface
>> - (Arrow) Python bindings
>>
>> To make this work, somehow Arrow core / libarrow would have to be
>> built before invoking the Parquet core part of the build system. You
>> would need to pass dependent targets across different CMake build
>> systems; I don't know if it's possible (I spent some time looking into
>> it earlier this year). This is what I meant by the lack of a "concrete
>> and actionable plan". The only thing that would really work would be
>> for the Parquet core to be "included" in the Arrow build system
>> somehow rather than using ExternalProject. Currently Parquet builds
>> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
>> system because it's only depended upon by the Python bindings.
>>
>> And even if a solution could be devised, it would not wholly resolve
>> the CI workflow issues.
>>
>> You could make Parquet completely independent of the Arrow codebase,
>> but at that point there is little reason to maintain a relationship
>> between the projects or their communities. We have spent a great deal
>> of effort refactoring the two projects to enable as much code sharing
>> as there is now.
>>
>> - Wes
>>
>> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney  wrote:
>> >> If you still strongly feel that the only way forward is to clone the
>> 

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Wes McKinney
> The current Arrow adaptor code for parquet should live in the arrow repo. 
> That will remove a majority of the dependency issues. Joshua's work would not 
> have been blocked in parquet-cpp if that adapter was in the arrow repo.  This 
> will be similar to the ORC adaptor.

This has been suggested before, but I don't see how it would alleviate
any issues because of the significant dependencies on other parts of
the Arrow codebase. What you are proposing is:

- (Arrow) arrow platform
- (Parquet) parquet core
- (Arrow) arrow columnar-parquet adapter interface
- (Arrow) Python bindings

To make this work, somehow Arrow core / libarrow would have to be
built before invoking the Parquet core part of the build system. You
would need to pass dependent targets across different CMake build
systems; I don't know if it's possible (I spent some time looking into
it earlier this year). This is what I meant by the lack of a "concrete
and actionable plan". The only thing that would really work would be
for the Parquet core to be "included" in the Arrow build system
somehow rather than using ExternalProject. Currently Parquet builds
Arrow using ExternalProject, and Parquet is unknown to the Arrow build
system because it's only depended upon by the Python bindings.

And even if a solution could be devised, it would not wholly resolve
the CI workflow issues.

You could make Parquet completely independent of the Arrow codebase,
but at that point there is little reason to maintain a relationship
between the projects or their communities. We have spent a great deal
of effort refactoring the two projects to enable as much code sharing
as there is now.

- Wes

On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney  wrote:
>> If you still strongly feel that the only way forward is to clone the 
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two 
>> parquet-cpp repos is no way a better approach.
>
> Yes, indeed. In my view, the next best option after a monorepo is to
> fork. That would obviously be a bad outcome for the community.
>
> It doesn't look like I will be able to convince you that a monorepo is
> a good idea; what I would ask instead is that you be willing to give
> it a shot, and if it turns out in the way you're describing (which I
> don't think it will) then I suggest that we fork at that point.
>
> - Wes
>
> On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti  
> wrote:
>> Wes,
>>
>> Unfortunately, I cannot show you any practical fact-based problems of a
>> non-existent Arrow-Parquet mono-repo.
>> Bringing in related Apache community experiences are more meaningful than
>> how mono-repos work at Google and other big organizations.
>> We solely depend on volunteers and cannot hire full-time developers.
>> You are very well aware of how difficult it has been to find more
>> contributors and maintainers for Arrow. parquet-cpp already has a low
>> contribution rate to its core components.
>>
>> We should target to ensure that new volunteers who want to contribute
>> bug-fixes/features should spend the least amount of time in figuring out
>> the project repo. We can never come up with an automated build system that
>> caters to every possible environment.
>> My only concern is if the mono-repo will make it harder for new developers
>> to work on parquet-cpp core just due to the additional code, build and test
>> dependencies.
>> I am not saying that the Arrow community/committers will be less
>> co-operative.
>> I just don't think the mono-repo structure model will be sustainable in an
>> open source community unless there are long-term vested interests. We can't
>> predict that.
>>
>> The current circular dependency problems between Arrow and Parquet is a
>> major problem for the community and it is important.
>>
>> The current Arrow adaptor code for parquet should live in the arrow repo.
>> That will remove a majority of the dependency issues.
>> Joshua's work would not have been blocked in parquet-cpp if that adapter
>> was in the arrow repo.  This will be similar to the ORC adaptor.
>>
>> The platform API code is pretty stable at this point. Minor changes in the
>> future to this code should not be the main reason to combine the arrow
>> parquet repos.
>>
>> "
>> *I question whether it's worth the community's time long term to wear*
>>
>>
>> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
>> to plug components together rather than utilizing commonplatform APIs.*"
>>
>> My answer to your question below would be "Yes". Modularity/separation is
>> very important in an open source community where priorities of contributors
>> are often short term.
>> The retention is low and therefore the acquisition costs should be low as
>> well. This is the community over code approach according to me. Minor code
>> duplication is not a deal breaker.
>> ORC, Parquet, Arrow, etc. are all different components in the big data
>> space serving their own functions.
>>
>> If you still strongly feel 

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Julian Hyde
A controlled fork doesn’t sound like a terrible option. Copy the code from 
parquet into arrow, and for a limited period of time it would be the primary. 
When that period is over, the code in parquet becomes the primary.

During the period during which arrow has the primary, the parquet release 
manager will have to synchronize parquet’s copy of the code (probably by 
patches) before making releases.

Julian


> On Jul 31, 2018, at 11:29 AM, Wes McKinney  wrote:
> 
>> If you still strongly feel that the only way forward is to clone the 
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two 
>> parquet-cpp repos is no way a better approach.
> 
> Yes, indeed. In my view, the next best option after a monorepo is to
> fork. That would obviously be a bad outcome for the community.
> 
> It doesn't look like I will be able to convince you that a monorepo is
> a good idea; what I would ask instead is that you be willing to give
> it a shot, and if it turns out in the way you're describing (which I
> don't think it will) then I suggest that we fork at that point.
> 
> - Wes
> 
> On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti  
> wrote:
>> Wes,
>> 
>> Unfortunately, I cannot show you any practical fact-based problems of a
>> non-existent Arrow-Parquet mono-repo.
>> Bringing in related Apache community experiences are more meaningful than
>> how mono-repos work at Google and other big organizations.
>> We solely depend on volunteers and cannot hire full-time developers.
>> You are very well aware of how difficult it has been to find more
>> contributors and maintainers for Arrow. parquet-cpp already has a low
>> contribution rate to its core components.
>> 
>> We should target to ensure that new volunteers who want to contribute
>> bug-fixes/features should spend the least amount of time in figuring out
>> the project repo. We can never come up with an automated build system that
>> caters to every possible environment.
>> My only concern is if the mono-repo will make it harder for new developers
>> to work on parquet-cpp core just due to the additional code, build and test
>> dependencies.
>> I am not saying that the Arrow community/committers will be less
>> co-operative.
>> I just don't think the mono-repo structure model will be sustainable in an
>> open source community unless there are long-term vested interests. We can't
>> predict that.
>> 
>> The current circular dependency problems between Arrow and Parquet is a
>> major problem for the community and it is important.
>> 
>> The current Arrow adaptor code for parquet should live in the arrow repo.
>> That will remove a majority of the dependency issues.
>> Joshua's work would not have been blocked in parquet-cpp if that adapter
>> was in the arrow repo.  This will be similar to the ORC adaptor.
>> 
>> The platform API code is pretty stable at this point. Minor changes in the
>> future to this code should not be the main reason to combine the arrow
>> parquet repos.
>> 
>> "
>> *I question whether it's worth the community's time long term to wear*
>> 
>> 
>> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary
>> to plug components together rather than utilizing commonplatform APIs.*"
>> 
>> My answer to your question below would be "Yes". Modularity/separation is
>> very important in an open source community where priorities of contributors
>> are often short term.
>> The retention is low and therefore the acquisition costs should be low as
>> well. This is the community over code approach according to me. Minor code
>> duplication is not a deal breaker.
>> ORC, Parquet, Arrow, etc. are all different components in the big data
>> space serving their own functions.
>> 
>> If you still strongly feel that the only way forward is to clone the
>> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> parquet-cpp repos is no way a better approach.
>> 
>> 
>> 
>> 
>> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney  wrote:
>> 
>>> @Antoine
>>> 
 By the way, one concern with the monorepo approach: it would slightly
>>> increase Arrow CI times (which are already too large).
>>> 
>>> A typical CI run in Arrow is taking about 45 minutes:
>>> https://travis-ci.org/apache/arrow/builds/410119750
>>> 
>>> Parquet run takes about 28
>>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>>> 
>>> Inevitably we will need to create some kind of bot to run certain
>>> builds on-demand based on commit / PR metadata or on request.
>>> 
>>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>>> made substantially shorter by moving some of the slower parts (like
>>> the Python ASV benchmarks) from being tested every-commit to nightly
>>> or on demand. Using ASAN instead of valgrind in Travis would also
>>> improve build times (valgrind build could be moved to a nightly
>>> exhaustive test run)
>>> 
>>> - Wes
>>> 
>>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney 
>>> wrote:
> I would like 

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Deepak Majeti
Wes,

Unfortunately, I cannot show you any practical fact-based problems of a
non-existent Arrow-Parquet mono-repo.
Bringing in related Apache community experiences are more meaningful than
how mono-repos work at Google and other big organizations.
We solely depend on volunteers and cannot hire full-time developers.
You are very well aware of how difficult it has been to find more
contributors and maintainers for Arrow. parquet-cpp already has a low
contribution rate to its core components.

We should target to ensure that new volunteers who want to contribute
bug-fixes/features should spend the least amount of time in figuring out
the project repo. We can never come up with an automated build system that
caters to every possible environment.
My only concern is if the mono-repo will make it harder for new developers
to work on parquet-cpp core just due to the additional code, build and test
dependencies.
I am not saying that the Arrow community/committers will be less
co-operative.
I just don't think the mono-repo structure model will be sustainable in an
open source community unless there are long-term vested interests. We can't
predict that.

The current circular dependency problems between Arrow and Parquet is a
major problem for the community and it is important.

The current Arrow adaptor code for parquet should live in the arrow repo.
That will remove a majority of the dependency issues.
Joshua's work would not have been blocked in parquet-cpp if that adapter
was in the arrow repo.  This will be similar to the ORC adaptor.

The platform API code is pretty stable at this point. Minor changes in the
future to this code should not be the main reason to combine the arrow
parquet repos.

"
*I question whether it's worth the community's time long term to wear*


*ourselves out defining custom "ports" / virtual interfaces in eachlibrary
to plug components together rather than utilizing commonplatform APIs.*"

My answer to your question below would be "Yes". Modularity/separation is
very important in an open source community where priorities of contributors
are often short term.
The retention is low and therefore the acquisition costs should be low as
well. This is the community over code approach according to me. Minor code
duplication is not a deal breaker.
ORC, Parquet, Arrow, etc. are all different components in the big data
space serving their own functions.

If you still strongly feel that the only way forward is to clone the
parquet-cpp repo and part ways, I will withdraw my concern. Having two
parquet-cpp repos is no way a better approach.




On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney  wrote:

> @Antoine
>
> > By the way, one concern with the monorepo approach: it would slightly
> increase Arrow CI times (which are already too large).
>
> A typical CI run in Arrow is taking about 45 minutes:
> https://travis-ci.org/apache/arrow/builds/410119750
>
> Parquet run takes about 28
> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>
> Inevitably we will need to create some kind of bot to run certain
> builds on-demand based on commit / PR metadata or on request.
>
> The slowest build in Arrow (the Arrow C++/Python one) build could be
> made substantially shorter by moving some of the slower parts (like
> the Python ASV benchmarks) from being tested every-commit to nightly
> or on demand. Using ASAN instead of valgrind in Travis would also
> improve build times (valgrind build could be moved to a nightly
> exhaustive test run)
>
> - Wes
>
> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney 
> wrote:
> >> I would like to point out that arrow's use of orc is a great example of
> how it would be possible to manage parquet-cpp as a separate codebase. That
> gives me hope that the projects could be managed separately some day.
> >
> > Well, I don't know that ORC is the best example. The ORC C++ codebase
> > features several areas of duplicated logic which could be replaced by
> > components from the Arrow platform for better platform-wide
> > interoperability:
> >
> >
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> >
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
> >
> > ORC's use of symbols from Protocol Buffers was actually a cause of
> > bugs that we had to fix in Arrow's build system to prevent them from
> > leaking to third party linkers when statically linked (ORC is only
> > available for static linking at the moment AFAIK).
> >
> > I question whether it's worth the community's time long term to wear
> > ourselves out defining custom "ports" / virtual interfaces in each
> > library to plug components together rather than utilizing common
> > platform APIs.
> >
> > - Wes
> >
> > On Mon, Jul 30, 2018 at 10:45 

[jira] [Created] (ARROW-2952) [C++] Dockerfile for running include-what-you-use checks

2018-07-31 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2952:
---

 Summary: [C++] Dockerfile for running include-what-you-use checks
 Key: ARROW-2952
 URL: https://issues.apache.org/jira/browse/ARROW-2952
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


It would be valuable to have a non-nonsense reproducible IWYU report. Every 
time I want to run this report on a new machine I lose time building the 
correct version of IWYU and remembering how to correctly run the report



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2951) [CI] Changes in format/ should cause Appveyor builds to run

2018-07-31 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2951:
---

 Summary: [CI] Changes in format/ should cause Appveyor builds to 
run
 Key: ARROW-2951
 URL: https://issues.apache.org/jira/browse/ARROW-2951
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Wes McKinney


Currently they are skipped 
https://github.com/apache/arrow/blob/master/appveyor.yml#L23



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

2018-07-31 Thread Wes McKinney
@Antoine

> By the way, one concern with the monorepo approach: it would slightly 
> increase Arrow CI times (which are already too large).

A typical CI run in Arrow is taking about 45 minutes:
https://travis-ci.org/apache/arrow/builds/410119750

Parquet run takes about 28
https://travis-ci.org/apache/parquet-cpp/builds/410147208

Inevitably we will need to create some kind of bot to run certain
builds on-demand based on commit / PR metadata or on request.

The slowest build in Arrow (the Arrow C++/Python one) build could be
made substantially shorter by moving some of the slower parts (like
the Python ASV benchmarks) from being tested every-commit to nightly
or on demand. Using ASAN instead of valgrind in Travis would also
improve build times (valgrind build could be moved to a nightly
exhaustive test run)

- Wes

On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney  wrote:
>> I would like to point out that arrow's use of orc is a great example of how 
>> it would be possible to manage parquet-cpp as a separate codebase. That 
>> gives me hope that the projects could be managed separately some day.
>
> Well, I don't know that ORC is the best example. The ORC C++ codebase
> features several areas of duplicated logic which could be replaced by
> components from the Arrow platform for better platform-wide
> interoperability:
>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>
> ORC's use of symbols from Protocol Buffers was actually a cause of
> bugs that we had to fix in Arrow's build system to prevent them from
> leaking to third party linkers when statically linked (ORC is only
> available for static linking at the moment AFAIK).
>
> I question whether it's worth the community's time long term to wear
> ourselves out defining custom "ports" / virtual interfaces in each
> library to plug components together rather than utilizing common
> platform APIs.
>
> - Wes
>
> On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck  
> wrote:
>> You're point about the constraints of the ASF release process are well
>> taken and as a developer who's trying to work in the current environment I
>> would be much happier if the codebases were merged. The main issues I worry
>> about when you put codebases like these together are:
>>
>> 1. The delineation of API's become blurred and the code becomes too coupled
>> 2. Release of artifacts that are lower in the dependency tree are delayed
>> by artifacts higher in the dependency tree
>>
>> If the project/release management is structured well and someone keeps an
>> eye on the coupling, then I don't have any concerns.
>>
>> I would like to point out that arrow's use of orc is a great example of how
>> it would be possible to manage parquet-cpp as a separate codebase. That
>> gives me hope that the projects could be managed separately some day.
>>
>> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney  wrote:
>>
>>> hi Josh,
>>>
>>> > I can imagine use cases for parquet that don't involve arrow and tying
>>> them together seems like the wrong choice.
>>>
>>> Apache is "Community over Code"; right now it's the same people
>>> building these projects -- my argument (which I think you agree with?)
>>> is that we should work more closely together until the community grows
>>> large enough to support larger-scope process than we have now. As
>>> you've seen, our process isn't serving developers of these projects.
>>>
>>> > I also think build tooling should be pulled into its own codebase.
>>>
>>> I don't see how this can possibly be practical taking into
>>> consideration the constraints imposed by the combination of the GitHub
>>> platform and the ASF release process. I'm all for being idealistic,
>>> but right now we need to be practical. Unless we can devise a
>>> practical procedure that can accommodate at least 1 patch per day
>>> which may touch both code and build system simultaneously without
>>> being a hindrance to contributor or maintainer, I don't see how we can
>>> move forward.
>>>
>>> > That being said, I think it makes sense to merge the codebases in the
>>> short term with the express purpose of separating them in the near  term.
>>>
>>> I would agree but only if separation can be demonstrated to be
>>> practical and result in net improvements in productivity and community
>>> growth. I think experience has clearly demonstrated that the current
>>> separation is impractical, and is causing problems.
>>>
>>> Per Julian's and Ted's comments, I think we need to consider
>>> development process and ASF releases separately. My argument is as
>>> follows:
>>>
>>> * Monorepo for development (for practicality)
>>> * Releases structured according to the desires of 

[jira] [Created] (ARROW-2950) [C++] Clean up util/bit-util.h

2018-07-31 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2950:
-

 Summary: [C++] Clean up util/bit-util.h
 Key: ARROW-2950
 URL: https://issues.apache.org/jira/browse/ARROW-2950
 Project: Apache Arrow
  Issue Type: Task
Reporter: Antoine Pitrou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2949) [CI] repo.continuum.io can be flaky in builds

2018-07-31 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2949:
---

 Summary: [CI] repo.continuum.io can be flaky in builds
 Key: ARROW-2949
 URL: https://issues.apache.org/jira/browse/ARROW-2949
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Wes McKinney


I have seen this flakiness in several builds:

{code}
++wget --no-verbose -O miniconda.sh 
https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
wget: unable to resolve host address ‘repo.continuum.ioâ€
{code}

e.g. https://travis-ci.org/apache/arrow/jobs/410201987



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2948) [Packaging] Generate changelog with crossbow

2018-07-31 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-2948:
--

 Summary: [Packaging] Generate changelog with crossbow
 Key: ARROW-2948
 URL: https://issues.apache.org/jira/browse/ARROW-2948
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


Basically the port of 
https://github.com/apache/arrow/blob/master/dev/release/changelog.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Reading PageHeader separately from reading entire page

2018-07-31 Thread Wes McKinney
hi Renato,

Sounds like a useful feature to have (to be able to inspect data page
metadata without decoding all the data inside). You'll need to propose
a change and patch to Apache Parquet

Speaking of which, we're having a discussion on the Arrow and Parquet
mailing lists about easing Parquet-related development process for
both communities:

https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E

- Wes

On Mon, Jul 30, 2018 at 12:02 PM, Renato Marroquín Mogrovejo
 wrote:
> Hi Arrow devs,
>
> I am trying to separate reading only pageHeaders from reading
> (reading+uncompresing+serializing) its entire content.
> The current SerializedPageReader::NextPage() does both things at the same
> time.
> I tried importing format::PageHeader into a separate project linking
> against a build of parquet-cpp,  but I can't, I guess it is because it is
> not exported, right?
> Any suggestions/pointers/ideas are highly appreciated!
> Thanks!
>
>
> Renato M.


[jira] [Created] (ARROW-2947) [Packaging] Remove Ubuntu Artful

2018-07-31 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-2947:
---

 Summary: [Packaging] Remove Ubuntu Artful
 Key: ARROW-2947
 URL: https://issues.apache.org/jira/browse/ARROW-2947
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2946) [Packaging] Stop to use PWD in debian/rules

2018-07-31 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-2946:
---

 Summary: [Packaging] Stop to use PWD in debian/rules
 Key: ARROW-2946
 URL: https://issues.apache.org/jira/browse/ARROW-2946
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2945) [Packaging] Update argument check for 02-source.sh

2018-07-31 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-2945:
---

 Summary: [Packaging] Update argument check for 02-source.sh
 Key: ARROW-2945
 URL: https://issues.apache.org/jira/browse/ARROW-2945
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)