[jira] [Created] (ARROW-2954) [Plasma] Store object_id only once in object table
Philipp Moritz created ARROW-2954: - Summary: [Plasma] Store object_id only once in object table Key: ARROW-2954 URL: https://issues.apache.org/jira/browse/ARROW-2954 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Philipp Moritz Assignee: Philipp Moritz Fix For: 0.10.0 This is the first part of ARROW-2953, i.e. the duplicated storage of the object id both in the key and the value of the object hash table. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2953) [Plasma] Store memory usage
Philipp Moritz created ARROW-2953: - Summary: [Plasma] Store memory usage Key: ARROW-2953 URL: https://issues.apache.org/jira/browse/ARROW-2953 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz While doing some memory profiling on the store, it became clear that at the moment the metadata of the objects takes up much more space than it should. In particular, for each object: * The object id (20 bytes) is stored three times * The object checksum (8 bytes) is stored twice * data_size and metadata_size (each 8 bytes) are stored twice We can therefore significantly reduce the metadata overhead with some refactoring. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass
hi, On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti wrote: > I think the circular dependency can be broken if we build a new library for > the platform code. This will also make it easy for other projects such as > ORC to use it. > I also remember your proposal a while ago of having a separate project for > the platform code. That project can live in the arrow repo. However, one > has to clone the entire apache arrow repo but can just build the platform > code. This will be temporary until we can find a new home for it. > > The dependency will look like: > libarrow(arrow core / bindings) <- libparquet (parquet core) <- > libplatform(platform api) > > CI workflow will clone the arrow project twice, once for the platform > library and once for the arrow-core/bindings library. This seems like an interesting proposal; the best place to work toward this goal (if it is even possible; the build system interactions and ASF release management are the hard problems) is to have all of the code in a single repository. ORC could already be using Arrow if it wanted, but the ORC contributors aren't active in Arrow. > > There is no doubt that the collaborations between the Arrow and Parquet > communities so far have been very successful. > The reason to maintain this relationship moving forward is to continue to > reap the mutual benefits. > We should continue to take advantage of sharing code as well. However, I > don't see any code sharing opportunities between arrow-core and the > parquet-core. Both have different functions. I think you mean the Arrow columnar format. The Arrow columnar format is only one part of a project that has become quite large already (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-platform-for-inmemory-data-105427919). > > We are at a point where the parquet-cpp public API is pretty stable. We > already passed that difficult stage. My take at arrow and parquet is to > keep them nimble since we can. I believe that parquet-core has progress to make yet ahead of it. We have done little work in asynchronous IO and concurrency which would yield both improved read and write throughput. This aligns well with other concurrency and async-IO work planned in the Arrow platform. I believe that more development will happen on parquet-core once the development process issues are resolved by having a single codebase, single build system, and a single CI framework. I have some gripes about design decisions made early in parquet-cpp, like the use of C++ exceptions. So while "stability" is a reasonable goal I think we should still be open to making significant changes in the interest of long term progress. Having now worked on these projects for more than 2 and a half years and the most frequent contributor to both codebases, I'm sadly far past the "breaking point" and not willing to continue contributing in a significant way to parquet-cpp if the projects remained structured as they are now. It's hampering progress and not serving the community. - Wes > > > > > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney wrote: > >> > The current Arrow adaptor code for parquet should live in the arrow >> repo. That will remove a majority of the dependency issues. Joshua's work >> would not have been blocked in parquet-cpp if that adapter was in the arrow >> repo. This will be similar to the ORC adaptor. >> >> This has been suggested before, but I don't see how it would alleviate >> any issues because of the significant dependencies on other parts of >> the Arrow codebase. What you are proposing is: >> >> - (Arrow) arrow platform >> - (Parquet) parquet core >> - (Arrow) arrow columnar-parquet adapter interface >> - (Arrow) Python bindings >> >> To make this work, somehow Arrow core / libarrow would have to be >> built before invoking the Parquet core part of the build system. You >> would need to pass dependent targets across different CMake build >> systems; I don't know if it's possible (I spent some time looking into >> it earlier this year). This is what I meant by the lack of a "concrete >> and actionable plan". The only thing that would really work would be >> for the Parquet core to be "included" in the Arrow build system >> somehow rather than using ExternalProject. Currently Parquet builds >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build >> system because it's only depended upon by the Python bindings. >> >> And even if a solution could be devised, it would not wholly resolve >> the CI workflow issues. >> >> You could make Parquet completely independent of the Arrow codebase, >> but at that point there is little reason to maintain a relationship >> between the projects or their communities. We have spent a great deal >> of effort refactoring the two projects to enable as much code sharing >> as there is now. >> >> - Wes >> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney wrote: >> >> If you still strongly feel that the only way forward is to clone the >>
Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass
> The current Arrow adaptor code for parquet should live in the arrow repo. > That will remove a majority of the dependency issues. Joshua's work would not > have been blocked in parquet-cpp if that adapter was in the arrow repo. This > will be similar to the ORC adaptor. This has been suggested before, but I don't see how it would alleviate any issues because of the significant dependencies on other parts of the Arrow codebase. What you are proposing is: - (Arrow) arrow platform - (Parquet) parquet core - (Arrow) arrow columnar-parquet adapter interface - (Arrow) Python bindings To make this work, somehow Arrow core / libarrow would have to be built before invoking the Parquet core part of the build system. You would need to pass dependent targets across different CMake build systems; I don't know if it's possible (I spent some time looking into it earlier this year). This is what I meant by the lack of a "concrete and actionable plan". The only thing that would really work would be for the Parquet core to be "included" in the Arrow build system somehow rather than using ExternalProject. Currently Parquet builds Arrow using ExternalProject, and Parquet is unknown to the Arrow build system because it's only depended upon by the Python bindings. And even if a solution could be devised, it would not wholly resolve the CI workflow issues. You could make Parquet completely independent of the Arrow codebase, but at that point there is little reason to maintain a relationship between the projects or their communities. We have spent a great deal of effort refactoring the two projects to enable as much code sharing as there is now. - Wes On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney wrote: >> If you still strongly feel that the only way forward is to clone the >> parquet-cpp repo and part ways, I will withdraw my concern. Having two >> parquet-cpp repos is no way a better approach. > > Yes, indeed. In my view, the next best option after a monorepo is to > fork. That would obviously be a bad outcome for the community. > > It doesn't look like I will be able to convince you that a monorepo is > a good idea; what I would ask instead is that you be willing to give > it a shot, and if it turns out in the way you're describing (which I > don't think it will) then I suggest that we fork at that point. > > - Wes > > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti > wrote: >> Wes, >> >> Unfortunately, I cannot show you any practical fact-based problems of a >> non-existent Arrow-Parquet mono-repo. >> Bringing in related Apache community experiences are more meaningful than >> how mono-repos work at Google and other big organizations. >> We solely depend on volunteers and cannot hire full-time developers. >> You are very well aware of how difficult it has been to find more >> contributors and maintainers for Arrow. parquet-cpp already has a low >> contribution rate to its core components. >> >> We should target to ensure that new volunteers who want to contribute >> bug-fixes/features should spend the least amount of time in figuring out >> the project repo. We can never come up with an automated build system that >> caters to every possible environment. >> My only concern is if the mono-repo will make it harder for new developers >> to work on parquet-cpp core just due to the additional code, build and test >> dependencies. >> I am not saying that the Arrow community/committers will be less >> co-operative. >> I just don't think the mono-repo structure model will be sustainable in an >> open source community unless there are long-term vested interests. We can't >> predict that. >> >> The current circular dependency problems between Arrow and Parquet is a >> major problem for the community and it is important. >> >> The current Arrow adaptor code for parquet should live in the arrow repo. >> That will remove a majority of the dependency issues. >> Joshua's work would not have been blocked in parquet-cpp if that adapter >> was in the arrow repo. This will be similar to the ORC adaptor. >> >> The platform API code is pretty stable at this point. Minor changes in the >> future to this code should not be the main reason to combine the arrow >> parquet repos. >> >> " >> *I question whether it's worth the community's time long term to wear* >> >> >> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary >> to plug components together rather than utilizing commonplatform APIs.*" >> >> My answer to your question below would be "Yes". Modularity/separation is >> very important in an open source community where priorities of contributors >> are often short term. >> The retention is low and therefore the acquisition costs should be low as >> well. This is the community over code approach according to me. Minor code >> duplication is not a deal breaker. >> ORC, Parquet, Arrow, etc. are all different components in the big data >> space serving their own functions. >> >> If you still strongly feel
Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass
A controlled fork doesn’t sound like a terrible option. Copy the code from parquet into arrow, and for a limited period of time it would be the primary. When that period is over, the code in parquet becomes the primary. During the period during which arrow has the primary, the parquet release manager will have to synchronize parquet’s copy of the code (probably by patches) before making releases. Julian > On Jul 31, 2018, at 11:29 AM, Wes McKinney wrote: > >> If you still strongly feel that the only way forward is to clone the >> parquet-cpp repo and part ways, I will withdraw my concern. Having two >> parquet-cpp repos is no way a better approach. > > Yes, indeed. In my view, the next best option after a monorepo is to > fork. That would obviously be a bad outcome for the community. > > It doesn't look like I will be able to convince you that a monorepo is > a good idea; what I would ask instead is that you be willing to give > it a shot, and if it turns out in the way you're describing (which I > don't think it will) then I suggest that we fork at that point. > > - Wes > > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti > wrote: >> Wes, >> >> Unfortunately, I cannot show you any practical fact-based problems of a >> non-existent Arrow-Parquet mono-repo. >> Bringing in related Apache community experiences are more meaningful than >> how mono-repos work at Google and other big organizations. >> We solely depend on volunteers and cannot hire full-time developers. >> You are very well aware of how difficult it has been to find more >> contributors and maintainers for Arrow. parquet-cpp already has a low >> contribution rate to its core components. >> >> We should target to ensure that new volunteers who want to contribute >> bug-fixes/features should spend the least amount of time in figuring out >> the project repo. We can never come up with an automated build system that >> caters to every possible environment. >> My only concern is if the mono-repo will make it harder for new developers >> to work on parquet-cpp core just due to the additional code, build and test >> dependencies. >> I am not saying that the Arrow community/committers will be less >> co-operative. >> I just don't think the mono-repo structure model will be sustainable in an >> open source community unless there are long-term vested interests. We can't >> predict that. >> >> The current circular dependency problems between Arrow and Parquet is a >> major problem for the community and it is important. >> >> The current Arrow adaptor code for parquet should live in the arrow repo. >> That will remove a majority of the dependency issues. >> Joshua's work would not have been blocked in parquet-cpp if that adapter >> was in the arrow repo. This will be similar to the ORC adaptor. >> >> The platform API code is pretty stable at this point. Minor changes in the >> future to this code should not be the main reason to combine the arrow >> parquet repos. >> >> " >> *I question whether it's worth the community's time long term to wear* >> >> >> *ourselves out defining custom "ports" / virtual interfaces in eachlibrary >> to plug components together rather than utilizing commonplatform APIs.*" >> >> My answer to your question below would be "Yes". Modularity/separation is >> very important in an open source community where priorities of contributors >> are often short term. >> The retention is low and therefore the acquisition costs should be low as >> well. This is the community over code approach according to me. Minor code >> duplication is not a deal breaker. >> ORC, Parquet, Arrow, etc. are all different components in the big data >> space serving their own functions. >> >> If you still strongly feel that the only way forward is to clone the >> parquet-cpp repo and part ways, I will withdraw my concern. Having two >> parquet-cpp repos is no way a better approach. >> >> >> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney wrote: >> >>> @Antoine >>> By the way, one concern with the monorepo approach: it would slightly >>> increase Arrow CI times (which are already too large). >>> >>> A typical CI run in Arrow is taking about 45 minutes: >>> https://travis-ci.org/apache/arrow/builds/410119750 >>> >>> Parquet run takes about 28 >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208 >>> >>> Inevitably we will need to create some kind of bot to run certain >>> builds on-demand based on commit / PR metadata or on request. >>> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be >>> made substantially shorter by moving some of the slower parts (like >>> the Python ASV benchmarks) from being tested every-commit to nightly >>> or on demand. Using ASAN instead of valgrind in Travis would also >>> improve build times (valgrind build could be moved to a nightly >>> exhaustive test run) >>> >>> - Wes >>> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney >>> wrote: > I would like
Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass
Wes, Unfortunately, I cannot show you any practical fact-based problems of a non-existent Arrow-Parquet mono-repo. Bringing in related Apache community experiences are more meaningful than how mono-repos work at Google and other big organizations. We solely depend on volunteers and cannot hire full-time developers. You are very well aware of how difficult it has been to find more contributors and maintainers for Arrow. parquet-cpp already has a low contribution rate to its core components. We should target to ensure that new volunteers who want to contribute bug-fixes/features should spend the least amount of time in figuring out the project repo. We can never come up with an automated build system that caters to every possible environment. My only concern is if the mono-repo will make it harder for new developers to work on parquet-cpp core just due to the additional code, build and test dependencies. I am not saying that the Arrow community/committers will be less co-operative. I just don't think the mono-repo structure model will be sustainable in an open source community unless there are long-term vested interests. We can't predict that. The current circular dependency problems between Arrow and Parquet is a major problem for the community and it is important. The current Arrow adaptor code for parquet should live in the arrow repo. That will remove a majority of the dependency issues. Joshua's work would not have been blocked in parquet-cpp if that adapter was in the arrow repo. This will be similar to the ORC adaptor. The platform API code is pretty stable at this point. Minor changes in the future to this code should not be the main reason to combine the arrow parquet repos. " *I question whether it's worth the community's time long term to wear* *ourselves out defining custom "ports" / virtual interfaces in eachlibrary to plug components together rather than utilizing commonplatform APIs.*" My answer to your question below would be "Yes". Modularity/separation is very important in an open source community where priorities of contributors are often short term. The retention is low and therefore the acquisition costs should be low as well. This is the community over code approach according to me. Minor code duplication is not a deal breaker. ORC, Parquet, Arrow, etc. are all different components in the big data space serving their own functions. If you still strongly feel that the only way forward is to clone the parquet-cpp repo and part ways, I will withdraw my concern. Having two parquet-cpp repos is no way a better approach. On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney wrote: > @Antoine > > > By the way, one concern with the monorepo approach: it would slightly > increase Arrow CI times (which are already too large). > > A typical CI run in Arrow is taking about 45 minutes: > https://travis-ci.org/apache/arrow/builds/410119750 > > Parquet run takes about 28 > https://travis-ci.org/apache/parquet-cpp/builds/410147208 > > Inevitably we will need to create some kind of bot to run certain > builds on-demand based on commit / PR metadata or on request. > > The slowest build in Arrow (the Arrow C++/Python one) build could be > made substantially shorter by moving some of the slower parts (like > the Python ASV benchmarks) from being tested every-commit to nightly > or on demand. Using ASAN instead of valgrind in Travis would also > improve build times (valgrind build could be moved to a nightly > exhaustive test run) > > - Wes > > On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney > wrote: > >> I would like to point out that arrow's use of orc is a great example of > how it would be possible to manage parquet-cpp as a separate codebase. That > gives me hope that the projects could be managed separately some day. > > > > Well, I don't know that ORC is the best example. The ORC C++ codebase > > features several areas of duplicated logic which could be replaced by > > components from the Arrow platform for better platform-wide > > interoperability: > > > > > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37 > > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh > > > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh > > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh > > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh > > > > ORC's use of symbols from Protocol Buffers was actually a cause of > > bugs that we had to fix in Arrow's build system to prevent them from > > leaking to third party linkers when statically linked (ORC is only > > available for static linking at the moment AFAIK). > > > > I question whether it's worth the community's time long term to wear > > ourselves out defining custom "ports" / virtual interfaces in each > > library to plug components together rather than utilizing common > > platform APIs. > > > > - Wes > > > > On Mon, Jul 30, 2018 at 10:45
[jira] [Created] (ARROW-2952) [C++] Dockerfile for running include-what-you-use checks
Wes McKinney created ARROW-2952: --- Summary: [C++] Dockerfile for running include-what-you-use checks Key: ARROW-2952 URL: https://issues.apache.org/jira/browse/ARROW-2952 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney It would be valuable to have a non-nonsense reproducible IWYU report. Every time I want to run this report on a new machine I lose time building the correct version of IWYU and remembering how to correctly run the report -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2951) [CI] Changes in format/ should cause Appveyor builds to run
Wes McKinney created ARROW-2951: --- Summary: [CI] Changes in format/ should cause Appveyor builds to run Key: ARROW-2951 URL: https://issues.apache.org/jira/browse/ARROW-2951 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Wes McKinney Currently they are skipped https://github.com/apache/arrow/blob/master/appveyor.yml#L23 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass
@Antoine > By the way, one concern with the monorepo approach: it would slightly > increase Arrow CI times (which are already too large). A typical CI run in Arrow is taking about 45 minutes: https://travis-ci.org/apache/arrow/builds/410119750 Parquet run takes about 28 https://travis-ci.org/apache/parquet-cpp/builds/410147208 Inevitably we will need to create some kind of bot to run certain builds on-demand based on commit / PR metadata or on request. The slowest build in Arrow (the Arrow C++/Python one) build could be made substantially shorter by moving some of the slower parts (like the Python ASV benchmarks) from being tested every-commit to nightly or on demand. Using ASAN instead of valgrind in Travis would also improve build times (valgrind build could be moved to a nightly exhaustive test run) - Wes On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney wrote: >> I would like to point out that arrow's use of orc is a great example of how >> it would be possible to manage parquet-cpp as a separate codebase. That >> gives me hope that the projects could be managed separately some day. > > Well, I don't know that ORC is the best example. The ORC C++ codebase > features several areas of duplicated logic which could be replaced by > components from the Arrow platform for better platform-wide > interoperability: > > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37 > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh > > ORC's use of symbols from Protocol Buffers was actually a cause of > bugs that we had to fix in Arrow's build system to prevent them from > leaking to third party linkers when statically linked (ORC is only > available for static linking at the moment AFAIK). > > I question whether it's worth the community's time long term to wear > ourselves out defining custom "ports" / virtual interfaces in each > library to plug components together rather than utilizing common > platform APIs. > > - Wes > > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck > wrote: >> You're point about the constraints of the ASF release process are well >> taken and as a developer who's trying to work in the current environment I >> would be much happier if the codebases were merged. The main issues I worry >> about when you put codebases like these together are: >> >> 1. The delineation of API's become blurred and the code becomes too coupled >> 2. Release of artifacts that are lower in the dependency tree are delayed >> by artifacts higher in the dependency tree >> >> If the project/release management is structured well and someone keeps an >> eye on the coupling, then I don't have any concerns. >> >> I would like to point out that arrow's use of orc is a great example of how >> it would be possible to manage parquet-cpp as a separate codebase. That >> gives me hope that the projects could be managed separately some day. >> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney wrote: >> >>> hi Josh, >>> >>> > I can imagine use cases for parquet that don't involve arrow and tying >>> them together seems like the wrong choice. >>> >>> Apache is "Community over Code"; right now it's the same people >>> building these projects -- my argument (which I think you agree with?) >>> is that we should work more closely together until the community grows >>> large enough to support larger-scope process than we have now. As >>> you've seen, our process isn't serving developers of these projects. >>> >>> > I also think build tooling should be pulled into its own codebase. >>> >>> I don't see how this can possibly be practical taking into >>> consideration the constraints imposed by the combination of the GitHub >>> platform and the ASF release process. I'm all for being idealistic, >>> but right now we need to be practical. Unless we can devise a >>> practical procedure that can accommodate at least 1 patch per day >>> which may touch both code and build system simultaneously without >>> being a hindrance to contributor or maintainer, I don't see how we can >>> move forward. >>> >>> > That being said, I think it makes sense to merge the codebases in the >>> short term with the express purpose of separating them in the near term. >>> >>> I would agree but only if separation can be demonstrated to be >>> practical and result in net improvements in productivity and community >>> growth. I think experience has clearly demonstrated that the current >>> separation is impractical, and is causing problems. >>> >>> Per Julian's and Ted's comments, I think we need to consider >>> development process and ASF releases separately. My argument is as >>> follows: >>> >>> * Monorepo for development (for practicality) >>> * Releases structured according to the desires of
[jira] [Created] (ARROW-2950) [C++] Clean up util/bit-util.h
Antoine Pitrou created ARROW-2950: - Summary: [C++] Clean up util/bit-util.h Key: ARROW-2950 URL: https://issues.apache.org/jira/browse/ARROW-2950 Project: Apache Arrow Issue Type: Task Reporter: Antoine Pitrou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2949) [CI] repo.continuum.io can be flaky in builds
Wes McKinney created ARROW-2949: --- Summary: [CI] repo.continuum.io can be flaky in builds Key: ARROW-2949 URL: https://issues.apache.org/jira/browse/ARROW-2949 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Wes McKinney I have seen this flakiness in several builds: {code} ++wget --no-verbose -O miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh wget: unable to resolve host address ‘repo.continuum.io†{code} e.g. https://travis-ci.org/apache/arrow/jobs/410201987 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2948) [Packaging] Generate changelog with crossbow
Krisztian Szucs created ARROW-2948: -- Summary: [Packaging] Generate changelog with crossbow Key: ARROW-2948 URL: https://issues.apache.org/jira/browse/ARROW-2948 Project: Apache Arrow Issue Type: Sub-task Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Basically the port of https://github.com/apache/arrow/blob/master/dev/release/changelog.py -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Reading PageHeader separately from reading entire page
hi Renato, Sounds like a useful feature to have (to be able to inspect data page metadata without decoding all the data inside). You'll need to propose a change and patch to Apache Parquet Speaking of which, we're having a discussion on the Arrow and Parquet mailing lists about easing Parquet-related development process for both communities: https://lists.apache.org/thread.html/4bc135b4e933b959602df48bc3d5978ab7a4299d83d4295da9f498ac@%3Cdev.parquet.apache.org%3E - Wes On Mon, Jul 30, 2018 at 12:02 PM, Renato Marroquín Mogrovejo wrote: > Hi Arrow devs, > > I am trying to separate reading only pageHeaders from reading > (reading+uncompresing+serializing) its entire content. > The current SerializedPageReader::NextPage() does both things at the same > time. > I tried importing format::PageHeader into a separate project linking > against a build of parquet-cpp, but I can't, I guess it is because it is > not exported, right? > Any suggestions/pointers/ideas are highly appreciated! > Thanks! > > > Renato M.
[jira] [Created] (ARROW-2947) [Packaging] Remove Ubuntu Artful
Kouhei Sutou created ARROW-2947: --- Summary: [Packaging] Remove Ubuntu Artful Key: ARROW-2947 URL: https://issues.apache.org/jira/browse/ARROW-2947 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2946) [Packaging] Stop to use PWD in debian/rules
Kouhei Sutou created ARROW-2946: --- Summary: [Packaging] Stop to use PWD in debian/rules Key: ARROW-2946 URL: https://issues.apache.org/jira/browse/ARROW-2946 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2945) [Packaging] Update argument check for 02-source.sh
Kouhei Sutou created ARROW-2945: --- Summary: [Packaging] Update argument check for 02-source.sh Key: ARROW-2945 URL: https://issues.apache.org/jira/browse/ARROW-2945 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)