[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata
[ https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208158#comment-17208158 ] Uwe Korn commented on PARQUET-1345: --- Turns out this was not due to many categorical columns but due to a huge number (>1mio) of RowGroups. We cannot fix this as Thrift messages are capped at 2GiB but we could probably raise a better error message. > [C++] It is possible to overflow a TMemoryBuffer when serializing the file > metadata > --- > > Key: PARQUET-1345 > URL: https://issues.apache.org/jira/browse/PARQUET-1345 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Major > > I'm not sure if this is fixable, but see issue reported to Arrow: > https://github.com/apache/arrow/issues/2077 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata
[ https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205420#comment-17205420 ] Uwe Korn commented on PARQUET-1345: --- One of the reasons this could appear is in the case that one has a pandas DataFrame with many categorical columns. Then the pandas metadata may become really huge. > [C++] It is possible to overflow a TMemoryBuffer when serializing the file > metadata > --- > > Key: PARQUET-1345 > URL: https://issues.apache.org/jira/browse/PARQUET-1345 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Major > > I'm not sure if this is fixable, but see issue reported to Arrow: > https://github.com/apache/arrow/issues/2077 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1825) [C++] Fix compilation error in column_io_benchmark.cc
Uwe Korn created PARQUET-1825: - Summary: [C++] Fix compilation error in column_io_benchmark.cc Key: PARQUET-1825 URL: https://issues.apache.org/jira/browse/PARQUET-1825 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Uwe Korn Assignee: Uwe Korn Leftover of [https://github.com/apache/arrow/pull/6690] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029811#comment-17029811 ] Uwe Korn commented on PARQUET-1783: --- The problem is somewhere in the PARQUET C++ code as statistices are computed there. > [C++] Parquet statistics wrong for dictionary type > -- > > Key: PARQUET-1783 > URL: https://issues.apache.org/jira/browse/PARQUET-1783 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.6.0 >Reporter: Florian Jetter >Priority: Major > > h3. Observed behaviour > Statistics for categorical data are equivalent for all row groups and refer > to the entire {{CategoricalDtype}} instead of the data included in the row > group. > h3. Expected behaviour > The row group statistics should only include data which is part of the actual > row group, not the entire {{CategoricalDtype}} > h3. Minimal example > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])}) > table = pa.Table.from_pandas(test_df) > pq.write_table( > table, > "test_parquet", > chunk_size=1, > ) > test_parquet = pq.ParquetFile("test_parquet") > test_parquet.metadata.row_group(0).column(0).statistics > {code} > {code:java} > Out[1]: > > has_min_max: True > min: 1 > max: 42 > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > logical_type: String > converted_type (legacy): UTF8 > {code} > Expected would be > {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group > > Tested with > pandas==1.0.0 > pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / > essentially 0.16.0) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Moved] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type
[ https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn moved ARROW-7732 to PARQUET-1783: -- Component/s: (was: C++) parquet-cpp Key: PARQUET-1783 (was: ARROW-7732) Affects Version/s: (was: 0.15.1) (was: 0.16.0) cpp-1.6.0 Workflow: patch-available, re-open possible (was: jira) Project: Parquet (was: Apache Arrow) > [C++] Parquet statistics wrong for dictionary type > -- > > Key: PARQUET-1783 > URL: https://issues.apache.org/jira/browse/PARQUET-1783 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.6.0 >Reporter: Florian Jetter >Priority: Major > > h3. Observed behaviour > Statistics for categorical data are equivalent for all row groups and refer > to the entire {{CategoricalDtype}} instead of the data included in the row > group. > h3. Expected behaviour > The row group statistics should only include data which is part of the actual > row group, not the entire {{CategoricalDtype}} > h3. Minimal example > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])}) > table = pa.Table.from_pandas(test_df) > pq.write_table( > table, > "test_parquet", > chunk_size=1, > ) > test_parquet = pq.ParquetFile("test_parquet") > test_parquet.metadata.row_group(0).column(0).statistics > {code} > {code:java} > Out[1]: > > has_min_max: True > min: 1 > max: 42 > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > logical_type: String > converted_type (legacy): UTF8 > {code} > Expected would be > {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group > > Tested with > pandas==1.0.0 > pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / > essentially 0.16.0) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1779) format: Update merge script
Uwe Korn created PARQUET-1779: - Summary: format: Update merge script Key: PARQUET-1779 URL: https://issues.apache.org/jira/browse/PARQUET-1779 Project: Parquet Issue Type: Improvement Components: parquet-format Reporter: Uwe Korn Assignee: Uwe Korn Fix For: format-2.8.0 The current merge script is Python 3 incompatible, copy over the merge_script from the Arrow project which is a development that initially started from merge_parquet.py -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1777) add Parquet logo vector files to repo
[ https://issues.apache.org/jira/browse/PARQUET-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved PARQUET-1777. --- Fix Version/s: format-2.8.0 Resolution: Fixed Issue resolved by pull request 157 [https://github.com/apache/parquet-format/pull/157] > add Parquet logo vector files to repo > - > > Key: PARQUET-1777 > URL: https://issues.apache.org/jira/browse/PARQUET-1777 > Project: Parquet > Issue Type: Task > Components: parquet-format >Reporter: Julien Le Dem >Assignee: Julien Le Dem >Priority: Major > Labels: pull-request-available > Fix For: format-2.8.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1689) [C++] Stream API: Allow for columns/rows to be skipped when reading
[ https://issues.apache.org/jira/browse/PARQUET-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved PARQUET-1689. --- Fix Version/s: cpp-1.6.0 Resolution: Fixed Issue resolved by pull request 5797 [https://github.com/apache/arrow/pull/5797] > [C++] Stream API: Allow for columns/rows to be skipped when reading > --- > > Key: PARQUET-1689 > URL: https://issues.apache.org/jira/browse/PARQUET-1689 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Gawain BOLTON >Assignee: Gawain BOLTON >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > It can be useful to be able to skip rows and/or columns when reading data. > The ColumnReader class already allows for data to be skipped. > This new StreamReader class could use this functionality to allow for users > to skip columns and rows when using the StreamReader API. > I will propose this functionality by submitting a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1686) Automate site generation
[ https://issues.apache.org/jira/browse/PARQUET-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963045#comment-16963045 ] Uwe Korn commented on PARQUET-1686: --- In Arrow we are using Jekyll with Github Actions to automatically deploy our site: [https://github.com/apache/arrow-site/blob/master/.github/workflows/deploy.yml] > Automate site generation > > > Key: PARQUET-1686 > URL: https://issues.apache.org/jira/browse/PARQUET-1686 > Project: Parquet > Issue Type: Improvement > Components: parquet-site >Reporter: Gabor Szadovszky >Priority: Major > Labels: documentation > > We moved our site source to [github|https://github.com/apache/parquet-site]. > It is much better than svn but still not working as it should. Currently, we > have to generate the site manually before checking in. It would be much > better if the site generation would be automatic so we can simply accept PRs > on the source files. > One option to achieve this is the [Pelican CMS > System|https://blog.getpelican.com/] as described at [.asf.yaml features for > git > repositories|https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories#id-.asf.yamlfeaturesforgitrepositories-StaticwebsitecontentgenerationviaPelicanCMS]. > Not sure if this is the best solution though. Another solution might be to > trigger a jenkins build for the changes on master and after generating the > site with middleman commit the files to the branch asf-site. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Preparing for parquet-cpp 0.1
We already have https://issues.apache.org/jira/browse/PARQUET-713, closed as duplicate ;) Especially the dev scripts seem to origin from somewhere else? Is there something we have to take care of because of parquet-cpp's origin? Also I made a PR to run RAT in the CI to check the Licenses: https://github.com/apache/parquet-cpp/pull/189 Runs nicely but we still have to deal with the things Ryan mentioned. On 08.11.16 19:23, Julien Le Dem wrote: I create a jira for the release: https://issues.apache.org/jira/browse/PARQUET-774 please add blockers to that jira if they need to be in the release. On Tue, Nov 8, 2016 at 10:07 AM, Ryan Blue wrote: Do you guys intend to release convenience binaries in addition to the initial source release? If so, I think you'll have to include a license/notice that includes the third party dependencies. Also, license should be used to record third-party licensed works that are included in the source distribution. The bit packing code should be in there, rather than in notice. Notice is for required third-party notices and isn't the file where third-party licensing information should be accumulated. rb On Tue, Nov 8, 2016 at 10:00 AM, Wes McKinney wrote: I think we are ready to make a release once PARQUET-702 is merged. Is there any more licensing / NOTICE review work to do? On Fri, Nov 4, 2016 at 10:29 AM, Deepak Majeti wrote: I would like to get PARQUET-764 and PARQUET-702 into the release as well. Both of them belong to me. I plan to finish PARQUET-702 by Monday. If someone can take over PARQUET-764, it will be easier. On Fri, Nov 4, 2016 at 3:04 AM, Uwe Korn wrote: Hello, given that we have reached a point parquet-cpp is working quite nicely and a minimal set of features is implemented, I would like to continue to make a release in the next days. I would wait for PARQUET-726 [1] to be merged and then setup the release scripts and ask for a vote. Is there anything else someone wants to get in before the initial release? Uwe [1] https://github.com/apache/parquet-cpp/pull/184 -- regards, Deepak Majeti -- Ryan Blue Software Engineer Netflix
Preparing for parquet-cpp 0.1
Hello, given that we have reached a point parquet-cpp is working quite nicely and a minimal set of features is implemented, I would like to continue to make a release in the next days. I would wait for PARQUET-726 [1] to be merged and then setup the release scripts and ask for a vote. Is there anything else someone wants to get in before the initial release? Uwe [1] https://github.com/apache/parquet-cpp/pull/184
Re: [VOTE] Release Apache Parquet 1.9.0 RC1
Hello Ryan, sadly I have failing tests with the RC. Seems like they are locale dependent ("," vs "."). Rerunning with LANG=en_US.UTF-8 did sadly not solve this, is there some other magic I need to provide to switch JVM locals? % cat parquet-column/target/surefire-reports/org.apache.parquet.column.statistics.TestStatistics.txt --- Test set: org.apache.parquet.column.statistics.TestStatistics --- Tests run: 9, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.024 sec <<< FAILURE! testFloatMinMax(org.apache.parquet.column.statistics.TestStatistics) Time elapsed: 0.01 sec <<< FAILURE! org.junit.ComparisonFailure: expected:num_nulls: 0> but was: at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.parquet.column.statistics.TestStatistics.testFloatMinMax(TestStatistics.java:235) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110) at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175) at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68) testDoubleMinMax(org.apache.parquet.column.statistics.TestStatistics) Time elapsed: 0 sec <<< FAILURE! org.junit.ComparisonFailure: expected:num_nulls: 0> but was: at org.junit.Assert.assertEquals(Assert.java:125) at org.junit.Assert.assertEquals(Assert.java:147) at org.apache.parquet.column.statistics.TestStatistics.testDoubleMinMax(TestStatistics.java:296) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(P
Re: [Draft report] Apache Parquet
+1 On 13.10.16 02:43, Julien Le Dem wrote: Report from the Apache Parquet committee [Julien Le Dem] ## Description: Parquet is a standard and interoperable columnar file format for efficient analytics. ## Issues: there are no issues requiring board attention at this time ## Activity: The community has been converging toward a 1.9 release. The vote will start in the coming days. Discussion about better encoding and vectorization apis are ongoing. The parquet-cpp repo has reached a stable state and should release soon. Integration with arrow-cpp is now in the parquet-cpp repo. ## Health report: The PMC and committer list are growing. Discussion is happening on the mailing list, JIRA and regular hangout sync up. Notes are sent to the mailing list. ## PMC changes: - Currently 22 PMC members. - Wes McKinney was added to the PMC on Thu Sep 01 2016 ## Committer base changes: - Currently 25 committers. - Uwe Korn was added as a committer on Sun Sep 04 2016 ## Releases: - Last release was Format 2.3.1 on Thu Dec 17 2015 ## Mailing list activity: - Activity on the mailing list is still relatively the same - JIRAS are resolved about at the same pace they are opened. - dev@parquet.apache.org: - 172 subscribers (up 9 in the last 3 months): - 486 emails sent to list (394 in previous quarter) ## JIRA activity: - 85 JIRA tickets created in the last 3 months - 74 JIRA tickets closed/resolved in the last 3 months
Re: Python Parquet package
Sounds reasonable for me. I will then to continue to implement the missing interfaces for Parquet in pyarrow.parquet. @wesm Can you take care that we easily depend on a pinned version of parquet-cpp in pyarrow’s travis builds? Uwe > Am 21.09.2016 um 20:07 schrieb Wes McKinney : > > I don't agree with this approach right now. Here are my reasons: > > 1. The Parquet Python integration will need to depend both on PyArrow > and the Arrow C++ libraries, so these libraries would generally need > to be developed together > > 2. PyArrow would need to define and maintain a C++ or Cython API so > that the equivalent of the current pyarrow.parquet library can access > C-level data. For example: > > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31 > > Cython does permit cross-project C API access (we are already doing > cross-module Cython APi access within pyarrow). This adds additional > complexity that I think we should avoid for now. > > 3. Maintaining a separate C++ build toolchain for a Python package > adds additional maintenance and packaging burden on us > > My inclination is to keep the code where it is and make the Parquet > extension optional. > > - Wes > > On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn wrote: >> Hello, >> >> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we >> still have to decide on how we are going to proceed with the Arrow<->Parquet >> Python integration. For the moment, it seems that the best way to go ahead >> is to pull the pyarrow.parquet module out into a separate Python package. >> From an organisational point, I'm unclear how I should proceed here. Should >> we put this in a separate repo? If so, as part of the Apache organisation? >> >> Uwe
Python Parquet package
Hello, as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we still have to decide on how we are going to proceed with the Arrow<->Parquet Python integration. For the moment, it seems that the best way to go ahead is to pull the pyarrow.parquet module out into a separate Python package. From an organisational point, I'm unclear how I should proceed here. Should we put this in a separate repo? If so, as part of the Apache organisation? Uwe
Re: Cannot load Parquet files created with parquet-cpp in Drill
Happy to report back, that this is really a parquet-cpp issue and not something in Drill. Kudos to Deepak Majeti for finding that we did not set the dictionary_page_offset in the C++ code. Uwe On 07.09.16 21:08, Kunal Khatua wrote: Hi Uwe I believe you're using the latest Apache Drill 1.8.0. From a quick look at the stack trace, it appears to be a potential bug on Drill's interpretation of dictionary encoded data. One way to verify that your C++ implementation of Parquet is correct would be to have your generated data without dictionary encoding before attempting to see if Drill can read that. Regards Kunal On Wed 7-Sep-2016 5:30:32 AM, Uwe Korn wrote: Hello, I'm currently looking at the correctness of our C++ implementation of Parquet and noticed that I cannot load these files in Drill. Although this is probably a bug in the C++ implementation, I don't understand what causes the error. Using the Java parquet-tools, I can read these files. I'm using Apache Drill 1.8.0 on OSX. I've posted the error output from Drill and the parquet file as a gist: https://gist.github.com/xhochy/d4441a5ff2025b877df43fecd4466a11 If anyone could have a short look into this and tell me why Drill cannot read the file, you would really help me to fix the parquet-cpp issues. Kind Regards, Uwe
Cannot load Parquet files created with parquet-cpp in Drill
Hello, I'm currently looking at the correctness of our C++ implementation of Parquet and noticed that I cannot load these files in Drill. Although this is probably a bug in the C++ implementation, I don't understand what causes the error. Using the Java parquet-tools, I can read these files. I'm using Apache Drill 1.8.0 on OSX. I've posted the error output from Drill and the parquet file as a gist: https://gist.github.com/xhochy/d4441a5ff2025b877df43fecd4466a11 If anyone could have a short look into this and tell me why Drill cannot read the file, you would really help me to fix the parquet-cpp issues. Kind Regards, Uwe
Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)
Hello, I'm also in favour of switching the dependency direction between Parquet and Arrow as this would avoid a lot of duplicate code in both projects as well as parquet-cpp profiting from functionality that is available in Arrow. @wesm: go ahead with the JIRAs and I'll add comments or will pick some of them up. Cheers Uwe On 07.09.16 04:41, Wes McKinney wrote: hi Julien, It makes sense to move the Parquet support for Arrow into Parquet itself and invert the dependency. I had thought that the coupling to Arrow C++'s IO subsystem might be tighter, but the connection between memory allocators and file abstractions is fairly simple: https://github.com/apache/arrow/blob/master/cpp/src/arrow/parquet/io.h I'll open appropriate JIRAs and Uwe and I can coordinate on the refactoring. The exposure of the Parquet functionality in Python should stay inside Arrow for now, but mainly because it would make developing the Python side of things much more difficult if we split things up right now. - Wes On Tue, Sep 6, 2016 at 8:27 PM, Brian Bowman wrote: Forgive me if interposing my first post for the Apache Arrow project on this thread is incorrect procedure. What Julien proposes with each storage layer producing Arrow Record Batches is exactly how I envision it working and would certainly make Arrow integration with SAS much more palatable. This is likely true for other storage layer providers as well. Brian Bowman (SAS) On Sep 6, 2016, at 7:52 PM, Julien Le Dem wrote: Thanks Wes, No worries, I know you are on top of those things. On a side note, I was wondering if the arrow-parquet integration should be in Parquet instead. Parquet would depend on Arrow and not the other way around. Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra, ...) provides a way to produce Arrow Record Batches. thoughts? On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney wrote: hi Julien, I'm very sorry about the inconvenience with this and the delay in getting it sorted out. I will triage this evening by disabling the Parquet tests in Arrow until we get the current problems under control. When we re-enable the Parquet tests in Travis CI I agree we should pin the version SHA. - Wes On Tue, Sep 6, 2016 at 5:30 PM, Julien Le Dem wrote: The Arrow cpp travis-ci build is broken right now because it depends on parquet-cpp which has changed in an incompatible way. [1] [2] (or so it looks to me) Since parquet-cpp is not released yet it is totally fine to make incompatible API changes. However, we may want to pin the Arrow to Parquet dependency (on a git sha?) to prevent cross project changes from breaking the master build. Since I'm not one of the core cpp dev on those projects I mainly want to start that conversation rather than prescribe a solution. Feel free to take this as a straw man and suggest something else. [1] https://travis-ci.org/apache/arrow/jobs/156080555 [2] https://github.com/apache/arrow/blob/2d8ec789365f3c0f82b1f22d76160d 5af150dd31/ci/travis_before_script_cpp.sh -- Julien -- Julien
Re: Reviving Parquet sync ups
+1 for a sync up and for the European friendly time. Should be able to join this time. On 01.09.16 08:02, Julien Le Dem wrote: Hi Piyush, You are totally right. Sync ups are an important part of keeping the community informed and making progress. I'll schedule one for next week. Thursday 10 am PT? Julien On Aug 31, 2016, at 18:54, Piyush Narang wrote: hi folks, A few months back we used have Parquet community sync ups via hangouts which were a nice opportunity to chat with other Parquet developers and discuss major / minor agenda items (e.g. 1.9.0 release / Parquet 2.0 etc) and things folks were working on. As it has been a while since the last sync up, I was wondering if there would there be interest in reviving this? Thanks, -- - Piyush
Re: Parquet Vectorized Read hackathon
Yes, I'm GMT +1 On 05.07.16 18:52, Julien Le Dem wrote: If there are people interested in the cpp implementation we’ll talk about that too. I’m happy to give context or help with the encoding. In particular a Parquet -> Arrow vectorized converter would be great. Are you GMT +1 ? We can schedule a 1 hour slot in the morning for discussing with remote folks in Europe. (same in afternoon if there are people joining from Asia) Julien On Jul 5, 2016, at 2:37 AM, Uwe Korn wrote: Hello, this effort is only for the parquet-mr project or would there also be some work/benefit for parquet-cpp? If so, I might join briefly in a hangout but due to the timezone shift, I probably will not be able to be awake all the time. Uwe On 02.07.16 01:01, Julien Le Dem wrote: Dear Parquet dev list, There have been efforts in several projects for vectorized reads of Parquet. We had discussed during the Parquet sync up to organize a hackathon to brainstorm and look into a shared implementation. Some projects that would benefit: - Apache Drill - Apache Arrow - Apache Spark - Presto - Apache Hive I'm planning to organize this at the Dremio office in Mountain View with optionally a hangout for people who would want to join remotely. I'm adding to the "to:" people that have expressed interest or could be interested but that's not an exhaustive list. Please respond to this email if you wish to be included. Who's interested and what dates would work between this Tuesday 7/5 and Wednesday 7/20 ?
Re: Parquet Vectorized Read hackathon
7/12 and 7/14 is ok for me. I'm mainly interested in the Path Parquet-cpp->Arrow-C++->PyArrow path for now. Encodings other than plain encoding are currently on my near future roadmap. On 05.07.16 19:00, Julien Le Dem wrote: 7/14 works better for me. For now we have for 7/14: - OK for 7/14: Jacques, Ryan, Julien - Please confirm the date (and time): Deepak, Cheng, Uwe Please send a short description of the projects you’re working on and what your particular interest is. On Jul 5, 2016, at 9:50 AM, Ryan Blue wrote: I'm in, and both 7/12 and 7/14 work for me. rb On Tue, Jul 5, 2016 at 9:15 AM, Jacques Nadeau wrote: Great idea, Julien! I vote for 7/12 or 7/14 On Tue, Jul 5, 2016 at 2:37 AM, Uwe Korn wrote: Hello, this effort is only for the parquet-mr project or would there also be some work/benefit for parquet-cpp? If so, I might join briefly in a hangout but due to the timezone shift, I probably will not be able to be awake all the time. Uwe On 02.07.16 01:01, Julien Le Dem wrote: Dear Parquet dev list, There have been efforts in several projects for vectorized reads of Parquet. We had discussed during the Parquet sync up to organize a hackathon to brainstorm and look into a shared implementation. Some projects that would benefit: - Apache Drill - Apache Arrow - Apache Spark - Presto - Apache Hive I'm planning to organize this at the Dremio office in Mountain View with optionally a hangout for people who would want to join remotely. I'm adding to the "to:" people that have expressed interest or could be interested but that's not an exhaustive list. Please respond to this email if you wish to be included. Who's interested and what dates would work between this Tuesday 7/5 and Wednesday 7/20 ? -- Ryan Blue Software Engineer Netflix
Re: Parquet Vectorized Read hackathon
Hello, this effort is only for the parquet-mr project or would there also be some work/benefit for parquet-cpp? If so, I might join briefly in a hangout but due to the timezone shift, I probably will not be able to be awake all the time. Uwe On 02.07.16 01:01, Julien Le Dem wrote: Dear Parquet dev list, There have been efforts in several projects for vectorized reads of Parquet. We had discussed during the Parquet sync up to organize a hackathon to brainstorm and look into a shared implementation. Some projects that would benefit: - Apache Drill - Apache Arrow - Apache Spark - Presto - Apache Hive I'm planning to organize this at the Dremio office in Mountain View with optionally a hangout for people who would want to join remotely. I'm adding to the "to:" people that have expressed interest or could be interested but that's not an exhaustive list. Please respond to this email if you wish to be included. Who's interested and what dates would work between this Tuesday 7/5 and Wednesday 7/20 ?
List of Additions to Parquet 2
Hello, I'm currently looking at the differences between Parquet 1 and Parquet 2 to implement these versions as a switch in parquet-cpp. The only list I could find is the rather undetailed changelog [1]. Is there maybe some better list or do I need to go through the referenced changesets entries myself to find the actual differences? (If the latter is the case, I'd also make a PR afterwards that augments the documentation with some "(since version 2.0)" markings. But I'm hoping a bit that there is some blog post or so out there that could make my life easier. Thanks, Uwe [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
Re: Parquet sync uo
Can you add me with xho...@gmail.com to the Sync google calendar so I get notified? Cheers Uwe On 16.05.16 18:20, Julien Le Dem wrote: Wes: I maintain a google calendar invite. People can send me their email address to be notified of the sync. Otherwise I send reminders on the dev list, but it looks like last time I missed sending an earlier reminder. Cheng: On the Parquet side for vectorization, you can always bypass the assembly and access the column readers directly. Nezih/Ryan/Dan have some work done around this with Presto. Other projects Like Drill or Spark have a custom reader based on the column readers. We're discussing making a shared implementation in Parquet itself. On Mon, May 16, 2016 at 12:32 AM, Xu, Cheng A wrote: Hi, Looks the vectorization is still undergoing. And I'd like to support Hive vectorization for parquet. Is there any early vectorization feature ready version of Parquet I could use to continue the work in Hive side? Thank you in advance. -Original Message- From: Julien Le Dem [mailto:jul...@dremio.com] Sent: Friday, May 13, 2016 8:34 AM To: dev@parquet.apache.org Subject: Re: Parquet sync uo The next sync up will be around Strata London early June, where I'll happen to be. We will do in the morning Pacific time, evening Europe time. Notes from this sync: attendees: - Julien (Dremio) - Alex, Piyush (Twitter) - Ryan (Netflix) Parquet 2.0 encodings discussion: - Jira open to finalize encodings: PARQUET-588: 2.0 encodings finalization. - Ryan is doing experiments to measure efficiency on their data - Alex and Piyush are looking at encoding selection strategies: How to pick the best encoding for the data automatically 1.9 release: - last blocker: PARQUET-400 (readFully() behavior) needs update from Jason. Possibly Piyush could pick it up if Jason is busy Brotli integration. - Ryan has been working on Brotli compression algorithm integration - for similar compression cost as snappy, much better compression ratio - embeds native library similar to snappy integration - looking into possibly statically linking the native library - PR available on parquet-format and parquet-mr Vectorized read: - towards end of June we will organize a Parquet vectorized read hackathon for all parties interested (make yourself known if interested, we'll send more details later, possible remote participation through hangout) Lazy projections at runtime. - Alex has been looking into lazy thrift object for parquet-thrift to minimize assembly cost in scalding existing jobs that don't declare the columns they need. Next sync will be in the morning PT. On Thu, May 12, 2016 at 5:42 AM, Deepak Majeti wrote: I am sorry for missing this meeting as well. My interest is also to improve parquet-cpp reader/writer performance. I will work with Uwe and Wes on this. My other interest is on supporting predicate pushdown. I will work on this in parallel with performance. Thanks! On Thu, May 12, 2016 at 4:05 AM, Uwe Korn wrote: I'm sorry I wasn't able to join today again (traveling). We could choose an early time Pacific time to make the meeting accessible to both Asia and Europe -- I would suggest 8 or 9 AM Pacific 8 or 9 am PT would work for me (CEST), 4pm PT is just not manageable. Also: Do we have a calendar where I can see in advance when sync ups are? Currently I'm working on the Parquet integration with Arrow and on building a Python interface for libarrow-parquet. Once we have a basic working version, I will look into implementing missing features in the writer and improving general read/write performance in parquet-cpp. Uwe http://timesched.pocoo.org/?date=2016-05-11&tz=pacific-standard-tim e !,de:berlin,cn:shanghai,us:new-york-city:ny I did not have much time for writing Parquet C++ development the last 6 weeks, but plan to help Uwe complete the writer implementation and work toward a more complete Apache Arrow integration (this is in progress here: https://github.com/apache/arrow/tree/master/cpp/src/arrow/parquet) Other items of immediate interest - C++ API to the file metadata (read + write) - Conda packaging for built artifacts (to make parquet-cpp easier for Python programmers to install portably when the time comes). I got Thrift C++ into conda-forge this week so this should not be hard now https://github.com/conda-forge/thrift-cpp-feedstock - Expanding column scan benchmarks (thanks Uwe for kickstarting the benchmarking effort!) - Perf improvements for the RLE decoder Thanks Wes On Wed, May 11, 2016 at 4:04 PM, Julien Le Dem wrote: The actual hangout url is https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up On Wed, May 11, 2016 at 3:57 PM, Julien Le Dem wrote: starting in 5 mins: https://plus.google.com/hangouts/_/event/parquet_sync_up On Wed, May 11, 2016 at 1:53 PM, Julien Le Dem wrote: It is happening at 4pm PT on google hangout https://plus.goog
Re: Parquet sync uo
I'm sorry I wasn't able to join today again (traveling). We could choose an early time Pacific time to make the meeting accessible to both Asia and Europe -- I would suggest 8 or 9 AM Pacific 8 or 9 am PT would work for me (CEST), 4pm PT is just not manageable. Also: Do we have a calendar where I can see in advance when sync ups are? Currently I'm working on the Parquet integration with Arrow and on building a Python interface for libarrow-parquet. Once we have a basic working version, I will look into implementing missing features in the writer and improving general read/write performance in parquet-cpp. Uwe http://timesched.pocoo.org/?date=2016-05-11&tz=pacific-standard-time!,de:berlin,cn:shanghai,us:new-york-city:ny I did not have much time for writing Parquet C++ development the last 6 weeks, but plan to help Uwe complete the writer implementation and work toward a more complete Apache Arrow integration (this is in progress here: https://github.com/apache/arrow/tree/master/cpp/src/arrow/parquet) Other items of immediate interest - C++ API to the file metadata (read + write) - Conda packaging for built artifacts (to make parquet-cpp easier for Python programmers to install portably when the time comes). I got Thrift C++ into conda-forge this week so this should not be hard now https://github.com/conda-forge/thrift-cpp-feedstock - Expanding column scan benchmarks (thanks Uwe for kickstarting the benchmarking effort!) - Perf improvements for the RLE decoder Thanks Wes On Wed, May 11, 2016 at 4:04 PM, Julien Le Dem wrote: The actual hangout url is https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up On Wed, May 11, 2016 at 3:57 PM, Julien Le Dem wrote: starting in 5 mins: https://plus.google.com/hangouts/_/event/parquet_sync_up On Wed, May 11, 2016 at 1:53 PM, Julien Le Dem wrote: It is happening at 4pm PT on google hangout https://plus.google.com/hangouts/_/event/parquet_sync_up (we can do a different time next time, based on timezone preferences. Afternoon is better for Asia. Morning is better for Europe) -- Julien -- Julien -- Julien
Re: Parquet sync up
Hello, due to me being in Europe, this is a very inconvenient time. Thus I rather write a longer mail instead of joining. As a bit of input, here is what I'm up to at the moment: * Write support in a basic form for parquet-cpp (no compression, fixed encodings, excessive memory usage, ..) is nearly done. I hope to open the final PR for discussion next week. * Remaining Tasks until I make the PR: * a bit of code cleanup * Going through the API again to make it consistent * Metadata for RowGroups and ColumnChunks Afterwards I would look into one of the following tasks w.r.t. parquet-cpp: * WriterProperties to specify compression, encoding, .. on a global and per-column basis. * Performance benchmarks for Write * Integration of Parquet support in Apache Arrow to use it with Python * Reduce the memory usage of the initial Writer implementation (therefore we probably need to extend the encoders a bit) If anyone else also looks into this, I'm happy to collaborate ;) Cheers Uwe On 21.04.16 00:51, Julien Le Dem wrote: It is happening at 4pm PT on google hangout https://plus.google.com/hangouts/_/event/parquet_sync_up
C++: API Documentation Style/Tool
Hello, I would start to make some API documentation comments in the parquet-cpp code I'm currently working on. By default, I would use doxygen and doxygen-style comments for the API. Are there any other suggestions/best practices you would prefer? Greetings Uwe
Retrieving the full/expanded name of a column in parquet-cpp
Hello, While using parquet-cpp, I'm trying to figure out how to reliably check which index a named/nested column is. In my example, I have a nested column "neighbours.array" but may also add at a later point some more columns with "??.array". Until now I used "column->descr()->name()" inside a loop over all columns in a RowGroup to determine if the current column is the one I want to read. This works fine for "top-level" columns but for neighbours.array, this only returns "array", the name of the primitive node in the schema description. To solve my problem: 1. Do we already have a reliable solution to determine which column index "neighbours.array" is? 2. We could add a fullname (or differently named) function to the column description. 3. We could have a map on Reader or RowGroup level that maps expanded name to index. If there is no solution yet, I'd be happy to implement 2 or 3 (or an alternative approach). My schema is as follows (generated via ParquetAvroWriter): required group com.xhochy.AdjacencyArray { required int32 id required int32 degree required group neighbours { repeated int32 array } } Greetings, Uwe
Re: Parquet-cpp dependency on C++11
Hello, Ubuntu 12.04 (as in the default GCC 4.6) has C++11 support, only partial but it covers the most common features. It is named C++0x there as the standard had not been finalized at the date of the GCC 4.6 release. A good overview is https://gcc.gnu.org/projects/cxx0x.html and the linked status subpages. Cheers, Uwe On 07.03.16 18:24, Ryan Blue wrote: From some quick searching, it looks like C++11 is supported on Ubuntu 14.04 LTS but not on 12.04 LTS. Considering that 14.04 is already nearly 2 years old (and 16.04 comes out soon), I think it is fairly reasonable to depend on C++11 even though 12.04 still has another 2 years of life. Everyone has had 2 years to update to the current LTS. I only looked into Ubuntu, but I'm guessing that this is about the same for redhat or centos. I think we should stay with C++11 and expect anyone on the old releases to install newer C++ libs if they want to use Parquet-CPP, unless there's some reason I'm missing why this is a more wide-spread problem than it looks like. rb On Mon, Mar 7, 2016 at 9:02 AM, Wes McKinney wrote: hello, responses inline On Mon, Mar 7, 2016 at 8:22 AM, Aliaksei Sandryhaila wrote: Hi Wes and Julien, At this point, parquet-cpp is heavily reliant on C++11 features and semantics. Believe it or not :), there are plenty of companies still running older versions of Linux that do not support C++11. Removing this dependency will make parquet-cpp usable (and much more appealing) to them. Just to be clear -- is this a problem for you specifically? Any other context would be helpful. It is not especially difficult to set up a portable C++11 build toolchain even on Linux distributions that do not have a new enough gcc in their package repository. Both Impala and Kudu have recently developed isolated 3rd-party toolchains to facilitate development and packaging for these systems. See for example https://github.com/cloudera/native-toolchain We would like to make parquet-cpp C++09 compatible. The end goal is to have a library that can compile with and without --std==c++11 flag. There are two parts of this process. The first one is to redefine or remove C++11 keywords, such as auto, unique_ptr, std::move, or for( : ) loops. The other part is to evaluate our use of C++11 features that are harder to replace, such as shared_ptr, make_shared(), etc., and either write our own implementation for this or modify code where appropriate (such as replace shared_ptr with unique_ptr where possible). We can do this either by maintaining a separate feature branch and periodically pulling new code from parquet-cpp; or by implementing the compatibility functionality directly in parquet-cpp (all future PRs will be tested for c++09 compatibility during CI builds). I'm fairly negative on dropping C++11 in trunk / main library development -- it would be a hardship for me personally, and additionally deter software engineers who are increasingly coming back to C++ development because of C++11/14. This leaves legacy C++<11 projects that wish to use parquet-cpp as a 3rd-party dependency somewhat out in the cold. One approach is to provide a wrapper API for projects that cannot interact with APIs that use C++11 facilities (like std::unique_ptr). The same approach could be used to provide a C API for the project. A wrapper API would be much easier to maintain and test without having a separate branch to keep in sync -- there might be some pitfalls here that I'm not aware of so let me know what you think. Thanks, Wes What are your thoughts on this? Thank you, Aliaksei.