from:"Uwe Korn"

[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata

2020-10-05 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208158#comment-17208158
 ] 

Uwe Korn commented on PARQUET-1345:
---

Turns out this was not due to many categorical columns but due to a huge number 
(>1mio) of RowGroups. We cannot fix this as Thrift messages are capped at 2GiB 
but we could probably raise a better error message.

> [C++] It is possible to overflow a TMemoryBuffer when serializing the file 
> metadata
> ---
>
> Key: PARQUET-1345
> URL: https://issues.apache.org/jira/browse/PARQUET-1345
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> I'm not sure if this is fixable, but see issue reported to Arrow:
> https://github.com/apache/arrow/issues/2077



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata

2020-10-01 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205420#comment-17205420
 ] 

Uwe Korn commented on PARQUET-1345:
---

One of the reasons this could appear is in the case that one has a pandas 
DataFrame with many categorical columns. Then the pandas metadata may become 
really huge.

> [C++] It is possible to overflow a TMemoryBuffer when serializing the file 
> metadata
> ---
>
> Key: PARQUET-1345
> URL: https://issues.apache.org/jira/browse/PARQUET-1345
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> I'm not sure if this is fixable, but see issue reported to Arrow:
> https://github.com/apache/arrow/issues/2077



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-1825) [C++] Fix compilation error in column_io_benchmark.cc

2020-03-24 Thread Uwe Korn (Jira)

Uwe Korn created PARQUET-1825:
-

 Summary: [C++] Fix compilation error in column_io_benchmark.cc
 Key: PARQUET-1825
 URL: https://issues.apache.org/jira/browse/PARQUET-1825
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Uwe Korn
Assignee: Uwe Korn


Leftover of [https://github.com/apache/arrow/pull/6690]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029811#comment-17029811
 ] 

Uwe Korn commented on PARQUET-1783:
---

The problem is somewhere in the PARQUET C++ code as statistices are computed 
there.

> [C++] Parquet statistics wrong for dictionary type
> --
>
> Key: PARQUET-1783
> URL: https://issues.apache.org/jira/browse/PARQUET-1783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>Reporter: Florian Jetter
>Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> 
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Moved] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Uwe Korn (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn moved ARROW-7732 to PARQUET-1783:
--

  Component/s: (was: C++)
   parquet-cpp
  Key: PARQUET-1783  (was: ARROW-7732)
Affects Version/s: (was: 0.15.1)
   (was: 0.16.0)
   cpp-1.6.0
 Workflow: patch-available, re-open possible  (was: jira)
  Project: Parquet  (was: Apache Arrow)

> [C++] Parquet statistics wrong for dictionary type
> --
>
> Key: PARQUET-1783
> URL: https://issues.apache.org/jira/browse/PARQUET-1783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>Reporter: Florian Jetter
>Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> 
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (PARQUET-1779) format: Update merge script

2020-01-27 Thread Uwe Korn (Jira)

Uwe Korn created PARQUET-1779:
-

 Summary: format: Update merge script
 Key: PARQUET-1779
 URL: https://issues.apache.org/jira/browse/PARQUET-1779
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Uwe Korn
Assignee: Uwe Korn
 Fix For: format-2.8.0


The current merge script is Python 3 incompatible, copy over the merge_script 
from the Arrow project which is a development that initially started from 
merge_parquet.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (PARQUET-1777) add Parquet logo vector files to repo

2020-01-27 Thread Uwe Korn (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved PARQUET-1777.
---
Fix Version/s: format-2.8.0
   Resolution: Fixed

Issue resolved by pull request 157
[https://github.com/apache/parquet-format/pull/157]

> add Parquet logo vector files to repo
> -
>
> Key: PARQUET-1777
> URL: https://issues.apache.org/jira/browse/PARQUET-1777
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
>  Labels: pull-request-available
> Fix For: format-2.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (PARQUET-1689) [C++] Stream API: Allow for columns/rows to be skipped when reading

2019-11-22 Thread Uwe Korn (Jira)



 [ 
https://issues.apache.org/jira/browse/PARQUET-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved PARQUET-1689.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 5797
[https://github.com/apache/arrow/pull/5797]

> [C++] Stream API: Allow for columns/rows to be skipped when reading
> ---
>
> Key: PARQUET-1689
> URL: https://issues.apache.org/jira/browse/PARQUET-1689
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Gawain BOLTON
>Assignee: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> It can be useful to be able to skip rows and/or columns when reading data.
> The ColumnReader class already allows for data to be skipped.
> This new StreamReader class could use this functionality to allow for users 
> to skip columns and rows when using the StreamReader API.
> I will propose this functionality by submitting a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1686) Automate site generation

2019-10-30 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963045#comment-16963045
 ] 

Uwe Korn commented on PARQUET-1686:
---

In Arrow we are using Jekyll with Github Actions to automatically deploy our 
site: 
[https://github.com/apache/arrow-site/blob/master/.github/workflows/deploy.yml]

> Automate site generation
> 
>
> Key: PARQUET-1686
> URL: https://issues.apache.org/jira/browse/PARQUET-1686
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-site
>Reporter: Gabor Szadovszky
>Priority: Major
>  Labels: documentation
>
> We moved our site source to [github|https://github.com/apache/parquet-site]. 
> It is much better than svn but still not working as it should. Currently, we 
> have to generate the site manually before checking in. It would be much 
> better if the site generation would be automatic so we can simply accept PRs 
> on the source files.
>  One option to achieve this is the [Pelican CMS 
> System|https://blog.getpelican.com/] as described at [.asf.yaml features for 
> git 
> repositories|https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories#id-.asf.yamlfeaturesforgitrepositories-StaticwebsitecontentgenerationviaPelicanCMS].
>  Not sure if this is the best solution though. Another solution might be to 
> trigger a jenkins build for the changes on master and after generating the 
> site with middleman commit the files to the branch asf-site. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: Preparing for parquet-cpp 0.1

2016-11-08 Thread Uwe Korn

We already have https://issues.apache.org/jira/browse/PARQUET-713, 
closed as duplicate ;)


Especially the dev scripts seem to origin from somewhere else? Is there 
something we have to take care of because of parquet-cpp's origin?


Also I made a PR to run RAT in the CI to check the Licenses: 
https://github.com/apache/parquet-cpp/pull/189 Runs nicely but we still 
have to deal with the things Ryan mentioned.


On 08.11.16 19:23, Julien Le Dem wrote:

I create a jira for the release:
https://issues.apache.org/jira/browse/PARQUET-774
please add blockers to that jira if they need to be in the release.


On Tue, Nov 8, 2016 at 10:07 AM, Ryan Blue 
wrote:


Do you guys intend to release convenience binaries in addition to the
initial source release? If so, I think you'll have to include a
license/notice that includes the third party dependencies.

Also, license should be used to record third-party licensed works that are
included in the source distribution. The bit packing code should be in
there, rather than in notice. Notice is for required third-party notices
and isn't the file where third-party licensing information should be
accumulated.

rb

On Tue, Nov 8, 2016 at 10:00 AM, Wes McKinney  wrote:


I think we are ready to make a release once PARQUET-702 is merged. Is
there any more licensing / NOTICE review work to do?

On Fri, Nov 4, 2016 at 10:29 AM, Deepak Majeti 
wrote:

I would like to get PARQUET-764 and PARQUET-702 into the release as
well. Both of them belong to me.
I plan to finish PARQUET-702 by Monday.
If someone can take over PARQUET-764, it will be easier.

On Fri, Nov 4, 2016 at 3:04 AM, Uwe Korn  wrote:

Hello,

given that we have reached a point parquet-cpp is working quite nicely

and a

minimal set of features is implemented, I would like to continue to

make a

release in the next days. I would wait for PARQUET-726 [1] to be

merged

and

then setup the release scripts and ask for a vote.

Is there anything else someone wants to get in before the initial

release?

Uwe

[1] https://github.com/apache/parquet-cpp/pull/184




--
regards,
Deepak Majeti



--
Ryan Blue
Software Engineer
Netflix

Preparing for parquet-cpp 0.1

2016-11-04 Thread Uwe Korn


Hello,

given that we have reached a point parquet-cpp is working quite nicely 
and a minimal set of features is implemented, I would like to continue 
to make a release in the next days. I would wait for PARQUET-726 [1] to 
be merged and then setup the release scripts and ask for a vote.


Is there anything else someone wants to get in before the initial release?

Uwe

[1] https://github.com/apache/parquet-cpp/pull/184

Re: [VOTE] Release Apache Parquet 1.9.0 RC1

2016-10-14 Thread Uwe Korn


Hello Ryan,

sadly I have failing tests with the RC. Seems like they are locale 
dependent ("," vs "."). Rerunning with LANG=en_US.UTF-8 did sadly not 
solve this, is there some other magic I need to provide to switch JVM 
locals?


% cat 
parquet-column/target/surefire-reports/org.apache.parquet.column.statistics.TestStatistics.txt

---
Test set: org.apache.parquet.column.statistics.TestStatistics
---
Tests run: 9, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.024 
sec <<< FAILURE!
testFloatMinMax(org.apache.parquet.column.statistics.TestStatistics) 
Time elapsed: 0.01 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:num_nulls: 0> but was:

at org.junit.Assert.assertEquals(Assert.java:125)
at org.junit.Assert.assertEquals(Assert.java:147)
at 
org.apache.parquet.column.statistics.TestStatistics.testFloatMinMax(TestStatistics.java:235)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
at 
org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175)
at 
org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68)
testDoubleMinMax(org.apache.parquet.column.statistics.TestStatistics) 
Time elapsed: 0 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:num_nulls: 0> but was:

at org.junit.Assert.assertEquals(Assert.java:125)
at org.junit.Assert.assertEquals(Assert.java:147)
at 
org.apache.parquet.column.statistics.TestStatistics.testDoubleMinMax(TestStatistics.java:296)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(P

Re: [Draft report] Apache Parquet

2016-10-12 Thread Uwe Korn


+1


On 13.10.16 02:43, Julien Le Dem wrote:

Report from the Apache Parquet committee [Julien Le Dem]

## Description:
Parquet is a standard and interoperable columnar file format for
efficient analytics.

## Issues:
there are no issues requiring board attention at this time

## Activity:
The community has been converging toward a 1.9 release. The vote will start
in the coming days. Discussion about better encoding and vectorization apis
are ongoing.
The parquet-cpp repo has reached a stable state and should release soon.
Integration with arrow-cpp is now in the parquet-cpp repo.

## Health report:
The PMC and committer list are growing. Discussion is happening on the
mailing list, JIRA and regular hangout sync up. Notes are sent to the
mailing list.

## PMC changes:

  - Currently 22 PMC members.
  - Wes McKinney was added to the PMC on Thu Sep 01 2016

## Committer base changes:

  - Currently 25 committers.
  - Uwe Korn was added as a committer on Sun Sep 04 2016

## Releases:

  - Last release was Format 2.3.1 on Thu Dec 17 2015

## Mailing list activity:

  - Activity on the mailing list is still relatively the same
  - JIRAS are resolved about at the same pace they are opened.

  - dev@parquet.apache.org:
 - 172 subscribers (up 9 in the last 3 months):
 - 486 emails sent to list (394 in previous quarter)


## JIRA activity:

  - 85 JIRA tickets created in the last 3 months
  - 74 JIRA tickets closed/resolved in the last 3 months

Re: Python Parquet package

2016-09-21 Thread Uwe Korn

Sounds reasonable for me. I will then to continue to implement the missing 
interfaces for Parquet in pyarrow.parquet. 

@wesm Can you take care that we easily depend on a pinned version of 
parquet-cpp in pyarrow’s travis builds?

Uwe

> Am 21.09.2016 um 20:07 schrieb Wes McKinney :
> 
> I don't agree with this approach right now. Here are my reasons:
> 
> 1. The Parquet Python integration will need to depend both on PyArrow
> and the Arrow C++ libraries, so these libraries would generally need
> to be developed together
> 
> 2. PyArrow would need to define and maintain a C++ or Cython API so
> that the equivalent of the current pyarrow.parquet library can access
> C-level data. For example:
> 
> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31
> 
> Cython does permit cross-project C API access (we are already doing
> cross-module Cython APi access within pyarrow). This adds additional
> complexity that I think we should avoid for now.
> 
> 3. Maintaining a separate C++ build toolchain for a Python package
> adds additional maintenance and packaging burden on us
> 
> My inclination is to keep the code where it is and make the Parquet
> extension optional.
> 
> - Wes
> 
> On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn  wrote:
>> Hello,
>> 
>> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we
>> still have to decide on how we are going to proceed with the Arrow<->Parquet
>> Python integration. For the moment, it seems that the best way to go ahead
>> is to pull the pyarrow.parquet module out into a separate Python package.
>> From an organisational point, I'm unclear how I should proceed here. Should
>> we put this in a separate repo? If so, as part of the Apache organisation?
>> 
>> Uwe

Python Parquet package

2016-09-21 Thread Uwe Korn


Hello,

as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, 
we still have to decide on how we are going to proceed with the 
Arrow<->Parquet Python integration. For the moment, it seems that the 
best way to go ahead is to pull the pyarrow.parquet module out into a 
separate Python package. From an organisational point, I'm unclear how I 
should proceed here. Should we put this in a separate repo? If so, as 
part of the Apache organisation?


Uwe

Re: Cannot load Parquet files created with parquet-cpp in Drill

2016-09-07 Thread Uwe Korn

Happy to report back, that this is really a parquet-cpp issue and not 
something in Drill. Kudos to Deepak Majeti for finding that we did not 
set the dictionary_page_offset in the C++ code.


Uwe

On 07.09.16 21:08, Kunal Khatua wrote:

Hi Uwe

I believe you're using the latest Apache Drill 1.8.0. From a quick look at the 
stack trace, it appears to be a potential bug on Drill's interpretation of 
dictionary encoded data.

One way to verify that your C++ implementation of Parquet is correct would be 
to have your generated data without dictionary encoding before attempting to 
see if Drill can read that.

Regards
Kunal

On Wed 7-Sep-2016 5:30:32 AM, Uwe Korn  wrote:
Hello,

I'm currently looking at the correctness of our C++ implementation of
Parquet and noticed that I cannot load these files in Drill. Although
this is probably a bug in the C++ implementation, I don't understand
what causes the error. Using the Java parquet-tools, I can read these
files. I'm using Apache Drill 1.8.0 on OSX.

I've posted the error output from Drill and the parquet file as a gist:
https://gist.github.com/xhochy/d4441a5ff2025b877df43fecd4466a11

If anyone could have a short look into this and tell me why Drill cannot
read the file, you would really help me to fix the parquet-cpp issues.

Kind Regards,

Uwe

Cannot load Parquet files created with parquet-cpp in Drill

2016-09-07 Thread Uwe Korn


Hello,

I'm currently looking at the correctness of our C++ implementation of 
Parquet and noticed that I cannot load these files in Drill. Although 
this is probably a bug in the C++ implementation, I don't understand 
what causes the error. Using the Java parquet-tools, I can read these 
files. I'm using Apache Drill 1.8.0 on OSX.


I've posted the error output from Drill and the parquet file as a gist: 
https://gist.github.com/xhochy/d4441a5ff2025b877df43fecd4466a11


If anyone could have a short look into this and tell me why Drill cannot 
read the file, you would really help me to fix the parquet-cpp issues.


Kind Regards,

Uwe

Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)

2016-09-06 Thread Uwe Korn


Hello,

I'm also in favour of switching the dependency direction between Parquet 
and Arrow as this would avoid a lot of duplicate code in both projects 
as well as parquet-cpp profiting from functionality that is available in 
Arrow.


@wesm: go ahead with the JIRAs and I'll add comments or will pick some 
of them up.


Cheers

Uwe


On 07.09.16 04:41, Wes McKinney wrote:

hi Julien,

It makes sense to move the Parquet support for Arrow into Parquet
itself and invert the dependency. I had thought that the coupling to
Arrow C++'s IO subsystem might be tighter, but the connection between
memory allocators and file abstractions is fairly simple:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/parquet/io.h

I'll open appropriate JIRAs and Uwe and I can coordinate on the refactoring.

The exposure of the Parquet functionality in Python should stay inside
Arrow for now, but mainly because it would make developing the Python
side of things much more difficult if we split things up right now.

- Wes

On Tue, Sep 6, 2016 at 8:27 PM, Brian Bowman  wrote:

Forgive me if interposing my first post for the Apache Arrow project on this 
thread is incorrect procedure.

What Julien proposes with each storage layer producing Arrow Record Batches is 
exactly how I envision it working and would certainly make Arrow integration 
with SAS much more palatable.  This is likely true for other storage layer 
providers as well.

Brian Bowman (SAS)


On Sep 6, 2016, at 7:52 PM, Julien Le Dem  wrote:

Thanks Wes,
No worries, I know you are on top of those things.
On a side note, I was wondering if the arrow-parquet integration should be
in Parquet instead.
Parquet would depend on Arrow and not the other way around.
Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra,
...) provides a way to produce Arrow Record Batches.
thoughts?


On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney  wrote:

hi Julien,

I'm very sorry about the inconvenience with this and the delay in
getting it sorted out. I will triage this evening by disabling the
Parquet tests in Arrow until we get the current problems under
control. When we re-enable the Parquet tests in Travis CI I agree we
should pin the version SHA.

- Wes


On Tue, Sep 6, 2016 at 5:30 PM, Julien Le Dem  wrote:
The Arrow cpp travis-ci build is broken right now because it depends on
parquet-cpp which has changed in an incompatible way. [1] [2] (or so it
looks to me)
Since parquet-cpp is not released yet it is totally fine to make
incompatible API changes.
However, we may want to pin the Arrow to Parquet dependency (on a git

sha?)

to prevent cross project changes from breaking the master build.
Since I'm not one of the core cpp dev on those projects I mainly want to
start that conversation rather than prescribe a solution. Feel free to

take

this as a straw man and suggest something else.

[1] https://travis-ci.org/apache/arrow/jobs/156080555
[2]
https://github.com/apache/arrow/blob/2d8ec789365f3c0f82b1f22d76160d

5af150dd31/ci/travis_before_script_cpp.sh


--
Julien



--
Julien

Re: Reviving Parquet sync ups

2016-08-31 Thread Uwe Korn

+1 for a sync up and for the European friendly time. Should be able to 
join this time.



On 01.09.16 08:02, Julien Le Dem wrote:

Hi Piyush,
You are totally right. Sync ups are an important part of keeping the community 
informed and making progress.
I'll schedule one for next week.
Thursday 10 am PT?

Julien


On Aug 31, 2016, at 18:54, Piyush Narang  wrote:

hi folks,

A few months back we used have Parquet community sync ups via hangouts
which were a nice opportunity to chat with other Parquet developers and
discuss major / minor agenda items (e.g. 1.9.0 release / Parquet 2.0 etc)
and things folks were working on. As it has been a while since the last
sync up, I was wondering if there would there be interest in reviving this?

Thanks,

--
- Piyush

Re: Parquet Vectorized Read hackathon

2016-07-06 Thread Uwe Korn


Yes, I'm GMT +1


On 05.07.16 18:52, Julien Le Dem wrote:

If there are people interested in the cpp implementation we’ll talk about that 
too.
I’m happy to give context or help with the encoding. In particular a Parquet -> 
Arrow vectorized converter would be great.
Are you GMT +1 ?
We can schedule a 1 hour slot in the morning for discussing with remote folks 
in Europe. (same in afternoon if there are people joining from Asia)
Julien


On Jul 5, 2016, at 2:37 AM, Uwe Korn  wrote:

Hello,

this effort is only for the parquet-mr project or would there also be some 
work/benefit for parquet-cpp? If so, I might join briefly in a hangout but due 
to the timezone shift, I probably will not be able to be awake all the time.

Uwe

On 02.07.16 01:01, Julien Le Dem wrote:

Dear Parquet dev list,
There have been efforts in several projects for vectorized reads of Parquet.
We had discussed during the Parquet sync up to organize a hackathon to
brainstorm and look into a shared implementation.
Some projects that would benefit:
  - Apache Drill
  - Apache Arrow
  - Apache Spark
  - Presto
  - Apache Hive

I'm planning to organize this at the Dremio office in Mountain View with
optionally a hangout for people who would want to join remotely.
I'm adding to the "to:" people that have expressed interest or could be
interested but that's not an exhaustive list. Please respond to this email
if you wish to be included.
Who's interested and what dates would work between this Tuesday 7/5 and
Wednesday 7/20 ?

Re: Parquet Vectorized Read hackathon

2016-07-06 Thread Uwe Korn

7/12 and 7/14 is ok for me. I'm mainly interested in the Path 
Parquet-cpp->Arrow-C++->PyArrow path for now. Encodings other than plain 
encoding are currently on my near future roadmap.



On 05.07.16 19:00, Julien Le Dem wrote:

7/14 works better for me.
For now we have for 7/14:

- OK for 7/14: Jacques, Ryan, Julien
- Please confirm the date (and time): Deepak, Cheng, Uwe

Please send a short description of the projects you’re working on and what your 
particular interest is.




On Jul 5, 2016, at 9:50 AM, Ryan Blue  wrote:

I'm in, and both 7/12 and 7/14 work for me.

rb

On Tue, Jul 5, 2016 at 9:15 AM, Jacques Nadeau  wrote:


Great idea, Julien!

I vote for 7/12 or 7/14

On Tue, Jul 5, 2016 at 2:37 AM, Uwe Korn  wrote:


Hello,

this effort is only for the parquet-mr project or would there also be

some

work/benefit for parquet-cpp? If so, I might join briefly in a hangout

but

due to the timezone shift, I probably will not be able to be awake all

the

time.

Uwe


On 02.07.16 01:01, Julien Le Dem wrote:


Dear Parquet dev list,
There have been efforts in several projects for vectorized reads of
Parquet.
We had discussed during the Parquet sync up to organize a hackathon to
brainstorm and look into a shared implementation.
Some projects that would benefit:
  - Apache Drill
  - Apache Arrow
  - Apache Spark
  - Presto
  - Apache Hive

I'm planning to organize this at the Dremio office in Mountain View with
optionally a hangout for people who would want to join remotely.
I'm adding to the "to:" people that have expressed interest or could be
interested but that's not an exhaustive list. Please respond to this

email

if you wish to be included.
Who's interested and what dates would work between this Tuesday 7/5 and
Wednesday 7/20 ?





--
Ryan Blue
Software Engineer
Netflix

Re: Parquet Vectorized Read hackathon

2016-07-05 Thread Uwe Korn


Hello,

this effort is only for the parquet-mr project or would there also be 
some work/benefit for parquet-cpp? If so, I might join briefly in a 
hangout but due to the timezone shift, I probably will not be able to be 
awake all the time.


Uwe

On 02.07.16 01:01, Julien Le Dem wrote:

Dear Parquet dev list,
There have been efforts in several projects for vectorized reads of Parquet.
We had discussed during the Parquet sync up to organize a hackathon to
brainstorm and look into a shared implementation.
Some projects that would benefit:
  - Apache Drill
  - Apache Arrow
  - Apache Spark
  - Presto
  - Apache Hive

I'm planning to organize this at the Dremio office in Mountain View with
optionally a hangout for people who would want to join remotely.
I'm adding to the "to:" people that have expressed interest or could be
interested but that's not an exhaustive list. Please respond to this email
if you wish to be included.
Who's interested and what dates would work between this Tuesday 7/5 and
Wednesday 7/20 ?

List of Additions to Parquet 2

2016-06-16 Thread Uwe Korn


Hello,

I'm currently looking at the differences between Parquet 1 and Parquet 2 
to implement these versions as a switch in parquet-cpp. The only list I 
could find is the rather undetailed changelog [1]. Is there maybe some 
better list or do I need to go through the referenced changesets entries 
myself to find the actual differences? (If the latter is the case, I'd 
also make a PR afterwards that augments the documentation with some 
"(since version 2.0)" markings. But I'm hoping a bit that there is some 
blog post or so out there that could make my life easier.


Thanks,

Uwe

[1] https://github.com/apache/parquet-format/blob/master/CHANGES.md

Re: Parquet sync uo

2016-05-16 Thread Uwe Korn

Can you add me with xho...@gmail.com to the Sync google calendar so I 
get notified?


Cheers
Uwe


On 16.05.16 18:20, Julien Le Dem wrote:

Wes: I maintain a google calendar invite. People can send me their email
address to be notified of the sync. Otherwise I send reminders on the dev
list, but it looks like last time I missed sending an earlier reminder.

Cheng: On the Parquet side for vectorization, you can always bypass the
assembly and access the column readers directly. Nezih/Ryan/Dan have some
work done around this with Presto. Other projects Like Drill or Spark have
a custom reader based on the column readers. We're discussing making a
shared implementation in Parquet itself.

On Mon, May 16, 2016 at 12:32 AM, Xu, Cheng A  wrote:


Hi,
Looks the vectorization is still undergoing. And I'd like to support Hive
vectorization for parquet. Is there any early vectorization feature ready
version of Parquet I could use to continue the work in Hive side? Thank you
in advance.

-Original Message-
From: Julien Le Dem [mailto:jul...@dremio.com]
Sent: Friday, May 13, 2016 8:34 AM
To: dev@parquet.apache.org
Subject: Re: Parquet sync uo

The next sync up will be around Strata London early June, where I'll
happen to be. We will do in the morning Pacific time, evening Europe time.

Notes from this sync:

attendees:
  - Julien (Dremio)
  - Alex, Piyush (Twitter)
  - Ryan (Netflix)


  Parquet 2.0 encodings discussion:

  - Jira open to finalize encodings: PARQUET-588: 2.0 encodings
finalization.

  - Ryan is doing experiments to measure efficiency on their data

- Alex and Piyush are looking at encoding selection strategies: How to
pick the best encoding for the data automatically


1.9 release:

  - last blocker: PARQUET-400 (readFully() behavior) needs update from
Jason. Possibly Piyush could pick it up if Jason is busy


Brotli integration.

- Ryan has been working on Brotli compression algorithm integration

- for similar compression cost as snappy, much better compression ratio

- embeds native library similar to snappy integration

- looking into possibly statically linking the native library

- PR available on parquet-format and parquet-mr


Vectorized read:

  - towards end of June we will organize a Parquet vectorized read
hackathon for all parties interested (make yourself known if interested,
we'll send more details later, possible remote participation through
hangout)


Lazy projections at runtime.

  - Alex has been looking into lazy thrift object for parquet-thrift to
minimize assembly cost in scalding existing jobs that don't declare the
columns they need.


Next sync will be in the morning PT.







On Thu, May 12, 2016 at 5:42 AM, Deepak Majeti 
wrote:


I am sorry for missing this meeting as well.
My interest is also to improve parquet-cpp reader/writer performance.
I will work with Uwe and Wes on this.
My other interest is on supporting predicate pushdown.  I will work on
this in parallel with performance.

Thanks!

On Thu, May 12, 2016 at 4:05 AM, Uwe Korn  wrote:

I'm sorry I wasn't able to join today again (traveling). We could
choose an early time Pacific time to make the meeting accessible to
both Asia and Europe -- I would suggest 8 or 9 AM Pacific


8 or 9 am PT would work for me (CEST), 4pm PT is just not manageable.
Also: Do we have a calendar where I can see in advance when sync ups

are?

Currently I'm working on the Parquet integration with Arrow and on

building

a Python interface for libarrow-parquet. Once we have a basic
working version, I will look into implementing missing features in
the writer and improving general read/write performance in parquet-cpp.

Uwe


http://timesched.pocoo.org/?date=2016-05-11&tz=pacific-standard-tim
e

!,de:berlin,cn:shanghai,us:new-york-city:ny

I did not have much time for writing Parquet C++ development the
last
6 weeks, but plan to help Uwe complete the writer implementation
and work toward a more complete Apache Arrow integration (this is
in progress here:
https://github.com/apache/arrow/tree/master/cpp/src/arrow/parquet)

Other items of immediate interest

- C++ API to the file metadata (read + write)
- Conda packaging for built artifacts (to make parquet-cpp easier
for Python programmers to install portably when the time comes). I
got Thrift C++ into conda-forge this week so this should not be
hard now https://github.com/conda-forge/thrift-cpp-feedstock
- Expanding column scan benchmarks (thanks Uwe for kickstarting the
benchmarking effort!)
- Perf improvements for the RLE decoder

Thanks
Wes

On Wed, May 11, 2016 at 4:04 PM, Julien Le Dem 

wrote:

The actual hangout url is
https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up

On Wed, May 11, 2016 at 3:57 PM, Julien Le Dem 

wrote:

starting in 5 mins:
https://plus.google.com/hangouts/_/event/parquet_sync_up

On Wed, May 11, 2016 at 1:53 PM, Julien Le Dem

wrote:


It is happening at 4pm PT on google hangout
https://plus.goog

Re: Parquet sync uo

2016-05-12 Thread Uwe Korn




I'm sorry I wasn't able to join today again (traveling). We could
choose an early time Pacific time to make the meeting accessible to
both Asia and Europe -- I would suggest 8 or 9 AM Pacific


8 or 9 am PT would work for me (CEST), 4pm PT is just not manageable.
Also: Do we have a calendar where I can see in advance when sync ups are?

Currently I'm working on the Parquet integration with Arrow and on 
building a Python interface for libarrow-parquet. Once we have a basic 
working version, I will look into implementing missing features in the 
writer and improving general read/write performance in parquet-cpp.


Uwe

http://timesched.pocoo.org/?date=2016-05-11&tz=pacific-standard-time!,de:berlin,cn:shanghai,us:new-york-city:ny

I did not have much time for writing Parquet C++ development the last
6 weeks, but plan to help Uwe complete the writer implementation and
work toward a more complete Apache Arrow integration (this is in
progress here: 
https://github.com/apache/arrow/tree/master/cpp/src/arrow/parquet)

Other items of immediate interest

- C++ API to the file metadata (read + write)
- Conda packaging for built artifacts (to make parquet-cpp easier for
Python programmers to install portably when the time comes). I got
Thrift C++ into conda-forge this week so this should not be hard now
https://github.com/conda-forge/thrift-cpp-feedstock
- Expanding column scan benchmarks (thanks Uwe for kickstarting the
benchmarking effort!)
- Perf improvements for the RLE decoder

Thanks
Wes

On Wed, May 11, 2016 at 4:04 PM, Julien Le Dem  wrote:

The actual hangout url is
https://hangouts.google.com/hangouts/_/dremio.com/parquet-sync-up

On Wed, May 11, 2016 at 3:57 PM, Julien Le Dem  wrote:


starting in 5 mins:
https://plus.google.com/hangouts/_/event/parquet_sync_up

On Wed, May 11, 2016 at 1:53 PM, Julien Le Dem  wrote:


It is happening at 4pm PT on google hangout
https://plus.google.com/hangouts/_/event/parquet_sync_up

(we can do a different time next time, based on timezone preferences.
Afternoon is better for Asia. Morning is better for Europe)

--
Julien




--
Julien




--
Julien

Re: Parquet sync up

2016-04-21 Thread Uwe Korn


Hello,

due to me being in Europe, this is a very inconvenient time. Thus I 
rather write a longer mail instead of joining. As a bit of input, here 
is what I'm up to at the moment:


 * Write support in a basic form for parquet-cpp (no compression, fixed 
encodings, excessive memory usage, ..) is nearly done. I hope to open 
the final PR for discussion next week.

 * Remaining Tasks until I make the PR:
   * a bit of code cleanup
   * Going through the API again to make it consistent
   * Metadata for RowGroups and ColumnChunks

Afterwards I would look into one of the following tasks w.r.t. parquet-cpp:
 * WriterProperties to specify compression, encoding, .. on a global 
and per-column basis.

 * Performance benchmarks for Write
 * Integration of Parquet support in Apache Arrow to use it with Python
 * Reduce the memory usage of the initial Writer implementation 
(therefore we probably need to extend the encoders a bit)


If anyone else also looks into this, I'm happy to collaborate ;)

Cheers
Uwe

On 21.04.16 00:51, Julien Le Dem wrote:

It is happening at 4pm PT on google hangout
https://plus.google.com/hangouts/_/event/parquet_sync_up

C++: API Documentation Style/Tool

2016-04-19 Thread Uwe Korn


Hello,

I would start to make some API documentation comments in the parquet-cpp 
code I'm currently working on. By default, I would use doxygen and 
doxygen-style comments for the API. Are there any other suggestions/best 
practices you would prefer?


Greetings
Uwe

Retrieving the full/expanded name of a column in parquet-cpp

2016-03-19 Thread Uwe Korn


Hello,

While using parquet-cpp, I'm trying to figure out how to reliably check 
which index a named/nested column is. In my example, I have a nested 
column "neighbours.array" but may also add at a later point some more 
columns with "??.array".


Until now I used "column->descr()->name()" inside a loop over all 
columns in a RowGroup to determine if the current column is the one I 
want to read. This works fine for "top-level" columns but for 
neighbours.array, this only returns "array", the name of the primitive 
node in the schema description.


To solve my problem:

1. Do we already have a reliable solution to determine which column
   index "neighbours.array" is?
2. We could add a fullname (or differently named) function to the
   column description.
3. We could have a map on Reader or RowGroup level that maps expanded
   name to index.

If there is no solution yet, I'd be happy to implement 2 or 3 (or an 
alternative approach).


My schema is as follows (generated via ParquetAvroWriter):

   required group com.xhochy.AdjacencyArray {
  required int32 id
  required int32 degree
  required group neighbours {
repeated int32 array
  }
   }

Greetings,
Uwe

Re: Parquet-cpp dependency on C++11

2016-03-07 Thread Uwe Korn


Hello,

Ubuntu 12.04 (as in the default GCC 4.6) has C++11 support, only partial 
but it covers the most common features. It is named C++0x there as the 
standard had not been finalized at the date of the GCC 4.6 release. A 
good overview is https://gcc.gnu.org/projects/cxx0x.html and the linked 
status subpages.


Cheers,
Uwe

On 07.03.16 18:24, Ryan Blue wrote:

 From some quick searching, it looks like C++11 is supported on Ubuntu 14.04
LTS but not on 12.04 LTS. Considering that 14.04 is already nearly 2 years
old (and 16.04 comes out soon), I think it is fairly reasonable to depend
on C++11 even though 12.04 still has another 2 years of life. Everyone has
had 2 years to update to the current LTS.

I only looked into Ubuntu, but I'm guessing that this is about the same for
redhat or centos. I think we should stay with C++11 and expect anyone on
the old releases to install newer C++ libs if they want to use Parquet-CPP,
unless there's some reason I'm missing why this is a more wide-spread
problem than it looks like.

rb

On Mon, Mar 7, 2016 at 9:02 AM, Wes McKinney  wrote:


hello,

responses inline

On Mon, Mar 7, 2016 at 8:22 AM, Aliaksei Sandryhaila
 wrote:

Hi Wes and Julien,

At this point, parquet-cpp is heavily reliant on C++11 features and
semantics. Believe it or not :), there are plenty of companies still
running older versions of Linux that do not support C++11. Removing this
dependency will make parquet-cpp usable (and much more appealing) to

them.
Just to be clear -- is this a problem for you specifically? Any other
context would be helpful.

It is not especially difficult to set up a portable C++11 build
toolchain even on Linux distributions that do not have a new enough
gcc in their package repository. Both Impala and Kudu have recently
developed isolated 3rd-party toolchains to facilitate development and
packaging for these systems. See for example
https://github.com/cloudera/native-toolchain


We would like to make parquet-cpp C++09 compatible. The end goal is to

have

a library that can compile with and without --std==c++11 flag. There are

two

parts of this process. The first one is to redefine or remove C++11
keywords, such as auto, unique_ptr, std::move, or for( : ) loops. The

other

part is to evaluate our use of C++11 features that are harder to replace,
such as shared_ptr, make_shared(), etc., and either write our own
implementation for this or modify code where appropriate (such as replace
shared_ptr with unique_ptr where possible).

We can do this either by maintaining a separate feature branch and
periodically pulling new code from parquet-cpp; or by implementing the
compatibility functionality directly in parquet-cpp (all future PRs will

be

tested for c++09 compatibility during CI builds).


I'm fairly negative on dropping C++11 in trunk / main library
development -- it would be a hardship for me personally, and
additionally deter software engineers who are increasingly coming back
to C++ development because of C++11/14.

This leaves legacy C++<11 projects that wish to use parquet-cpp as a
3rd-party dependency somewhat out in the cold. One approach is to
provide a wrapper API for projects that cannot interact with APIs that
use C++11 facilities (like std::unique_ptr). The same approach could
be used to provide a C API for the project. A wrapper API would be
much easier to maintain and test without having a separate branch to
keep in sync -- there might be some pitfalls here that I'm not aware
of so let me know what you think.

Thanks,
Wes


What are your thoughts on this?

Thank you,
Aliaksei.

[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata

[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata

[jira] [Created] (PARQUET-1825) [C++] Fix compilation error in column_io_benchmark.cc

[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

[jira] [Moved] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

[jira] [Created] (PARQUET-1779) format: Update merge script

[jira] [Resolved] (PARQUET-1777) add Parquet logo vector files to repo

[jira] [Resolved] (PARQUET-1689) [C++] Stream API: Allow for columns/rows to be skipped when reading

[jira] [Commented] (PARQUET-1686) Automate site generation

Re: Preparing for parquet-cpp 0.1

Preparing for parquet-cpp 0.1

Re: [VOTE] Release Apache Parquet 1.9.0 RC1

Re: [Draft report] Apache Parquet

Re: Python Parquet package

Python Parquet package

Re: Cannot load Parquet files created with parquet-cpp in Drill

Cannot load Parquet files created with parquet-cpp in Drill

Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)

Re: Reviving Parquet sync ups

Re: Parquet Vectorized Read hackathon

Re: Parquet Vectorized Read hackathon

Re: Parquet Vectorized Read hackathon

List of Additions to Parquet 2

Re: Parquet sync uo

Re: Parquet sync uo

Re: Parquet sync up

C++: API Documentation Style/Tool

Retrieving the full/expanded name of a column in parquet-cpp

Re: Parquet-cpp dependency on C++11

29 matches

Site Navigation

Mail list logo

Footer information