Re: [DISCUSS] Other website improvements?

2020-07-28 Thread Maarten Ballintijn
The new website looks very nice. Looking around for a bit, some areas where extra content might be helpful A search feature in the top bar More info, more prominence for Parquet, Gandiva, Plasma, others? A vision and or current roadmap for the project. The analysis / computation aspects seem

Re: [Python][Documentation] Add column limit recommendations Parquet page

2020-05-11 Thread Maarten Ballintijn
that using python is that much slower? Are my observations correct? > On Sat, May 9, 2020 at 4:28 PM Maarten Ballintijn wrote: >> >> Wes, >> >> "Users would be well advised to not write columns with large numbers (> >> 1000) of columns" >> You

Re: [Python][Documentation] Add column limit recommendations Parquet page

2020-05-09 Thread Maarten Ballintijn
Wes, "Users would be well advised to not write columns with large numbers (> 1000) of columns" You've mentioned this before and as this is in my experience not an uncommon use-case can you maybe expand a bit on the following related questions. (use-cases include daily or minute data for a few 1

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-03-12 Thread Maarten Ballintijn
Hi Micah, How does the performance change for “flat” schemas? (particularly in the case of a large number of columns) Thanks, Maarten > On Mar 11, 2020, at 11:53 PM, Micah Kornfield wrote: > > Another status update. I've integrated the level generation code with the > parquet writing code [

Re: Documentation fixlet

2020-01-13 Thread Maarten Ballintijn
I’ve some more documentation fixes, shall I add these to the PR? > On Jan 13, 2020, at 3:54 PM, Maarten Ballintijn wrote: > > Done, first time, feedback welcome! > > >> On Jan 13, 2020, at 2:50 PM, Wes McKinney wrote: >> >> Would you like to submit a PR? &g

Re: Documentation fixlet

2020-01-13 Thread Maarten Ballintijn
Done, first time, feedback welcome! > On Jan 13, 2020, at 2:50 PM, Wes McKinney wrote: > > Would you like to submit a PR? > > On Mon, Jan 13, 2020 at 1:25 PM Maarten Ballintijn wrote: >> >> Hello, >> >> It looks like “—file arrow/ci

[jira] [Created] (ARROW-7561) [Doc] fix conda environment command

2020-01-13 Thread Maarten Ballintijn (Jira)
Maarten Ballintijn created ARROW-7561: - Summary: [Doc] fix conda environment command Key: ARROW-7561 URL: https://issues.apache.org/jira/browse/ARROW-7561 Project: Apache Arrow Issue

Documentation fixlet

2020-01-13 Thread Maarten Ballintijn
Hello, It looks like “—file arrow/ci/conda_env_gandiva.yml \” is missing in: https://arrow.apache.org/docs/python/development.html#using-conda Cheers, Maarten.

Re: Human-readable version of Arrow Schema?

2019-12-07 Thread Maarten Ballintijn
Is there a syntax specified for schemas? Cheers, Maarten. > On Dec 6, 2019, at 5:01 PM, Micah Kornfield wrote: > > Hi Christian, > As far as I know no-one is working on a canonical text representation for > schemas. A JSON serializer exists for integration test purposes, but > IMO it should

Re: PyArrow.Table schema.metadata issue

2019-11-27 Thread Maarten Ballintijn
Hi Aaron, The schema is immutable, add_metadata returns a new schema object which includes the metadata. So I think this does what you want: schema = schema.add_metadata(meta) If not, experts will chime in hopefully. Cheers, Maarten. > On Nov 28, 2019, at 12:41 AM, Aaron Chu wrote: > > De

Re: pyarrow read_csv with different amount of columns per row

2019-11-19 Thread Maarten Ballintijn
Hi Elisa, One option is to preprocess the file and add the missing columns. You can do this using two passes (reading once to determine the number of columns and once writing out the lines filled out to the right number of columns) This does not need to take a lot of memory as you can read line

Re: [C++] The quest for zero-dependency builds

2019-10-20 Thread Maarten Ballintijn
Dev's I would request to be as conservative as possible in choosing (keeping) a build system. For developers, packagers and even end-users for some languages the build system is just another dependency. Even if cmake is not ideal, it has become quite ubiquitous which is a huge plus. Maybe it

Re: Parquet file reading performance

2019-10-01 Thread Maarten Ballintijn
the dtype on the pd.Series constructor triggers another code path which is a further factor ~5 slower. > On Oct 1, 2019, at 7:07 AM, Joris Van den Bossche > wrote: > > Some answers to the other questions: > > On Sat, 28 Sep 2019 at 22:16, Maarten Ballintijn wrote: >

Re: Parquet file reading performance

2019-09-28 Thread Maarten Ballintijn
_table('testdata.dt.parquet') >> 43 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) >> >> In [6]: table_int = pq.read_table('testdata.int.parquet') >> >> In [7]: table_datetime = pq.read_table('testdata.dt.parquet') >

Re: Parquet file reading performance

2019-09-24 Thread Maarten Ballintijn
e: > > hi > > On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <mailto:maart...@xs4all.nl>> wrote: >> >> Hi Wes, >> >> Thanks for your quick response. >> >> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and: >>

Re: Parquet file reading performance

2019-09-24 Thread Maarten Ballintijn
ance issues you're seeing. > > Thanks > Wes > > On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn wrote: >> >> Greetings, >> >> We have Pandas DataFrames with typically about 6,000 rows using >> DateTimeIndex. >> They have about 20,000