Sounds good.
In general I would say that this is a good opportunity to make
improvements around random data generation. For example, I don't think
we have an API for generating a RecordBatch given a schema and some
options (e.g. probability of nulls, distribution of list sizes), for
example, but
Hi Wes,
Thanks that seems like a good characterization. I opened up some JIRA
subtasks on ARROW-1644 which go into a little more detail on tasks that can
probably be worked on in parallel (I've only assigned ones to myself
that I'm actively working on, happy to add discuss/collaborate on the
hi Micah,
Sounds good. It seems like there are a few projects where people might
be able to work without stepping on each other's toes
A. Array reassembly from raw repetition/definition levels (I would
guess this would be your focus)
B. Schema and data generation for round-trip correctness and
Hi Wes,
Yes, I'm making progress and at this point I anticipate being able to
finish it off by next release, possibly without support for round tripping
fixed size lists. I've been spending some time thinking about different
approaches and have started coding some of the building blocks, which I
hi Micah,
I'm glad that we have the write side of nested completed for 0.17.0.
As far as completing the read side and then implementing sufficient
testing to exercise corner cases in end-to-end reads/writes, do you
anticipate being able to work on this in the next 4-6 weeks (obviously
the state
Hi Igor,
It looks like a good start. Thank you! I left feedback on the PR. I
should have some time over the next few weeks or so to handle the
combinations lists/structs that aren't currently supported in the core
reader. As I get further into the code I'll chime in here if there are
other
Hi Micah,
Finally got around to doing some work on the reader's side. Like you
suggested, I started with https://issues.apache.org/jira/browse/ARROW-7960.
I never programmed C++ professionally so I opened the PR
https://github.com/apache/arrow/pull/6758 as soon as possible to collect
feedback
>
> I'd be OK with having the flag so long as the new code is the default.
> Otherwise we'll never find out about the corner cases.
Completely agree, my PR makes the new code the default.
On Fri, Mar 13, 2020 at 6:44 AM Wes McKinney wrote:
> On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield
>
On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield wrote:
>
> Maarten, I don't expect regressions for flat cases (I'm going to try to run
> benchmarks comparison tonight).
>
> In terms of the flag, I'm more concerned about some corner case I didn't
> think of in testing or a workload that for some
Maarten, I don't expect regressions for flat cases (I'm going to try to run
benchmarks comparison tonight).
In terms of the flag, I'm more concerned about some corner case I didn't
think of in testing or a workload that for some reason is better with the
prior code. If either of these arise I
Maarten -- AFAIK Micah's work only affects nested / non-flat column
paths, so flat data should not be impacted. Since we have a partial
implementation of writes for nested data (lists-of-lists and
structs-of-structs, but no mix of the two) that was the performance
difference I was referencing.
On
Hi Micah,
How does the performance change for “flat” schemas?
(particularly in the case of a large number of columns)
Thanks,
Maarten
> On Mar 11, 2020, at 11:53 PM, Micah Kornfield wrote:
>
> Another status update. I've integrated the level generation code with the
> parquet writing code
hi Micah,
Great to hear about the progress, I'll help with code review.
FWIW, if the new code passes the existing unit tests I would be in
favor of deleting the old code so that we're fully invested in making
the new code suitably fast. Jump in with two feet, so to speak.
Thanks
Wes
On Wed,
Another status update. I've integrated the level generation code with the
parquet writing code [1].
After that PR is merged I'll add bindings in Python to control versions of
the level generation algorithm and plan on moving on to the read side.
Thanks,
Micah
[1]
Hi Igor,
If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might
be a good task to pick up for this I think it should be a relatively small
amount of code, so it is probably a good contribution to the project. Once
that is wrapped up we can see were we both are.
Cheers,
Micah
Hi Micah, I actually got involved with another personal project and had to
postpone my contribution to arrow a bit. The good news is that I'm almost
done with it, so I could help you with the read side very soon. Any ideas
how we could coordinate this?
Em qua., 26 de fev. de 2020 às 21:06, Wes
hi Micah -- great news on the level generation PR. I'll try to carve
out some time for reviewing over the coming week.
On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield wrote:
>
> Hi Igor,
> I was wondering if you have made any progress on this?
>
> I posted a new PR [1] which I believe handles
Hi Igor,
I was wondering if you have made any progress on this?
I posted a new PR [1] which I believe handles the difficult algorithmic
part of writing. There will be some follow-ups but I think this PR might
take a while to review, so I was thinking of starting to take a look at the
read side
>
> Glad to hear about the progress. As I mentioned on #2, what do you
> think about setting up a feature branch for you to merge PRs into?
> Then the branch can be iterated on and we can merge it back when it's
> feature complete and does not have perf regressions for the flat
> read/write path.
hi Micah
On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield wrote:
>
> Just to give an update. I've been a little bit delayed, but my progress is
> as follows:
> 1. Had 1 PR merged that will exercise basic end-to-end tests.
> 2. Have another PR open that allows a configuration option in C++ to
>
Hi Igor,
Thanks for the offer.
There is an old PR [1] that contains the generic logic for reading and
writing, which last time I looked seemed like a reasonable start for the
read portion (I think the write portion can potentially be made more memory
efficient memory (at the expense of some
Hi, I would love to help with this issue. I'm aware that this is a huge
task for a first contribution to arrow, but I feel that I could help with
the read path.
Reading parquet seems like a extremely complex task since both hive[0] and
spark[1] tried to implement a "vectorized" version and they
Just to give an update. I've been a little bit delayed, but my progress is
as follows:
1. Had 1 PR merged that will exercise basic end-to-end tests.
2. Have another PR open that allows a configuration option in C++ to
determine which algorithm version to use for reading/writing, the existing
Hi Wes,
I'm still interested in doing the work. But don't to hold anybody up if
they have bandwidth.
In order to actually make progress on this, my plan will be to:
1. Help with the current Java review backlog through early next week or so
(this has been taking the majority of my time allocated
hi folks,
I think we have reached a point where the incomplete C++ Parquet
nested data assembly/disassembly is harming the value of several
others parts of the project, for example the Datasets API. As another
example, it's possible to ingest nested data from JSON but not write
it to Parquet in
25 matches
Mail list logo