Correct.
On Tue, May 28, 2019 at 3:13 PM Anton Okolnychyi
wrote:
> Alright, so we are talking about reading Parquet data into
> ArrowRecordBatches and then exposing them as ColumnarBatches in Spark,
> where Spark ColumnVectors actually wrap Arrow FieldVectors, correct?
>
> - Anton
>
> > On 28
Alright, so we are talking about reading Parquet data into ArrowRecordBatches
and then exposing them as ColumnarBatches in Spark, where Spark ColumnVectors
actually wrap Arrow FieldVectors, correct?
- Anton
> On 28 May 2019, at 21:24, Ryan Blue wrote:
>
> From a performance viewpoint, this
On Fri, May 24, 2019 at 8:28 PM Ryan Blue wrote:
> if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an
> Iterator[InternalRow] interface, it would still not work right? Coz it
> seems to me there is a lot more going on upstream in the operator execution
> path that would be needed to
On Tue, May 28, 2019 at 11:19 AM Daniel Weeks
wrote:
>
> Hey Anton,
>
> #1) Part of the reason Iceberg has a custom reader is to help resolve some of
> the Iceberg specific aspects of how parquet files are read (e.g. column
> resolution by id, iceberg expressions). Also, it's been a struggle
Hey Anton,
#1) Part of the reason Iceberg has a custom reader is to help resolve some
of the Iceberg specific aspects of how parquet files are read (e.g. column
resolution by id, iceberg expressions). Also, it's been a struggle to get
agreement on a good vectorized api. I don't believe the
hi Anton,
On point #5, I would suggest doing the work either in Apache Arrow or
in the Parquet Java project -- we are developing both Parquet C++ and
Rust codebases within the apache/arrow repository so I think you would
find an active community there. I know that there has been a lot of
interest
Hi,
I see more and more questions around Iceberg Parquet reader. I think it would
be useful to have a thread that clarifies all open questions and explains the
long-term plan.
1. Am I correct that performance is the main reason to have a custom reader in
Iceberg? Are there any other purposes?
> You’re right that the first thing that Spark does it to get each row as
> InternalRow. But we still get a benefit from vectorizing the data
> materialization to Arrow itself. Spark execution is not vectorized, but that
> can be updated in Spark later (I think there’s a proposal).
>
I am not
Hm, this is actually a good question.
My understanding is that we shouldn't explicitly define partitioning by
year/month/day/hour on the same column. Instead, we should be fine with hour
only. Iceberg produces ordinals for time-based partition functions. As far as I
remember, Ryan was planning
A while back I bumped into an issue with what seems to be an inconsistency
in the partition spec API or maybe it's just an implementation bug.
Attempting to have multiple partitions specs on the same schema field I
bumped into an issue regarding the fact that while the API allows for
multiple
10 matches
Mail list logo