Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Ryan Blue
Correct. On Tue, May 28, 2019 at 3:13 PM Anton Okolnychyi wrote: > Alright, so we are talking about reading Parquet data into > ArrowRecordBatches and then exposing them as ColumnarBatches in Spark, > where Spark ColumnVectors actually wrap Arrow FieldVectors, correct? > > - Anton > > > On 28

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Anton Okolnychyi
Alright, so we are talking about reading Parquet data into ArrowRecordBatches and then exposing them as ColumnarBatches in Spark, where Spark ColumnVectors actually wrap Arrow FieldVectors, correct? - Anton > On 28 May 2019, at 21:24, Ryan Blue wrote: > > From a performance viewpoint, this

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Owen O'Malley
On Fri, May 24, 2019 at 8:28 PM Ryan Blue wrote: > if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an > Iterator[InternalRow] interface, it would still not work right? Coz it > seems to me there is a lot more going on upstream in the operator execution > path that would be needed to

Re: Future of Iceberg Parquet Reader

2019-05-28 Thread Wes McKinney
On Tue, May 28, 2019 at 11:19 AM Daniel Weeks wrote: > > Hey Anton, > > #1) Part of the reason Iceberg has a custom reader is to help resolve some of > the Iceberg specific aspects of how parquet files are read (e.g. column > resolution by id, iceberg expressions). Also, it's been a struggle

Re: Future of Iceberg Parquet Reader

2019-05-28 Thread Daniel Weeks
Hey Anton, #1) Part of the reason Iceberg has a custom reader is to help resolve some of the Iceberg specific aspects of how parquet files are read (e.g. column resolution by id, iceberg expressions). Also, it's been a struggle to get agreement on a good vectorized api. I don't believe the

Re: Future of Iceberg Parquet Reader

2019-05-28 Thread Wes McKinney
hi Anton, On point #5, I would suggest doing the work either in Apache Arrow or in the Parquet Java project -- we are developing both Parquet C++ and Rust codebases within the apache/arrow repository so I think you would find an active community there. I know that there has been a lot of interest

Future of Iceberg Parquet Reader

2019-05-28 Thread Anton Okolnychyi
Hi, I see more and more questions around Iceberg Parquet reader. I think it would be useful to have a thread that clarifies all open questions and explains the long-term plan. 1. Am I correct that performance is the main reason to have a custom reader in Iceberg? Are there any other purposes?

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-28 Thread Anton Okolnychyi
> You’re right that the first thing that Spark does it to get each row as > InternalRow. But we still get a benefit from vectorizing the data > materialization to Arrow itself. Spark execution is not vectorized, but that > can be updated in Spark later (I think there’s a proposal). > I am not

Re: Need help trying to figure out if the issue on multiple partition specs on same field is a tracked issue or not

2019-05-28 Thread Anton Okolnychyi
Hm, this is actually a good question. My understanding is that we shouldn't explicitly define partitioning by year/month/day/hour on the same column. Instead, we should be fine with hour only. Iceberg produces ordinals for time-based partition functions. As far as I remember, Ryan was planning

Need help trying to figure out if the issue on multiple partition specs on same field is a tracked issue or not

2019-05-28 Thread filip
A while back I bumped into an issue with what seems to be an inconsistency in the partition spec API or maybe it's just an implementation bug. Attempting to have multiple partitions specs on the same schema field I bumped into an issue regarding the fact that while the API allows for multiple