master branch does not work with spark-shell 2.3.0

2019-07-24 Thread Lian Jiang
Hi,

I am new to Iceberg. I just built master branch and ran the jars on EMR
cluster spark-shell 2.3.0.


spark-shell --jars
 
gradle-wrapper.jar,iceberg-api-66fa048.jar,iceberg-common-66fa048.jar,iceberg-core-66fa048.jar,iceberg-data-66fa048.jar,iceberg-hive-66fa048.jar,iceberg-hive-66fa048-tests.jar,iceberg-orc-66fa048.jar,iceberg-parquet-66fa048.jar,iceberg-pig-66fa048.jar,iceberg-spark-66fa048.jar,iceberg-spark-runtime-66fa048.jar


Below is the error:


import spark.implicits._

val someDF = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse")
).toDF("number", "word")

someDF.write.format("iceberg").mode("overwrite").save("/tmp/hello.table")


java.lang.NoSuchMethodError:
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
  at
org.apache.iceberg.spark.source.IcebergSource.createWriter(IcebergSource.java:
72)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:254)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
  ... 53 elided

Any idea why this error? Does it mean master branch does not work with
spark 2.3.0?

Thanks for any clue!


Re: Approaching Vectorized Reading in Iceberg ..

2019-07-24 Thread Daniel Weeks
Gautam,

I've created a branch off current master:
https://github.com/apache/incubator-iceberg/tree/vectorized-read

I've also created a milestone, so feel free to add issues and we can
associate them with the milestone:
https://github.com/apache/incubator-iceberg/milestone/2

-Dan

On Wed, Jul 24, 2019 at 4:21 PM Gautam  wrote:

> +1 on having a branch. Lemme know once you do i'l rebase and open a PR
> against it.
>
> Will get back to you on perf numbers soon.
>
> On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue  wrote:
>
>> Thanks Gautam!
>>
>> We'll start taking a look at your code. What do you think about creating
>> a branch in the Iceberg repository where we can work on improving it
>> together, before merging it into master?
>>
>> Also, you mentioned performance comparisons. Do you have any early
>> results to share?
>>
>> rb
>>
>> On Tue, Jul 23, 2019 at 3:40 PM Gautam  wrote:
>>
>>> Hello Folks,
>>>
>>> I have checked in a WIP branch [1] with a working version of Vectorized
>>> reads for Iceberg reader. Here's the diff  [2].
>>>
>>> *Implementation Notes:*
>>>  - Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to instruct
>>> the DataSourceV2ScanExec to use `planBatchPartitions()` instead of the
>>> usual `planInputPartitions()`. It returns instances of `ColumnarBatch` on
>>> each iteration.
>>>  - `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. This was
>>> copied from [3] . Added by @Daniel Weeks  . Thanks
>>> for that!
>>>  - `VectorizedParquetValueReaders` contains ParquetValueReaders used for
>>> reading/decoding the Parquet rowgroups (aka pagestores as referred to in
>>> the code)
>>>  - `VectorizedSparkParquetReaders` contains the visitor implementations
>>> to map Parquet types to appropriate value readers. I implemented the struct
>>> visitor so that the root schema can be mapped properly. This has the added
>>> benefit of vectorization support for structs, so yay!
>>>  - For the initial version the value readers read an entire row group
>>> into a single Arrow Field Vector. this i'd imagine will require tuning for
>>> right batch sizing but i'v gone with one batch per rowgroup for now.
>>>  - Arrow Field Vectors are wrapped using `ArrowColumnVector` which is
>>> Spark's ColumnVector implementation backed by Arrow. This is the first
>>> contact point between Spark and Arrow interfaces.
>>>  - ArrowColumnVectors are stitched together into a `ColumnarBatch` by
>>> `ColumnarBatchReader` . This is my replacement for `InternalRowReader`
>>> which maps Structs to Columnar Batches. This allows us to have nested
>>> structs where each level of nesting would be a nested columnar batch. Lemme
>>> know what you think of this approach.
>>>  - I'v added value readers for all supported primitive types listed in
>>> `AvroDataTest`. There's a corresponding test for vectorized reader under
>>> `TestSparkParquetVectorizedReader`
>>>  - I haven't fixed all the Checkstyle errors so you will have to turn
>>> checkstyle off in build.gradle. Also skip tests while building.. sorry! :-(
>>>
>>> *P.S*. There's some unused code under ArrowReader.java. Ignore this as
>>> it's not used. This was from my previous impl of Vectorization. I'v kept it
>>> around to compare performance.
>>>
>>> Lemme know what folks think of the approach. I'm getting this working
>>> for our scale test benchmark and will report back with numbers. Feel free
>>> to run your own benchmarks and share.
>>>
>>> Cheers,
>>> -Gautam.
>>>
>>>
>>>
>>>
>>> [1] -
>>> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP
>>> [2] -
>>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP
>>> [3] -
>>> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java
>>>
>>>
>>> On Mon, Jul 22, 2019 at 2:33 PM Gautam  wrote:
>>>
 Will do. Doing a bit of housekeeping on the code and also adding more
 primitive type support.

 On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah  wrote:

> Would it be possible to put the work in progress code in open source?
>
>
>
> *From: *Gautam 
> *Reply-To: *"dev@iceberg.apache.org" 
> *Date: *Monday, July 22, 2019 at 9:46 AM
> *To: *Daniel Weeks 
> *Cc: *Ryan Blue , Iceberg Dev List <
> dev@iceberg.apache.org>
> *Subject: *Re: Approaching Vectorized Reading in Iceberg ..
>
>
>
> That would be great!
>
>
>
> On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks 
> wrote:
>
> Hey Gautam,
>
>
>
> We also have a couple people looking into vectorized reading (into
> Arrow memory).  I think it would be good for us to get together and see if
> we can collaborate on a common approach for this.
>
>
>
> I'll reach out directly and see if we can get together.
>
>
>
> -Dan
>
>
>
>

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-24 Thread Gautam
+1 on having a branch. Lemme know once you do i'l rebase and open a PR
against it.

Will get back to you on perf numbers soon.

On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue  wrote:

> Thanks Gautam!
>
> We'll start taking a look at your code. What do you think about creating a
> branch in the Iceberg repository where we can work on improving it
> together, before merging it into master?
>
> Also, you mentioned performance comparisons. Do you have any early results
> to share?
>
> rb
>
> On Tue, Jul 23, 2019 at 3:40 PM Gautam  wrote:
>
>> Hello Folks,
>>
>> I have checked in a WIP branch [1] with a working version of Vectorized
>> reads for Iceberg reader. Here's the diff  [2].
>>
>> *Implementation Notes:*
>>  - Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to instruct
>> the DataSourceV2ScanExec to use `planBatchPartitions()` instead of the
>> usual `planInputPartitions()`. It returns instances of `ColumnarBatch` on
>> each iteration.
>>  - `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. This was
>> copied from [3] . Added by @Daniel Weeks  . Thanks
>> for that!
>>  - `VectorizedParquetValueReaders` contains ParquetValueReaders used for
>> reading/decoding the Parquet rowgroups (aka pagestores as referred to in
>> the code)
>>  - `VectorizedSparkParquetReaders` contains the visitor implementations
>> to map Parquet types to appropriate value readers. I implemented the struct
>> visitor so that the root schema can be mapped properly. This has the added
>> benefit of vectorization support for structs, so yay!
>>  - For the initial version the value readers read an entire row group
>> into a single Arrow Field Vector. this i'd imagine will require tuning for
>> right batch sizing but i'v gone with one batch per rowgroup for now.
>>  - Arrow Field Vectors are wrapped using `ArrowColumnVector` which is
>> Spark's ColumnVector implementation backed by Arrow. This is the first
>> contact point between Spark and Arrow interfaces.
>>  - ArrowColumnVectors are stitched together into a `ColumnarBatch` by
>> `ColumnarBatchReader` . This is my replacement for `InternalRowReader`
>> which maps Structs to Columnar Batches. This allows us to have nested
>> structs where each level of nesting would be a nested columnar batch. Lemme
>> know what you think of this approach.
>>  - I'v added value readers for all supported primitive types listed in
>> `AvroDataTest`. There's a corresponding test for vectorized reader under
>> `TestSparkParquetVectorizedReader`
>>  - I haven't fixed all the Checkstyle errors so you will have to turn
>> checkstyle off in build.gradle. Also skip tests while building.. sorry! :-(
>>
>> *P.S*. There's some unused code under ArrowReader.java. Ignore this as
>> it's not used. This was from my previous impl of Vectorization. I'v kept it
>> around to compare performance.
>>
>> Lemme know what folks think of the approach. I'm getting this working for
>> our scale test benchmark and will report back with numbers. Feel free to
>> run your own benchmarks and share.
>>
>> Cheers,
>> -Gautam.
>>
>>
>>
>>
>> [1] -
>> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP
>> [2] -
>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP
>> [3] -
>> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java
>>
>>
>> On Mon, Jul 22, 2019 at 2:33 PM Gautam  wrote:
>>
>>> Will do. Doing a bit of housekeeping on the code and also adding more
>>> primitive type support.
>>>
>>> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah  wrote:
>>>
 Would it be possible to put the work in progress code in open source?



 *From: *Gautam 
 *Reply-To: *"dev@iceberg.apache.org" 
 *Date: *Monday, July 22, 2019 at 9:46 AM
 *To: *Daniel Weeks 
 *Cc: *Ryan Blue , Iceberg Dev List <
 dev@iceberg.apache.org>
 *Subject: *Re: Approaching Vectorized Reading in Iceberg ..



 That would be great!



 On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks 
 wrote:

 Hey Gautam,



 We also have a couple people looking into vectorized reading (into
 Arrow memory).  I think it would be good for us to get together and see if
 we can collaborate on a common approach for this.



 I'll reach out directly and see if we can get together.



 -Dan



 On Sun, Jul 21, 2019 at 10:35 PM Gautam 
 wrote:

 Figured this out. I'm returning ColumnarBatch iterator directly without
 projection with schema set appropriately in `readSchema() `.. the empty
 result was due to valuesRead not being set correctly on FileIterator. Did
 that and things are working. Will circle back with numbers soon.



 On Fri, Jul 19, 2019 at 5:22 PM Gautam  wrote:

 Hey Guys,


Re: Approaching Vectorized Reading in Iceberg ..

2019-07-24 Thread Ryan Blue
Thanks Gautam!

We'll start taking a look at your code. What do you think about creating a
branch in the Iceberg repository where we can work on improving it
together, before merging it into master?

Also, you mentioned performance comparisons. Do you have any early results
to share?

rb

On Tue, Jul 23, 2019 at 3:40 PM Gautam  wrote:

> Hello Folks,
>
> I have checked in a WIP branch [1] with a working version of Vectorized
> reads for Iceberg reader. Here's the diff  [2].
>
> *Implementation Notes:*
>  - Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to instruct
> the DataSourceV2ScanExec to use `planBatchPartitions()` instead of the
> usual `planInputPartitions()`. It returns instances of `ColumnarBatch` on
> each iteration.
>  - `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. This was
> copied from [3] . Added by @Daniel Weeks  . Thanks
> for that!
>  - `VectorizedParquetValueReaders` contains ParquetValueReaders used for
> reading/decoding the Parquet rowgroups (aka pagestores as referred to in
> the code)
>  - `VectorizedSparkParquetReaders` contains the visitor implementations to
> map Parquet types to appropriate value readers. I implemented the struct
> visitor so that the root schema can be mapped properly. This has the added
> benefit of vectorization support for structs, so yay!
>  - For the initial version the value readers read an entire row group into
> a single Arrow Field Vector. this i'd imagine will require tuning for right
> batch sizing but i'v gone with one batch per rowgroup for now.
>  - Arrow Field Vectors are wrapped using `ArrowColumnVector` which is
> Spark's ColumnVector implementation backed by Arrow. This is the first
> contact point between Spark and Arrow interfaces.
>  - ArrowColumnVectors are stitched together into a `ColumnarBatch` by
> `ColumnarBatchReader` . This is my replacement for `InternalRowReader`
> which maps Structs to Columnar Batches. This allows us to have nested
> structs where each level of nesting would be a nested columnar batch. Lemme
> know what you think of this approach.
>  - I'v added value readers for all supported primitive types listed in
> `AvroDataTest`. There's a corresponding test for vectorized reader under
> `TestSparkParquetVectorizedReader`
>  - I haven't fixed all the Checkstyle errors so you will have to turn
> checkstyle off in build.gradle. Also skip tests while building.. sorry! :-(
>
> *P.S*. There's some unused code under ArrowReader.java. Ignore this as
> it's not used. This was from my previous impl of Vectorization. I'v kept it
> around to compare performance.
>
> Lemme know what folks think of the approach. I'm getting this working for
> our scale test benchmark and will report back with numbers. Feel free to
> run your own benchmarks and share.
>
> Cheers,
> -Gautam.
>
>
>
>
> [1] -
> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP
> [2] -
> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP
> [3] -
> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java
>
>
> On Mon, Jul 22, 2019 at 2:33 PM Gautam  wrote:
>
>> Will do. Doing a bit of housekeeping on the code and also adding more
>> primitive type support.
>>
>> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah  wrote:
>>
>>> Would it be possible to put the work in progress code in open source?
>>>
>>>
>>>
>>> *From: *Gautam 
>>> *Reply-To: *"dev@iceberg.apache.org" 
>>> *Date: *Monday, July 22, 2019 at 9:46 AM
>>> *To: *Daniel Weeks 
>>> *Cc: *Ryan Blue , Iceberg Dev List <
>>> dev@iceberg.apache.org>
>>> *Subject: *Re: Approaching Vectorized Reading in Iceberg ..
>>>
>>>
>>>
>>> That would be great!
>>>
>>>
>>>
>>> On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks  wrote:
>>>
>>> Hey Gautam,
>>>
>>>
>>>
>>> We also have a couple people looking into vectorized reading (into Arrow
>>> memory).  I think it would be good for us to get together and see if we can
>>> collaborate on a common approach for this.
>>>
>>>
>>>
>>> I'll reach out directly and see if we can get together.
>>>
>>>
>>>
>>> -Dan
>>>
>>>
>>>
>>> On Sun, Jul 21, 2019 at 10:35 PM Gautam  wrote:
>>>
>>> Figured this out. I'm returning ColumnarBatch iterator directly without
>>> projection with schema set appropriately in `readSchema() `.. the empty
>>> result was due to valuesRead not being set correctly on FileIterator. Did
>>> that and things are working. Will circle back with numbers soon.
>>>
>>>
>>>
>>> On Fri, Jul 19, 2019 at 5:22 PM Gautam  wrote:
>>>
>>> Hey Guys,
>>>
>>>Sorry bout the delay on this. Just got back on getting a
>>> basic working implementation in Iceberg for Vectorization on primitive
>>> types.
>>>
>>>
>>>
>>> *Here's what I have so far :  *
>>>
>>>
>>>
>>> I have added `ParquetValueReader` implementations for some basic
>>> primitive types that build the respectiv

[DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

2019-07-24 Thread Adrien Guillo
Hi Iceberg folks,

In the last few months, we (the data infrastructure team at Airbnb) have
been closely following the project. We are currently evaluating potential
strategies to migrate our data warehouse to Iceberg. However, we have a
very large Hive deployment, which means we can’t really do so without
support for Iceberg tables in Hive.

We have been thinking about implementation strategies. Here are some
thoughts that we would like to share them with you:

– Implementing a new `RawStore`

This is something that has been mentioned several times on the mailing list
and seems to indicate that adding support for Iceberg tables in Hive could
be achieved without client-side modifications. Does that mean that the
Metastore is the only process manipulating Iceberg metadata (snapshots,
manifests)? Does that mean that for instance the `listPartition*` calls to
the Metastore return the DataFiles associated with each partition? Per our
understanding, it seems that supporting Iceberg tables in Hive with this
strategy will most likely require to update the RawStore interface AND will
require at least some client-side changes. In addition, with this strategy
the Metastore bears new responsibilities, which contradicts one of the
Iceberg design goals: offloading more work to jobs and removing the
metastore as a bottleneck. In the Iceberg world, not much is needed from
the Metastore: it just keeps track of the metadata location and provides a
mechanism for atomically updating this location (basically, what is done in
the `HiveTableOperations` class). We would like to design a solution that
relies  as little as possible on the Metastore so that in future we have
the option to replace our fleet of Metastores with a simpler system.


– Implementing a new `HiveStorageHandler`

We are working on implementing custom `InputFormat` and `OutputFormat`
classes for Iceberg (more on that in the next paragraph) and they would fit
in nicely with the `HiveStorageHandler` and `HiveStoragePredicateHandler`
interfaces. However, the `HiveMetaHook` interface does not seem rich enough
to accommodate all the workflows, for instance no hooks run on `ALTER ...`
 or `INSERT...` commands.



– Proof of concept

We set out to write a proof of concept that would allow us to learn and
experiment. We based our work on the 2.3 branch. Here’s the state of the
project and the paths we explored:

DDL commands
We support some commands such as `CREABLE TABLE ...`, `DESC ...`, `SHOW
PARTITIONS`. They are all implemented in the client and mostly rely on the
`HiveCatalog` class to do the work.

Read path
We are in the process of implementing a custom `FileInputFormat` that
receives an Iceberg table identifier and a serialized expression
`ExprNodeDesc` as input. This is similar in a lot of ways to what you can
find in the `PigParquetReader` class in the `iceberg-pig` package or in
`HiveHBaseTableInputFormat` class in Hive.


Write path
We have made less progress in that front but we see a path forward by
implementing a custom `OutputFormat` that would keep track of the files
that are being written and gather statistics. Then, each task can dump this
information on HDFS. From there, the final Hive `MoveTask` can merge those
“pre-manifest” files to create a new snapshot and commit the new version of
a table.


We hope that our observations will start a healthy conversation about
supporting Iceberg tables in Hive :)


Cheers,
Adrien