Re: [DISCUSS] Switch to JDK 11 for releases?

2023-04-24 Thread Mass Dosage
I agree with Ryan, unless you can change the source version there's not
that much point.

On the Hive front, as you can see from that ticket it's been open for 4(!)
years and hasn't received much action recently. I think it's one of the
reasons AWS EMR still defaults to Java 8. It would be really great if they
could finally push that one over the finish line.

On Sat, 22 Apr 2023 at 20:43, Ryan Blue  wrote:

> I think in order to publish multiple versions we would need to have
> different artifact names, like Scala uses (e.g. _2.12).
>
> It probably also wouldn't help. If we have to remain compatible with JDK
> 8, then publishing some artifacts for JDK 11 would still mean only using
> JDK 8 features. The source version is what we care about more, so if we
> can't change it then we can't really do anything else.
>
> On Sat, Apr 22, 2023 at 10:12 AM Jack Ye  wrote:
>
>> Would it be an option to use --release flag to control the release target
>> version, and publish 2 versions of the library to Maven, 1 for JDK8 and 1
>> for JDK11?
>>
>> Jack
>>
>> On Fri, Apr 21, 2023 at 5:17 PM Ryan Blue  wrote:
>>
>>> Looks like Hive isn't quite done migrating to Java 11:
>>> https://issues.apache.org/jira/browse/HIVE-22415
>>>
>>> I'm not sure whether that's still a problem, but we currently don't
>>> build Hive 3 support unless we're using Java 8. That makes me think that
>>> dropping JDK 8 support would probably also make it a lot more difficult for
>>> Hive to do releases based on Iceberg. Even with some of the integration
>>> moving into the Hive project, if we started shipping JDK 11 Jars then Hive
>>> would no longer be able to update.
>>>
>>> Ryan
>>>
>>> On Fri, Apr 21, 2023 at 5:02 PM Anton Okolnychyi
>>>  wrote:
>>>
 Sorry, I wasn’t clear that I also imply dropping JDK 8 (unless there is
 a good reason to keep it?).

 - Anton

 On Apr 21, 2023, at 4:59 PM, Ryan Blue  wrote:

 Would we also drop support for JDK 8?

 On Fri, Apr 21, 2023 at 4:58 PM Anton Okolnychyi <
 aokolnyc...@apple.com.invalid> wrote:

> Following up on the discussion in the Spark 2.4 thread, shall we move
> to JDK 11 for releases as Spark 2.4 support has been dropped?
>
> - Anton



 --
 Ryan Blue
 Tabular



>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>


Re: Cannot build iceberg locally

2021-05-06 Thread Mass Dosage
Hello Taher,

Can you share a bit more of the error message you're seeing? Perhaps attach
a longer portion of the log showing all the gradle(?) output? Where exactly
is the problem occurring that you can't resolve classes in the relocated
package?

Thanks,

Adrian

On Thu, 6 May 2021 at 13:28, Taher Koitawala  wrote:

> Hi All,
>Very silly help needed. I am trying to work on the metadata
> file version test cases and I want to build iceberg locally. I cloned the
> master branch and ran
>
>- ./gradlew build -x test
>
> on the root directory. everything builds however I am still not able to
> resolve org.apache.iceberg.relocated. package. What am I missing?
>
> Regards,
> Taher Koitawala
>


Re: introductory Iceberg blog post

2021-01-28 Thread Mass Dosage
Ah great, I wasn't aware there was such a thing, thank you Jack!

On Thu, 28 Jan 2021 at 20:19, Jack Ye  wrote:

> I have added it to the blog page PR:
> https://github.com/apache/iceberg/pull/2177
> -Jack
>
> On Thu, Jan 28, 2021 at 10:46 AM Ryan Blue 
> wrote:
>
>> Thanks for sharing this, Adrian!
>>
>> On Thu, Jan 28, 2021 at 1:54 AM Mass Dosage  wrote:
>>
>>> Hello all,
>>>
>>> As you may be aware Expedia Group helped contribute Hive read support to
>>> Iceberg last year. We finally got around to publishing a blog post about
>>> this which also includes an overview of Iceberg and why we think it's so
>>> useful. If you're interested you can read it here:
>>>
>>>
>>> https://medium.com/expedia-group-tech/a-short-introduction-to-apache-iceberg-d34f628b6799
>>>
>>> Thanks,
>>>
>>> Adrian
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


introductory Iceberg blog post

2021-01-28 Thread Mass Dosage
Hello all,

As you may be aware Expedia Group helped contribute Hive read support to
Iceberg last year. We finally got around to publishing a blog post about
this which also includes an overview of Iceberg and why we think it's so
useful. If you're interested you can read it here:

https://medium.com/expedia-group-tech/a-short-introduction-to-apache-iceberg-d34f628b6799

Thanks,

Adrian


Re: Welcoming Peter Vary as a new committer!

2021-01-25 Thread Mass Dosage
Nice one, well done Peter!

On Mon, 25 Jan 2021 at 19:46, Daniel Weeks  wrote:

> Congratulations, Peter!
>
> On Mon, Jan 25, 2021, 11:27 AM Jungtaek Lim 
> wrote:
>
>> Congratulations Peter! Well deserved!
>>
>> On Tue, Jan 26, 2021 at 3:40 AM Wing Yew Poon 
>> wrote:
>>
>>> Congratulations Peter!
>>>
>>>
>>> On Mon, Jan 25, 2021 at 10:35 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 Congratulations!

 On Jan 25, 2021, at 12:34 PM, Jacques Nadeau 
 wrote:

 Congrats Peter! Thanks for all your great work

 On Mon, Jan 25, 2021 at 10:24 AM Ryan Blue  wrote:

> Hi everyone,
>
> I'd like to welcome Peter Vary as a new Iceberg committer.
>
> Thanks for all your contributions, Peter!
>
> rb
>
> --
> Ryan Blue
>




Re: S3 strong read-after-write consistency

2020-12-14 Thread Mass Dosage
I had a call with some developers from S3 and asked and they said this
change should resolve the "negative caching" issue.

Atomic renames are on their radar but they said this will take a lot of
work on their part.

On Fri, 4 Dec 2020 at 21:57, Ryan Blue  wrote:

> It isn't clear whether this S3 consistency change also fixes the negative
> caching (HEAD when file doesn't exist causes later HEAD to not see the
> file), but I think that it does not fix it because there was a PR opened to
> add consistency using LIST before a HEAD operation.
>
> I think it is still a good idea to use the new S3FileIO for S3 tables.
>
> On Wed, Dec 2, 2020 at 2:11 AM Jungtaek Lim 
> wrote:
>
>> What about S3FileIO implementation? I see some issue filed that even with
>> Hive catalog working with S3 brings unexpected issues, and S3FileIO
>> supposed to fix the issue (according to Ryan). Is it safe without S3FileIO
>> to use Hive catalog + Hadoop API for S3 now?
>>
>> 2020년 12월 2일 (수) 오후 6:54, Vivekanand Vellanki 님이 작성:
>>
>>> Iceberg tables backed by HadoopTables and HadoopCatalog require an
>>> atomic rename. This is not yet supported with S3.
>>>
>>> On Wed, Dec 2, 2020 at 3:20 PM Mass Dosage  wrote:
>>>
>>>> Hello all,
>>>>
>>>> Yesterday AWS announced that S3 now has strong read-after-write
>>>> consistency:
>>>>
>>>>
>>>> https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency
>>>>
>>>> https://aws.amazon.com/s3/consistency/
>>>>
>>>> Does this mean that Iceberg tables backed by HadoopTables and
>>>> HadoopCatalog can now be used on S3 in addition to HDFS?
>>>>
>>>> Thanks,
>>>>
>>>> Adrian
>>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


S3 strong read-after-write consistency

2020-12-02 Thread Mass Dosage
Hello all,

Yesterday AWS announced that S3 now has strong read-after-write consistency:

https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency

https://aws.amazon.com/s3/consistency/

Does this mean that Iceberg tables backed by HadoopTables and HadoopCatalog
can now be used on S3 in addition to HDFS?

Thanks,

Adrian


Re: Iceberg/Hive properties handling

2020-11-27 Thread Mass Dosage
I like these suggestions, comments inline below on the last round...

On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy 
wrote:

> Hi,
>
> The above aligns with what we did in Impala, i.e. we store information
> about table loading in HMS table properties. We are just a bit more
> explicit about which catalog to use.
> We have table property 'iceberg.catalog' to determine the catalog type,
> right now the supported values are 'hadoop.tables', 'hadoop.catalog', and
> 'hive.catalog'. Additional table properties can be set based on the catalog
> type.
>
> So, if the value of 'iceberg.catalog' is
>

I'm all for renaming this, having "mr" in the property name is confusing.


>
>- hadoop.tables
>   - the table location is used to load the table
>
> The only question I have is should we have this as the default? i.e. if
you don't set a catalog it will assume its HadoopTables and use the
location? Or should we require this property to be here to be consistent
and avoid any "magic"?


>
>- hadoop.catalog
>   - Required table property 'iceberg.catalog_location' specifies the
>   location of the hadoop catalog in the file system
>   - Optional table property 'iceberg.table_identifier' specifies the
>   table id. If it's not set, then . is used as
>   table identifier
>
> I like this as it would allow you to use a different database and table
name in Hive as opposed to the Hadoop Catalog - at the moment they have to
match. The only thing here is that I think Hive requires a table LOCATION
to be set and it's then confusing as there are now two locations on the
table. I'm not sure whether in the Hive storage handler or SerDe etc. we
can get Hive to not require that and maybe even disallow it from being set.
That would probably be best in conjunction with this. Another solution
would be to not have the 'iceberg.catalog_location' property but instead
use the table LOCATION for this but that's a bit confusing from a Hive
point of view.


>- hive.catalog
>   - Optional table property 'iceberg.table_identifier' specifies the
>   table id. If it's not set, then . is used as
>   table identifier
>   - We have the assumption that the current Hive metastore stores the
>   table, i.e. we don't support external Hive metastores currently
>
> These sound fine for Hive catalog tables that are created outside of the
automatic Hive table creation (see https://iceberg.apache.org/hive/ ->
Using Hive Catalog) we'd just need to document how you can create these
yourself and that one could use a different Hive database and table etc.


> Independent of catalog implementations, but we also have table property
> 'iceberg.file_format' to specify the file format for the data files.
>

OK, I don't think we need that for Hive?


> We haven't released it yet, so we are open to changes, but I think these
> properties are reasonable and it would be great if we could standardize the
> properties across engines that use HMS as the primary metastore of tables.
>
>
If others agree I think we should create an issue where we document the
above changes so it's very clear what we're doing and can then go an
implement them and update the docs etc.


> Cheers,
> Zoltan
>
>
> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue 
> wrote:
>
>> Yes, I think that is a good summary of the principles.
>>
>> #4 is correct because we provide some information that is informational
>> (Hive schema) or tracked only by the metastore (best-effort current user).
>> I also agree that it would be good to have a table identifier in HMS table
>> metadata when loading from an external table. That gives us a way to handle
>> name conflicts.
>>
>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau 
>> wrote:
>>
>>> Minor error, my last example should have been:
>>>
>>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau 
>>> wrote:
>>>
 I agree with Ryan on the core principles here. As I understand them:

1. Iceberg metadata describes all properties of a table
2. Hive table properties describe "how to get to" Iceberg metadata
(which catalog + possibly ptr, path, token, etc)
3. There could be default "how to get to" information set at a
global level
4. Best-effort schema should stored be in the table properties in
HMS. This should be done for information schema retrieval purposes 
 within
Hive but should be ignored during Hive/other tool execution.

 Is that a fair summary of your statements Ryan (except 4, which I just
 added)?

 One comment I have on #2 is that for different catalogs and use cases,
 I think it can be somewhat more complex where it would be desirable for a
 table that initially existed without Hive that was later exposed in Hive to
 support a ptr/path/token for how the table 

Re: CI logging question

2020-11-23 Thread Mass Dosage
Thanks for following up here with the solution and the steps for accessing
the logs!

On Mon, 23 Nov 2020 at 08:59, Peter Vary  wrote:

> Hi Team,
>
> Ryan pushed my changes. Thanks for the review and the merge!
>
> The final solution was to create a log file for every package which will
> contain the StdErr / StdOut of the tests. These will be stored in the
> /build/testlogs/.log file.
> Like /build/testlogs/iceberg-parquet.log:
>
> 
> - Test log for: Test 
> testRowGroupSizeConfigurableWithWriter(org.apache.iceberg.parquet.TestParquet)
> 
> StdErr log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.util.NativeCodeLoader).
> StdErr log4j:WARN Please initialize the log4j system properly.
> StdErr log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig 
> for more info.
> 
> - Test log for: Test 
> testListProjection(org.apache.iceberg.avro.TestParquetReadProjection)
> 
> StdErr [Test worker] INFO 
> org.apache.parquet.hadoop.InternalParquetRecordReader - RecordReader 
> initialized will read a total of 1 records.
> StdErr [Test worker] INFO 
> org.apache.parquet.hadoop.InternalParquetRecordReader - at row 0. reading 
> next block
> StdErr [Test worker] INFO 
> org.apache.parquet.hadoop.InternalParquetRecordReader - block read in memory 
> in 1 ms. row count = 1
>
> If there is a failure in the CI run then these logs are achieved. "By
> default, GitHub stores build logs and artifacts for 90 days", and they are 
> accessible
> through the "Artifacts/test logs" on the top right corner of the failed run.
> See:
>
> This could help us investigating flaky failures. That said, if you find
> flaky tests for the Hive/Tez related tests, please notify me, Laszlo Pinter
> or Marton Bod.
>
> Thanks,
> Peter
>
>
> On Nov 19, 2020, at 16:02, Peter Vary  wrote:
>
> Created the pull request for it:
> https://github.com/apache/iceberg/pull/1789
>
> You can turn it on for manual builds by
>
> *export CI=true*
>
>
> Any reviewers would be welcome!
> Thanks,
> Peter
>
> On Nov 18, 2020, at 10:11, Mass Dosage  wrote:
>
> I can definitely see how having more detailed logs could be useful so I
> like what you're suggesting. I guess another option could be to make this
> configurable so you can pass in an argument to turn on the
> "showStandardStreams", by default it's false but while you're debugging
> this issue it would be turned on?
>
> On Wed, 18 Nov 2020 at 09:03, Peter Vary 
> wrote:
>
>> Hi Team,
>>
>> Recently I have been working on trying to reproduce the following CI
>> failure without success:
>>
>>
>>
>>
>>
>> *org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithCustomCatalog
>> > testScanTable[fileFormat=PARQUET, engine=tez] FAILED
>> java.lang.IllegalArgumentException: Failed to execute Hive query 'SELECT *
>> FROM default.customers ORDER BY customer_id DESC': Error while processing
>> statement: FAILED: Execution Error, return code 1 from
>> org.apache.hadoop.hive.ql.exec.tez.TezTaskCaused by:
>> org.apache.hive.service.cli.HiveSQLException: Error while processing
>> statement: FAILED: Execution Error, return code 1 from
>> org.apache.hadoop.hive.ql.exec.tez.TezTask*
>>
>>
>> Since I was unsuccessful reproing the case, and the provided error
>> message in CI logs are not really helpful this means I can not fix this
>> flaky test for now. :(
>>
>> After Marton Bods changes for adding logs for tests (
>> https://github.com/apache/iceberg/pull/1712), we could have more info
>> about the failures in the test logs (
>> *build/test-results/test/binary/output.bin*), but I am not sure if that
>> is retained and accessible after a CI run.
>>
>> I would like to propose adding the following to the build.gradle for the
>> CI runs:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *test {  testLogging {
>>   if ("true".equalsIgnoreCase(System.getenv('CI'))) {
>> events "failed", "passed"+  testLogging.showStandardStreams = true
>>   } else {  events "failed"}exceptionFormat "full"  }}*
>>
>>
>> This would add the logs printed during the tests to the standard output
>> for the CI runs. Example can be seen here (
>> https://github.com/pvary/iceberg/runs/1405960983) - only enabled
>> standard streams for the hive related tests in this patch to see the
>> results.
>>
>> Pros:
>>
>>- Easily accessible log information for the failed runs
>>
>> Cons:
>>
>>- Harder to read CI logs
>>- Possible cost associated with retaining the logs
>>
>>
>> I think having more logs would be great, but I am not sure who pays the
>> bill and whether having bigger logs could cause any problem and whether the
>> CI is able to handle the increased amount of data.
>>
>> Any thoughts, comments, ideas?
>>
>> Thanks,
>> Peter
>>
>
>
>


Re: Proposal for additional fields in Iceberg manifest files

2020-11-20 Thread Mass Dosage
+1 - I also like the idea of having more data profiling info for the
partition but worry about hostnames and IP addresses and maintaining those
as things change, especially if you have hundreds of hosts, I'd rather
leave that to the name node.

On Fri, 20 Nov 2020 at 17:48, Ryan Blue  wrote:

> Thanks Vivekanand!
>
> I made some comments on the doc. Overall, I think a partition index is a
> good idea. We've thought about adding sketches that contain skew estimates
> for certain columns in a partition so that we can do better join
> estimation. Getting a start on how we would store data like this is a good
> step.
>
> I'm a bit more skeptical about locality information, since it would get
> out of date and require rewriting old, large manifests.
>
> On Fri, Nov 20, 2020 at 1:44 AM Vivekanand Vellanki 
> wrote:
>
>> Hi,
>>
>> I would like to propose additional fields in Iceberg manifest files
>> 
>> to support the following scenarios:
>>
>>- Partition index to include per-partition stats to help support
>>planning
>>- Data locality information to support split assignment in
>>distributed query engines
>>
>> Comments are welcome.
>>
>> --
>> Thanks
>> Vivek
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: CI logging question

2020-11-18 Thread Mass Dosage
I can definitely see how having more detailed logs could be useful so I
like what you're suggesting. I guess another option could be to make this
configurable so you can pass in an argument to turn on the
"showStandardStreams", by default it's false but while you're debugging
this issue it would be turned on?

On Wed, 18 Nov 2020 at 09:03, Peter Vary  wrote:

> Hi Team,
>
> Recently I have been working on trying to reproduce the following CI
> failure without success:
>
>
>
>
>
> *org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithCustomCatalog
> > testScanTable[fileFormat=PARQUET, engine=tez] FAILED
> java.lang.IllegalArgumentException: Failed to execute Hive query 'SELECT *
> FROM default.customers ORDER BY customer_id DESC': Error while processing
> statement: FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.tez.TezTaskCaused by:
> org.apache.hive.service.cli.HiveSQLException: Error while processing
> statement: FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.tez.TezTask*
>
>
> Since I was unsuccessful reproing the case, and the provided error message
> in CI logs are not really helpful this means I can not fix this flaky test
> for now. :(
>
> After Marton Bods changes for adding logs for tests (
> https://github.com/apache/iceberg/pull/1712), we could have more info
> about the failures in the test logs (
> *build/test-results/test/binary/output.bin*), but I am not sure if that
> is retained and accessible after a CI run.
>
> I would like to propose adding the following to the build.gradle for the
> CI runs:
>
>
>
>
>
>
>
>
>
>
>
>
> *test {  testLogging {
>   if ("true".equalsIgnoreCase(System.getenv('CI'))) {
> events "failed", "passed"+  testLogging.showStandardStreams = true
>   } else {  events "failed"}exceptionFormat "full"  }}*
>
>
> This would add the logs printed during the tests to the standard output
> for the CI runs. Example can be seen here (
> https://github.com/pvary/iceberg/runs/1405960983) - only enabled standard
> streams for the hive related tests in this patch to see the results.
>
> Pros:
>
>- Easily accessible log information for the failed runs
>
> Cons:
>
>- Harder to read CI logs
>- Possible cost associated with retaining the logs
>
>
> I think having more logs would be great, but I am not sure who pays the
> bill and whether having bigger logs could cause any problem and whether the
> CI is able to handle the increased amount of data.
>
> Any thoughts, comments, ideas?
>
> Thanks,
> Peter
>


Re: [VOTE] Release Apache Iceberg 0.10.0 RC5

2020-11-09 Thread Mass Dosage
+1 (non-binding)

I tested the Hive read path in distributed mode for HadoopTables-backed
Iceberg tables and it worked fine.

On Sun, 8 Nov 2020 at 18:06, Anton Okolnychyi 
wrote:

> Hi everyone,
>
> I propose the following RC to be released as official Apache Iceberg
> 0.10.0 release.
>
> The commit id is c344762c8ad11d67da16dc2ee678eb542ea4c495
> * This corresponds to the tag: apache-iceberg-0.10.0-rc5
> * https://github.com/apache/iceberg/commits/apache-iceberg-0.10.0-rc5
> *
> https://github.com/apache/iceberg/tree/c344762c8ad11d67da16dc2ee678eb542ea4c495
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.10.0-rc5
>
> You can find the KEYS file here (make sure to import the new key that was
> used to sign the release):
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged in Nexus. The Maven repository URL
> is:
> * https://repository.apache.org/content/repositories/orgapacheiceberg-1013
>
> This release includes important changes:
>
> * Flink support
> * Hive read support
> * ORC support fixes and improvements
> * Application of row-level delete files on read
> * Snapshot partition summary
> * Ability to load LocationProvider dynamically
> * Sort spec
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 0.10.0
> [ ] +0
> [ ] -1 Do not release this because…
>
> Thanks,
> Anton
>


Re: [VOTE] Release Apache Iceberg 0.10.0 RC4

2020-11-05 Thread Mass Dosage
+1 non-binding on RC4. I tested out the Hive read path on a distributed
cluster using HadoopTables.

On Thu, 5 Nov 2020 at 04:46, Dongjoon Hyun  wrote:

> +1 for 0.10.0 RC4.
>
> Bests,
> Dongjoon.
>
> On Wed, Nov 4, 2020 at 7:17 PM Jingsong Li  wrote:
>
>> +1
>>
>> 1. Download the source tarball, signature (.asc), and checksum
>> (.sha512):   OK
>> 2. Import gpg keys: download KEYS and run gpg --import
>> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
>> 3. Verify the signature by running: gpg --verify
>> apache-iceberg-xx.tar.gz.asc:  OK
>> 4. Verify the checksum by running: sha512sum -c
>> apache-iceberg-xx.tar.gz.sha512 :  OK
>> 5. Untar the archive and go into the source directory: tar xzf
>> apache-iceberg-xx.tar.gz && cd apache-iceberg-xx:  OK
>> 6. Run RAT checks to validate license headers: dev/check-license: OK
>> 7. Build and test the project: ./gradlew build (use Java 8) :   OK
>>
>> Best,
>> Jingsong
>>
>> On Thu, Nov 5, 2020 at 7:38 AM Ryan Blue 
>> wrote:
>>
>>> +1
>>>
>>>- Validated checksum and signature
>>>- Ran license checks
>>>- Built and ran tests
>>>- Queried a Hadoop FS table created with 0.9.0 in Spark 3.0.1
>>>- Created a Hive table from Spark 3.0.1
>>>- Tested metadata tables from Spark
>>>- Tested Hive and Hadoop table reads in Hive 2.3.7
>>>
>>> I was able to read both Hadoop and Hive tables created in Spark from
>>> Hive using:
>>>
>>> add jar /home/blue/Downloads/iceberg-hive-runtime-0.10.0.jar;
>>> create external table hadoop_table
>>>   stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
>>>   location 'file:/home/blue/tmp/hadoop-warehouse/default/test';
>>> select * from hadoop_table;
>>>
>>> set iceberg.mr.catalog=hive;
>>> select * from hive_table;
>>>
>>> The hive_table needed engine.hive.enabled=true set in table properties
>>> by Spark using:
>>>
>>> alter table hive_table set tblproperties ('engine.hive.enabled'='true')
>>>
>>> Hive couldn’t read the #snapshots metadata table for Hadoop. It failed
>>> with this error:
>>>
>>> Failed with exception 
>>> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
>>> java.lang.ClassCastException: java.lang.Long cannot be cast to 
>>> java.time.OffsetDateTime
>>>
>>> I also couldn’t read the Hadoop table once iceberg.mr.catalog was set
>>> in my environment, so I think we have a bit more work to do to clean up
>>> Hive table configuration.
>>>
>>> On Wed, Nov 4, 2020 at 12:54 AM Ryan Murray  wrote:
>>>
 +1 (non-binding)

 1. Download the source tarball, signature (.asc), and checksum
 (.sha512):   OK
 2. Import gpg keys: download KEYS and run gpg --import
 /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
 3. Verify the signature by running: gpg --verify
 apache-iceberg-xx.tar.gz.asc:  I got a warning "gpg: WARNING: This key is
 not certified with a trusted signature! gpg:  There is no
 indication that the signature belongs to the owner." but it passed
 4. Verify the checksum by running: sha512sum -c
 apache-iceberg-xx.tar.gz.sha512 :  OK
 5. Untar the archive and go into the source directory: tar xzf
 apache-iceberg-xx.tar.gz && cd apache-iceberg-xx:  OK
 6. Run RAT checks to validate license headers: dev/check-license: OK
 7. Build and test the project: ./gradlew build (use Java 8 & Java 11)
 :   OK


 On Wed, Nov 4, 2020 at 2:56 AM OpenInx  wrote:

> +1 for 0.10.0 RC4
>
> 1. Download the source tarball, signature (.asc), and checksum
> (.sha512):   OK
> 2. Import gpg keys: download KEYS and run gpg --import
> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
> 3. Verify the signature by running: gpg --verify
> apache-iceberg-xx.tar.gz.asc:  OK
> 4. Verify the checksum by running: sha512sum -c
> apache-iceberg-xx.tar.gz.sha512 :  OK
> 5. Untar the archive and go into the source directory: tar xzf
> apache-iceberg-xx.tar.gz && cd apache-iceberg-xx:  OK
> 6. Run RAT checks to validate license headers: dev/check-license: OK
> 7. Build and test the project: ./gradlew build (use Java 8) :   OK
>
> On Wed, Nov 4, 2020 at 8:25 AM Anton Okolnychyi
>  wrote:
>
>> Hi everyone,
>>
>> I propose the following RC to be released as official Apache Iceberg
>> 0.10.0 release.
>>
>> The commit id is d39fad00b7dded98121368309f381473ec21e85f
>> * This corresponds to the tag: apache-iceberg-0.10.0-rc4
>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.10.0-rc4
>> *
>> https://github.com/apache/iceberg/tree/d39fad00b7dded98121368309f381473ec21e85f
>>
>> The release tarball, signature, and checksums are here:
>> *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.10.0-rc4/
>>
>> You can find the KEYS file here (make sure to import the new key that
>> was used to sign the release):

Re: [VOTE] Release Apache Iceberg 0.10.0 RC2

2020-11-02 Thread Mass Dosage
+1 (non-binding)

I ran the RC against a set of integration tests I have for a subset of the
Hive2 read functionality on a distributed cluster and it worked fine.

On Mon, 2 Nov 2020 at 04:05, Simon Su  wrote:

> + 1 (non-binding)
> 1. Build code pass all UTs.
> 2. Test Flink iceberg sink failover, test exactly-once.
>
> Junjie Chen  于2020年11月2日周一 上午11:35写道:
>
>> + 1 (non-binding)
>>
>> I ran step 1-7 on my cloud virtual machine (centos 7, java 1.8.0_171),
>> all passed.
>>
>>
>>
>>
>>
>> On Mon, Nov 2, 2020 at 10:36 AM OpenInx  wrote:
>>
>>> +1 for 0.10.0 RC2
>>>
>>> 1. Download the source tarball, signature (.asc), and checksum
>>> (.sha512):   OK
>>> 2. Import gpg keys: download KEYS and run gpg --import
>>> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
>>> 3. Verify the signature by running: gpg --verify
>>> apache-iceberg-xx-incubating.tar.gz.asc:  OK
>>> 4. Verify the checksum by running: sha512sum -c
>>> apache-iceberg-xx-incubating.tar.gz.sha512 :  OK
>>> 5. Untar the archive and go into the source directory: tar xzf
>>> apache-iceberg-xx-incubating.tar.gz && cd apache-iceberg-xx-incubating:  OK
>>> 6. Run RAT checks to validate license headers: dev/check-license: OK
>>> 7. Build and test the project: ./gradlew build (use Java 8) :   OK
>>>
>>> BTW,  I think we may need a bash script to do the above verification
>>> automatically ,  so I created an issue for it:
>>> https://github.com/apache/iceberg/issues/1700
>>>
>>> Thanks all for the work.
>>>
>>> On Sun, Nov 1, 2020 at 5:40 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1 for 0.10.0 RC2 (non-binding)

 I checked the followings.
 1. Checksum and GPG signatures
 2. Gradle build and tests on Java 1.8.0_272
 3. Manual integration tests with Hive Metastore 2.3.7 and Apache Spark
 2.3.7/3.0.1.

 Thank you!

 Bests,
 Dongjoon.


 On Fri, Oct 30, 2020 at 3:25 PM Russell Spitzer <
 russell.spit...@gmail.com> wrote:

> +1 (non-binding) Downloaded and ran build with Java HotSpot(TM) 64-Bit
> Server VM 18.9 (build 11.0.7+8-LTS, mixed mode). All tests passed :)
>
> On Fri, Oct 30, 2020 at 4:05 PM Anton Okolnychyi
>  wrote:
>
>> Here is the link to steps we normally use to validate a release
>> candidate:
>>
>> https://lists.apache.org/thread.html/rd5e6b1656ac80252a9a7d473b36b6227da91d07d86d4ba4bee10df66%40%3Cdev.iceberg.apache.org%3E
>> 
>>
>> - Anton
>>
>> On 30 Oct 2020, at 14:03, Anton Okolnychyi <
>> aokolnyc...@apple.com.INVALID> wrote:
>>
>> Hi everyone,
>>
>> I propose the following RC to be released as official Apache Iceberg
>> 0.10.0 release.
>>
>> The commit id is 37f21b72fb55503e6e40b1555b7ea1af61dfdfc7
>> * This corresponds to the tag: apache-iceberg-0.10.0-rc2
>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.10.0-rc2
>> *
>> https://github.com/apache/iceberg/tree/37f21b72fb55503e6e40b1555b7ea1af61dfdfc7
>>
>> The release tarball, signature, and checksums are here:
>> *
>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.10.0-rc2/
>>
>> You can find the KEYS file here (make sure to import the new key that
>> was used to sign the release):
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged in Nexus. The Maven
>> repository URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1011/
>>
>> This release includes important changes:
>>
>> * Flink support
>> * Hive read support
>> * ORC support fixes and improvements
>> * Application of row-level delete files on read
>> * Snapshot partition summary
>> * Ability to load LocationProvider dynamically
>> * Sort spec
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 0.10.0
>> [ ] +0
>> [ ] -1 Do not release this because…
>>
>> Thanks,
>> Anton
>>
>>
>>
>>
>>
>> --
>> Best Regards
>>
>


Re: Travis build question

2020-09-16 Thread Mass Dosage
This is what it's failing with right?

org.apache.iceberg.hadoop.TestHadoopCatalog > testVersionHintFile FAILED
org.apache.iceberg.exceptions.NoSuchTableException: Table does not
exist: tbl
at 
org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:108)
at 
org.apache.iceberg.hadoop.TestHadoopCatalog.testVersionHintFile(TestHadoopCatalog.java:522)


I haven't seen that build failure on Travis before. In the past there were
some issues with tests failing due to Timezones but those were all fixed.
Have you tried merging master into your branch recently?

On Wed, 16 Sep 2020 at 17:15, Peter Vary  wrote:

> Hi Team,
>
> Struggling with a PR (https://github.com/apache/iceberg/pull/1465) where
> a test is green on my runs in IntelliJ, and also green if I run the test
> with command line, and I even run them successfully on linux with the
> command:
>
> ./gradlew :iceberg-core:test
>
>
> The problem is that the test is failing on travis. Any quick ideas how is
> the travis env different from the various test environments above?
>
> Thanks,
> Peter
>


Re: Iceberg sync notes - 9 September 2020

2020-09-15 Thread Mass Dosage
I'm fine with not waiting for Hive projection. What is in master now is
enough to do an end-to-end Hive read, I'd prefer to have that out there
sooner so we can start trying it out as opposed to delaying this release
for the projection.

Thanks,

Adrian

On Mon, 14 Sep 2020 at 23:38, Ryan Blue  wrote:

> Hi everyone,
>
> I just update the Iceberg sync doc
> 
> with my notes. Feel free to add corrections or additional context!
>
> There was quite a bit of discussion, so I want to highlight a few things
> that we talked about for more discussion on the dev list:
>
> 1. 0.10.0 blocker issues
> - Java 11 flaky tests (Fixed in PR #1446
> )
> - Flink checkpoint Java serialization errors (PR #1438
> )
> - Probably will *not* wait for Hive projection
> - Please bring up any other blockers!
> 2. The general consensus was that adding a time offset parameter (PR #1368
> ) is not a good solution.
> Instead we should consider using hourly partitioning or adding custom
> partition functions.
> 3. We discussed trying to make snapshot timestamps monotonically
> increasing, but though that it was probably not worth pursuing (already
> mentioned on the dev list thread).
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Upgrade components to Hive 3 and Hadoop 3

2020-09-14 Thread Mass Dosage
+1 for doing this in a way that keeps Hive 2 support as that's still our
primary Hive version in production and will be for quite some time.

On Mon, 14 Sep 2020 at 09:00, Marton Bod  wrote:

> Hi Ryan,
>
> Thanks, I absolutely agree with you that we should keep support for Hive2
> as well. I've created a draft PR, which highlights the changes:
> https://github.com/apache/iceberg/pull/1455
> We can continue the discussion there.
>
> Thanks,
>
> Marton
>
> On Fri, 11 Sep 2020 at 22:20, Ryan Blue  wrote:
>
>> Hi Marton, could you share a link to your branch with the changes? It
>> would be great to see what needs to be done. A quick summary would help as
>> well.
>>
>> Knowing what changes between Hive 2 and 3 in our iceberg-hive-metastore
>> project is important because we would ideally use whatever is available at
>> runtime. I'm all for upgrading to enable people on the newer versions, but
>> I think we want to make sure we maintain support for people that haven't
>> migrated yet as well.
>>
>> rb
>>
>> On Fri, Sep 11, 2020 at 3:28 AM Marton Bod  wrote:
>>
>>> Hi Team,
>>>
>>> We would like to start a discussion on upgrading Iceberg components to
>>> use Hive 3 and Hadoop 3. We have a fork where we have bumped up the hive
>>> and hadoop dependency versions and made the necessary changes to get all
>>> tests to pass.
>>>
>>> As some components cannot (e.g. spark2) or might not want to upgrade
>>> yet, our solution was to create a separate iceberg-hive2-metastore module,
>>> which would keep on using Hive2 and Hadoop2. This would give an option for
>>> each component to do the upgrade at their own pace or not at all.
>>>
>>> At this point, our primary goal is to upgrade iceberg-mr to Hive3.
>>> Upgrading iceberg-flink posed no major issues either, but of course it's up
>>> to the Flink iceberg community to make this call. As for spark2/spark3, we
>>> have left them for now to use Hive2.
>>>
>>> Any thoughts from the community on this upgrade?
>>>
>>> Thank you,
>>>
>>> Marton
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Re: [DISCUSS] Rename iceberg-hive module?

2020-09-03 Thread Mass Dosage
I have raised a PR for this: https://github.com/apache/iceberg/pull/1418

Please take a look and comment.

Thanks,

Adrian

On Thu, 20 Aug 2020 at 17:30, Ryan Blue  wrote:

> Sounds unanimous. Thanks, everyone!
>
> On Thu, Aug 20, 2020 at 9:10 AM John Zhuge  wrote:
>
>> +1 for the rename
>>
>> On Thu, Aug 20, 2020 at 7:22 AM Junjie Chen 
>> wrote:
>>
>>> +1 for `iceberg-hive-metastore`, also +1 to have a new module to contain
>>> the `iceberg-mr`.
>>>
>>> On Thu, Aug 20, 2020 at 8:13 PM Saisai Shao 
>>> wrote:
>>>
>>>> +1 for the changes.
>>>>
>>>> Mass Dosage  于2020年8月20日周四 下午5:46写道:
>>>>
>>>>> +1 for `iceberg-hive-metastore` as I found this confusing when I first
>>>>> started working with the code.
>>>>>
>>>>> On Thu, 20 Aug 2020 at 03:27, Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> +1 for `iceberg-hive-metastore` and also +1 for RD's proposal.
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 20, 2020 at 11:20 AM Jingsong Li 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 for `iceberg-hive-metastore`
>>>>>>>
>>>>>>> I'm confused about `iceberg-hive` and `iceberg-mr`.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jingsong
>>>>>>>
>>>>>>> On Thu, Aug 20, 2020 at 9:48 AM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> +1 for `iceberg-hive-metastore`.
>>>>>>>>
>>>>>>>> Maybe, is `Apache Iceberg 1.0.0` a good candidate to have that
>>>>>>>> breaking change?
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>> On Wed, Aug 19, 2020 at 6:35 PM RD  wrote:
>>>>>>>>
>>>>>>>>> I'm +1 for this rename.  I think we should keep the iceberg-mr
>>>>>>>>> module as is and maybe add a new module iceberg-hive-exec [not sure 
>>>>>>>>> if it
>>>>>>>>> is a good idea to salvage iceberg-hive for this purpose] which 
>>>>>>>>> contains
>>>>>>>>> hive specific StorageHandler, Serde and IcebergHivInputFormat classes.
>>>>>>>>>
>>>>>>>>> -R
>>>>>>>>>
>>>>>>>>> On Wed, Aug 19, 2020 at 5:06 PM Ryan Blue  wrote:
>>>>>>>>>
>>>>>>>>>> In the discussion this morning, we talked about what to name the
>>>>>>>>>> runtime module we want to add for Hive, iceberg-hive-runtime.
>>>>>>>>>> Unfortunately, iceberg-hive is the Hive _metastore_ module, so it is 
>>>>>>>>>> a bit
>>>>>>>>>> misleading to name the Hive runtime module iceberg-hive-runtime. It 
>>>>>>>>>> was
>>>>>>>>>> also pointed out that the iceberg-hive module is confusing for other
>>>>>>>>>> reasons: someone unfamiliar with it would expect to use it to work 
>>>>>>>>>> with
>>>>>>>>>> Hive, but it has no InputFormat or StorageHandler classes.
>>>>>>>>>>
>>>>>>>>>> Both problems are a result of a poor name for iceberg-hive. Maybe
>>>>>>>>>> we should rename iceberg-hive to iceberg-hive-metastore.
>>>>>>>>>>
>>>>>>>>>> The drawback is that a module people could use will disappear
>>>>>>>>>> (I'm assuming we won't rename iceberg-mr to iceberg-hive right 
>>>>>>>>>> away). But
>>>>>>>>>> most people probably use a runtime Jar, so it might be a good time 
>>>>>>>>>> to make
>>>>>>>>>> this change before there are more people depending on it.
>>>>>>>>>>
>>>>>>>>>> What does everyone think? Should we do the rename?
>>>>>>>>>>
>>>>>>>>>> rb
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best, Jingsong Lee
>>>>>>>
>>>>>>
>>>
>>> --
>>> Best Regards
>>>
>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Question about Iceberg release cadence

2020-08-27 Thread Mass Dosage
I'm all for a release. The only thing still required for basic Hive read
support (other than documentation of course!) is producing a *single* jar
that can be added to Hive's classpath, the PR for that is at
https://github.com/apache/iceberg/pull/1267.

Thanks,

Adrian

On Thu, 27 Aug 2020 at 01:26, Anton Okolnychyi
 wrote:

> +1 on releasing structured streaming source. I should be able to do one
> more review round tomorrow.
>
> - Anton
>
> On 26 Aug 2020, at 17:12, Jungtaek Lim 
> wrote:
>
> I hope we include Spark structured streaming read as well in the next
> release; that was proposed in Feb this year and still around. Quoting my
> comment on benefit of the streaming read on Spark;
>
> This would be the major feature to cover the gap on use case for
>> structured streaming between Delta Lake and Iceberg. There's a technical
>> limitation on Spark structured streaming itself (global watermark), which
>> requires workaround via splitting query into multiple queries &
>> intermediate storage supporting end-to-end exactly once. Delta Lake covers
>> the case, and I really would like to see the case also covered by Iceberg.
>> I see there're lots of works in progress on the milestone (and these are
>> great features which should be done), but after this we cover both batch
>> and streaming workloads being done with Spark, which is a huge step forward
>> on Spark users.
>
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Thu, Aug 27, 2020 at 1:13 AM Ryan Blue 
> wrote:
>
>> Hi Marton,
>>
>> 0.9.0 was released about 6 weeks ago, so I don't think we've planned when
>> the next release will be yet. I think it's a good idea to release soon,
>> though. The Flink sink is close to being ready as well and I'd like to get
>> both of those released so that the contributors can start using them.
>>
>> Seems like a good question for the broader community: how about a release
>> in the next month or so for Hive reads and the Flink sink?
>>
>> rb
>>
>> On Wed, Aug 26, 2020 at 8:58 AM Marton Bod  wrote:
>>
>>> Hi Team,
>>>
>>> I was wondering whether there is a release cadence already in place for
>>> Iceberg, e.g. how often releases will take place approximately? Which
>>> commits/features as release candidates in the near term?
>>>
>>> We're looking to integrate Iceberg into Hive, however, the current 0.9.1
>>> release does not yet contain the StorageHandler code in iceberg-mr. Knowing
>>> the approximate release timelines would help greatly with our integration
>>> planning.
>>>
>>> Of course, happy to get involved with ongoing dev/stability efforts to
>>> help achieve a new release of this module.
>>>
>>> Thanks a lot,
>>> Marton
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


Re: Hive Iceberg writes

2020-08-27 Thread Mass Dosage
We're definitely interested in this too but haven't started work on it yet.
It has been discussed at our community syncs as something quite a few
people are interested in so if nobody else responds a good starting point
would probably be an early WIP PR that everyone can follow and contribute
to.

Thanks,

Adrian

On Wed, 26 Aug 2020 at 17:35, Ryan Blue  wrote:

> I think Edgar and Adrien who have been contributing support for ORC and
> Hive are interested in this as well.
>
> On Wed, Aug 26, 2020 at 9:22 AM Peter Vary 
> wrote:
>
>> Hi Team,
>>
>> We are thinking about implementing HiveOutputFormat, so writes through
>> Hive can work as well.
>> Has anybody working on this? Do you know any ongoing effort related to
>> Hive writes?
>> Asking because we would like to prevent duplicate effort.
>> Also if anyone has some good pointers to start for an Iceberg noobie, it
>> would be good.
>>
>> Thanks,
>> Peter
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Rename iceberg-hive module?

2020-08-20 Thread Mass Dosage
+1 for `iceberg-hive-metastore` as I found this confusing when I first
started working with the code.

On Thu, 20 Aug 2020 at 03:27, Jungtaek Lim 
wrote:

> +1 for `iceberg-hive-metastore` and also +1 for RD's proposal.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
>
> On Thu, Aug 20, 2020 at 11:20 AM Jingsong Li 
> wrote:
>
>> +1 for `iceberg-hive-metastore`
>>
>> I'm confused about `iceberg-hive` and `iceberg-mr`.
>>
>> Best,
>> Jingsong
>>
>> On Thu, Aug 20, 2020 at 9:48 AM Dongjoon Hyun 
>> wrote:
>>
>>> +1 for `iceberg-hive-metastore`.
>>>
>>> Maybe, is `Apache Iceberg 1.0.0` a good candidate to have that breaking
>>> change?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Aug 19, 2020 at 6:35 PM RD  wrote:
>>>
 I'm +1 for this rename.  I think we should keep the iceberg-mr module
 as is and maybe add a new module iceberg-hive-exec [not sure if it is a
 good idea to salvage iceberg-hive for this purpose] which contains hive
 specific StorageHandler, Serde and IcebergHivInputFormat classes.

 -R

 On Wed, Aug 19, 2020 at 5:06 PM Ryan Blue  wrote:

> In the discussion this morning, we talked about what to name the
> runtime module we want to add for Hive, iceberg-hive-runtime.
> Unfortunately, iceberg-hive is the Hive _metastore_ module, so it is a bit
> misleading to name the Hive runtime module iceberg-hive-runtime. It was
> also pointed out that the iceberg-hive module is confusing for other
> reasons: someone unfamiliar with it would expect to use it to work with
> Hive, but it has no InputFormat or StorageHandler classes.
>
> Both problems are a result of a poor name for iceberg-hive. Maybe we
> should rename iceberg-hive to iceberg-hive-metastore.
>
> The drawback is that a module people could use will disappear (I'm
> assuming we won't rename iceberg-mr to iceberg-hive right away). But most
> people probably use a runtime Jar, so it might be a good time to make this
> change before there are more people depending on it.
>
> What does everyone think? Should we do the rename?
>
> rb
>
> --
> Ryan Blue
>

>>
>> --
>> Best, Jingsong Lee
>>
>


Re: [DISCUSS] July board report

2020-07-08 Thread Mass Dosage
LGTM!

On Tue, 7 Jul 2020 at 21:27, Ryan Blue  wrote:

> Hi everyone,
>
> Here's my draft report for July. Feel free to comment and suggest updates
> that I've missed. Thanks!
>
> rb
>
> ## Description:
> Apache Iceberg is a table format for huge analytic datasets that is
> designed
> for high performance and ease of use.
>
> ## Issues:
> There are no issues requiring board attention.
>
> ## Membership Data:
> Apache Iceberg was founded 2020-05-19 (2 months ago)
> There are currently 9 committers and 9 PMC members in this project.
> The Committer-to-PMC ratio is 1:1.
>
> Community changes, past quarter:
> - No new PMC members (project graduated recently).
> - No new committers were added.
>
> ## Project Activity:
> In July, the community held one sync meeting to discuss general topics, and
> one specifically to discuss how to include both groups that have been
> working
> on integration with Hive.
>
> To address the question on the last board report, the community sync
> meetings
> are video conferences that anyone in the community is welcome to attend.
> The
> discussion is documented and summarized for anyone that can't attend. We
> have
> found these to be a good way to exchange context and ideas more quickly,
> but
> recognize that this isn't the best way for some people to participate and
> so
> we don't consider these a forum for making decisions or voting. If we come
> to
> a tentative conclusion on a topic, it is still open for further discussion
> on the dev list. The idea for this comes from the Parquet community that
> has
> been doing this for several years.
>
> Development activity:
> * Spark vectorized reads for flat schemas was merged and benchmarked
> * The Spark 3 integration branch was merged into master
> * Name mapping for Parquet files without IDs was committed
> * And action to compact data files was added
> * Support was added for managing and adding delete files in table metadata
> * Refactoring to support reuse Spark components for Flink
> * Several PRs for Flink support have been committed and more are open
> * CI tests for JDK 11 have been added
>
> The community also plans to release 0.9.0 with Spark 3 support soon.
>
> ## Community Health:
> Most community metrics have again increased in the last month, although dev
> list traffic is a bit lower. More importantly, the community has made
> further
> progress on several large areas with different groups leading the efforts,
> like Hive support, Spark 3 support, and Flink support.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Iceberg at Subsurface Conference

2020-07-08 Thread Mass Dosage
Hello all,

You might be interested to know that myself and Christine Mathiesen will be
presenting our work on adding Hive read support to Iceberg at the upcoming
Subsurface Cloud Data Lake conference. The talk is entitled "Hiveberg:
Integrating Apache Iceberg with the Hive Metastore". You can register for
free for the (online) conference at https://subsurfaceconf.com/summer2020,
it will be held on July 30. There are some other interesting talks lined up
too, maybe see you there!

Thanks,

Adrian


Re: failing tests on master

2020-06-29 Thread Mass Dosage
Yes, I merged them into our branches this afternoon and can confirm that
the tests now pass. Thanks!

On Mon, 29 Jun 2020 at 19:24, Ryan Blue  wrote:

> I merged this over the weekend, so it should be fixed now. Did it work for
> you?
>
> On Fri, Jun 26, 2020 at 11:48 AM Mass Dosage  wrote:
>
>> Cool, I look forward to trying it out 😀
>>
>> On Fri, 26 Jun 2020, 18:56 Ryan Blue,  wrote:
>>
>>> Yes, I think Edgar has addressed this. I have it on my list of reviews
>>> for today.
>>>
>>> On Fri, Jun 26, 2020 at 10:27 AM Edgar Rodriguez
>>>  wrote:
>>>
>>>> There's already a fix for this in
>>>> https://github.com/apache/iceberg/pull/1127
>>>>
>>>> Cheers,
>>>>
>>>> On Fri, Jun 26, 2020 at 5:26 AM Mass Dosage 
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> For the past week or so I've noticed failing builds on a local
>>>>> checkout of master.
>>>>>
>>>>> I have raised an issue here:
>>>>>
>>>>> https://github.com/apache/iceberg/issues/1113
>>>>>
>>>>> (there was initially one failing test, there are now two)
>>>>>
>>>>> Someone else raised a similar issue with one of the same failing tests
>>>>> and then another one:
>>>>>
>>>>> https://github.com/apache/iceberg/issues/1116
>>>>>
>>>>> What can we do in order to get these 3 failing tests resolved? I think
>>>>> the first step should be to get the Travis build to fail on them so it's
>>>>> clear to everyone that there is a problem here and a failing master build
>>>>> should be #1 priority to resolve. Perhaps the Travis build could be 
>>>>> changed
>>>>> to use a different timezone? I generally recommend using a timezone,
>>>>> charset etc. in the CI system that is different to whatever most of the
>>>>> developers have as their default in order to catch these kinds of issues.
>>>>>
>>>>> This is making it really hard to develop with confidence as one can't
>>>>> tell whether failing tests are due to changes I am making or not.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Adrian
>>>>>
>>>>
>>>>
>>>> --
>>>> Edgar R
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: failing tests on master

2020-06-26 Thread Mass Dosage
Cool, I look forward to trying it out 😀

On Fri, 26 Jun 2020, 18:56 Ryan Blue,  wrote:

> Yes, I think Edgar has addressed this. I have it on my list of reviews for
> today.
>
> On Fri, Jun 26, 2020 at 10:27 AM Edgar Rodriguez
>  wrote:
>
>> There's already a fix for this in
>> https://github.com/apache/iceberg/pull/1127
>>
>> Cheers,
>>
>> On Fri, Jun 26, 2020 at 5:26 AM Mass Dosage  wrote:
>>
>>> Hello all,
>>>
>>> For the past week or so I've noticed failing builds on a local checkout
>>> of master.
>>>
>>> I have raised an issue here:
>>>
>>> https://github.com/apache/iceberg/issues/1113
>>>
>>> (there was initially one failing test, there are now two)
>>>
>>> Someone else raised a similar issue with one of the same failing tests
>>> and then another one:
>>>
>>> https://github.com/apache/iceberg/issues/1116
>>>
>>> What can we do in order to get these 3 failing tests resolved? I think
>>> the first step should be to get the Travis build to fail on them so it's
>>> clear to everyone that there is a problem here and a failing master build
>>> should be #1 priority to resolve. Perhaps the Travis build could be changed
>>> to use a different timezone? I generally recommend using a timezone,
>>> charset etc. in the CI system that is different to whatever most of the
>>> developers have as their default in order to catch these kinds of issues.
>>>
>>> This is making it really hard to develop with confidence as one can't
>>> tell whether failing tests are due to changes I am making or not.
>>>
>>> Thanks,
>>>
>>> Adrian
>>>
>>
>>
>> --
>> Edgar R
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


failing tests on master

2020-06-26 Thread Mass Dosage
Hello all,

For the past week or so I've noticed failing builds on a local checkout of
master.

I have raised an issue here:

https://github.com/apache/iceberg/issues/1113

(there was initially one failing test, there are now two)

Someone else raised a similar issue with one of the same failing tests and
then another one:

https://github.com/apache/iceberg/issues/1116

What can we do in order to get these 3 failing tests resolved? I think the
first step should be to get the Travis build to fail on them so it's clear
to everyone that there is a problem here and a failing master build should
be #1 priority to resolve. Perhaps the Travis build could be changed to use
a different timezone? I generally recommend using a timezone, charset etc.
in the CI system that is different to whatever most of the developers have
as their default in order to catch these kinds of issues.

This is making it really hard to develop with confidence as one can't tell
whether failing tests are due to changes I am making or not.

Thanks,

Adrian


Re: Iceberg community sync this week

2020-06-17 Thread Mass Dosage
I can't do Friday at 9:00 PDT but could do 11 or 12 PDT.

On Wed, 17 Jun 2020 at 01:30, Ryan Blue  wrote:

> Sounds like we should not plan to move the sync tomorrow, and should set
> up another discussion about getting the Hive work in and deduplicating
> effort.
>
> When is a time that would work? I'm out on Thursday, but I could do Friday
> morning at 9:00 or 11:00 PDT (17:00 or 19:00 UTC). Would either of those
> work?
>
> rb
>
> On Tue, Jun 16, 2020 at 9:49 AM Ryan Blue  wrote:
>
>> Thanks, Junjie. I'll add row-level deletes to the list.
>>
>> Can you open a PR with the README and site updates you're talking about?
>> I think that sounds like a good idea.
>>
>> On Tue, Jun 16, 2020 at 7:30 AM Junjie Chen 
>> wrote:
>>
>>> Hi
>>>
>>> I 'd like to add one topic about row-level delete planning. We have
>>> several large changes recently while the supporting for metadata column in
>>> the Spark side has no update.
>>>
>>> For discussion, how about adding slack channel link to the README and
>>> Iceberg website? The slack channel could make good and in-time
>>> discussion according to experiences in other projects. Currently, we almost
>>> don't have any discussion on the mail list.
>>>
>>>
>>> On Tue, Jun 16, 2020 at 1:49 AM Ryan Blue  wrote:
>>>
 Hi everyone,

 The next Iceberg community sync is currently scheduled for Wednesday,
 at 17:30 PDT. Please reply with topics you'd like to discuss.

 Also, there has been activity on the InputFormat and Hive support that
 I think people want to discuss this week, since we have two groups that
 have made significant progress on different implementations. I'd like to
 talk about how to get this work in, but this week's sync is scheduled for
 the evening PDT, which is really late for one of the groups to join from
 Europe. I think we should either move the sync to 9:00 PDT so they can join
 or have a second one to discuss getting the Hive support in.

 Whether we move the sync to the morning or have a second one to discuss
 Hive support depends on what topics we have for people that would like to
 attend from Asia time zones, so please reply with items to discuss or let
 me know if you don't plan on making it.

 Thanks!

 rb

 --
 Ryan Blue

>>>
>>>
>>> --
>>> Best Regards
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: CI for Iceberg

2020-06-05 Thread Mass Dosage
Ok, that makes sense.

I pulled master a few hours ago and was still getting those errors.
Christine also had one of them fail. Could it be timezone related? I
haven't had a chance to look into the failures in detail yet, will try
again on Monday.


On Fri, 5 Jun 2020, 17:15 Ryan Blue,  wrote:

> These can creep in because pull requests are tested against the current
> master when the test starts and not re-tested when master changes. So you
> can merge two passing pull requests and end up with a failure in master. We
> generally fix those pretty quickly because all subsequent pull requests
> will fail.
>
> I think what you ran into here was an issue with a new test that pushed
> down timestamp filters and an ORC timestamp correctness bug in stats. That
> should be fixed now. The current master builds are green.
>
> On Fri, Jun 5, 2020 at 5:38 AM Mass Dosage  wrote:
>
>> I now looked better and see that the Travis file does actually build
>> Iceberg ;) I'm still curious how something managed to get merged into
>> master while failing the tests though?
>>
>> On Fri, 5 Jun 2020 at 13:13, Mass Dosage  wrote:
>>
>>> Hello all,
>>>
>>> I just wanted to know if there is any CI set up for Iceberg? I noticed
>>> that if I pull the current master branch I get failing tests (see below for
>>> stack traces, Ryan - we talked about this last night but it's still
>>> happening). So this made me wonder why there isn't some CI set up to check
>>> that every PR actually successfully passes the build. I noticed there is a
>>> .travis.yml but this doesn't seem to run the gradle build. Should we add
>>> that or create a GitHub action to do this? I think it would be a really
>>> good safety net to have to reduce the likelihood of broken code and tests
>>> getting into master.
>>>
>>> Below are the two tests which are currently failing for me:
>>>
>>> > Task :iceberg-data:test
>>>
>>> org.apache.iceberg.data.TestLocalScan >
>>> testFilterWithDateAndTimestamp[1] FAILED
>>> java.lang.AssertionError
>>> at org.junit.Assert.fail(Assert.java:86)
>>> at org.junit.Assert.assertTrue(Assert.java:41)
>>> at org.junit.Assert.assertTrue(Assert.java:52)
>>> at
>>> org.apache.iceberg.data.TestLocalScan.testFilterWithDateAndTimestamp(TestLocalScan.java:486)
>>>
>>> org.apache.iceberg.data.TestMetricsRowGroupFilterTypes > testEq[20]
>>> FAILED
>>> java.lang.AssertionError: Should read: value is in the row group:
>>> 2018-06-29
>>> at org.junit.Assert.fail(Assert.java:88)
>>> at org.junit.Assert.assertTrue(Assert.java:41)
>>> at
>>> org.apache.iceberg.data.TestMetricsRowGroupFilterTypes.testEq(TestMetricsRowGroupFilterTypes.java:284)
>>>
>>> 211 tests completed, 2 failed, 6 skipped
>>>
>>> > Task :iceberg-data:test FAILED
>>>
>>> Thanks,
>>>
>>> Adrian
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: CI for Iceberg

2020-06-05 Thread Mass Dosage
I now looked better and see that the Travis file does actually build
Iceberg ;) I'm still curious how something managed to get merged into
master while failing the tests though?

On Fri, 5 Jun 2020 at 13:13, Mass Dosage  wrote:

> Hello all,
>
> I just wanted to know if there is any CI set up for Iceberg? I noticed
> that if I pull the current master branch I get failing tests (see below for
> stack traces, Ryan - we talked about this last night but it's still
> happening). So this made me wonder why there isn't some CI set up to check
> that every PR actually successfully passes the build. I noticed there is a
> .travis.yml but this doesn't seem to run the gradle build. Should we add
> that or create a GitHub action to do this? I think it would be a really
> good safety net to have to reduce the likelihood of broken code and tests
> getting into master.
>
> Below are the two tests which are currently failing for me:
>
> > Task :iceberg-data:test
>
> org.apache.iceberg.data.TestLocalScan > testFilterWithDateAndTimestamp[1]
> FAILED
> java.lang.AssertionError
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertTrue(Assert.java:52)
> at
> org.apache.iceberg.data.TestLocalScan.testFilterWithDateAndTimestamp(TestLocalScan.java:486)
>
> org.apache.iceberg.data.TestMetricsRowGroupFilterTypes > testEq[20] FAILED
> java.lang.AssertionError: Should read: value is in the row group:
> 2018-06-29
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at
> org.apache.iceberg.data.TestMetricsRowGroupFilterTypes.testEq(TestMetricsRowGroupFilterTypes.java:284)
>
> 211 tests completed, 2 failed, 6 skipped
>
> > Task :iceberg-data:test FAILED
>
> Thanks,
>
> Adrian
>


CI for Iceberg

2020-06-05 Thread Mass Dosage
Hello all,

I just wanted to know if there is any CI set up for Iceberg? I noticed that
if I pull the current master branch I get failing tests (see below for
stack traces, Ryan - we talked about this last night but it's still
happening). So this made me wonder why there isn't some CI set up to check
that every PR actually successfully passes the build. I noticed there is a
.travis.yml but this doesn't seem to run the gradle build. Should we add
that or create a GitHub action to do this? I think it would be a really
good safety net to have to reduce the likelihood of broken code and tests
getting into master.

Below are the two tests which are currently failing for me:

> Task :iceberg-data:test

org.apache.iceberg.data.TestLocalScan > testFilterWithDateAndTimestamp[1]
FAILED
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at
org.apache.iceberg.data.TestLocalScan.testFilterWithDateAndTimestamp(TestLocalScan.java:486)

org.apache.iceberg.data.TestMetricsRowGroupFilterTypes > testEq[20] FAILED
java.lang.AssertionError: Should read: value is in the row group:
2018-06-29
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at
org.apache.iceberg.data.TestMetricsRowGroupFilterTypes.testEq(TestMetricsRowGroupFilterTypes.java:284)

211 tests completed, 2 failed, 6 skipped

> Task :iceberg-data:test FAILED

Thanks,

Adrian


Re: [VOTE] Graduate to a top-level project

2020-05-15 Thread Mass Dosage
+1 as a member of the community (non-binding)

On Thu, 14 May 2020 at 23:12, Gautam  wrote:

> +1  We'v come a long way :-)
>
> On Wed, May 13, 2020 at 1:07 AM Dongjoon Hyun 
> wrote:
>
>> +1 for graduation!
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, May 12, 2020 at 11:59 PM Driesprong, Fokko 
>> wrote:
>>
>>> +1
>>>
>>> Op wo 13 mei 2020 om 08:58 schreef jiantao yu 
>>>
 +1 for graduation.


 在 2020年5月13日,下午12:50,Jun H.  写道:

 +1 for graduation.


 On Tue, May 12, 2020 at 9:41 PM 李响  wrote:


 +1 non-binding. My honor to be a part of this.

 On Wed, May 13, 2020 at 10:16 AM OpenInx  wrote:


 +1 for graduation.  It's a great news that we've prepared to graduate.

 (non-binding).

 On Wed, May 13, 2020 at 9:50 AM Saisai Shao 
 wrote:


 +1 for graduation.

 Junjie Chen  于2020年5月13日周三 上午9:33写道:


 +1

 On Wed, May 13, 2020 at 8:07 AM RD  wrote:


 +1 for graduation!

 On Tue, May 12, 2020 at 3:50 PM John Zhuge  wrote:


 +1

 On Tue, May 12, 2020 at 3:33 PM parth brahmbhatt <
 brahmbhatt.pa...@gmail.com> wrote:


 +1

 On Tue, May 12, 2020 at 3:31 PM Anton Okolnychyi
  wrote:


 +1 for graduation

 On 12 May 2020, at 15:30, Ryan Blue  wrote:

 +1

 Jacques, I agree. I'll make sure to let you know about the IPMC vote
 because we'd love to have your support there as well.

 On Tue, May 12, 2020 at 3:02 PM Jacques Nadeau 
 wrote:


 I'm +1.

 (I think that is non-binding here but binding at the incubator level)
 --
 Jacques Nadeau
 CTO and Co-Founder, Dremio


 On Tue, May 12, 2020 at 2:35 PM Romin Parekh 
 wrote:


 +1

 On Tue, May 12, 2020 at 2:32 PM Owen O'Malley 
 wrote:


 +1

 On Tue, May 12, 2020 at 2:16 PM Ryan Blue  wrote:


 Hi everyone,

 I propose that the Iceberg community should petition to graduate from
 the Apache Incubator to a top-level project.

 Here is the draft board resolution:

 Establish the Apache Iceberg Project

 WHEREAS, the Board of Directors deems it to be in the best interests of
 the Foundation and consistent with the Foundation's purpose to establish
 a Project Management Committee charged with the creation and maintenance
 of open-source software, for distribution at no charge to the public,
 related to managing huge analytic datasets using a standard at-rest
 table format that is designed for high performance and ease of use..

 NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
 (PMC), to be known as the "Apache Iceberg Project", be and hereby is
 established pursuant to Bylaws of the Foundation; and be it further

 RESOLVED, that the Apache Iceberg Project be and hereby is responsible
 for the creation and maintenance of software related to managing huge
 analytic datasets using a standard at-rest table format that is designed
 for high performance and ease of use; and be it further

 RESOLVED, that the office of "Vice President, Apache Iceberg" be and
 hereby is created, the person holding such office to serve at the
 direction of the Board of Directors as the chair of the Apache Iceberg
 Project, and to have primary responsibility for management of the
 projects within the scope of responsibility of the Apache Iceberg
 Project; and be it further

 RESOLVED, that the persons listed immediately below be and hereby are
 appointed to serve as the initial members of the Apache Iceberg Project:

 * Anton Okolnychyi 
 * Carl Steinbach   
 * Daniel C. Weeks  
 * James R. Taylor  
 * Julien Le Dem
 * Owen O'Malley
 * Parth Brahmbhatt 
 * Ratandeep Ratti  
 * Ryan Blue

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Ryan Blue be appointed to
 the office of Vice President, Apache Iceberg, to serve in accordance
 with and subject to the direction of the Board of Directors and the
 Bylaws of the Foundation until death, resignation, retirement, removal
 or disqualification, or until a successor is appointed; and be it
 further

 RESOLVED, that the Apache Iceberg Project be and hereby is tasked with
 the migration and rationalization of the Apache Incubator Iceberg
 podling; and be it further

 RESOLVED, that all responsibilities pertaining to the Apache Incubator
 Iceberg podling encumbered upon the Apache Incubator PMC are hereafter
 discharged.

 Please vote in the next 72 hours.

 [ ] +1 Petition the IPMC to graduate to top-level project
 [ ] +0
 [ ] -1 Wait to graduate because . . .

 --
 Ryan Blue




 --
 Thanks,

Re: [VOTE] Release Apache Iceberg 0.8.0-incubating RC2

2020-04-30 Thread Mass Dosage
The build for RC2 worked fine for me, I didn't get a failure on
"TestHiveTableConcurrency". Perhaps there is some kind of race condition in
the test? I have seen timeout errors like that when I ran tests on an
overloaded machine, could that have been the case?

On Thu, 30 Apr 2020 at 08:32, OpenInx  wrote:

> I checked the rc2, seems the TestHiveTableConcurrency is broken, may need
> to fix it.
>
> 1. Download the tarball and check the signature & checksum: OK
> 2. license checking: RAT checks passed.
> 3. Build and test the project (java8):
> org.apache.iceberg.hive.TestHiveTableConcurrency >
> testConcurrentConnections FAILED
> java.lang.AssertionError: Timeout
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at
> org.apache.iceberg.hive.TestHiveTableConcurrency.testConcurrentConnections(TestHiveTableConcurrency.java:106)
>
> On Thu, Apr 30, 2020 at 9:29 AM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> I propose the following candidate to be released as the official Apache
>> Iceberg 0.8.0-incubating release.
>>
>> The commit id is 8c05a2f5f1c8b111c049d43cf15cd8a51920dda1
>> * This corresponds to the tag: apache-iceberg-0.8.0-incubating-rc2
>> *
>> https://github.com/apache/incubator-iceberg/commits/apache-iceberg-0.8.0-incubating-rc2
>> * https://github.com/apache/incubator-iceberg/tree/8c05a2f5
>>
>> The release tarball, signature, and checksums are here:
>> *
>> https://dist.apache.org/repos/dist/dev/incubator/iceberg/apache-iceberg-0.8.0-incubating-rc2/
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/incubator/iceberg/KEYS
>>
>> Convenience binary artifacts are staged in Nexus. The Maven repository
>> URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1006/
>>
>> This release contains many bug fixes and several new features:
>> * Actions to remove orphaned files and to optimize metadata for query
>> performance
>> * Support for ORC data files
>> * Snapshot cherry-picking
>> * Incremental scan planning based on table history
>> * In and notIn expressions
>> * An InputFormat for writing MR jobs
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 0.8.0-incubating
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>> --
>> Ryan Blue
>>
>


Re: [VOTE] Release Apache Iceberg 0.8.0-incubating RC1

2020-04-29 Thread Mass Dosage
+1 (non-binding) [I assume only Apache/Iceberg members have binding votes?)

Similar to others I verified:

√ RAT checks passed
√ signature is correct
√ checksum is correct
√ build from source
√ run tests locally

Thanks,

Adrian

On Tue, 28 Apr 2020 at 21:45, Ryan Blue  wrote:

> Here are the steps to verify the release that I sent out last time, for
> anyone that doesn’t want to look them up:
>
>1. Download the source tarball, signature (.asc), and checksum
>(.sha512) from
>
> https://dist.apache.org/repos/dist/dev/incubator/iceberg/apache-iceberg-0.8.0-incubating-rc1/
>2. Import gpg keys: download KEYS and run gpg --import
>/path/to/downloaded/KEYS (optional if this hasn’t changed)
>3. Verify the signature by running: gpg --verify
>apache-iceberg-0.8.0-incubating.tar.gz.asc
>4. Verify the checksum by running: sha512sum -c
>apache-iceberg-0.8.0-incubating.tar.gz.sha512
>5. Untar the archive and go into the source directory: tar xzf
>apache-iceberg-0.8.0-incubating.tar.gz && cd 
> apache-iceberg-0.8.0-incubating
>6. Run RAT checks to validate license headers: dev/check-license
>7. Build and test the project: ./gradlew build (use Java 8)
>
> You can also validate the LICENSE and NOTICE documentation, which is
> included in the source tarball, as well as the staged binary artifacts. The
> latest update to the spark-runtime Jar’s was PR #966
>  if you’d like to
> review it.
>
> To validate the convenience binaries, add the Maven URL from the email
> above to a downstream project and update your Iceberg dependency to
> 0.8.0-incubating, like this:
>
>   repositories {
> maven {
>   name 'stagedIceberg'
>   url 
> 'https://repository.apache.org/content/repositories/orgapacheiceberg-1005/'
> }
>   }
>
>   ext {
> icebergVersion = '0.8.0-incubating'
>   }
>
> Then run the downstream project’s tests.
>
> Thanks for reviewing and voting, everyone!
>
> rb
>
> On Tue, Apr 28, 2020 at 1:39 PM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> I propose the following RC to be released as official Apache Iceberg
>> 0.8.0-incubating release.
>>
>> The commit id is 4c2dd0ac2c832cc425b33d3b578025fa4e295392
>> * This corresponds to the tag: apache-iceberg-0.8.0-incubating-rc1
>> *
>> https://github.com/apache/incubator-iceberg/commits/apache-iceberg-0.8.0-incubating-rc1
>> * https://github.com/apache/incubator-iceberg/tree/4c2dd0ac
>>
>> The release tarball, signature, and checksums are here:
>> *
>> https://dist.apache.org/repos/dist/dev/incubator/iceberg/apache-iceberg-0.8.0-incubating-rc1/
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/incubator/iceberg/KEYS
>>
>> Convenience binary artifacts are staged in Nexus. The Maven repository
>> URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1005/
>>
>> This release contains many bug fixes and several new features:
>> * Actions to remove orphaned files and to optimize metadata for query
>> performance
>> * Support for ORC data files
>> * Snapshot cherry-picking
>> * Incremental scan planning based on table history
>> * In and notIn expressions
>> * An InputFormat for writing MR jobs
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 0.8.0-incubating
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>>
>> --
>> Ryan Blue
>>
>
>
> --
> Ryan Blue
>


Re: Iceberg community sync notes - 15 April 2020

2020-04-17 Thread Mass Dosage
Cool. I've raised a draft PR for the approach we discussed on the call:

https://github.com/apache/incubator-iceberg/pull/935/files

It's incomplete but I've put some notes explaining that, would be nice to
know what others think of the above approach and if they have better ideas.

Another approach that we did successfully was to shade and relocate Guava
in every Iceberg subproject that used it, that way you can depend on it
"normally" but the build file is pretty messy with shadow jar versions of
everything etc. I can raise a WIP PR for that approach to compare if anyone
is interested.

Thanks,

Adrian



On Fri, 17 Apr 2020 at 15:58, RD  wrote:

> Thanks for the Correction Adrian.  I've filed the ticket for github here:
> https://github.com/apache/incubator-iceberg/issues/934 . There are 2
> approaches mentioned there with pros/cons. Will be good to get the
> community's feedback on how to proceed.
>
> -best,
> R.
>
> On Fri, Apr 17, 2020 at 6:28 AM Mass Dosage  wrote:
>
>> Thanks for the detailed notes Ryan. My thoughts on a few of the topics...
>>
>> 0.8.0 release - my general preference is to release early and release
>> often. If features aren't ready why wait? Why not go with a 0.8.0 release
>> now and then a 0.9.0 (or whatever) a couple of weeks later with the other
>> features? I know with Apache projects this can sometimes be a challenge
>> with all the ceremony around a release, getting votes etc. but I don't
>> think that's such a problem in the incubating stage?
>>
>> A clarification on the InputFomats - I think the DDL Ratandeep was
>> referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS"
>> i.e. the "read" path but for statements other than "SELECT" etc. Also, to
>> be clear the `mapreduce` InputFormat that was contributed - it sounds like
>> that works for Pig but I don't think it will work for Hive 1 or 2 since
>> they use the `mapred` API for InputFormats. This is what we have attempted
>> to cover in our InputFormat. I raised a WIP PR for it yesterday at
>> https://github.com/apache/incubator-iceberg/pull/933 and would
>> appreciate feedback from anyone interested in it.
>>
>> Thanks for sharing the Avro hack for shading and relocating Guava. Should
>> I create a ticket on GitHub to capture this work? We'll then have a go at
>> implementing it.
>>
>> Thanks,
>>
>> Adrian
>>
>>
>> On Fri, 17 Apr 2020 at 04:07, OpenInx  wrote:
>>
>>> Thanks for the writing.
>>> The views from Netflix branch is a great feature, would have any plan to
>>> port to Apache Iceberg ?
>>>
>>> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue 
>>> wrote:
>>>
>>>> Here are my notes from yesterday’s sync. As usual, feel free to add to
>>>> this if I missed something.
>>>>
>>>> There were a couple of questions raised during the sync that we’d like
>>>> to open up to anyone who wasn’t able to attend:
>>>>
>>>>- Should we wait for the parallel metadata rewrite action before
>>>>cutting 0.8.0 candidates?
>>>>- Should we wait for ORC metrics before cutting 0.8.0 candidates?
>>>>
>>>> In the sync, we thought that it would be good to wait and get these in.
>>>> Please reply to this if you agree or disagree.
>>>>
>>>> Thanks!
>>>>
>>>> *Attendees*:
>>>>
>>>>- Ryan Blue
>>>>- Dan Weeks
>>>>- Anjali Norwood
>>>>- Jun Ma
>>>>- Ratandeep Ratti
>>>>- Pavan
>>>>- Christine Mathiesen
>>>>- Gautam Kowshik
>>>>- Mass Dosage
>>>>- Filip
>>>>- Ryan Murray
>>>>
>>>> *Topics*:
>>>>
>>>>- 0.8.0 release blockers: actions, ORC metrics
>>>>- Row-level delete update
>>>>- Parquet vectorized read update
>>>>- InputFormats and Hive support
>>>>- Netflix branch
>>>>
>>>> *Discussion*:
>>>>
>>>>- 0.8.0 release
>>>>   - Ryan: we planned to get a candidate out this week, but I think
>>>>   we may want to wait on 2 things that are about ready
>>>>   - First: Anton is contributing an action to rewrite manifests in
>>>>   parallel that is close. Anyone interested? (Gautam is interested)
>>>>   - Second: ORC is passing correctness tests, but 

Re: Iceberg community sync notes - 15 April 2020

2020-04-17 Thread Mass Dosage
Thanks for the detailed notes Ryan. My thoughts on a few of the topics...

0.8.0 release - my general preference is to release early and release
often. If features aren't ready why wait? Why not go with a 0.8.0 release
now and then a 0.9.0 (or whatever) a couple of weeks later with the other
features? I know with Apache projects this can sometimes be a challenge
with all the ceremony around a release, getting votes etc. but I don't
think that's such a problem in the incubating stage?

A clarification on the InputFomats - I think the DDL Ratandeep was
referring to was more like "SHOW PARTITIONS" rather than "ADD PARTITIONS"
i.e. the "read" path but for statements other than "SELECT" etc. Also, to
be clear the `mapreduce` InputFormat that was contributed - it sounds like
that works for Pig but I don't think it will work for Hive 1 or 2 since
they use the `mapred` API for InputFormats. This is what we have attempted
to cover in our InputFormat. I raised a WIP PR for it yesterday at
https://github.com/apache/incubator-iceberg/pull/933 and would appreciate
feedback from anyone interested in it.

Thanks for sharing the Avro hack for shading and relocating Guava. Should I
create a ticket on GitHub to capture this work? We'll then have a go at
implementing it.

Thanks,

Adrian


On Fri, 17 Apr 2020 at 04:07, OpenInx  wrote:

> Thanks for the writing.
> The views from Netflix branch is a great feature, would have any plan to
> port to Apache Iceberg ?
>
> On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue 
> wrote:
>
>> Here are my notes from yesterday’s sync. As usual, feel free to add to
>> this if I missed something.
>>
>> There were a couple of questions raised during the sync that we’d like to
>> open up to anyone who wasn’t able to attend:
>>
>>- Should we wait for the parallel metadata rewrite action before
>>cutting 0.8.0 candidates?
>>- Should we wait for ORC metrics before cutting 0.8.0 candidates?
>>
>> In the sync, we thought that it would be good to wait and get these in.
>> Please reply to this if you agree or disagree.
>>
>> Thanks!
>>
>> *Attendees*:
>>
>>- Ryan Blue
>>- Dan Weeks
>>- Anjali Norwood
>>- Jun Ma
>>- Ratandeep Ratti
>>- Pavan
>>- Christine Mathiesen
>>- Gautam Kowshik
>>- Mass Dosage
>>- Filip
>>- Ryan Murray
>>
>> *Topics*:
>>
>>- 0.8.0 release blockers: actions, ORC metrics
>>- Row-level delete update
>>- Parquet vectorized read update
>>- InputFormats and Hive support
>>- Netflix branch
>>
>> *Discussion*:
>>
>>- 0.8.0 release
>>   - Ryan: we planned to get a candidate out this week, but I think
>>   we may want to wait on 2 things that are about ready
>>   - First: Anton is contributing an action to rewrite manifests in
>>   parallel that is close. Anyone interested? (Gautam is interested)
>>   - Second: ORC is passing correctness tests, but doesn’t have
>>   column-level metrics. Should we wait for this?
>>   - Ratandeep: ORC also lacks predicate push-down support
>>   - Ryan: I think metrics are more important than PPD because PPD is
>>   task side and metrics help reduce the number of tasks. If we wait on 
>> one,
>>   I’d prefer to wait on metrics
>>   - Ratandeep will look into whether he or Shardul can work on this
>>   - General consensus was to wait for these features before getting
>>   a candidate out
>>- Row-level deletes
>>   - Good progress in several PRs on adding the parallel v2 write
>>   path, as Owen suggested last sync
>>   - Junjie contributed an update to the spec for file/position
>>   delete files
>>- Parquet vectorized read
>>   - Dan: flat schema reads are primarily waiting on reviews
>>   - Dan: is anyone interested in complex type support?
>>   - Gautam needs struct and map support. 0.14.0 doesn’t support maps
>>   - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but
>>   not maps of structs
>>   - Ryan (Blue): Because we have a translation layer in Iceberg to
>>   pass off to Spark, we don’t actually need support in Arrow. We are
>>   currently stuck on 0.14.0 because of changes that prevent us from 
>> avoiding
>>   a null check (see this comment
>>   <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
>>   )
>>-
>>
>>InputFormat and Hive support
>>- Ratandeep: Generic (mapreduce) In

Re: [Discuss] Merge spark-3 branch into master

2020-03-26 Thread Mass Dosage
020年3月4日周三 上午10:01写道:
>>>>>>
>>>>>>> Thanks Matt,
>>>>>>>
>>>>>>> If branching is the only choice, then we would potentially have two
>>>>>>> *master* branches until spark-3 is vastly adopted. That will somehow
>>>>>>> increase the maintenance burden and lead to inconsistency. IMO I'm OK 
>>>>>>> with
>>>>>>> the branching way, just think that we should have a clear way to keep
>>>>>>> tracking of two branches.
>>>>>>>
>>>>>>> Best,
>>>>>>> Saisai
>>>>>>>
>>>>>>> Matt Cheah  于2020年3月4日周三 上午9:50写道:
>>>>>>>
>>>>>>>> I think it’s generally dangerous and error-prone to try to support
>>>>>>>> two versions of the same library in the same build, in the same 
>>>>>>>> published
>>>>>>>> artifacts. This is the stance that Baseline
>>>>>>>> <https://github.com/palantir/gradle-baseline> + Gradle Consistent
>>>>>>>> Versions <https://github.com/palantir/gradle-consistent-versions>
>>>>>>>> takes. Gradle Consistent Versions is specifically opinionated towards
>>>>>>>> building against one version of a library across all modules in the 
>>>>>>>> build.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I would think that branching would be the best way to build and
>>>>>>>> publish against multiple versions of a dependency.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matt Cheah
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From: *Saisai Shao 
>>>>>>>> *Reply-To: *"dev@iceberg.apache.org" 
>>>>>>>> *Date: *Tuesday, March 3, 2020 at 5:45 PM
>>>>>>>> *To: *Iceberg Dev List 
>>>>>>>> *Cc: *Ryan Blue 
>>>>>>>> *Subject: *Re: [Discuss] Merge spark-3 branch into master
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I didn't realized that Gradle cannot support two different versions
>>>>>>>> in one build. I think I did such things for Livy to build scala 2.10 
>>>>>>>> and
>>>>>>>> 2.11 jars simultaneously with Maven. I'm not so familiar with Gradle 
>>>>>>>> thing,
>>>>>>>> I can take a shot to see if there's some hacky ways to make it work.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Besides, are we saying that we will move to spark-3 support after
>>>>>>>> 0.8 release in the master branch to replace Spark-2, or we maintain two
>>>>>>>> branches for both spark-2 and spark-3 and make two releases? From
>>>>>>>> my understanding, the adoption of spark-3 may not be so fast, and there
>>>>>>>> still has lots users who stick on spark-2. Ideally, it might be better 
>>>>>>>> to
>>>>>>>> support two versions in a near future.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Saisai
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Mass Dosage  于2020年3月4日周三 上午1:33写道:
>>>>>>>>
>>>>>>>> +1 for a 0.8.0 release with Spark 2.4 and then move on for Spark
>>>>>>>> 3.0 when it's ready.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 3 Mar 2020 at 16:32, Ryan Blue 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks for bringing this up, Saisai. I tried to do this a couple of
>>>>>>>> months ago, but ran into a problem with dependency locks. I couldn't 
>>>>>>>> get
>>>>>>>> two different versions of Spark packages in the build with baseline, 
>>>>>>>> but
>>>>>>>> maybe I was missing something. If you can get it working, I think it's 
>>>>>>>> a
>>>>>>>> great idea to get this into master.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Otherwise, I was thinking about proposing an 0.8.0 release in the
>>>>>>>> next month or so based on Spark 2.4. Then we could merge the branch 
>>>>>>>> into
>>>>>>>> master and do another release for Spark 3.0 when it's ready.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 3, 2020 at 6:07 AM Saisai Shao 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi team,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I was thinking of merging spark-3 branch into master, also per the
>>>>>>>> discussion before we could make spark-2 and spark-3 coexisted into 2
>>>>>>>> different sub-modules. With this, one build could generate both 
>>>>>>>> spark-2 and
>>>>>>>> spark-3 runtime jars, user could pick either at preference.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> One concern is that they share lots of common code in read/write
>>>>>>>> path, this will increase the maintenance overhead to keep consistency 
>>>>>>>> of
>>>>>>>> two copies.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So I'd like to hear your thoughts, any suggestions on it?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Saisai
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Ryan Blue
>>>>>>>>
>>>>>>>> Software Engineer
>>>>>>>>
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>


Re: Shall we start a regular community sync up?

2020-03-18 Thread Mass Dosage
We're in London so that wouldn't work for us but up to you obviously based
on where most of the committers are.

On Wed, 18 Mar 2020, 17:13 Ryan Blue,  wrote:

> Yes, I agree! What days work for everyone? Since most people are in UTC-7
> and UTC+8, it probably makes sense to do something in the evening here in
> California, right?
>
> On Wed, Mar 18, 2020 at 10:06 AM Mass Dosage  wrote:
>
>> +1 to monthly or fortnightly.
>>
>> On Wed, 18 Mar 2020 at 16:22, Miao Wang  wrote:
>>
>>> +1. Monthly or Bi-Weekly.
>>>
>>>
>>>
>>> *From: *OpenInx 
>>> *Reply-To: *"dev@iceberg.apache.org" 
>>> *Date: *Wednesday, March 18, 2020 at 8:20 AM
>>> *To: *"dev@iceberg.apache.org" 
>>> *Cc: *Ryan Blue 
>>> *Subject: *Re: Shall we start a regular community sync up?
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, Mar 18, 2020 at 10:30 PM Saisai Shao 
>>> wrote:
>>>
>>> Hi team,
>>>
>>>
>>>
>>> With more companies and developers joining in the community, I was
>>> wondering if we could have regular sync up to discuss anything about
>>> Iceberg, like milestone, feature design, etc. I think this will be quite
>>> helpful to grow the community and move forward the project.
>>>
>>>
>>>
>>> Would like to hear your thoughts.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Saisai
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Shall we start a regular community sync up?

2020-03-18 Thread Mass Dosage
+1 to monthly or fortnightly.

On Wed, 18 Mar 2020 at 16:22, Miao Wang  wrote:

> +1. Monthly or Bi-Weekly.
>
>
>
> *From: *OpenInx 
> *Reply-To: *"dev@iceberg.apache.org" 
> *Date: *Wednesday, March 18, 2020 at 8:20 AM
> *To: *"dev@iceberg.apache.org" 
> *Cc: *Ryan Blue 
> *Subject: *Re: Shall we start a regular community sync up?
>
>
>
> +1
>
>
>
> On Wed, Mar 18, 2020 at 10:30 PM Saisai Shao 
> wrote:
>
> Hi team,
>
>
>
> With more companies and developers joining in the community, I was
> wondering if we could have regular sync up to discuss anything about
> Iceberg, like milestone, feature design, etc. I think this will be quite
> helpful to grow the community and move forward the project.
>
>
>
> Would like to hear your thoughts.
>
>
>
> Best regards,
>
> Saisai
>
>
>
>
>
>


Re: [Discuss] Merge spark-3 branch into master

2020-03-03 Thread Mass Dosage
+1 for a 0.8.0 release with Spark 2.4 and then move on for Spark 3.0 when
it's ready.

On Tue, 3 Mar 2020 at 16:32, Ryan Blue  wrote:

> Thanks for bringing this up, Saisai. I tried to do this a couple of months
> ago, but ran into a problem with dependency locks. I couldn't get two
> different versions of Spark packages in the build with baseline, but maybe
> I was missing something. If you can get it working, I think it's a great
> idea to get this into master.
>
> Otherwise, I was thinking about proposing an 0.8.0 release in the next
> month or so based on Spark 2.4. Then we could merge the branch into master
> and do another release for Spark 3.0 when it's ready.
>
> rb
>
> On Tue, Mar 3, 2020 at 6:07 AM Saisai Shao  wrote:
>
>> Hi team,
>>
>> I was thinking of merging spark-3 branch into master, also per the
>> discussion before we could make spark-2 and spark-3 coexisted into 2
>> different sub-modules. With this, one build could generate both spark-2 and
>> spark-3 runtime jars, user could pick either at preference.
>>
>> One concern is that they share lots of common code in read/write path,
>> this will increase the maintenance overhead to keep consistency of two
>> copies.
>>
>> So I'd like to hear your thoughts, any suggestions on it?
>>
>> Thanks
>> Saisai
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Hive Metastore integration future

2020-02-19 Thread Mass Dosage
Regarding the Hive versions... from what I've seen over on the Hive mailing
lists it sounds like Spark 3.0 will be targeting Hive 2.3.x based on
whether they can get the Hive community to do a new point release of Hive
before they finalise Spark 3.0. So hopefully that gets sorted in time and
then Iceberg can move to Spark 3 and then using Hive 2.3 should be easier.

On Sat, 15 Feb 2020, 00:21 Ryan Blue,  wrote:

> Sorry for the late reply, everyone. This slipped to the bottom of my inbox.
>
> As for the Hive version, we were using 1.2.1 for the Spark runtime because
> that was the default in Spark. I think that changes in Spark 3.0, so when
> we move master over to build for 3.0, we should update the default Hive
> version as well. I think that should solve a lot of these problems
> because Spark will no longer be reliant on such an old version and we can
> continue to use the version that Spark provides. Does that sound like
> a good plan?
>
> For the metastore project, I'm not sure whether I would include it in
> Iceberg or not. I wouldn't want Iceberg to suffer from feature creep, but I
> think that good integration between a new metastore and Iceberg would be
> really beneficial. I'd be happy either way, as long as we don't make it
> impossible for users that still rely on the Hive metastore or other
> implementations.
>
> rb
>
> On Wed, Jan 29, 2020 at 10:01 AM Kristopher Kane 
> wrote:
>
>> " It would be simply to gain full functionality of Hive" . That should
>> read Iceberg.
>>
>> On Wed, Jan 29, 2020 at 12:55 PM Kristopher Kane  wrote:
>>
>>> Adrian, "I'd imagine that keeping binary compatibility across Hive,
>>> Spark and Iceberg will be quite a challenge."  Yeah, this is what I'm
>>> afraid of over time.  Iceberg's big draw for me is only maintaining a
>>> processing engine (Spark), Iceberg and cloud storage compatibility and any
>>> potential Iceberg use wouldn't even be with the rest of the Hive ecosystem.
>>> It would be simply to gain full functionality of Hive via a ready-to-use
>>> metastore which, right now, defaults to Hive.  Hive 3, with Ranger and
>>> Atlas and Ranger based security, take things even further away for Spark as
>>> it is not allowing interaction with Hive intrinsic services like the
>>> metastore anyway.  It might be that you can run the Hive 3 metastore for
>>> now but the paths forward don't suggest that is accessible for much into
>>> the future.
>>>
>>> Ryan, when you said, "I'd really love to see a new metastore project,"
>>> did you mean internal to the Iceberg project?
>>>
>>> Kris
>>>
>>> On Wed, Jan 29, 2020 at 12:17 PM Mass Dosage 
>>> wrote:
>>>
>>>> On the topic of Hive versions - we've definitely experienced some
>>>> issues trying to programmatically use the iceberg-spark-runtime artifact in
>>>> unit tests (it uses Hive 1.2 as mentioned above). We then tried to also use
>>>> some other common HIve testing libraries like HiveRunner
>>>> <https://github.com/klarna/HiveRunner/> and BeeJU
>>>> <https://github.com/HotelsDotCom/beeju> which in turn use Hive 2.3. We
>>>> then ended up with exceptions (e.g. "Method not found") due to
>>>> incompatibilities between the Hive library classes and had to abandon the
>>>> testing libraries. I can share these exceptions if that would be useful but
>>>> I'd imagine that keeping binary compatibility across Hive, Spark and
>>>> Iceberg will be quite a challenge. I'd prefer Iceberg defaulting to Hive
>>>> 2.3.x over 1.2 as 1.2 is pretty old, I don't think any of the commercial
>>>> Hadoop vendors officially support it any more and I think it's used a lot
>>>> less now than 2.x but I could be wrong. Alternatively a way to pick and
>>>> choose a Hive version would be great but probably quite a bit of work to
>>>> pull off...
>>>>
>>>> Adrian
>>>>
>>>> On Wed, 29 Jan 2020 at 16:59, Ryan Blue 
>>>> wrote:
>>>>
>>>>> Hi Kris,
>>>>>
>>>>> We use version 1.2.1 because the part that we're using hasn't changed
>>>>> much and we want to ensure compatibility with old metastore versions.
>>>>> Iceberg should work with newer metastores, and feel free to open a bug if
>>>>> you find a problem with one. We'll make sure to fix it to be compatible
>>>>> 

Re: Hive Metastore integration future

2020-01-29 Thread Mass Dosage
On the topic of Hive versions - we've definitely experienced some issues
trying to programmatically use the iceberg-spark-runtime artifact in unit
tests (it uses Hive 1.2 as mentioned above). We then tried to also use some
other common HIve testing libraries like HiveRunner
 and BeeJU
 which in turn use Hive 2.3. We then
ended up with exceptions (e.g. "Method not found") due to incompatibilities
between the Hive library classes and had to abandon the testing libraries.
I can share these exceptions if that would be useful but I'd imagine that
keeping binary compatibility across Hive, Spark and Iceberg will be quite a
challenge. I'd prefer Iceberg defaulting to Hive 2.3.x over 1.2 as 1.2 is
pretty old, I don't think any of the commercial Hadoop vendors officially
support it any more and I think it's used a lot less now than 2.x but I
could be wrong. Alternatively a way to pick and choose a Hive version would
be great but probably quite a bit of work to pull off...

Adrian

On Wed, 29 Jan 2020 at 16:59, Ryan Blue  wrote:

> Hi Kris,
>
> We use version 1.2.1 because the part that we're using hasn't changed much
> and we want to ensure compatibility with old metastore versions. Iceberg
> should work with newer metastores, and feel free to open a bug if you find
> a problem with one. We'll make sure to fix it to be compatible with a range
> of versions.
>
> I'm not sure what people are going to want eventually. Right now, we know
> that many people use the Hive metastore to track tables, so it makes sense
> to support it as an option. Iceberg allows you to plug in your own
> metastore easily because we know that lots of places (Netflix included)
> have their own metastore implementations. I'd really love to see a new
> metastore project, but I don't think that Iceberg should be opinionated
> about which one you use.
>
> rb
>
> On Wed, Jan 29, 2020 at 7:32 AM Kristopher Kane 
> wrote:
>
>> Hi Iceberg.
>>
>> It looks like for most cases where non-atomic rename is required, using
>> the Hive metastore is the baseline with the ability to implement a custom.
>>
>> I couldn't find mailing list history or GitHub issue that suggests that
>> Iceberg will implement its own. Is that intended for the future?
>>
>> I ask because Iceberg's metastore version pin is 1.2.1 which is very
>> old.  Someone using Iceberg, with a Hive metastore, mind find difficult
>> moving maintaining peace in upgrades with Hive.
>>
>> Related:  Is the intention here that existing Hive users would use the
>> store that they have and new Iceberg users would implement custom?
>>
>> Appreciate help in understanding,
>>
>> Kris
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>