[GitHub] mccheah opened a new pull request #19: Set derby log system property earlier

2018-11-28 Thread GitBox
mccheah opened a new pull request #19: Set derby log system property earlier URL: https://github.com/apache/incubator-iceberg/pull/19 Otherwise a hive/derby.log file is generated, likely by default when starting the thrift server. ---

[GitHub] rdblue closed pull request #17: Remove filter and iterable methods from Snapshot.

2018-11-28 Thread GitBox
rdblue closed pull request #17: Remove filter and iterable methods from Snapshot. URL: https://github.com/apache/incubator-iceberg/pull/17 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this i

[GitHub] danielcweeks commented on issue #17: Remove filter and iterable methods from Snapshot.

2018-11-28 Thread GitBox
danielcweeks commented on issue #17: Remove filter and iterable methods from Snapshot. URL: https://github.com/apache/incubator-iceberg/pull/17#issuecomment-442646567 +1 This is an automated message from the Apache Git Servic

[GitHub] rdblue closed pull request #2: Support dateCreated expressions in ScanSummary.

2018-11-28 Thread GitBox
rdblue closed pull request #2: Support dateCreated expressions in ScanSummary. URL: https://github.com/apache/incubator-iceberg/pull/2 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a f

[GitHub] mccheah opened a new issue #18: Avro files in tests should be generated under the build folder

2018-11-28 Thread GitBox
mccheah opened a new issue #18: Avro files in tests should be generated under the build folder URL: https://github.com/apache/incubator-iceberg/issues/18 This is an automated message from the Apache Git Service. To respond t

[GitHub] rdsr commented on issue #17: Remove filter and iterable methods from Snapshot.

2018-11-28 Thread GitBox
rdsr commented on issue #17: Remove filter and iterable methods from Snapshot. URL: https://github.com/apache/incubator-iceberg/pull/17#issuecomment-442634067 That makes sense. Thks! This is an automated message from the Apach

[GitHub] mccheah commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log.

2018-11-28 Thread GitBox
mccheah commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log. URL: https://github.com/apache/incubator-iceberg/pull/11#issuecomment-442633733 Yup agreed, I didn't take the step to look into how these files are generated - can do that instead.

[GitHub] rdsr commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log.

2018-11-28 Thread GitBox
rdsr commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log. URL: https://github.com/apache/incubator-iceberg/pull/11#issuecomment-442632443 A slightly better way is to set the right properties so that these files are generated under the **build** folder of the Gradle module.

[GitHub] rdblue commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log.

2018-11-28 Thread GitBox
rdblue commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log. URL: https://github.com/apache/incubator-iceberg/pull/11#issuecomment-442630879 Is it possible to clean up after tests run instead of ignoring them? T

Re: merge-on-read?

2018-11-28 Thread Erik Wright
On Wed, Nov 28, 2018 at 4:32 PM Owen O'Malley wrote: > For Hive's ACID, we started with deltas that had three options per a row: > insert, delete, edit. Since that didn't enable predicate push down in the > common case where there are large number of inserts, we went to the model > of just using

[GitHub] mccheah removed a comment on issue #11: Ignore .avro, .avro.crc, and hive/derby.log.

2018-11-28 Thread GitBox
mccheah removed a comment on issue #11: Ignore .avro, .avro.crc, and hive/derby.log. URL: https://github.com/apache/incubator-iceberg/pull/11#issuecomment-442627937 @rdblue thoughts? This is an automated message from the Apac

[GitHub] mccheah commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log.

2018-11-28 Thread GitBox
mccheah commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log. URL: https://github.com/apache/incubator-iceberg/pull/11#issuecomment-442627937 @rdblue thoughts? This is an automated message from the Apache Git Se

[GitHub] mccheah commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log.

2018-11-28 Thread GitBox
mccheah commented on issue #11: Ignore .avro, .avro.crc, and hive/derby.log. URL: https://github.com/apache/incubator-iceberg/pull/11#issuecomment-442627965 @rdblue thoughts? This is an automated message from the Apache Git Se

Re: merge-on-read?

2018-11-28 Thread Owen O'Malley
For Hive's ACID, we started with deltas that had three options per a row: insert, delete, edit. Since that didn't enable predicate push down in the common case where there are large number of inserts, we went to the model of just using inserts and deletes in separate files. Queries that modifying t

Re: merge-on-read?

2018-11-28 Thread Erik Wright
Those are both really neat use cases, but the one I had in mind was what Ryan mentioned. It's something that Hoodie apparently supports or is building support for, and it's an important use case for the systems that my colleagues and I are building. There are three scenarios: - An Extract syst

[GitHub] rdblue commented on issue #17: Remove filter and iterable methods from Snapshot.

2018-11-28 Thread GitBox
rdblue commented on issue #17: Remove filter and iterable methods from Snapshot. URL: https://github.com/apache/incubator-iceberg/pull/17#issuecomment-442568451 The original idea was that you would be able to easily list all the files in a Snapshot. That may be a good idea still, but nothin

[GitHub] rdsr edited a comment on issue #17: Remove filter and iterable methods from Snapshot.

2018-11-28 Thread GitBox
rdsr edited a comment on issue #17: Remove filter and iterable methods from Snapshot. URL: https://github.com/apache/incubator-iceberg/pull/17#issuecomment-442563584 What was the original motivation for adding these? This is

[GitHub] rdsr commented on issue #17: Remove filter and iterable methods from Snapshot.

2018-11-28 Thread GitBox
rdsr commented on issue #17: Remove filter and iterable methods from Snapshot. URL: https://github.com/apache/incubator-iceberg/pull/17#issuecomment-442563584 What was the original motivation for adding this? This is an automa

Re: merge-on-read?

2018-11-28 Thread Owen O'Malley
I’m not sure what use case Erik is looking for, but I’ve had users that want to do the equivalent of HBase’s column families. They want some of the columns to be stored separately and the merged together on read. The requirements would be that there is a 1:1 mapping between rows in the matching

[GitHub] rdblue opened a new pull request #17: Remove filter and iterable methods from Snapshot.

2018-11-28 Thread GitBox
rdblue opened a new pull request #17: Remove filter and iterable methods from Snapshot. URL: https://github.com/apache/incubator-iceberg/pull/17 These are not used. This is an automated message from the Apache Git Service. To

Re: merge-on-read?

2018-11-28 Thread Ryan Blue
What do you mean by merge on read? A few people I've talked to are interested in building delete and upsert features. Those would create files that track the changes, which would be merged at read time to apply them. Is that what you mean? rb On Tue, Nov 27, 2018 at 12:26 PM Erik Wright wrote:

Re: Status of Spark Integration, Questions

2018-11-28 Thread Ryan Blue
This depends on how you write the data. If you read and then overwrite what you read, then it would work and be reasonably efficient. Iceberg supports this. On the other hand, if you read and then append, then you’ll get duplicates and won’t remove any rows. So you have to choose the right write s

Re: Presto Partitioning question

2018-11-28 Thread Ryan Blue
Iceberg has 2 main types of partitions: * identity, named for the identity transform, which are just like Hive partitions where data from some column is used to partition without modification (identity transform) * hidden, which derive partition values from some column but don't expose those value