[jira] [Commented] (PARQUET-1954) TCP connection leak in parquet dump
[ https://issues.apache.org/jira/browse/PARQUET-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258586#comment-17258586 ] ASF GitHub Bot commented on PARQUET-1954: - shangxinli commented on pull request #849: URL: https://github.com/apache/parquet-mr/pull/849#issuecomment-754286697 Good catch! Thanks for the effort! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > TCP connection leak in parquet dump > --- > > Key: PARQUET-1954 > URL: https://issues.apache.org/jira/browse/PARQUET-1954 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: xiepengjie >Priority: Major > > Hi, > When i'm trying to dump a parquet file, i find the TCP connection leak in > org.apache.parquet.tools.command.DumpCommand#dump : > > {code:java} > ... > ParquetFileReader freader = null; > if (showmd) { > try { > long group = 0; > for (BlockMetaData block : blocks) { > ... > freader = new ParquetFileReader( > conf, meta.getFileMetaData(), inpath, rblocks, columns); > ... > out.flushColumns(); > } > } finally { > if (freader != null) { > freader.close(); > } > } > } > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] shangxinli commented on pull request #849: PARQUET-1954: TCP connection leak in parquet dump
shangxinli commented on pull request #849: URL: https://github.com/apache/parquet-mr/pull/849#issuecomment-754286697 Good catch! Thanks for the effort! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258584#comment-17258584 ] ASF GitHub Bot commented on PARQUET-1951: - shangxinli commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551623352 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ConcatenatingKeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import java.util.HashMap; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +/** + * Strategy to concatenate if there are multiple values for a given key in metadata. + * Note: use this with caution. This is expected to work only for certain use cases. + */ +public class ConcatenatingKeyValueMetadataMergeStrategy implements KeyValueMetadataMergeStrategy { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + public Map merge(Map> keyValueMetaData) { +Map mergedKeyValues = new HashMap(); +for (Map.Entry> entry : keyValueMetaData.entrySet()) { + mergedKeyValues.put(entry.getKey(), entry.getValue().stream().collect(Collectors.joining(","))); Review comment: The delimiter needs to be parameterized in other non-hudi cases. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow different strategies to combine key values when merging parquet files > --- > > Key: PARQUET-1951 > URL: https://issues.apache.org/jira/browse/PARQUET-1951 > Project: Parquet > Issue Type: Improvement >Reporter: satish >Priority: Minor > > I work on Apache Hudi project. We store some additional metadata in parquet > files (key range in the file, for example). So the metadata is different in > different parquet files that we want to merge these files. > Here is what I'm thinking: > 1) Merge command takes additional command line option: --strategy > . > 2) We introduce new strategy class in parquet-hadoop to keep the same > behavior as today. > We can extend that class and provide our custom implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] shangxinli commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
shangxinli commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551623352 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ConcatenatingKeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import java.util.HashMap; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +/** + * Strategy to concatenate if there are multiple values for a given key in metadata. + * Note: use this with caution. This is expected to work only for certain use cases. + */ +public class ConcatenatingKeyValueMetadataMergeStrategy implements KeyValueMetadataMergeStrategy { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + public Map merge(Map> keyValueMetaData) { +Map mergedKeyValues = new HashMap(); +for (Map.Entry> entry : keyValueMetaData.entrySet()) { + mergedKeyValues.put(entry.getKey(), entry.getValue().stream().collect(Collectors.joining(","))); Review comment: The delimiter needs to be parameterized in other non-hudi cases. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258581#comment-17258581 ] ASF GitHub Bot commented on PARQUET-1951: - shangxinli commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551621357 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + Map merge(Map> keyValueMetaData); Review comment: People may already rely on the 'set' to do the dedup. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} Review comment: The signature doesn't have NullPointerException. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java ## @@ -87,19 +87,25 @@ public String toString() { * Will merge the metadata as if it was coming from a single file. * (for all part files written together this will always work) * If there are conflicting values an exception will be thrown + * + * Provided for backward compatibility * @return the merged version of this */ - public FileMetaData merge() { + public FileMetaData merge() { Review comment: Remove one space in the beginning This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For
[GitHub] [parquet-mr] shangxinli commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
shangxinli commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551621357 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + Map merge(Map> keyValueMetaData); Review comment: People may already rely on the 'set' to do the dedup. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} Review comment: The signature doesn't have NullPointerException. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java ## @@ -87,19 +87,25 @@ public String toString() { * Will merge the metadata as if it was coming from a single file. * (for all part files written together this will always work) * If there are conflicting values an exception will be thrown + * + * Provided for backward compatibility * @return the merged version of this */ - public FileMetaData merge() { + public FileMetaData merge() { Review comment: Remove one space in the beginning This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258562#comment-17258562 ] ASF GitHub Bot commented on PARQUET-1951: - satishkotha edited a comment on pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730 > I have a couple of comments but the change itself looks good. > > Meanwhile, I have some problems with the concept. `parquet-tools` is a simple command in most of the environments. It means that it is not clear how to extend it with your own implementation. > I would suggest adding this change to the API only (in `parquet-hadoop`) so you can extend the implementation from the code. If you want to extend `parquet-tools` as well I would suggest adding your implementation to parquet and let the user choose from the existing implementations (by its simple name like `strict` or others not the class name). @gszadovszky I added one more strategy for merging key values. I would prefer to keep this in parquet-tools to make this easy to run for one-off debugging use cases. Let me know if you strong opinion against this. Also, the actual merge strategy I need is specific to my use-case. So I'd like to keep referencing the class name. Here is an example to help explain: This is how I run parquet-tools java -cp my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar:hadoop-jars merge --s f1.parquet f2.parquet merged.parquet Parquet files we write have additional metadata "range" and "count" (Count is not rowcount, its specific to my usecase tracking distinct values for a column). 1) file f1.parquet metadata will have - key: "range", value : "a-d" - key: "count", value: "12000" 2) file f2.parquet metadata will have - key: "range", value: "e-f" - key: "count", value: "1" When we merge, we want to resulting parquet file to have metadata - key: "range", value: "a-f" - key: "count", value: "15000" So merge strategy is different for different keys we store in parquet footer. This is specific to how we use parquet and very likely not usable for other use cases, so I prefer not to expose this strategy in parquet library. Let me know if you have any suggestions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow different strategies to combine key values when merging parquet files > --- > > Key: PARQUET-1951 > URL: https://issues.apache.org/jira/browse/PARQUET-1951 > Project: Parquet > Issue Type: Improvement >Reporter: satish >Priority: Minor > > I work on Apache Hudi project. We store some additional metadata in parquet > files (key range in the file, for example). So the metadata is different in > different parquet files that we want to merge these files. > Here is what I'm thinking: > 1) Merge command takes additional command line option: --strategy > . > 2) We introduce new strategy class in parquet-hadoop to keep the same > behavior as today. > We can extend that class and provide our custom implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258563#comment-17258563 ] ASF GitHub Bot commented on PARQUET-1951: - satishkotha edited a comment on pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730 > I have a couple of comments but the change itself looks good. > > Meanwhile, I have some problems with the concept. `parquet-tools` is a simple command in most of the environments. It means that it is not clear how to extend it with your own implementation. > I would suggest adding this change to the API only (in `parquet-hadoop`) so you can extend the implementation from the code. If you want to extend `parquet-tools` as well I would suggest adding your implementation to parquet and let the user choose from the existing implementations (by its simple name like `strict` or others not the class name). @gszadovszky I added one more strategy for merging key values. I would prefer to keep this in parquet-tools to make this easy to run for one-off debugging use cases. Let me know if you strong opinion against this. Also, the actual merge strategy I need is specific to my use-case. So I'd like to keep referencing the class name. Here is an example to help explain: This is how I run parquet-tools java -cp my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar:hadoop-jars merge --s myClass f1.parquet f2.parquet merged.parquet Parquet files we write have additional metadata "range" and "count" (Count is not rowcount, its specific to my usecase tracking distinct values for a column). 1) file f1.parquet metadata will have - key: "range", value : "a-d" - key: "count", value: "12000" 2) file f2.parquet metadata will have - key: "range", value: "e-f" - key: "count", value: "1" When we merge, we want to resulting parquet file to have metadata - key: "range", value: "a-f" - key: "count", value: "15000" So merge strategy is different for different keys we store in parquet footer. This is specific to how we use parquet and very likely not usable for other use cases, so I prefer not to expose this strategy in parquet library. Let me know if you have any suggestions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow different strategies to combine key values when merging parquet files > --- > > Key: PARQUET-1951 > URL: https://issues.apache.org/jira/browse/PARQUET-1951 > Project: Parquet > Issue Type: Improvement >Reporter: satish >Priority: Minor > > I work on Apache Hudi project. We store some additional metadata in parquet > files (key range in the file, for example). So the metadata is different in > different parquet files that we want to merge these files. > Here is what I'm thinking: > 1) Merge command takes additional command line option: --strategy > . > 2) We introduce new strategy class in parquet-hadoop to keep the same > behavior as today. > We can extend that class and provide our custom implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] satishkotha edited a comment on pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
satishkotha edited a comment on pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730 > I have a couple of comments but the change itself looks good. > > Meanwhile, I have some problems with the concept. `parquet-tools` is a simple command in most of the environments. It means that it is not clear how to extend it with your own implementation. > I would suggest adding this change to the API only (in `parquet-hadoop`) so you can extend the implementation from the code. If you want to extend `parquet-tools` as well I would suggest adding your implementation to parquet and let the user choose from the existing implementations (by its simple name like `strict` or others not the class name). @gszadovszky I added one more strategy for merging key values. I would prefer to keep this in parquet-tools to make this easy to run for one-off debugging use cases. Let me know if you strong opinion against this. Also, the actual merge strategy I need is specific to my use-case. So I'd like to keep referencing the class name. Here is an example to help explain: This is how I run parquet-tools java -cp my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar:hadoop-jars merge --s myClass f1.parquet f2.parquet merged.parquet Parquet files we write have additional metadata "range" and "count" (Count is not rowcount, its specific to my usecase tracking distinct values for a column). 1) file f1.parquet metadata will have - key: "range", value : "a-d" - key: "count", value: "12000" 2) file f2.parquet metadata will have - key: "range", value: "e-f" - key: "count", value: "1" When we merge, we want to resulting parquet file to have metadata - key: "range", value: "a-f" - key: "count", value: "15000" So merge strategy is different for different keys we store in parquet footer. This is specific to how we use parquet and very likely not usable for other use cases, so I prefer not to expose this strategy in parquet library. Let me know if you have any suggestions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] satishkotha edited a comment on pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
satishkotha edited a comment on pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730 > I have a couple of comments but the change itself looks good. > > Meanwhile, I have some problems with the concept. `parquet-tools` is a simple command in most of the environments. It means that it is not clear how to extend it with your own implementation. > I would suggest adding this change to the API only (in `parquet-hadoop`) so you can extend the implementation from the code. If you want to extend `parquet-tools` as well I would suggest adding your implementation to parquet and let the user choose from the existing implementations (by its simple name like `strict` or others not the class name). @gszadovszky I added one more strategy for merging key values. I would prefer to keep this in parquet-tools to make this easy to run for one-off debugging use cases. Let me know if you strong opinion against this. Also, the actual merge strategy I need is specific to my use-case. So I'd like to keep referencing the class name. Here is an example to help explain: This is how I run parquet-tools java -cp my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar:hadoop-jars merge --s f1.parquet f2.parquet merged.parquet Parquet files we write have additional metadata "range" and "count" (Count is not rowcount, its specific to my usecase tracking distinct values for a column). 1) file f1.parquet metadata will have - key: "range", value : "a-d" - key: "count", value: "12000" 2) file f2.parquet metadata will have - key: "range", value: "e-f" - key: "count", value: "1" When we merge, we want to resulting parquet file to have metadata - key: "range", value: "a-f" - key: "count", value: "15000" So merge strategy is different for different keys we store in parquet footer. This is specific to how we use parquet and very likely not usable for other use cases, so I prefer not to expose this strategy in parquet library. Let me know if you have any suggestions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258561#comment-17258561 ] ASF GitHub Bot commented on PARQUET-1951: - satishkotha commented on pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730 > I have a couple of comments but the change itself looks good. > > Meanwhile, I have some problems with the concept. `parquet-tools` is a simple command in most of the environments. It means that it is not clear how to extend it with your own implementation. > I would suggest adding this change to the API only (in `parquet-hadoop`) so you can extend the implementation from the code. If you want to extend `parquet-tools` as well I would suggest adding your implementation to parquet and let the user choose from the existing implementations (by its simple name like `strict` or others not the class name). @gszadovszky I added one more strategy for merging key values. I would prefer to keep this in parquet-tools to make this easy to run for one-off debugging use cases. Let me know if you strong opinion against this. Also, the actual merge strategy I need is specific to my use-case. So I'd like to keep referencing the class name. Here is an example to help explain: This is how I run parquet-tools java -cp my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar: merge --s f1.parquet f2.parquet merged.parquet Parquet files we write have additional metadata "range" and "count" (Count is not rowcount, its specific to my usecase tracking distinct values for a column). 1) file f1.parquet metadata will have - key: "range", value : "a-d" - key: "count", value: "12000" 2) file f2.parquet metadata will have - key: "range", value: "e-f" - key: "count", value: "1" When we merge, we want to resulting parquet file to have metadata - key: "range", value: "a-f" - key: "count", value: "15000" So merge strategy is different for different keys we store in parquet footer. This is specific to how we use parquet and very likely not usable for other use cases, so I prefer not to expose this strategy in parquet library. Let me know if you have any suggestions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow different strategies to combine key values when merging parquet files > --- > > Key: PARQUET-1951 > URL: https://issues.apache.org/jira/browse/PARQUET-1951 > Project: Parquet > Issue Type: Improvement >Reporter: satish >Priority: Minor > > I work on Apache Hudi project. We store some additional metadata in parquet > files (key range in the file, for example). So the metadata is different in > different parquet files that we want to merge these files. > Here is what I'm thinking: > 1) Merge command takes additional command line option: --strategy > . > 2) We introduce new strategy class in parquet-hadoop to keep the same > behavior as today. > We can extend that class and provide our custom implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] satishkotha commented on pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
satishkotha commented on pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730 > I have a couple of comments but the change itself looks good. > > Meanwhile, I have some problems with the concept. `parquet-tools` is a simple command in most of the environments. It means that it is not clear how to extend it with your own implementation. > I would suggest adding this change to the API only (in `parquet-hadoop`) so you can extend the implementation from the code. If you want to extend `parquet-tools` as well I would suggest adding your implementation to parquet and let the user choose from the existing implementations (by its simple name like `strict` or others not the class name). @gszadovszky I added one more strategy for merging key values. I would prefer to keep this in parquet-tools to make this easy to run for one-off debugging use cases. Let me know if you strong opinion against this. Also, the actual merge strategy I need is specific to my use-case. So I'd like to keep referencing the class name. Here is an example to help explain: This is how I run parquet-tools java -cp my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar: merge --s f1.parquet f2.parquet merged.parquet Parquet files we write have additional metadata "range" and "count" (Count is not rowcount, its specific to my usecase tracking distinct values for a column). 1) file f1.parquet metadata will have - key: "range", value : "a-d" - key: "count", value: "12000" 2) file f2.parquet metadata will have - key: "range", value: "e-f" - key: "count", value: "1" When we merge, we want to resulting parquet file to have metadata - key: "range", value: "a-f" - key: "count", value: "15000" So merge strategy is different for different keys we store in parquet footer. This is specific to how we use parquet and very likely not usable for other use cases, so I prefer not to expose this strategy in parquet library. Let me know if you have any suggestions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258542#comment-17258542 ] ASF GitHub Bot commented on PARQUET-1951: - satishkotha commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598592 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + Map merge(Map> keyValueMetaData); Review comment: GlobalMetadata uses Map> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java#L41. So merge method takes in same argument. Do you think it makes sense to change GlobalMetadata to track items in List? I'm not sure if it would be backward compatible and if there are use cases expecting this to be Set. Without changing GlobalMetadata#keyValueMetaData, it may not be useful to change this interface. Let me know what you think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow different strategies to combine key values when merging parquet files > --- > > Key: PARQUET-1951 > URL: https://issues.apache.org/jira/browse/PARQUET-1951 > Project: Parquet > Issue Type: Improvement >Reporter: satish >Priority: Minor > > I work on Apache Hudi project. We store some additional metadata in parquet > files (key range in the file, for example). So the metadata is different in > different parquet files that we want to merge these files. > Here is what I'm thinking: > 1) Merge command takes additional command line option: --strategy > . > 2) We introduce new strategy class in parquet-hadoop to keep the same > behavior as today. > We can extend that class and provide our custom implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258540#comment-17258540 ] ASF GitHub Bot commented on PARQUET-1951: - satishkotha commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598725 ## File path: parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java ## @@ -41,6 +46,21 @@ "is the destination parquet file" }; + private static final String DEFAULT_KEY_VALUE_MERGE_STRATEGY = new DefaultKeyValueMetadataMergeStrategy().getClass().getName(); Review comment: My bad. Fixed. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow different strategies to combine key values when merging parquet files > --- > > Key: PARQUET-1951 > URL: https://issues.apache.org/jira/browse/PARQUET-1951 > Project: Parquet > Issue Type: Improvement >Reporter: satish >Priority: Minor > > I work on Apache Hudi project. We store some additional metadata in parquet > files (key range in the file, for example). So the metadata is different in > different parquet files that we want to merge these files. > Here is what I'm thinking: > 1) Merge command takes additional command line option: --strategy > . > 2) We introduce new strategy class in parquet-hadoop to keep the same > behavior as today. > We can extend that class and provide our custom implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] satishkotha commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
satishkotha commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598592 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + Map merge(Map> keyValueMetaData); Review comment: GlobalMetadata uses Map> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java#L41. So merge method takes in same argument. Do you think it makes sense to change GlobalMetadata to track items in List? I'm not sure if it would be backward compatible and if there are use cases expecting this to be Set. Without changing GlobalMetadata#keyValueMetaData, it may not be useful to change this interface. Let me know what you think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] satishkotha commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
satishkotha commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598725 ## File path: parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java ## @@ -41,6 +46,21 @@ "is the destination parquet file" }; + private static final String DEFAULT_KEY_VALUE_MERGE_STRATEGY = new DefaultKeyValueMetadataMergeStrategy().getClass().getName(); Review comment: My bad. Fixed. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258539#comment-17258539 ] ASF GitHub Bot commented on PARQUET-1951: - satishkotha commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598592 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + Map merge(Map> keyValueMetaData); Review comment: GlobalMetadata uses Map> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java#L41. So merge method takes in same argument. Do you think it makes sense to change GlobalMetadata to track items in List? I'm not sure if it would be backward compatible and if there are use cases expecting this to be Set. Without changing GlobalMetadata#keyValueMetaData, it may not be useful to change this interface. Let me know what you think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow different strategies to combine key values when merging parquet files > --- > > Key: PARQUET-1951 > URL: https://issues.apache.org/jira/browse/PARQUET-1951 > Project: Parquet > Issue Type: Improvement >Reporter: satish >Priority: Minor > > I work on Apache Hudi project. We store some additional metadata in parquet > files (key range in the file, for example). So the metadata is different in > different parquet files that we want to merge these files. > Here is what I'm thinking: > 1) Merge command takes additional command line option: --strategy > . > 2) We introduce new strategy class in parquet-hadoop to keep the same > behavior as today. > We can extend that class and provide our custom implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] satishkotha commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
satishkotha commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598592 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + Map merge(Map> keyValueMetaData); Review comment: GlobalMetadata uses Map> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java#L41. So merge method takes in same argument. Do you think it makes sense to change GlobalMetadata to track items in List? I'm not sure if it would be backward compatible and if there are use cases expecting this to be Set. Without changing GlobalMetadata#keyValueMetaData, it may not be useful to change this interface. Let me know what you think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258533#comment-17258533 ] ASF GitHub Bot commented on PARQUET-1951: - satishkotha commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551595794 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/DefaultKeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import java.util.HashMap; +import java.util.Map; +import java.util.Set; + +/** + * Default strategy to throw errors in metadata if there are multiple values for a given key in metadata. + */ +public class DefaultKeyValueMetadataMergeStrategy implements KeyValueMetadataMergeStrategy { Review comment: Good suggestion. Fixed it. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow different strategies to combine key values when merging parquet files > --- > > Key: PARQUET-1951 > URL: https://issues.apache.org/jira/browse/PARQUET-1951 > Project: Parquet > Issue Type: Improvement >Reporter: satish >Priority: Minor > > I work on Apache Hudi project. We store some additional metadata in parquet > files (key range in the file, for example). So the metadata is different in > different parquet files that we want to merge these files. > Here is what I'm thinking: > 1) Merge command takes additional command line option: --strategy > . > 2) We introduce new strategy class in parquet-hadoop to keep the same > behavior as today. > We can extend that class and provide our custom implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] satishkotha commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
satishkotha commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551595794 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/DefaultKeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import java.util.HashMap; +import java.util.Map; +import java.util.Set; + +/** + * Default strategy to throw errors in metadata if there are multiple values for a given key in metadata. + */ +public class DefaultKeyValueMetadataMergeStrategy implements KeyValueMetadataMergeStrategy { Review comment: Good suggestion. Fixed it. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: Query on striping parquet files to maintain Row group alignment
Hi Jayjeet, I assume you are using parquet-mr (and not other parquet implementations like parquet-cpp, Impala etc.). I am not sure if I got your request correctly. You may configure the size of the row group by setting the config parquet.block.size. You may also check parquet.writer.max-padding so the row groups will fit exactly into the blocks. See details about the available configs at https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md. Currently, parquet-mr does not have the functionality to automatically close a parquet file and start a new one during writing. Regards, Gabor On Thu, Dec 31, 2020 at 5:58 AM Jayjeet Chakraborty < jayjeetchakrabort...@gmail.com> wrote: > Hi all, > > I am trying to figure out if a large Parquet file can be striped across > multiple small files based on a Row group chunk size where each stripe > would naturally end up containing data pages from a single row group. So, > if I say my writer "write a parquet file in chunks of 128 MB (assuming my > row groups are of around 128MB), each of my chunks ends up being > self-contained row group, maybe except the last chunk which has the footer > contents. Is this possible? Can we fix the row group size (the amount of > disk space a row group uses) while writing parquet files ? Thanks a lot. >
[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258287#comment-17258287 ] Gabor Szadovszky commented on PARQUET-1827: --- [~vitalii], there are some open points for 1.12.0. I think we will know more about a release date after the parquet sync next week. > UUID type currently not supported by parquet-mr > --- > > Key: PARQUET-1827 > URL: https://issues.apache.org/jira/browse/PARQUET-1827 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Brad Smith >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > The parquet-format project introduced a new UUID logical type in version 2.4: > [https://github.com/apache/parquet-format/blob/master/CHANGES.md] > This would be a useful type to have available in some circumstances, but it > currently isn't supported in the parquet-mr library. Hopefully this feature > can be implemented at some point. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1827) UUID type currently not supported by parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky updated PARQUET-1827: -- Fix Version/s: 1.12.0 > UUID type currently not supported by parquet-mr > --- > > Key: PARQUET-1827 > URL: https://issues.apache.org/jira/browse/PARQUET-1827 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Brad Smith >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > The parquet-format project introduced a new UUID logical type in version 2.4: > [https://github.com/apache/parquet-format/blob/master/CHANGES.md] > This would be a useful type to have available in some circumstances, but it > currently isn't supported in the parquet-mr library. Hopefully this feature > can be implemented at some point. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files
[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258139#comment-17258139 ] ASF GitHub Bot commented on PARQUET-1951: - gszadovszky commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551240027 ## File path: parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java ## @@ -41,6 +46,21 @@ "is the destination parquet file" }; + private static final String DEFAULT_KEY_VALUE_MERGE_STRATEGY = new DefaultKeyValueMetadataMergeStrategy().getClass().getName(); Review comment: You do not need to instantiate an object to get its class. Simply use `DefaultKeyValueMetadataMergeStrategy.class` ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + Map merge(Map> keyValueMetaData); Review comment: I think, this interface does not support all types of the potential merge strategy implementations. For example if you would like to use the value which is common in the most files you cannot do it because you only have the distinct values. What do you think about using a `List` instead of a `Set`? ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/DefaultKeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import java.util.HashMap; +import java.util.Map; +import java.util.Set; + +/** + * Default strategy to throw errors in metadata if there are multiple values for a given key in metadata. + */ +public class DefaultKeyValueMetadataMergeStrategy implements KeyValueMetadataMergeStrategy { Review comment: I do not really like the naming of this class. This is the default implementation, it's fine but the name should suggest what it does. I'm quite sure there is a better naming but what about `StrictKeyValueMetadataMergeStrategy`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Allow different strategies to combine key values when merging parquet files >
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …
gszadovszky commented on a change in pull request #847: URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551240027 ## File path: parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java ## @@ -41,6 +46,21 @@ "is the destination parquet file" }; + private static final String DEFAULT_KEY_VALUE_MERGE_STRATEGY = new DefaultKeyValueMetadataMergeStrategy().getClass().getName(); Review comment: You do not need to instantiate an object to get its class. Simply use `DefaultKeyValueMetadataMergeStrategy.class` ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import org.apache.parquet.schema.MessageType; + +import java.io.Serializable; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Set; + +import static java.util.Collections.unmodifiableMap; + +/** + * Strategy to merge metadata. Possible implementations: + * 1) Expect metadata to be exactly identical. Otherwise throw error (this is the default strategy implementation) + * 2) Custom strategy to merge metadata if the values are not identical. Example: concatenate values. + */ +public interface KeyValueMetadataMergeStrategy extends Serializable { + /** + * @param keyValueMetaData the merged app specific metadata + * + * @throws NullPointerException if keyValueMetaData is {@code null} + */ + Map merge(Map> keyValueMetaData); Review comment: I think, this interface does not support all types of the potential merge strategy implementations. For example if you would like to use the value which is common in the most files you cannot do it because you only have the distinct values. What do you think about using a `List` instead of a `Set`? ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/DefaultKeyValueMetadataMergeStrategy.java ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.metadata; + +import java.util.HashMap; +import java.util.Map; +import java.util.Set; + +/** + * Default strategy to throw errors in metadata if there are multiple values for a given key in metadata. + */ +public class DefaultKeyValueMetadataMergeStrategy implements KeyValueMetadataMergeStrategy { Review comment: I do not really like the naming of this class. This is the default implementation, it's fine but the name should suggest what it does. I'm quite sure there is a better naming but what about `StrictKeyValueMetadataMergeStrategy`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org