[jira] [Commented] (PARQUET-1954) TCP connection leak in parquet dump

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258586#comment-17258586
 ] 

ASF GitHub Bot commented on PARQUET-1954:
-

shangxinli commented on pull request #849:
URL: https://github.com/apache/parquet-mr/pull/849#issuecomment-754286697


   Good catch! Thanks for the effort! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> TCP connection leak in parquet dump
> ---
>
> Key: PARQUET-1954
> URL: https://issues.apache.org/jira/browse/PARQUET-1954
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: xiepengjie
>Priority: Major
>
> Hi,
> When i'm trying to dump a parquet file, i find the TCP connection leak in 
> org.apache.parquet.tools.command.DumpCommand#dump :
>  
> {code:java}
> ...
> ParquetFileReader freader = null;
> if (showmd) {
> try {
> long group = 0;
> for (BlockMetaData block : blocks) {
> ...
> freader = new ParquetFileReader(
> conf, meta.getFileMetaData(), inpath, rblocks, columns);
> ...
> out.flushColumns();
> }
> } finally {
> if (freader != null) {
> freader.close();
> }
> }
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] shangxinli commented on pull request #849: PARQUET-1954: TCP connection leak in parquet dump

2021-01-04 Thread GitBox


shangxinli commented on pull request #849:
URL: https://github.com/apache/parquet-mr/pull/849#issuecomment-754286697


   Good catch! Thanks for the effort! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258584#comment-17258584
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

shangxinli commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551623352



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ConcatenatingKeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ *  Strategy to concatenate if there are multiple values for a given key in 
metadata.
+ *  Note: use this with caution. This is expected to work only for certain use 
cases.
+ */
+public class ConcatenatingKeyValueMetadataMergeStrategy implements 
KeyValueMetadataMergeStrategy {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  public Map merge(Map> keyValueMetaData) {
+Map mergedKeyValues = new HashMap();
+for (Map.Entry> entry : keyValueMetaData.entrySet()) {
+  mergedKeyValues.put(entry.getKey(), 
entry.getValue().stream().collect(Collectors.joining(",")));

Review comment:
   The delimiter needs to be parameterized in other non-hudi cases.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow different strategies to combine key values when merging parquet files
> ---
>
> Key: PARQUET-1951
> URL: https://issues.apache.org/jira/browse/PARQUET-1951
> Project: Parquet
>  Issue Type: Improvement
>Reporter: satish
>Priority: Minor
>
> I work on Apache Hudi project. We store some additional metadata in parquet 
> files (key range in the file, for example).  So the metadata is different in 
> different parquet files that we want to merge these files. 
> Here is what I'm thinking:
> 1) Merge command takes additional command line option: --strategy 
> . 
> 2) We introduce new strategy class in parquet-hadoop to keep the same 
> behavior as today.  
> We can extend that class and provide our custom implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] shangxinli commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


shangxinli commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551623352



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ConcatenatingKeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ *  Strategy to concatenate if there are multiple values for a given key in 
metadata.
+ *  Note: use this with caution. This is expected to work only for certain use 
cases.
+ */
+public class ConcatenatingKeyValueMetadataMergeStrategy implements 
KeyValueMetadataMergeStrategy {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  public Map merge(Map> keyValueMetaData) {
+Map mergedKeyValues = new HashMap();
+for (Map.Entry> entry : keyValueMetaData.entrySet()) {
+  mergedKeyValues.put(entry.getKey(), 
entry.getValue().stream().collect(Collectors.joining(",")));

Review comment:
   The delimiter needs to be parameterized in other non-hudi cases.  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258581#comment-17258581
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

shangxinli commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551621357



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  Map merge(Map> keyValueMetaData);

Review comment:
   People may already rely on the 'set' to do the dedup.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}

Review comment:
   The signature doesn't have NullPointerException. 

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java
##
@@ -87,19 +87,25 @@ public String toString() {
* Will merge the metadata as if it was coming from a single file.
* (for all part files written together this will always work)
* If there are conflicting values an exception will be thrown
+   * 
+   * Provided for backward compatibility
* @return the merged version of this
*/
-  public FileMetaData merge() {
+   public FileMetaData merge() {

Review comment:
   Remove one space in the beginning





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For 

[GitHub] [parquet-mr] shangxinli commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


shangxinli commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551621357



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  Map merge(Map> keyValueMetaData);

Review comment:
   People may already rely on the 'set' to do the dedup.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}

Review comment:
   The signature doesn't have NullPointerException. 

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java
##
@@ -87,19 +87,25 @@ public String toString() {
* Will merge the metadata as if it was coming from a single file.
* (for all part files written together this will always work)
* If there are conflicting values an exception will be thrown
+   * 
+   * Provided for backward compatibility
* @return the merged version of this
*/
-  public FileMetaData merge() {
+   public FileMetaData merge() {

Review comment:
   Remove one space in the beginning





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258562#comment-17258562
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

satishkotha edited a comment on pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730


   > I have a couple of comments but the change itself looks good.
   > 
   > Meanwhile, I have some problems with the concept. `parquet-tools` is a 
simple command in most of the environments. It means that it is not clear how 
to extend it with your own implementation.
   > I would suggest adding this change to the API only (in `parquet-hadoop`) 
so you can extend the implementation from the code. If you want to extend 
`parquet-tools` as well I would suggest adding your implementation to parquet 
and let the user choose from the existing implementations (by its simple name 
like `strict` or others not the class name).
   
   @gszadovszky I added one more strategy for merging key values. I would 
prefer to keep this in parquet-tools to make this easy to run for one-off 
debugging use cases.  Let me know if you strong opinion against this.
   
   Also, the actual merge strategy I need is specific to my use-case. So I'd 
like to keep referencing the class name. Here is an example to help explain:
   
   This is how I run parquet-tools
   java -cp  
my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar:hadoop-jars 
merge --s  f1.parquet f2.parquet merged.parquet
   
   Parquet files we write have additional metadata "range" and "count" (Count 
is not rowcount, its specific to my usecase tracking distinct values for a 
column).
   
   1) file f1.parquet metadata will have 
   
   - key: "range", value : "a-d"
   - key: "count", value: "12000"
   
   2) file f2.parquet metadata will have 
   - key: "range", value: "e-f"
   - key: "count", value: "1"
   
   When we merge, we want to resulting parquet file to have metadata 
   - key: "range", value: "a-f"
   - key: "count", value: "15000"
   
   So merge strategy is different for different keys we store in parquet 
footer. This is specific to how we use parquet and very likely not usable for 
other use cases, so I prefer not to expose this strategy in parquet library.
   
   Let me know if you have any suggestions.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow different strategies to combine key values when merging parquet files
> ---
>
> Key: PARQUET-1951
> URL: https://issues.apache.org/jira/browse/PARQUET-1951
> Project: Parquet
>  Issue Type: Improvement
>Reporter: satish
>Priority: Minor
>
> I work on Apache Hudi project. We store some additional metadata in parquet 
> files (key range in the file, for example).  So the metadata is different in 
> different parquet files that we want to merge these files. 
> Here is what I'm thinking:
> 1) Merge command takes additional command line option: --strategy 
> . 
> 2) We introduce new strategy class in parquet-hadoop to keep the same 
> behavior as today.  
> We can extend that class and provide our custom implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258563#comment-17258563
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

satishkotha edited a comment on pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730


   > I have a couple of comments but the change itself looks good.
   > 
   > Meanwhile, I have some problems with the concept. `parquet-tools` is a 
simple command in most of the environments. It means that it is not clear how 
to extend it with your own implementation.
   > I would suggest adding this change to the API only (in `parquet-hadoop`) 
so you can extend the implementation from the code. If you want to extend 
`parquet-tools` as well I would suggest adding your implementation to parquet 
and let the user choose from the existing implementations (by its simple name 
like `strict` or others not the class name).
   
   @gszadovszky I added one more strategy for merging key values. I would 
prefer to keep this in parquet-tools to make this easy to run for one-off 
debugging use cases.  Let me know if you strong opinion against this.
   
   Also, the actual merge strategy I need is specific to my use-case. So I'd 
like to keep referencing the class name. Here is an example to help explain:
   
   This is how I run parquet-tools
   java -cp  
my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar:hadoop-jars 
merge --s myClass f1.parquet f2.parquet merged.parquet
   
   Parquet files we write have additional metadata "range" and "count" (Count 
is not rowcount, its specific to my usecase tracking distinct values for a 
column).
   
   1) file f1.parquet metadata will have 
   
   - key: "range", value : "a-d"
   - key: "count", value: "12000"
   
   2) file f2.parquet metadata will have 
   - key: "range", value: "e-f"
   - key: "count", value: "1"
   
   When we merge, we want to resulting parquet file to have metadata 
   - key: "range", value: "a-f"
   - key: "count", value: "15000"
   
   So merge strategy is different for different keys we store in parquet 
footer. This is specific to how we use parquet and very likely not usable for 
other use cases, so I prefer not to expose this strategy in parquet library.
   
   Let me know if you have any suggestions.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow different strategies to combine key values when merging parquet files
> ---
>
> Key: PARQUET-1951
> URL: https://issues.apache.org/jira/browse/PARQUET-1951
> Project: Parquet
>  Issue Type: Improvement
>Reporter: satish
>Priority: Minor
>
> I work on Apache Hudi project. We store some additional metadata in parquet 
> files (key range in the file, for example).  So the metadata is different in 
> different parquet files that we want to merge these files. 
> Here is what I'm thinking:
> 1) Merge command takes additional command line option: --strategy 
> . 
> 2) We introduce new strategy class in parquet-hadoop to keep the same 
> behavior as today.  
> We can extend that class and provide our custom implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] satishkotha edited a comment on pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


satishkotha edited a comment on pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730


   > I have a couple of comments but the change itself looks good.
   > 
   > Meanwhile, I have some problems with the concept. `parquet-tools` is a 
simple command in most of the environments. It means that it is not clear how 
to extend it with your own implementation.
   > I would suggest adding this change to the API only (in `parquet-hadoop`) 
so you can extend the implementation from the code. If you want to extend 
`parquet-tools` as well I would suggest adding your implementation to parquet 
and let the user choose from the existing implementations (by its simple name 
like `strict` or others not the class name).
   
   @gszadovszky I added one more strategy for merging key values. I would 
prefer to keep this in parquet-tools to make this easy to run for one-off 
debugging use cases.  Let me know if you strong opinion against this.
   
   Also, the actual merge strategy I need is specific to my use-case. So I'd 
like to keep referencing the class name. Here is an example to help explain:
   
   This is how I run parquet-tools
   java -cp  
my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar:hadoop-jars 
merge --s myClass f1.parquet f2.parquet merged.parquet
   
   Parquet files we write have additional metadata "range" and "count" (Count 
is not rowcount, its specific to my usecase tracking distinct values for a 
column).
   
   1) file f1.parquet metadata will have 
   
   - key: "range", value : "a-d"
   - key: "count", value: "12000"
   
   2) file f2.parquet metadata will have 
   - key: "range", value: "e-f"
   - key: "count", value: "1"
   
   When we merge, we want to resulting parquet file to have metadata 
   - key: "range", value: "a-f"
   - key: "count", value: "15000"
   
   So merge strategy is different for different keys we store in parquet 
footer. This is specific to how we use parquet and very likely not usable for 
other use cases, so I prefer not to expose this strategy in parquet library.
   
   Let me know if you have any suggestions.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] satishkotha edited a comment on pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


satishkotha edited a comment on pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730


   > I have a couple of comments but the change itself looks good.
   > 
   > Meanwhile, I have some problems with the concept. `parquet-tools` is a 
simple command in most of the environments. It means that it is not clear how 
to extend it with your own implementation.
   > I would suggest adding this change to the API only (in `parquet-hadoop`) 
so you can extend the implementation from the code. If you want to extend 
`parquet-tools` as well I would suggest adding your implementation to parquet 
and let the user choose from the existing implementations (by its simple name 
like `strict` or others not the class name).
   
   @gszadovszky I added one more strategy for merging key values. I would 
prefer to keep this in parquet-tools to make this easy to run for one-off 
debugging use cases.  Let me know if you strong opinion against this.
   
   Also, the actual merge strategy I need is specific to my use-case. So I'd 
like to keep referencing the class name. Here is an example to help explain:
   
   This is how I run parquet-tools
   java -cp  
my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar:hadoop-jars 
merge --s  f1.parquet f2.parquet merged.parquet
   
   Parquet files we write have additional metadata "range" and "count" (Count 
is not rowcount, its specific to my usecase tracking distinct values for a 
column).
   
   1) file f1.parquet metadata will have 
   
   - key: "range", value : "a-d"
   - key: "count", value: "12000"
   
   2) file f2.parquet metadata will have 
   - key: "range", value: "e-f"
   - key: "count", value: "1"
   
   When we merge, we want to resulting parquet file to have metadata 
   - key: "range", value: "a-f"
   - key: "count", value: "15000"
   
   So merge strategy is different for different keys we store in parquet 
footer. This is specific to how we use parquet and very likely not usable for 
other use cases, so I prefer not to expose this strategy in parquet library.
   
   Let me know if you have any suggestions.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258561#comment-17258561
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

satishkotha commented on pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730


   > I have a couple of comments but the change itself looks good.
   > 
   > Meanwhile, I have some problems with the concept. `parquet-tools` is a 
simple command in most of the environments. It means that it is not clear how 
to extend it with your own implementation.
   > I would suggest adding this change to the API only (in `parquet-hadoop`) 
so you can extend the implementation from the code. If you want to extend 
`parquet-tools` as well I would suggest adding your implementation to parquet 
and let the user choose from the existing implementations (by its simple name 
like `strict` or others not the class name).
   
   @gszadovszky I added one more strategy for merging key values. I would 
prefer to keep this in parquet-tools to make this easy to run for one-off 
debugging use cases.  Let me know if you strong opinion against this.
   
   Also, the actual merge strategy I need is specific to my use-case. So I'd 
like to keep referencing the class name. Here is an example to help explain:
   
   This is how I run parquet-tools
   java -cp  
my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar: 
merge --s  f1.parquet f2.parquet merged.parquet
   
   Parquet files we write have additional metadata "range" and "count" (Count 
is not rowcount, its specific to my usecase tracking distinct values for a 
column).
   
   1) file f1.parquet metadata will have 
   
   - key: "range", value : "a-d"
   - key: "count", value: "12000"
   
   2) file f2.parquet metadata will have 
   - key: "range", value: "e-f"
   - key: "count", value: "1"
   
   When we merge, we want to resulting parquet file to have metadata 
   - key: "range", value: "a-f"
   - key: "count", value: "15000"
   
   So merge strategy is different for different keys we store in parquet 
footer. This is specific to how we use parquet and very likely not usable for 
other use cases, so I prefer not to expose this strategy in parquet library.
   
   Let me know if you have any suggestions.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow different strategies to combine key values when merging parquet files
> ---
>
> Key: PARQUET-1951
> URL: https://issues.apache.org/jira/browse/PARQUET-1951
> Project: Parquet
>  Issue Type: Improvement
>Reporter: satish
>Priority: Minor
>
> I work on Apache Hudi project. We store some additional metadata in parquet 
> files (key range in the file, for example).  So the metadata is different in 
> different parquet files that we want to merge these files. 
> Here is what I'm thinking:
> 1) Merge command takes additional command line option: --strategy 
> . 
> 2) We introduce new strategy class in parquet-hadoop to keep the same 
> behavior as today.  
> We can extend that class and provide our custom implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] satishkotha commented on pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


satishkotha commented on pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#issuecomment-754266730


   > I have a couple of comments but the change itself looks good.
   > 
   > Meanwhile, I have some problems with the concept. `parquet-tools` is a 
simple command in most of the environments. It means that it is not clear how 
to extend it with your own implementation.
   > I would suggest adding this change to the API only (in `parquet-hadoop`) 
so you can extend the implementation from the code. If you want to extend 
`parquet-tools` as well I would suggest adding your implementation to parquet 
and let the user choose from the existing implementations (by its simple name 
like `strict` or others not the class name).
   
   @gszadovszky I added one more strategy for merging key values. I would 
prefer to keep this in parquet-tools to make this easy to run for one-off 
debugging use cases.  Let me know if you strong opinion against this.
   
   Also, the actual merge strategy I need is specific to my use-case. So I'd 
like to keep referencing the class name. Here is an example to help explain:
   
   This is how I run parquet-tools
   java -cp  
my-app-jar:parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar: 
merge --s  f1.parquet f2.parquet merged.parquet
   
   Parquet files we write have additional metadata "range" and "count" (Count 
is not rowcount, its specific to my usecase tracking distinct values for a 
column).
   
   1) file f1.parquet metadata will have 
   
   - key: "range", value : "a-d"
   - key: "count", value: "12000"
   
   2) file f2.parquet metadata will have 
   - key: "range", value: "e-f"
   - key: "count", value: "1"
   
   When we merge, we want to resulting parquet file to have metadata 
   - key: "range", value: "a-f"
   - key: "count", value: "15000"
   
   So merge strategy is different for different keys we store in parquet 
footer. This is specific to how we use parquet and very likely not usable for 
other use cases, so I prefer not to expose this strategy in parquet library.
   
   Let me know if you have any suggestions.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258542#comment-17258542
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

satishkotha commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598592



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  Map merge(Map> keyValueMetaData);

Review comment:
   GlobalMetadata uses Map> 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java#L41.
 So merge method takes in same argument.
   
   Do you think it makes sense to change GlobalMetadata to track items in 
List? I'm not sure if it would be backward compatible and if there are 
use cases expecting this to be Set.  
   
   Without changing GlobalMetadata#keyValueMetaData, it may not be useful to 
change this interface. Let me know what you think.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow different strategies to combine key values when merging parquet files
> ---
>
> Key: PARQUET-1951
> URL: https://issues.apache.org/jira/browse/PARQUET-1951
> Project: Parquet
>  Issue Type: Improvement
>Reporter: satish
>Priority: Minor
>
> I work on Apache Hudi project. We store some additional metadata in parquet 
> files (key range in the file, for example).  So the metadata is different in 
> different parquet files that we want to merge these files. 
> Here is what I'm thinking:
> 1) Merge command takes additional command line option: --strategy 
> . 
> 2) We introduce new strategy class in parquet-hadoop to keep the same 
> behavior as today.  
> We can extend that class and provide our custom implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258540#comment-17258540
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

satishkotha commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598725



##
File path: 
parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java
##
@@ -41,6 +46,21 @@
   "is the destination parquet file"
   };
 
+  private static final String DEFAULT_KEY_VALUE_MERGE_STRATEGY = new 
DefaultKeyValueMetadataMergeStrategy().getClass().getName();

Review comment:
   My bad. Fixed. Thanks





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow different strategies to combine key values when merging parquet files
> ---
>
> Key: PARQUET-1951
> URL: https://issues.apache.org/jira/browse/PARQUET-1951
> Project: Parquet
>  Issue Type: Improvement
>Reporter: satish
>Priority: Minor
>
> I work on Apache Hudi project. We store some additional metadata in parquet 
> files (key range in the file, for example).  So the metadata is different in 
> different parquet files that we want to merge these files. 
> Here is what I'm thinking:
> 1) Merge command takes additional command line option: --strategy 
> . 
> 2) We introduce new strategy class in parquet-hadoop to keep the same 
> behavior as today.  
> We can extend that class and provide our custom implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] satishkotha commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


satishkotha commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598592



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  Map merge(Map> keyValueMetaData);

Review comment:
   GlobalMetadata uses Map> 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java#L41.
 So merge method takes in same argument.
   
   Do you think it makes sense to change GlobalMetadata to track items in 
List? I'm not sure if it would be backward compatible and if there are 
use cases expecting this to be Set.  
   
   Without changing GlobalMetadata#keyValueMetaData, it may not be useful to 
change this interface. Let me know what you think.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] satishkotha commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


satishkotha commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598725



##
File path: 
parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java
##
@@ -41,6 +46,21 @@
   "is the destination parquet file"
   };
 
+  private static final String DEFAULT_KEY_VALUE_MERGE_STRATEGY = new 
DefaultKeyValueMetadataMergeStrategy().getClass().getName();

Review comment:
   My bad. Fixed. Thanks





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258539#comment-17258539
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

satishkotha commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598592



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  Map merge(Map> keyValueMetaData);

Review comment:
   GlobalMetadata uses Map> 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java#L41.
 So merge method takes in same argument.
   
   Do you think it makes sense to change GlobalMetadata to track items in 
List? I'm not sure if it would be backward compatible and if there are 
use cases expecting this to be Set.  
   
   Without changing GlobalMetadata#keyValueMetaData, it may not be useful to 
change this interface. Let me know what you think.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow different strategies to combine key values when merging parquet files
> ---
>
> Key: PARQUET-1951
> URL: https://issues.apache.org/jira/browse/PARQUET-1951
> Project: Parquet
>  Issue Type: Improvement
>Reporter: satish
>Priority: Minor
>
> I work on Apache Hudi project. We store some additional metadata in parquet 
> files (key range in the file, for example).  So the metadata is different in 
> different parquet files that we want to merge these files. 
> Here is what I'm thinking:
> 1) Merge command takes additional command line option: --strategy 
> . 
> 2) We introduce new strategy class in parquet-hadoop to keep the same 
> behavior as today.  
> We can extend that class and provide our custom implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] satishkotha commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


satishkotha commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551598592



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  Map merge(Map> keyValueMetaData);

Review comment:
   GlobalMetadata uses Map> 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/GlobalMetaData.java#L41.
 So merge method takes in same argument.
   
   Do you think it makes sense to change GlobalMetadata to track items in 
List? I'm not sure if it would be backward compatible and if there are 
use cases expecting this to be Set.  
   
   Without changing GlobalMetadata#keyValueMetaData, it may not be useful to 
change this interface. Let me know what you think.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258533#comment-17258533
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

satishkotha commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551595794



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/DefaultKeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ *  Default strategy to throw errors in metadata if there are multiple values 
for a given key in metadata.
+ */
+public class DefaultKeyValueMetadataMergeStrategy implements 
KeyValueMetadataMergeStrategy {

Review comment:
   Good suggestion. Fixed it. Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow different strategies to combine key values when merging parquet files
> ---
>
> Key: PARQUET-1951
> URL: https://issues.apache.org/jira/browse/PARQUET-1951
> Project: Parquet
>  Issue Type: Improvement
>Reporter: satish
>Priority: Minor
>
> I work on Apache Hudi project. We store some additional metadata in parquet 
> files (key range in the file, for example).  So the metadata is different in 
> different parquet files that we want to merge these files. 
> Here is what I'm thinking:
> 1) Merge command takes additional command line option: --strategy 
> . 
> 2) We introduce new strategy class in parquet-hadoop to keep the same 
> behavior as today.  
> We can extend that class and provide our custom implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] satishkotha commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


satishkotha commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551595794



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/DefaultKeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ *  Default strategy to throw errors in metadata if there are multiple values 
for a given key in metadata.
+ */
+public class DefaultKeyValueMetadataMergeStrategy implements 
KeyValueMetadataMergeStrategy {

Review comment:
   Good suggestion. Fixed it. Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Re: Query on striping parquet files to maintain Row group alignment

2021-01-04 Thread Gabor Szadovszky
Hi Jayjeet,

I assume you are using parquet-mr (and not other parquet implementations
like parquet-cpp, Impala etc.).

I am not sure if I got your request correctly. You may configure the size
of the row group by setting the config parquet.block.size. You may also
check parquet.writer.max-padding so the row groups will fit exactly into
the blocks. See details about the available configs at
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md.
Currently, parquet-mr does not have the functionality to automatically
close a parquet file and start a new one during writing.

Regards,
Gabor

On Thu, Dec 31, 2020 at 5:58 AM Jayjeet Chakraborty <
jayjeetchakrabort...@gmail.com> wrote:

> Hi all,
>
> I am trying to figure out if a large Parquet file can be striped across
> multiple small files based on a Row group chunk size where each stripe
> would naturally end up containing data pages from a single row group. So,
> if I say my writer "write a parquet file in chunks of 128 MB (assuming my
> row groups are of around 128MB), each of my chunks ends up being
> self-contained row group, maybe except the last chunk which has the footer
> contents. Is this possible? Can we fix the row group size (the amount of
> disk space a row group uses) while writing parquet files ? Thanks a lot.
>


[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr

2021-01-04 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258287#comment-17258287
 ] 

Gabor Szadovszky commented on PARQUET-1827:
---

[~vitalii], there are some open points for 1.12.0. I think we will know more 
about a release date after the parquet sync next week.

> UUID type currently not supported by parquet-mr
> ---
>
> Key: PARQUET-1827
> URL: https://issues.apache.org/jira/browse/PARQUET-1827
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Brad Smith
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The parquet-format project introduced a new UUID logical type in version 2.4:
> [https://github.com/apache/parquet-format/blob/master/CHANGES.md]
> This would be a useful type to have available in some circumstances, but it 
> currently isn't supported in the parquet-mr library. Hopefully this feature 
> can be implemented at some point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1827) UUID type currently not supported by parquet-mr

2021-01-04 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1827:
--
Fix Version/s: 1.12.0

> UUID type currently not supported by parquet-mr
> ---
>
> Key: PARQUET-1827
> URL: https://issues.apache.org/jira/browse/PARQUET-1827
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Brad Smith
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The parquet-format project introduced a new UUID logical type in version 2.4:
> [https://github.com/apache/parquet-format/blob/master/CHANGES.md]
> This would be a useful type to have available in some circumstances, but it 
> currently isn't supported in the parquet-mr library. Hopefully this feature 
> can be implemented at some point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1951) Allow different strategies to combine key values when merging parquet files

2021-01-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258139#comment-17258139
 ] 

ASF GitHub Bot commented on PARQUET-1951:
-

gszadovszky commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551240027



##
File path: 
parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java
##
@@ -41,6 +46,21 @@
   "is the destination parquet file"
   };
 
+  private static final String DEFAULT_KEY_VALUE_MERGE_STRATEGY = new 
DefaultKeyValueMetadataMergeStrategy().getClass().getName();

Review comment:
   You do not need to instantiate an object to get its class. Simply use 
`DefaultKeyValueMetadataMergeStrategy.class`

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  Map merge(Map> keyValueMetaData);

Review comment:
   I think, this interface does not support all types of the potential 
merge strategy implementations. For example if you would like to use the value 
which is common in the most files you cannot do it because you only have the 
distinct values. What do you think about using a `List` instead of a 
`Set`?

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/DefaultKeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ *  Default strategy to throw errors in metadata if there are multiple values 
for a given key in metadata.
+ */
+public class DefaultKeyValueMetadataMergeStrategy implements 
KeyValueMetadataMergeStrategy {

Review comment:
   I do not really like the naming of this class. This is the default 
implementation, it's fine but the name should suggest what it does. I'm quite 
sure there is a better naming but what about 
`StrictKeyValueMetadataMergeStrategy`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow different strategies to combine key values when merging parquet files
> 

[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #847: [PARQUET-1951] Allow different strategies to combine key values when …

2021-01-04 Thread GitBox


gszadovszky commented on a change in pull request #847:
URL: https://github.com/apache/parquet-mr/pull/847#discussion_r551240027



##
File path: 
parquet-tools/src/main/java/org/apache/parquet/tools/command/MergeCommand.java
##
@@ -41,6 +46,21 @@
   "is the destination parquet file"
   };
 
+  private static final String DEFAULT_KEY_VALUE_MERGE_STRATEGY = new 
DefaultKeyValueMetadataMergeStrategy().getClass().getName();

Review comment:
   You do not need to instantiate an object to get its class. Simply use 
`DefaultKeyValueMetadataMergeStrategy.class`

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/KeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import org.apache.parquet.schema.MessageType;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Set;
+
+import static java.util.Collections.unmodifiableMap;
+
+/**
+ * Strategy to merge metadata. Possible implementations:
+ * 1) Expect metadata to be exactly identical. Otherwise throw error (this is 
the default strategy implementation)
+ * 2) Custom strategy to merge metadata if the values are not identical. 
Example: concatenate values.
+ */
+public interface KeyValueMetadataMergeStrategy extends Serializable {
+  /**
+   * @param keyValueMetaData the merged app specific metadata
+   *
+   * @throws NullPointerException if keyValueMetaData is {@code null}
+   */
+  Map merge(Map> keyValueMetaData);

Review comment:
   I think, this interface does not support all types of the potential 
merge strategy implementations. For example if you would like to use the value 
which is common in the most files you cannot do it because you only have the 
distinct values. What do you think about using a `List` instead of a 
`Set`?

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/DefaultKeyValueMetadataMergeStrategy.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.metadata;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ *  Default strategy to throw errors in metadata if there are multiple values 
for a given key in metadata.
+ */
+public class DefaultKeyValueMetadataMergeStrategy implements 
KeyValueMetadataMergeStrategy {

Review comment:
   I do not really like the naming of this class. This is the default 
implementation, it's fine but the name should suggest what it does. I'm quite 
sure there is a better naming but what about 
`StrictKeyValueMetadataMergeStrategy`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org