[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918795#action_12918795 ] Amareshwari Sriramadasu commented on HIVE-537: -- bq. Constants.java is a generated file ? Can you change serde/if/serde.thrift After adding the constant to serde/if/serde.thrift, do i need to regenerate the java file? If yes, How should i do it? Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Amareshwari Sriramadasu Fix For: 0.7.0 Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537.txt There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918740#action_12918740 ] Namit Jain commented on HIVE-537: - Otherwise it looks good to me Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Amareshwari Sriramadasu Fix For: 0.7.0 Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537.txt There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913475#action_12913475 ] Amareshwari Sriramadasu commented on HIVE-537: -- Zheng, Can you give an example usage of union type as UDF? I looked at Struct, Map and array UDFs, but Union is quiet different from them because it holds only one object at any point of time. Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Amareshwari Sriramadasu Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913670#action_12913670 ] Zheng Shao commented on HIVE-537: - {code} unionT0,T1,T2 create_union(byte tag, T0 o0, T1 o1, T2 o2, ...) Some real examples: unionSchool,Company create_union( is_student ? 0 : 1, school, company) {code} Depending on the value of the tag, the returned union object will choose to store only the object corresponding to that tag. Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Amareshwari Sriramadasu Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912420#action_12912420 ] HBase Review Board commented on HIVE-537: - Message from: Amareshwari Sriramadasu amar...@yahoo-inc.com bq. On 2010-09-15 15:15:08, Zheng Shao wrote: bq. Overall looks like a good first step. We need to change Hive.g, add UDF etc to allow users to use it in the Hive language. Zheng, there is already keyword (KW_UNION: 'UNION') used for doing union/union all operations. Do you think we should use a different keyword for specifying Union type? - Amareshwari --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/795/#review1231 --- Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Amareshwari Sriramadasu Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912614#action_12912614 ] Zheng Shao commented on HIVE-537: - I think so. Let's use a different name for the UDF. Using 'UNION' as UDF name will not cause grammar ambiguity, but it may cause other issues in the future. Zheng Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Amareshwari Sriramadasu Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909919#action_12909919 ] HBase Review Board commented on HIVE-537: - Message from: Zheng Shao zsh...@gmail.com --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/795/#review1231 --- Overall looks like a good first step. We need to change Hive.g, add UDF etc to allow users to use it in the Hive language. trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java http://review.cloudera.org/r/795/#comment4192 unioin - union trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java http://review.cloudera.org/r/795/#comment4193 We cannot compare 2 union objects like this. We need to first compare their TAG. Only when the TAG is the same shall we compare the field. - Zheng Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Amareshwari Sriramadasu Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906759#action_12906759 ] HBase Review Board commented on HIVE-537: - Message from: Amareshwari Sriramadasu amar...@yahoo-inc.com --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/795/ --- Review request for Hive Developers. Summary --- Adds Union type to Standard ObjectInSpectors, TypeInfo and Lazy ObjectInspectors. This addresses bug HIVE-537. http://issues.apache.org/jira/browse/HIVE-537 Diffs - trunk/serde/src/gen-java/org/apache/hadoop/hive/serde/Constants.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyFactory.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUnion.java PRE-CREATION trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyObjectInspectorFactory.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyUnionObjectInspector.java PRE-CREATION trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspector.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/StandardUnionObjectInspector.java PRE-CREATION trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/UnionObject.java PRE-CREATION trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/UnionObjectInspector.java PRE-CREATION trunk/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfo.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoFactory.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoUtils.java 991812 trunk/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/UnionTypeInfo.java PRE-CREATION trunk/serde/src/test/org/apache/hadoop/hive/serde2/lazy/TestLazyArrayMapStruct.java 991812 trunk/serde/src/test/org/apache/hadoop/hive/serde2/objectinspector/TestStandardObjectInspectors.java 991812 Diff: http://review.cloudera.org/r/795/diff Testing --- Thanks, Amareshwari Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Amareshwari Sriramadasu Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904099#action_12904099 ] Amareshwari Sriramadasu commented on HIVE-537: -- Min, any update on the patch? Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Min Zhou Attachments: HIVE-537.1.patch There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727999#action_12727999 ] Min Zhou commented on HIVE-537: --- Zheng, how would you get field value from an object without a ordinal? Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Min Zhou Attachments: HIVE-537.1.patch There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727846#action_12727846 ] Zheng Shao commented on HIVE-537: - @HIVE-537.1.patch: 1. Can you remove the property changes? These java files don't need to be executable: Property changes on: src/java/org/apache/hadoop/hive/serde2/objectinspector/StandardUnionObjectInspector.java ___ Name: svn:executable + * 2. UnionObjectInspector.java: byte getTag(Object o, int ordinal); We don't need ordinal here. 3. Can you add union to TypeInfoUtils.java: class TypeInfoParser as well? 4. We need some test cases. Please take a look at TestStandardObjectInspectors.java 5. We need to add the capability of serializing/deserializing Union types to LazySimpleSerDe. Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Min Zhou Attachments: HIVE-537.1.patch There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725532#action_12725532 ] Min Zhou commented on HIVE-537: --- Even if UnionObjectInspector has been implemented, the DynamicSerDe seems don't support the schema with a union type which thrift can't recoginze. We must find a way solving it, any suggestions? Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Zheng Shao There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor) userid:int,log:union0:structtouserid:int,message:string,1:string 123 1=login 123 0=243=helloworld 123 1=logout {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724916#action_12724916 ] Min Zhou commented on HIVE-537: --- we've done a test about this issue, dataset: 700m records. first approach, each distinct count needs 119 seconds, that's means 10 distinct count needs at least 1190 seconds. second approach where distinct keys were distinguished by a tag, 10 distinct count need 148 seconds. Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Zheng Shao There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718373#action_12718373 ] Min Zhou commented on HIVE-537: --- first approach: O(mN/p) + O(m(N/p log (N/p))) + O(mN/r) + O(m) I don't agree with you about this O(m). It would be indeed very large cost. and meanwhile, you should adding the cost in the end joining all results into one. for the second approach, I think it should be O(N/p) + O(mN/p log (mN/p)) + O(mN/r) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao Assignee: Zheng Shao There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715878#action_12715878 ] Zheng Shao commented on HIVE-537: - An example usage is for multiple distinct. Min Zhou talked with me offline and has shown that doing multiple distinct in a single map-reduce job can be much faster than doing them separately and then join the results. {code} Query: select a, count(distinct b), count(distinct c), sum(d) Plan: Map side: Emit: distribution_key: a, sort_key: a, 0, b, value: d Emit: distribution_key: a, sort_key: a, 1, c, value: nothing Reduce side: Group By: a, 0, count(distinct b), sum(d) a, 1, count(distinct c) Flatten: a, count(distinct b), sum(d), count(distinct c) {code} Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
[ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12716015#action_12716015 ] Ashish Thusoo commented on HIVE-537: One thing that you need to be careful about is the fact that you will be increasing the number of rows between the map and the reduce boundaries which, if there are a lot of distincts can lead to data explosion and a subsequent slowdown in the sort. From that I mean the following: Suppose we have a query with m different distincts and the base table with N rows and p mappers and r reducers By doing multiple map/reduce jobs, the predominant term in our complexity is O(mN/p) + O(m(N/p log (N/p))) + O(mN/r) + O(m) ie. map side scan + map side sort + Reduce side merge + fixed cost of starting the map/reduce job. how with the current approach the corresponding formula will be O(mN/p) + O(mN/p log (mN/p)) + O(mN/r) = O(mN/p) + O(mN/p log (N/p)) + O(mN/p log m) + O(mN/r) There may be situations where one is better than the other... Something to keep in mind. Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map) --- Key: HIVE-537 URL: https://issues.apache.org/jira/browse/HIVE-537 Project: Hadoop Hive Issue Type: New Feature Reporter: Zheng Shao There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors). We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future. We will support the union type like this: {code} TypeDefinition: type: primitivetype | structtype | arraytype | maptype | uniontype uniontype: union tag : type (, tag : type)* Example: union0:int,1:double,2:arraystring,3:structa:int,b:string Example of serialized data format: We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully. Interface for ObjectInspector: interface UnionObjectInspector { /** Returns the array of OIs that are for each of the tags */ ObjectInspector[] getObjectInspectors(); /** Return the tag of the object. */ byte getTag(Object o); /** Return the field based on the tag value associated with the Object. */ Object getField(Object o); }; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.