[jira] [Created] (HIVE-13315) Option to reuse existing restored HBase snapshots
Liyin Tang created HIVE-13315: - Summary: Option to reuse existing restored HBase snapshots Key: HIVE-13315 URL: https://issues.apache.org/jira/browse/HIVE-13315 Project: Hive Issue Type: Improvement Components: HBase Handler Reporter: Liyin Tang Assignee: Sushanth Sowmyan HiveHBaseTableSnapshotInputFormat needs to restore HBase snapshot for each query. It will be great to have an option in the table properties to specify an existing restored snapshot. And if such property is set, the job can skip the restoring stage to reduce query time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-2095) auto convert map join bug
[ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017208#comment-13017208 ] Liyin Tang commented on HIVE-2095: -- it looks good to me. Thanks Yongqiang > auto convert map join bug > - > > Key: HIVE-2095 > URL: https://issues.apache.org/jira/browse/HIVE-2095 > Project: Hive > Issue Type: Bug >Reporter: He Yongqiang >Assignee: He Yongqiang > Attachments: HIVE-2095.1.patch > > > 1) > when considering to choose one table as the big table candidate for a map > join, if at compile time, hive can find out that the total known size of all > other tables excluding the big table in consideration is bigger than a > configured value, this big table candidate is a bad one, and should not put > into plan. Otherwise, at runtime to filter this out may cause more time. > 2) > added a null check for back up tasks. Otherwise will see NullPointerException > 3) > CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise > it will make wrong decision. > 4) > changes made to the ConditionalResolverCommonJoin: added pathToAliases, > aliasToSize (alias's input size that is known at compile time, by > inputSummary), and intermediate dir path. > So the logic is, go over all the pathToAliases, and for each path, if it is > from intermediate dir path, add this path's size to all aliases. And finally > based on the size information and others like aliasToTask to choose the big > table. > 5) > Conditional task's children contains wrong options, which may cause join fail > or incorrect results. Basically when getting all possible children for the > conditional task, should use a whitelist of big tables. Only tables in this > while list can be considered as a big table. > Here is the logic: > + * Get a list of big table candidates. Only the tables in the returned set > can > + * be used as big table in the join operation. > + * > + * The logic here is to scan the join condition array from left to right. > If > + * see a inner join and the bigTableCandidates is empty, add both side of > this > + * inner join to big table candidates. If see a left outer join, and the > + * bigTableCandidates is empty, add the left side to it, and if the > + * bigTableCandidates is not empty, do nothing (which means the > + * bigTableCandidates is from left side). If see a right outer join, clear > the > + * bigTableCandidates, and add right side to the bigTableCandidates, it > means > + * the right side of a right outer join always win. If see a full outer > join, > + * return null immediately (no one can be the big table, can not do a > + * mapjoin). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2095) auto convert map join bug
[ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016871#comment-13016871 ] Liyin Tang commented on HIVE-2095: -- I will take a look > auto convert map join bug > - > > Key: HIVE-2095 > URL: https://issues.apache.org/jira/browse/HIVE-2095 > Project: Hive > Issue Type: Bug >Reporter: He Yongqiang >Assignee: He Yongqiang > Attachments: HIVE-2095.1.patch > > > 1) > when considering to choose one table as the big table candidate for a map > join, if at compile time, hive can find out that the total known size of all > other tables excluding the big table in consideration is bigger than a > configured value, this big table candidate is a bad one, and should not put > into plan. Otherwise, at runtime to filter this out may cause more time. > 2) > added a null check for back up tasks. Otherwise will see NullPointerException > 3) > CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise > it will make wrong decision. > 4) > changes made to the ConditionalResolverCommonJoin: added pathToAliases, > aliasToSize (alias's input size that is known at compile time, by > inputSummary), and intermediate dir path. > So the logic is, go over all the pathToAliases, and for each path, if it is > from intermediate dir path, add this path's size to all aliases. And finally > based on the size information and others like aliasToTask to choose the big > table. > 5) > Conditional task's children contains wrong options, which may cause join fail > or incorrect results. Basically when getting all possible children for the > conditional task, should use a whitelist of big tables. Only tables in this > while list can be considered as a big table. > Here is the logic: > + * Get a list of big table candidates. Only the tables in the returned set > can > + * be used as big table in the join operation. > + * > + * The logic here is to scan the join condition array from left to right. > If > + * see a inner join and the bigTableCandidates is empty, add both side of > this > + * inner join to big table candidates. If see a left outer join, and the > + * bigTableCandidates is empty, add the left side to it, and if the > + * bigTableCandidates is not empty, do nothing (which means the > + * bigTableCandidates is from left side). If see a right outer join, clear > the > + * bigTableCandidates, and add right side to the bigTableCandidates, it > means > + * the right side of a right outer join always win. If see a full outer > join, > + * return null immediately (no one can be the big table, can not do a > + * mapjoin). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-1966) mapjoin operator should not load hashtable for each new inputfile if the hashtable to be loaded is already there.
[ https://issues.apache.org/jira/browse/HIVE-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010510#comment-13010510 ] Liyin Tang commented on HIVE-1966: -- +1 > mapjoin operator should not load hashtable for each new inputfile if the > hashtable to be loaded is already there. > - > > Key: HIVE-1966 > URL: https://issues.apache.org/jira/browse/HIVE-1966 > Project: Hive > Issue Type: Improvement >Reporter: He Yongqiang >Assignee: Liyin Tang > Attachments: HIVE-1966.1.patch > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-1965) Auto convert mapjoin should not throw exception if the top operator is union operator.
[ https://issues.apache.org/jira/browse/HIVE-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010508#comment-13010508 ] Liyin Tang commented on HIVE-1965: -- +1 > Auto convert mapjoin should not throw exception if the top operator is union > operator. > -- > > Key: HIVE-1965 > URL: https://issues.apache.org/jira/browse/HIVE-1965 > Project: Hive > Issue Type: Bug >Reporter: He Yongqiang >Assignee: Liyin Tang > Attachments: HIVE-1965.1.patch > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HIVE-1845) Some attributes in the Eclipse template file is deprecated
[ https://issues.apache.org/jira/browse/HIVE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1845: - Status: Patch Available (was: Open) > Some attributes in the Eclipse template file is deprecated > > > Key: HIVE-1845 > URL: https://issues.apache.org/jira/browse/HIVE-1845 > Project: Hive > Issue Type: Bug >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1845-1.patch > > > In the eclipse template file, it will reference this jar file, which is > deprecated. > /@PROJECT@/build/metastore/hive-mod...@hive_version@.jar > So the correct one should be: > /@PROJECT@/build/metastore/hive-metasto...@hive_version@.jar > Just update all the eclipse template files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1845) Some attributes in the Eclipse template file is deprecated
[ https://issues.apache.org/jira/browse/HIVE-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1845: - Attachment: hive-1845-1.patch Update all the eclipse template files. > Some attributes in the Eclipse template file is deprecated > > > Key: HIVE-1845 > URL: https://issues.apache.org/jira/browse/HIVE-1845 > Project: Hive > Issue Type: Bug >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1845-1.patch > > > In the eclipse template file, it will reference this jar file, which is > deprecated. > /@PROJECT@/build/metastore/hive-mod...@hive_version@.jar > So the correct one should be: > /@PROJECT@/build/metastore/hive-metasto...@hive_version@.jar > Just update all the eclipse template files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1845) Some attributes in the Eclipse template file is deprecated
Some attributes in the Eclipse template file is deprecated Key: HIVE-1845 URL: https://issues.apache.org/jira/browse/HIVE-1845 Project: Hive Issue Type: Bug Reporter: Liyin Tang Assignee: Liyin Tang In the eclipse template file, it will reference this jar file, which is deprecated. /@PROJECT@/build/metastore/hive-mod...@hive_version@.jar So the correct one should be: /@PROJECT@/build/metastore/hive-metasto...@hive_version@.jar Just update all the eclipse template files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1842) Add the local flag to all the map red tasks, if the query is running locally.
[ https://issues.apache.org/jira/browse/HIVE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1842: - Attachment: hive-1842-1.patch Add the local flag to all the map red tasks, if the query is running locally. > Add the local flag to all the map red tasks, if the query is running locally. > - > > Key: HIVE-1842 > URL: https://issues.apache.org/jira/browse/HIVE-1842 > Project: Hive > Issue Type: Sub-task > Components: Query Processor >Affects Versions: 0.4.1 >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1842-1.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1842) Add the local flag to all the map red tasks, if the query is running locally.
[ https://issues.apache.org/jira/browse/HIVE-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1842: - Status: Patch Available (was: Open) Add the local flag to all the map red tasks, if the query is running locally. > Add the local flag to all the map red tasks, if the query is running locally. > - > > Key: HIVE-1842 > URL: https://issues.apache.org/jira/browse/HIVE-1842 > Project: Hive > Issue Type: Sub-task > Components: Query Processor >Affects Versions: 0.4.1 >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1842-1.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1842) Add the local flag to all the map red tasks, if the query is running locally.
Add the local flag to all the map red tasks, if the query is running locally. - Key: HIVE-1842 URL: https://issues.apache.org/jira/browse/HIVE-1842 Project: Hive Issue Type: Sub-task Components: Query Processor Affects Versions: 0.4.1 Reporter: Liyin Tang Assignee: Liyin Tang -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1830) mappers in group followed by joins may die OOM
[ https://issues.apache.org/jira/browse/HIVE-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1830: - Attachment: hive-1830-5.patch 1) Remove the debug statements 2) Add the memory threshold to group by desc. > mappers in group followed by joins may die OOM > -- > > Key: HIVE-1830 > URL: https://issues.apache.org/jira/browse/HIVE-1830 > Project: Hive > Issue Type: Bug >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1830-1.patch, hive-1830-2.patch, hive-1830-3.patch, > hive-1830-4.patch, hive-1830-5.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1830) mappers in group followed by joins may die OOM
[ https://issues.apache.org/jira/browse/HIVE-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1830: - Attachment: hive-1830-4.patch 1) Add more descriptions in the config file 2) Set the memory usage of hashtable sink op and group by op into their desc. The memory usage is deterministic after compiling stage. > mappers in group followed by joins may die OOM > -- > > Key: HIVE-1830 > URL: https://issues.apache.org/jira/browse/HIVE-1830 > Project: Hive > Issue Type: Bug >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1830-1.patch, hive-1830-2.patch, hive-1830-3.patch, > hive-1830-4.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1830) mappers in group followed by joins may die OOM
[ https://issues.apache.org/jira/browse/HIVE-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1830: - Attachment: hive-1830-3.patch Carefully measure the memory usage of map side group by. Flush frequently, if the left memory is less than a threshold. > mappers in group followed by joins may die OOM > -- > > Key: HIVE-1830 > URL: https://issues.apache.org/jira/browse/HIVE-1830 > Project: Hive > Issue Type: Bug >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1830-1.patch, hive-1830-2.patch, hive-1830-3.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1830) mappers in group followed by joins may die OOM
[ https://issues.apache.org/jira/browse/HIVE-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1830: - Attachment: hive-1830-2.patch Add a new test: auto_join26.q > mappers in group followed by joins may die OOM > -- > > Key: HIVE-1830 > URL: https://issues.apache.org/jira/browse/HIVE-1830 > Project: Hive > Issue Type: Bug >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1830-1.patch, hive-1830-2.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1830) mappers in group followed by joins may die OOM
[ https://issues.apache.org/jira/browse/HIVE-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1830: - Attachment: hive-1830-1.patch > mappers in group followed by joins may die OOM > -- > > Key: HIVE-1830 > URL: https://issues.apache.org/jira/browse/HIVE-1830 > Project: Hive > Issue Type: Bug >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1830-1.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1827) Audit how many queries will be run in the local mode
[ https://issues.apache.org/jira/browse/HIVE-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1827: - Attachment: hive-1827-1.patch Add a new attribute isLocalMode in Task. > Audit how many queries will be run in the local mode > > > Key: HIVE-1827 > URL: https://issues.apache.org/jira/browse/HIVE-1827 > Project: Hive > Issue Type: New Feature >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1827-1.patch > > > Hive can run query in local mode. It would be nice to track and audit how > many queries will be run in the local mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1832) Dynamically allocate and measure memory usage when a map join op followed by a group by op
[ https://issues.apache.org/jira/browse/HIVE-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967268#action_12967268 ] Liyin Tang commented on HIVE-1832: -- Duplicate of Hive-1830 > Dynamically allocate and measure memory usage when a map join op followed by > a group by op > -- > > Key: HIVE-1832 > URL: https://issues.apache.org/jira/browse/HIVE-1832 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Liyin Tang >Assignee: Liyin Tang > > Right now, if a map join operator followed by a map-side group by, this map > reduce task will be memory intensive task. > Memory usage should be carefully measured and bounded in order not to run out > of memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1832) Dynamically allocate and measure memory usage when a map join op followed by a group by op
Dynamically allocate and measure memory usage when a map join op followed by a group by op -- Key: HIVE-1832 URL: https://issues.apache.org/jira/browse/HIVE-1832 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Liyin Tang Assignee: Liyin Tang Right now, if a map join operator followed by a map-side group by, this map reduce task will be memory intensive task. Memory usage should be carefully measured and bounded in order not to run out of memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1827) Audit how many queries will be run in the local mode
Audit how many queries will be run in the local mode Key: HIVE-1827 URL: https://issues.apache.org/jira/browse/HIVE-1827 Project: Hive Issue Type: New Feature Reporter: Liyin Tang Assignee: Liyin Tang Hive can run query in local mode. It would be nice to track and audit how many queries will be run in the local mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1700) Optimiza JDBM to make mapjoin faster
[ https://issues.apache.org/jira/browse/HIVE-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang reassigned HIVE-1700: Assignee: Liyin Tang > Optimiza JDBM to make mapjoin faster > > > Key: HIVE-1700 > URL: https://issues.apache.org/jira/browse/HIVE-1700 > Project: Hive > Issue Type: Improvement >Reporter: He Yongqiang >Assignee: Liyin Tang > > copied from email: > From: Joydeep Sen Sarma > Sent: Tuesday, October 12, 2010 11:11 AM > To: Yongqiang He; Liyin Tang; Namit Jain > Subject: RE: Optimize jdbm > seems like we should move all deserialization to hive land. jdbm should just > work on byte arrays for both keys and values. (since the output of the > serializer used by hive is byte comparable - that seems to suffice) > > From: Yongqiang He > Sent: Tuesday, October 12, 2010 10:22 AM > To: Liyin Tang; Namit Jain > Cc: Joydeep Sen Sarma > Subject: Optimize jdbm > 1. Htree.get() cost 70% total time. It could help a lot if there is bloom > filter here to avoid unneeded get() if we know for sure the given key is not > in JDBM. (we can generate the bloom filter when doing the jdbm sink, and read > into memory when doing read. ) > 2. HTree.get() will deserialize both key and value until find a matched > key. We can only de-serialize the key, and de-serialize the value until the > key match. > Any others? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1700) Optimiza JDBM to make mapjoin faster
[ https://issues.apache.org/jira/browse/HIVE-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang resolved HIVE-1700. -- Resolution: Won't Fix Release Note: The JDBM component has been removed from Hive. No need to optimize this any more. The JDBM component has been removed from Hive. No need to optimize this any more. > Optimiza JDBM to make mapjoin faster > > > Key: HIVE-1700 > URL: https://issues.apache.org/jira/browse/HIVE-1700 > Project: Hive > Issue Type: Improvement >Reporter: He Yongqiang >Assignee: Liyin Tang > > copied from email: > From: Joydeep Sen Sarma > Sent: Tuesday, October 12, 2010 11:11 AM > To: Yongqiang He; Liyin Tang; Namit Jain > Subject: RE: Optimize jdbm > seems like we should move all deserialization to hive land. jdbm should just > work on byte arrays for both keys and values. (since the output of the > serializer used by hive is byte comparable - that seems to suffice) > > From: Yongqiang He > Sent: Tuesday, October 12, 2010 10:22 AM > To: Liyin Tang; Namit Jain > Cc: Joydeep Sen Sarma > Subject: Optimize jdbm > 1. Htree.get() cost 70% total time. It could help a lot if there is bloom > filter here to avoid unneeded get() if we know for sure the given key is not > in JDBM. (we can generate the bloom filter when doing the jdbm sink, and read > into memory when doing read. ) > 2. HTree.get() will deserialize both key and value until find a matched > key. We can only de-serialize the key, and de-serialize the value until the > key match. > Any others? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1811) Show the time the local task takes
[ https://issues.apache.org/jira/browse/HIVE-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1811: - Attachment: hive-1811-1.patch The original showTime code has potential bug if the local task takes more than 60 sec. This patch fixes this bug. > Show the time the local task takes > -- > > Key: HIVE-1811 > URL: https://issues.apache.org/jira/browse/HIVE-1811 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1811-1.patch > > > After the local tasks finished, show the how much time it takes -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1804) Mapjoin will fail if there are no files associating with the join tables
[ https://issues.apache.org/jira/browse/HIVE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1804: - Attachment: hive-1804-3.patch Since there are some other patches committed recently, I regenerate the patch after svn update. Please review. > Mapjoin will fail if there are no files associating with the join tables > > > Key: HIVE-1804 > URL: https://issues.apache.org/jira/browse/HIVE-1804 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1804-1.patch, hive-1804-2.patch, hive-1804-3.patch > > > If there are some empty tables without any file associated, the map join will > fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1811) Show the time the local task takes
Show the time the local task takes -- Key: HIVE-1811 URL: https://issues.apache.org/jira/browse/HIVE-1811 Project: Hive Issue Type: Improvement Components: Query Processor Affects Versions: 0.7.0 Reporter: Liyin Tang Assignee: Liyin Tang Fix For: 0.7.0 After the local tasks finished, show the how much time it takes -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1792) track the joins which are being converted to map-join automatically
[ https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1792: - Attachment: hive-1792-4.patch 1) Remove unrelated change from this patch 2) Set the backup tag in the common join resolver by the way, I generated the diff camp. > track the joins which are being converted to map-join automatically > --- > > Key: HIVE-1792 > URL: https://issues.apache.org/jira/browse/HIVE-1792 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1792-1.patch, hive-1792-2.patch, hive-1792-3.patch, > hive-1792-4.patch > > > We should be able to track how many queries (join) got converted to > map-join -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1792) track the joins which are being converted to map-join automatically
[ https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935234#action_12935234 ] Liyin Tang commented on HIVE-1792: -- There will be 2 cases to run the common join. One is when the resolver of the conditional task returns the common join. Another is when the map join local task fails. If not reset the tag during the getting the backup task, how to distinguish these 2 cases? > track the joins which are being converted to map-join automatically > --- > > Key: HIVE-1792 > URL: https://issues.apache.org/jira/browse/HIVE-1792 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1792-1.patch, hive-1792-2.patch, hive-1792-3.patch > > > We should be able to track how many queries (join) got converted to > map-join -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1792) track the joins which are being converted to map-join automatically
[ https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1792: - Attachment: hive-1792-3.patch Since Hive-1785 has been committed, I generate the diff again. So this diff does not include any code in Hive-1785. Please review. > track the joins which are being converted to map-join automatically > --- > > Key: HIVE-1792 > URL: https://issues.apache.org/jira/browse/HIVE-1792 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1792-1.patch, hive-1792-2.patch, hive-1792-3.patch > > > We should be able to track how many queries (join) got converted to > map-join -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935186#action_12935186 ] Liyin Tang commented on HIVE-1785: -- Thanks John's review and I have created a sub task (Hive-1810) to change the xml description. Please take a look. > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1785_3.patch, hive-1785_4.patch, hive-1785_6.patch, > hive_1785_1.patch, hive_1785_2.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1810) a followup patch for changing the description of hive.exec.pre/post.hooks in conf/hive-default.xml
[ https://issues.apache.org/jira/browse/HIVE-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1810: - Status: Patch Available (was: Open) Patch is available > a followup patch for changing the description of hive.exec.pre/post.hooks in > conf/hive-default.xml > -- > > Key: HIVE-1810 > URL: https://issues.apache.org/jira/browse/HIVE-1810 > Project: Hive > Issue Type: Sub-task >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1810-1.patch > > > a followup patch for changing the description of hive.exec.pre/post.hooks in > conf/hive-default.xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1810) a followup patch for changing the description of hive.exec.pre/post.hooks in conf/hive-default.xml
[ https://issues.apache.org/jira/browse/HIVE-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1810: - Attachment: hive-1810-1.patch change the hive-default.xml. So new pre/post hook should implements the ExecuteWithHookContext interface. > a followup patch for changing the description of hive.exec.pre/post.hooks in > conf/hive-default.xml > -- > > Key: HIVE-1810 > URL: https://issues.apache.org/jira/browse/HIVE-1810 > Project: Hive > Issue Type: Sub-task >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1810-1.patch > > > a followup patch for changing the description of hive.exec.pre/post.hooks in > conf/hive-default.xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1810) a followup patch for changing the description of hive.exec.pre/post.hooks in conf/hive-default.xml
a followup patch for changing the description of hive.exec.pre/post.hooks in conf/hive-default.xml -- Key: HIVE-1810 URL: https://issues.apache.org/jira/browse/HIVE-1810 Project: Hive Issue Type: Sub-task Reporter: Liyin Tang Assignee: Liyin Tang a followup patch for changing the description of hive.exec.pre/post.hooks in conf/hive-default.xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1792) track the joins which are being converted to map-join automatically
[ https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935171#action_12935171 ] Liyin Tang commented on HIVE-1792: -- Still need this change to tag on all the join tasks > track the joins which are being converted to map-join automatically > --- > > Key: HIVE-1792 > URL: https://issues.apache.org/jira/browse/HIVE-1792 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1792-1.patch, hive-1792-2.patch > > > We should be able to track how many queries (join) got converted to > map-join -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1792) track the joins which are being converted to map-join automatically
[ https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1792: - Attachment: hive-1792-2.patch > track the joins which are being converted to map-join automatically > --- > > Key: HIVE-1792 > URL: https://issues.apache.org/jira/browse/HIVE-1792 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1792-1.patch, hive-1792-2.patch > > > We should be able to track how many queries (join) got converted to > map-join -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1808) but in auto_join25.q
[ https://issues.apache.org/jira/browse/HIVE-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang reassigned HIVE-1808: Assignee: Liyin Tang > but in auto_join25.q > > > Key: HIVE-1808 > URL: https://issues.apache.org/jira/browse/HIVE-1808 > Project: Hive > Issue Type: Bug >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1808-1.patch > > > In this test case, there are 2 SET statements: > set hive.mapjoin.localtask.max.memory.usage = 0.0001; > set hive.mapjoin.check.memory.rows = 2; > But in HiveConf, the names of these 2 conf variable do not match with each > other. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1808) bug in auto_join25.q
[ https://issues.apache.org/jira/browse/HIVE-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1808: - Summary: bug in auto_join25.q (was: but in auto_join25.q) > bug in auto_join25.q > > > Key: HIVE-1808 > URL: https://issues.apache.org/jira/browse/HIVE-1808 > Project: Hive > Issue Type: Bug >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1808-1.patch > > > In this test case, there are 2 SET statements: > set hive.mapjoin.localtask.max.memory.usage = 0.0001; > set hive.mapjoin.check.memory.rows = 2; > But in HiveConf, the names of these 2 conf variable do not match with each > other. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1808) but in auto_join25.q
[ https://issues.apache.org/jira/browse/HIVE-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1808: - Attachment: hive-1808-1.patch The bug fixed in this patch > but in auto_join25.q > > > Key: HIVE-1808 > URL: https://issues.apache.org/jira/browse/HIVE-1808 > Project: Hive > Issue Type: Bug >Reporter: Liyin Tang > Attachments: hive-1808-1.patch > > > In this test case, there are 2 SET statements: > set hive.mapjoin.localtask.max.memory.usage = 0.0001; > set hive.mapjoin.check.memory.rows = 2; > But in HiveConf, the names of these 2 conf variable do not match with each > other. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1792) track the joins which are being converted to map-join automatically
[ https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1792: - Attachment: (was: hive-1792-2.patch) > track the joins which are being converted to map-join automatically > --- > > Key: HIVE-1792 > URL: https://issues.apache.org/jira/browse/HIVE-1792 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1792-1.patch > > > We should be able to track how many queries (join) got converted to > map-join -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1792) track the joins which are being converted to map-join automatically
[ https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1792: - Attachment: hive-1792-2.patch The previous patch includes the fix of another jira(Hive-1808) Now I separate the previous patch into 2 patches. This patch includes the diff only related to the map join measurement. > track the joins which are being converted to map-join automatically > --- > > Key: HIVE-1792 > URL: https://issues.apache.org/jira/browse/HIVE-1792 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1792-1.patch, hive-1792-2.patch > > > We should be able to track how many queries (join) got converted to > map-join -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1808) but in auto_join25.q
but in auto_join25.q Key: HIVE-1808 URL: https://issues.apache.org/jira/browse/HIVE-1808 Project: Hive Issue Type: Bug Reporter: Liyin Tang In this test case, there are 2 SET statements: set hive.mapjoin.localtask.max.memory.usage = 0.0001; set hive.mapjoin.check.memory.rows = 2; But in HiveConf, the names of these 2 conf variable do not match with each other. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1792) track the joins which are being converted to map-join automatically
[ https://issues.apache.org/jira/browse/HIVE-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1792: - Attachment: hive-1792-1.patch Add a new hook: MapJoinCounterHook, which will measure how many joins converted into common joins and how many map join revert back to common join. And this new post hook implements the new hook interface with HookContext Please review. > track the joins which are being converted to map-join automatically > --- > > Key: HIVE-1792 > URL: https://issues.apache.org/jira/browse/HIVE-1792 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1792-1.patch > > > We should be able to track how many queries (join) got converted to > map-join -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1804) Mapjoin will fail if there are no files associating with the join tables
[ https://issues.apache.org/jira/browse/HIVE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1804: - Attachment: hive-1804-2.patch Remove all the debug print statements. Please review > Mapjoin will fail if there are no files associating with the join tables > > > Key: HIVE-1804 > URL: https://issues.apache.org/jira/browse/HIVE-1804 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1804-1.patch, hive-1804-2.patch > > > If there are some empty tables without any file associated, the map join will > fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1785: - Attachment: hive-1785_6.patch Thanks for the careful review and sorry to submit the wrong patch before. This patch makes the all changes according to the discussion before and clears irrelevant code. Please review. > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1785_3.patch, hive-1785_4.patch, hive-1785_6.patch, > hive_1758_5.patch, hive_1785_1.patch, hive_1785_2.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1804) Mapjoin will fail if there are no files associating with the join tables
[ https://issues.apache.org/jira/browse/HIVE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1804: - Attachment: hive-1804-1.patch If the partition desc is empty, then create an empty hashtable file for it. > Mapjoin will fail if there are no files associating with the join tables > > > Key: HIVE-1804 > URL: https://issues.apache.org/jira/browse/HIVE-1804 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1804-1.patch > > > If there are some empty tables without any file associated, the map join will > fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1804) Mapjoin will fail if there are no files associating with the join tables
[ https://issues.apache.org/jira/browse/HIVE-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1804: - Status: Patch Available (was: Open) If the parition desc is empty, just create a empty hash table file > Mapjoin will fail if there are no files associating with the join tables > > > Key: HIVE-1804 > URL: https://issues.apache.org/jira/browse/HIVE-1804 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > > If there are some empty tables without any file associated, the map join will > fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1785: - Status: Patch Available (was: Open) > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1785_3.patch, hive-1785_4.patch, hive_1758_5.patch, > hive_1785_1.patch, hive_1785_2.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
[ https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1797: - Status: Patch Available (was: Open) > Compressed the hashtable dump file before put into distributed cache > > > Key: HIVE-1797 > URL: https://issues.apache.org/jira/browse/HIVE-1797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1797.patch, hive-1797_3.patch > > > Clearly, the size of small table is the performance bottleneck for map join. > Because the size of the small table will affect the memory usage and dumped > hashtable file. > That means there are 2 boundaries of the map join performance. > 1)The memory usage for local task and mapred task > 2)The dumped hashtable file size for distributed cache > The reason that test case in last email spends most of the execution time on > initializing is because it hits the second boundary. > Since we have already bound the memory usage, one thing we can do is to let > the performance never hits the secondary bound before it hits the first > boundary. > Assuming the heap size is 1.6 G and the small table file size is 15M > compressed (75M uncompressed), > local task can roughly hold that 1.5M unique rows in memory. > Roughly the dumped file size will be 150M, which is too large to put into the > distributed cache. > > From experiments, we can basically conclude when the dumped file size is > smaller than 30M. > The distributed cache works well and all the mappers will be initialized in > a short time (less than 30 secs). > One easy implementation is to compress the hashtable file. > I use the gzip to compress the hashtable file and the file size is compressed > from 100M to 13M. > After several tests, all the mappers will be initialized in less than 23 secs. > But this solution adds some decompression overhead to each mapper. > Mappers on the same machine will do the duplicated decompression work. > Maybe in the future, we can let the distributed cache to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1804) Mapjoin will fail if there are no files associating with the join tables
Mapjoin will fail if there are no files associating with the join tables Key: HIVE-1804 URL: https://issues.apache.org/jira/browse/HIVE-1804 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.7.0 Reporter: Liyin Tang Assignee: Liyin Tang Fix For: 0.7.0 If there are some empty tables without any file associated, the map join will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1785: - Attachment: hive_1758_5.patch 1) make the old interface be deprecated 2) let the existing Prehook and Posthook implements the new interface. 3) the task tag for each task > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1785_3.patch, hive-1785_4.patch, hive_1758_5.patch, > hive_1785_1.patch, hive_1785_2.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
[ https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1797: - Attachment: hive-1797_3.patch In this patch, all the hashtable dumped files will be compressed and packaged as a tar.gz file. And the put this tar file to distributed cache. The distributed cache will decompress the file for the mapper. If multiple mappers are in the same machine, only distributed cache will only decompress once. > Compressed the hashtable dump file before put into distributed cache > > > Key: HIVE-1797 > URL: https://issues.apache.org/jira/browse/HIVE-1797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1797.patch, hive-1797_3.patch > > > Clearly, the size of small table is the performance bottleneck for map join. > Because the size of the small table will affect the memory usage and dumped > hashtable file. > That means there are 2 boundaries of the map join performance. > 1)The memory usage for local task and mapred task > 2)The dumped hashtable file size for distributed cache > The reason that test case in last email spends most of the execution time on > initializing is because it hits the second boundary. > Since we have already bound the memory usage, one thing we can do is to let > the performance never hits the secondary bound before it hits the first > boundary. > Assuming the heap size is 1.6 G and the small table file size is 15M > compressed (75M uncompressed), > local task can roughly hold that 1.5M unique rows in memory. > Roughly the dumped file size will be 150M, which is too large to put into the > distributed cache. > > From experiments, we can basically conclude when the dumped file size is > smaller than 30M. > The distributed cache works well and all the mappers will be initialized in > a short time (less than 30 secs). > One easy implementation is to compress the hashtable file. > I use the gzip to compress the hashtable file and the file size is compressed > from 100M to 13M. > After several tests, all the mappers will be initialized in less than 23 secs. > But this solution adds some decompression overhead to each mapper. > Mappers on the same machine will do the duplicated decompression work. > Maybe in the future, we can let the distributed cache to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
[ https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1797: - Attachment: hive-1797_2.patch In this patch, all the hashtable dumped files will be compressed and packaged as a tar.gz file. And the put this tar file to distributed cache. The distributed cache will decompress the file for the mapper. If multiple mappers are in the same machine, only distributed cache will only decompress once. Please review. > Compressed the hashtable dump file before put into distributed cache > > > Key: HIVE-1797 > URL: https://issues.apache.org/jira/browse/HIVE-1797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1797.patch > > > Clearly, the size of small table is the performance bottleneck for map join. > Because the size of the small table will affect the memory usage and dumped > hashtable file. > That means there are 2 boundaries of the map join performance. > 1)The memory usage for local task and mapred task > 2)The dumped hashtable file size for distributed cache > The reason that test case in last email spends most of the execution time on > initializing is because it hits the second boundary. > Since we have already bound the memory usage, one thing we can do is to let > the performance never hits the secondary bound before it hits the first > boundary. > Assuming the heap size is 1.6 G and the small table file size is 15M > compressed (75M uncompressed), > local task can roughly hold that 1.5M unique rows in memory. > Roughly the dumped file size will be 150M, which is too large to put into the > distributed cache. > > From experiments, we can basically conclude when the dumped file size is > smaller than 30M. > The distributed cache works well and all the mappers will be initialized in > a short time (less than 30 secs). > One easy implementation is to compress the hashtable file. > I use the gzip to compress the hashtable file and the file size is compressed > from 100M to 13M. > After several tests, all the mappers will be initialized in less than 23 secs. > But this solution adds some decompression overhead to each mapper. > Mappers on the same machine will do the duplicated decompression work. > Maybe in the future, we can let the distributed cache to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
[ https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1797: - Attachment: (was: hive-1797_2.patch) > Compressed the hashtable dump file before put into distributed cache > > > Key: HIVE-1797 > URL: https://issues.apache.org/jira/browse/HIVE-1797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1797.patch > > > Clearly, the size of small table is the performance bottleneck for map join. > Because the size of the small table will affect the memory usage and dumped > hashtable file. > That means there are 2 boundaries of the map join performance. > 1)The memory usage for local task and mapred task > 2)The dumped hashtable file size for distributed cache > The reason that test case in last email spends most of the execution time on > initializing is because it hits the second boundary. > Since we have already bound the memory usage, one thing we can do is to let > the performance never hits the secondary bound before it hits the first > boundary. > Assuming the heap size is 1.6 G and the small table file size is 15M > compressed (75M uncompressed), > local task can roughly hold that 1.5M unique rows in memory. > Roughly the dumped file size will be 150M, which is too large to put into the > distributed cache. > > From experiments, we can basically conclude when the dumped file size is > smaller than 30M. > The distributed cache works well and all the mappers will be initialized in > a short time (less than 30 secs). > One easy implementation is to compress the hashtable file. > I use the gzip to compress the hashtable file and the file size is compressed > from 100M to 13M. > After several tests, all the mappers will be initialized in less than 23 secs. > But this solution adds some decompression overhead to each mapper. > Mappers on the same machine will do the duplicated decompression work. > Maybe in the future, we can let the distributed cache to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1798) Clear empty files in Hive
Clear empty files in Hive -- Key: HIVE-1798 URL: https://issues.apache.org/jira/browse/HIVE-1798 Project: Hive Issue Type: Improvement Reporter: Liyin Tang Assignee: Liyin Tang There are 4 empty files in Hive right now. We should delete them from trunk. D ql/src/java/org/apache/hadoop/hive/ql/exec/JDBMDummyOperator.java D ql/src/java/org/apache/hadoop/hive/ql/exec/JDBMSinkOperator.java D ql/src/java/org/apache/hadoop/hive/ql/plan/JDBMSinkDesc.java D ql/src/java/org/apache/hadoop/hive/ql/plan/JDBMDummyDesc.java D ql/src/java/org/apache/hadoop/hive/ql/util/JoinUtil.java -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
[ https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1797: - Attachment: hive-1797.patch Compress the hashtable dumped file by gzip before adding to distributed cache. > Compressed the hashtable dump file before put into distributed cache > > > Key: HIVE-1797 > URL: https://issues.apache.org/jira/browse/HIVE-1797 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Attachments: hive-1797.patch > > > Clearly, the size of small table is the performance bottleneck for map join. > Because the size of the small table will affect the memory usage and dumped > hashtable file. > That means there are 2 boundaries of the map join performance. > 1)The memory usage for local task and mapred task > 2)The dumped hashtable file size for distributed cache > The reason that test case in last email spends most of the execution time on > initializing is because it hits the second boundary. > Since we have already bound the memory usage, one thing we can do is to let > the performance never hits the secondary bound before it hits the first > boundary. > Assuming the heap size is 1.6 G and the small table file size is 15M > compressed (75M uncompressed), > local task can roughly hold that 1.5M unique rows in memory. > Roughly the dumped file size will be 150M, which is too large to put into the > distributed cache. > > From experiments, we can basically conclude when the dumped file size is > smaller than 30M. > The distributed cache works well and all the mappers will be initialized in > a short time (less than 30 secs). > One easy implementation is to compress the hashtable file. > I use the gzip to compress the hashtable file and the file size is compressed > from 100M to 13M. > After several tests, all the mappers will be initialized in less than 23 secs. > But this solution adds some decompression overhead to each mapper. > Mappers on the same machine will do the duplicated decompression work. > Maybe in the future, we can let the distributed cache to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1797) Compressed the hashtable dump file before put into distributed cache
Compressed the hashtable dump file before put into distributed cache Key: HIVE-1797 URL: https://issues.apache.org/jira/browse/HIVE-1797 Project: Hive Issue Type: Improvement Components: Query Processor Affects Versions: 0.7.0 Reporter: Liyin Tang Assignee: Liyin Tang Clearly, the size of small table is the performance bottleneck for map join. Because the size of the small table will affect the memory usage and dumped hashtable file. That means there are 2 boundaries of the map join performance. 1) The memory usage for local task and mapred task 2) The dumped hashtable file size for distributed cache The reason that test case in last email spends most of the execution time on initializing is because it hits the second boundary. Since we have already bound the memory usage, one thing we can do is to let the performance never hits the secondary bound before it hits the first boundary. Assuming the heap size is 1.6 G and the small table file size is 15M compressed (75M uncompressed), local task can roughly hold that 1.5M unique rows in memory. Roughly the dumped file size will be 150M, which is too large to put into the distributed cache. >From experiments, we can basically conclude when the dumped file size is >smaller than 30M. The distributed cache works well and all the mappers will be initialized in a short time (less than 30 secs). One easy implementation is to compress the hashtable file. I use the gzip to compress the hashtable file and the file size is compressed from 100M to 13M. After several tests, all the mappers will be initialized in less than 23 secs. But this solution adds some decompression overhead to each mapper. Mappers on the same machine will do the duplicated decompression work. Maybe in the future, we can let the distributed cache to support this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1785: - Attachment: hive-1785_4.patch In this patch, I add the Hook interface over Pre/PostExecute and ExecuteWithHookContext interface. In the future, user can only implements ExecuteWithHookContext instead of Pre/PostExecute. Also it is compatible with old hooks. > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1785_3.patch, hive-1785_4.patch, hive_1785_1.patch, > hive_1785_2.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933096#action_12933096 ] Liyin Tang commented on HIVE-1785: -- How about adding one more layer over Pre/PostExecute interface, call it Hook. So both ExecuteWithHookContext and Pre/PostExecute implements this Hook interface During the run time, using reflection to see whether the hook is ExecuteWithHookContext or Pre/PostExecute. > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1785_3.patch, hive_1785_1.patch, hive_1785_2.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1785: - Attachment: hive-1785_3.patch In order to be compatible, we check whether the hook implements the interface, which runs with the hook context. If not, just call the originally interface. > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive-1785_3.patch, hive_1785_1.patch, hive_1785_2.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive-1642_11.patch When the local task runs out of memory, do NOT print any thing out and just return from this process. Because calling l4j to print will make it worse. Sorry for so many minor changes in this afternoon. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1642_10.patch, hive-1642_11.patch, > hive-1642_5.patch, hive-1642_6.patch, hive-1642_7.patch, hive-1642_9.patch, > hive_1642_1.patch, hive_1642_2.patch, hive_1642_4.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive-1642_10.patch After discussing, we think the function: replaceWithConditionalTask is not such general to be put int the Task Class. So we move this function back to the CommonJoinResolver Class. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1642_10.patch, hive-1642_5.patch, > hive-1642_6.patch, hive-1642_7.patch, hive-1642_9.patch, hive_1642_1.patch, > hive_1642_2.patch, hive_1642_4.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive-1642_9.patch some minor changes in ConditionalResolverCommonJoin.java > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1642_5.patch, hive-1642_6.patch, hive-1642_7.patch, > hive-1642_9.patch, hive_1642_1.patch, hive_1642_2.patch, hive_1642_4.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive-1642_7.patch In Task.java public void replaceWithConditionalTask(ConditionalTask cndTsk, PhysicalContext physicalContext) { // take care of parent tasks ... // take care of children tasks List> oldChildTasks = this.getChildTasks(); if (oldChildTasks != null) { for (Task tsk : cndTsk.getListTasks()) { if (tsk.equals(this)) { // avoid redundantly add this task again continue; } for (Task oldChild : oldChildTasks) { tsk.addDependentTask(oldChild); } } } > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1642_5.patch, hive-1642_6.patch, hive-1642_7.patch, > hive_1642_1.patch, hive_1642_2.patch, hive_1642_4.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive-1642_6.patch Remove the getBackupTask interface from all the Conditional Resolver > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1642_5.patch, hive-1642_6.patch, hive_1642_1.patch, > hive_1642_2.patch, hive_1642_4.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: (was: hive-1642_5.patch) > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1642_5.patch, hive_1642_1.patch, hive_1642_2.patch, > hive_1642_4.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive-1642_5.patch Add more detailed description on configuration xml file Revert the DriverContext.java, since there should be no change on this file. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1642_5.patch, hive_1642_1.patch, hive_1642_2.patch, > hive_1642_4.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive-1642_5.patch Add more descriptions to the configuration files. Revert the DriverContext. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive-1642_5.patch, hive_1642_1.patch, hive_1642_2.patch, > hive_1642_4.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932623#action_12932623 ] Liyin Tang commented on HIVE-1785: -- I generate the diff based on the Hive-1642. Please ignore the irrelevant code and output file. Sorry for the inconvenient. > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive_1785_1.patch, hive_1785_2.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1785: - Attachment: hive_1785_2.patch Thanks for John's comments. Now in the Driver.java: for (PostExecute peh : getPostExecHooks()) { if (peh instanceof ExecuteWithHookContext) { ((ExecuteWithHookContext) peh).run(hookContext); } else { peh.run(SessionState.get(), plan.getInputs(), plan.getOutputs(), (SessionState.get() != null ? SessionState.get().getLineageState().getLineageInfo() : null), ShimLoader.getHadoopShims().getUGIForConf(conf)); } } Let's discuss about the interface of HookContext. How much information we should keep in the HookContext. Right now, I keep the query plan, job conf and completed tasks. > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive_1785_1.patch, hive_1785_2.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive_1642_4.patch This patch formats the output of local task. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive_1642_1.patch, hive_1642_2.patch, hive_1642_4.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive_1642_2.patch Thanks for the comments. I have updated the patch according to the review comments. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive_1642_1.patch, hive_1642_2.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932132#action_12932132 ] Liyin Tang commented on HIVE-1642: -- There are 2 kinds of backup. 1) task level 2) branch level. I think the way you mentioned above is the branch level. The conditional task maintains a tree, if one branch fails, then try another branch. I think, both of them is fine right now. But the branch level is more complicated to implement, because the back up task may not be a single task but a tree of tasks. The design goal is to replace one branch of task with another branch. I think the problem right now is that there 2 tasks involved in MapJoin. Image that, 3 months ago, there is no map join local task. It will be very easy to implement this. Once the mapjoin task fails, we replace with the backup task. It is the task level backup. The problem is we split the map join task into 2 tasks. But we can still logically argue that the local task is PART of the map reduce task. Actually, they do come from the same task. That's why if it is the local task, we look ahead one more task. In the future, we may have more this kinds of situation, splitting one task into multiple tasks. Then we may need a loop here. Say if this task is split from other tasks, keep looking ahead. Any other thoughts. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive_1642_1.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931995#action_12931995 ] Liyin Tang commented on HIVE-1642: -- Thanks for reviewing. 1. I will add these parameters in the config xml file. 2. By default hive.auto.convert.join = false right now, all the existing test cases won't be affected 3. I am also thinking about putting the backup task into task directly, which is the simplest way to implement this. My only concern is that it will take more than time de/serializing the task. 4. I will remove this the print statement. 5. The same as point 3. 6. I will fix it, some svn synchronization problem. 7. Right now the back up task is generated during the execution time. That's why it is not easy to work with explain task. But if we put backup task into task directly, we can solve this problem. Also we should set the backup task during the compile time instead of execution time. The only cost is the task serialization time. 8. Because we need to reuse the code of MapJoinProcessor, which uses join tree and row resolver to generate the new map join operator. So each time when generating a new map join operator, we need a deep copy of join tree and op context. Several classes need to be Serializable. 9. I generated these test cases output by set the hive.auto.convert.join = false first, then reset the flag as true. So I can compare whether the result is correct or not. Since right now, the join result is correct, I can add explain into test case. 10.I will fix the conditional task to make it more generic. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive_1642_1.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1792) track the joins which are being converted to map-join automatically
track the joins which are being converted to map-join automatically --- Key: HIVE-1792 URL: https://issues.apache.org/jira/browse/HIVE-1792 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.7.0 Reporter: Liyin Tang Assignee: Liyin Tang Fix For: 0.7.0 We should be able to track how many queries (join) got converted to map-join -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1785) change Pre/Post Query Hooks to take in 1 parameter: HookContext
[ https://issues.apache.org/jira/browse/HIVE-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1785: - Attachment: hive_1785_1.patch In this patch I have changed the interface of pre hoook and post hook. So all the hooks will take the HookContext as parameter. In HookContext, it has the QueryPlan, HiveConf and a list of Completed Tasks. It will be easier to extend HookContext in the future if more information the hook needs. By the way, I generate diff based on Hive-1642( converting join into map join), assuming this patch will be committed after Hive-1642. Please review :) > change Pre/Post Query Hooks to take in 1 parameter: HookContext > --- > > Key: HIVE-1785 > URL: https://issues.apache.org/jira/browse/HIVE-1785 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: hive_1785_1.patch > > > This way, it would be possible to add new parameters to the hooks without > changing the existing hooks. > This will be a incompatible change, and all the hooks need to change to the > new API -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1688) In the MapJoinOperator, the code uses tag as alias, which is not always true
[ https://issues.apache.org/jira/browse/HIVE-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang resolved HIVE-1688. -- Resolution: Fixed Fix Version/s: 0.7.0 > In the MapJoinOperator, the code uses tag as alias, which is not always true > > > Key: HIVE-1688 > URL: https://issues.apache.org/jira/browse/HIVE-1688 > Project: Hive > Issue Type: Bug > Components: Drivers >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > In the MapJoinOperator and SMBMapJoinOperator, the code uses tag as alias, > which is not always true. > Actually, alias = order[tag] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1688) In the MapJoinOperator, the code uses tag as alias, which is not always true
[ https://issues.apache.org/jira/browse/HIVE-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931710#action_12931710 ] Liyin Tang commented on HIVE-1688: -- This bug has been fixed in Hive-1641 earlier and that patch has been committed. Thanks > In the MapJoinOperator, the code uses tag as alias, which is not always true > > > Key: HIVE-1688 > URL: https://issues.apache.org/jira/browse/HIVE-1688 > Project: Hive > Issue Type: Bug > Components: Drivers >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Original Estimate: 24h > Remaining Estimate: 24h > > In the MapJoinOperator and SMBMapJoinOperator, the code uses tag as alias, > which is not always true. > Actually, alias = order[tag] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931705#action_12931705 ] Liyin Tang commented on HIVE-1642: -- In the case: A left outer join B right outer join C, A must be small table. I have a test case: auto_join25.q to test the backup test. There are several query in this test case. The idea is just set the hive.hashtable.max.memory.usage = 0.1. It means if the local task uses more than 0.1% of memory, it will abort. Obviously, all local tasks will always fail in this task case. So the back up will run after the local task failed. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive_1642_1.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1642: - Attachment: hive_1642_1.patch > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: hive_1642_1.patch > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1642) Convert join queries to map-join based on size of table/row
[ https://issues.apache.org/jira/browse/HIVE-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931599#action_12931599 ] Liyin Tang commented on HIVE-1642: -- I just finished converting common join into map join based on the file size. There are 2 flags to control this optimization. 1) set hive.auto.convert.join = true; It means this optimization is enabled. By default right now, this flag is disabled in order not to break any existing test cases. Also I put 25 additional test cases, auto_join0.q - auto_join25.q, which covers this optimization code. 2) Set hive.hashtable.max.memory.usage = 0.9; It means if the memory usage of local task is more than 90% of its heap size, then the local task will abort by itself. The Driver will know the local work fails and it won't submit the MapJoinTask (a Map Only MapRedTask) to Hadoop, but instead, it will submit the originally CommonJoinTask to Hadoop to run. 3) Set hive.smalltable.filesize = 2500L; It means if the summary of the small table file size is less than 25M, then it will run the map join task. If not, just run the originally common join task. The following is the basic flow how it works. For each common join, create a conditional task. 1) For each join table, generate a mapjoin task by assuming this table is big table. a. The left side of right outer join must be small table. b. The right side of left outer join must be small table. c. No full outer join can be optimized. d. Eg. A left outer join B right outer join C. Only C can be big table table. e. Eg. A right outer join B left outer join C. Only B can be big table table. f. Eg. A left outer join B left outer join C. Only A can be big table table. g. Eg. A right outer join B right outer join C. Both B and C can be big table table. 2) Put all these generated map join tasks into conditional task and set the mapping between big table's alias with the corresponding map join task. 3) During the execution time, the resolver will read the input file size. If the input file size of small table is less than a threshold, than run the converted map join task. 4) Set each map join task with a backup task. The backup task is the originally common join task. This mapping relationship is set during execution time. 5) If the map join task return abnormally, launch the backup task. > Convert join queries to map-join based on size of table/row > --- > > Key: HIVE-1642 > URL: https://issues.apache.org/jira/browse/HIVE-1642 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > > Based on the number of rows and size of each table, Hive should automatically > be able to convert a join into map-join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1754) Remove JDBM component from Map Join
[ https://issues.apache.org/jira/browse/HIVE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1754: - Attachment: hive-1754_9.patch > Remove JDBM component from Map Join > --- > > Key: HIVE-1754 > URL: https://issues.apache.org/jira/browse/HIVE-1754 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1754.patch, Hive-1754_2.patch, Hive-1754_3.patch, > hive-1754_4.patch, hive-1754_5.patch, hive-1754_7.patch, hive-1754_9.patch > > > Right now, JDBM is the major performance bottleneck of performance. > With the growth of the small table, the PUT and GET operation will take most > of execution time. > Map Join is designed to load the data of small table into memory. > If the data is too large to hold in memory, then there is no need to use the > map join strategy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1754) Remove JDBM component from Map Join
[ https://issues.apache.org/jira/browse/HIVE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1754: - Attachment: hive-1754_7.patch Change the code style according to the reviewer comments > Remove JDBM component from Map Join > --- > > Key: HIVE-1754 > URL: https://issues.apache.org/jira/browse/HIVE-1754 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1754.patch, Hive-1754_2.patch, Hive-1754_3.patch, > hive-1754_4.patch, hive-1754_5.patch, hive-1754_7.patch > > > Right now, JDBM is the major performance bottleneck of performance. > With the growth of the small table, the PUT and GET operation will take most > of execution time. > Map Join is designed to load the data of small table into memory. > If the data is too large to hold in memory, then there is no need to use the > map join strategy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1647) Incorrect initialization of thread local variable inside IOContext ( implementation is not threadsafe )
[ https://issues.apache.org/jira/browse/HIVE-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang resolved HIVE-1647. -- Resolution: Fixed Release Note: This problem is fixed in Hive-1754 > Incorrect initialization of thread local variable inside IOContext ( > implementation is not threadsafe ) > > > Key: HIVE-1647 > URL: https://issues.apache.org/jira/browse/HIVE-1647 > Project: Hive > Issue Type: Bug > Components: Server Infrastructure >Affects Versions: 0.6.0, 0.7.0 >Reporter: Raman Grover >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: HIVE-1647.patch > > Original Estimate: 0.17h > Remaining Estimate: 0.17h > > Bug in org.apache.hadoop.hive.ql.io.IOContext > in relation to initialization of thread local variable. > > public class IOContext { > > private static ThreadLocal threadLocal = new > ThreadLocal(){ }; > > static { > if (threadLocal.get() == null) { > threadLocal.set(new IOContext()); > } > } > > In a multi-threaded environment, the thread that gets to load the class first > for the JVM (assuming threads share the classloader), > gets to initialize itself correctly by executing the code in the static > block. Once the class is loaded, > any subsequent threads would have their respective threadlocal variable as > null. Since IOContext > is set during initialization of HiveRecordReader, In a scenario where > multiple threads get to acquire > an instance of HiveRecordReader, it would result in a NPE for all but the > first thread that gets to load the class in the VM. > > Is the above scenario of multiple threads initializing HiveRecordReader a > typical one ? or we could just provide the following fix... > > private static ThreadLocal threadLocal = new > ThreadLocal(){ > protected synchronized IOContext initialValue() { > return new IOContext(); > } > }; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1775) Assertation on inputObjInspectors.length in Groupy operator
[ https://issues.apache.org/jira/browse/HIVE-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang resolved HIVE-1775. -- Resolution: Fixed Release Note: This bug is fixed in Hive-1754 > Assertation on inputObjInspectors.length in Groupy operator > --- > > Key: HIVE-1775 > URL: https://issues.apache.org/jira/browse/HIVE-1775 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > > In the Groupby Operator: > Line 188: assert (inputObjInspectors.length == 1); > But this assertion may not necessary true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1754) Remove JDBM component from Map Join
[ https://issues.apache.org/jira/browse/HIVE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1754: - Attachment: hive-1754_5.patch This patch clears the conflicts of test output file completely. > Remove JDBM component from Map Join > --- > > Key: HIVE-1754 > URL: https://issues.apache.org/jira/browse/HIVE-1754 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1754.patch, Hive-1754_2.patch, Hive-1754_3.patch, > hive-1754_4.patch, hive-1754_5.patch > > > Right now, JDBM is the major performance bottleneck of performance. > With the growth of the small table, the PUT and GET operation will take most > of execution time. > Map Join is designed to load the data of small table into memory. > If the data is too large to hold in memory, then there is no need to use the > map join strategy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1754) Remove JDBM component from Map Join
[ https://issues.apache.org/jira/browse/HIVE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1754: - Attachment: hive-1754_4.patch Resolved all the output conflicts in this patch > Remove JDBM component from Map Join > --- > > Key: HIVE-1754 > URL: https://issues.apache.org/jira/browse/HIVE-1754 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1754.patch, Hive-1754_2.patch, Hive-1754_3.patch, > hive-1754_4.patch > > > Right now, JDBM is the major performance bottleneck of performance. > With the growth of the small table, the PUT and GET operation will take most > of execution time. > Map Join is designed to load the data of small table into memory. > If the data is too large to hold in memory, then there is no need to use the > map join strategy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1754) Remove JDBM component from Map Join
[ https://issues.apache.org/jira/browse/HIVE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1754: - Attachment: Hive-1754_3.patch Fix a bug when join value is null Also fix the hive-1775 > Remove JDBM component from Map Join > --- > > Key: HIVE-1754 > URL: https://issues.apache.org/jira/browse/HIVE-1754 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1754.patch, Hive-1754_2.patch, Hive-1754_3.patch > > > Right now, JDBM is the major performance bottleneck of performance. > With the growth of the small table, the PUT and GET operation will take most > of execution time. > Map Join is designed to load the data of small table into memory. > If the data is too large to hold in memory, then there is no need to use the > map join strategy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1775) Assertation on inputObjInspectors.length in Groupy operator
[ https://issues.apache.org/jira/browse/HIVE-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929869#action_12929869 ] Liyin Tang commented on HIVE-1775: -- resolved in patch hive-1754 > Assertation on inputObjInspectors.length in Groupy operator > --- > > Key: HIVE-1775 > URL: https://issues.apache.org/jira/browse/HIVE-1775 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > > In the Groupby Operator: > Line 188: assert (inputObjInspectors.length == 1); > But this assertion may not necessary true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1775) Assertation on inputObjInspectors.length in Groupy operator
[ https://issues.apache.org/jira/browse/HIVE-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929480#action_12929480 ] Liyin Tang commented on HIVE-1775: -- Yes. I can just comment out this assertion. > Assertation on inputObjInspectors.length in Groupy operator > --- > > Key: HIVE-1775 > URL: https://issues.apache.org/jira/browse/HIVE-1775 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > > In the Groupby Operator: > Line 188: assert (inputObjInspectors.length == 1); > But this assertion may not necessary true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1775) Assertation on inputObjInspectors.length in Groupy operator
Assertation on inputObjInspectors.length in Groupy operator --- Key: HIVE-1775 URL: https://issues.apache.org/jira/browse/HIVE-1775 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.6.0, 0.7.0 Reporter: Liyin Tang Assignee: Liyin Tang Fix For: 0.7.0 In the Groupby Operator: Line 188: assert (inputObjInspectors.length == 1); But this assertion may not necessary true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1754) Remove JDBM component from Map Join
[ https://issues.apache.org/jira/browse/HIVE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1754: - Attachment: Hive-1754_2.patch Remove JDBM from Hive completely > Remove JDBM component from Map Join > --- > > Key: HIVE-1754 > URL: https://issues.apache.org/jira/browse/HIVE-1754 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1754.patch, Hive-1754_2.patch > > > Right now, JDBM is the major performance bottleneck of performance. > With the growth of the small table, the PUT and GET operation will take most > of execution time. > Map Join is designed to load the data of small table into memory. > If the data is too large to hold in memory, then there is no need to use the > map join strategy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1754) Remove JDBM component from Map Join
[ https://issues.apache.org/jira/browse/HIVE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928262#action_12928262 ] Liyin Tang commented on HIVE-1754: -- This patch has some potential bugs. I will fix it today and upload a new one. > Remove JDBM component from Map Join > --- > > Key: HIVE-1754 > URL: https://issues.apache.org/jira/browse/HIVE-1754 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1754.patch > > > Right now, JDBM is the major performance bottleneck of performance. > With the growth of the small table, the PUT and GET operation will take most > of execution time. > Map Join is designed to load the data of small table into memory. > If the data is too large to hold in memory, then there is no need to use the > map join strategy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1754) Remove JDBM component from Map Join
[ https://issues.apache.org/jira/browse/HIVE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1754: - Attachment: Hive-1754.patch > Remove JDBM component from Map Join > --- > > Key: HIVE-1754 > URL: https://issues.apache.org/jira/browse/HIVE-1754 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1754.patch > > > Right now, JDBM is the major performance bottleneck of performance. > With the growth of the small table, the PUT and GET operation will take most > of execution time. > Map Join is designed to load the data of small table into memory. > If the data is too large to hold in memory, then there is no need to use the > map join strategy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1754) Remove JDBM component from Map Join
[ https://issues.apache.org/jira/browse/HIVE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1754: - Status: Patch Available (was: Open) This patch modifies the following things 1) Remove the JDBM from Hive 2) All the data in the small table will be stored in in-memory hashtable. 3) Create a light-weight RowContainer: MapJoinRowContainer. 4) Optimize MapJoinObjectKey. If there are only one join key or two join keys, it will use MapJoinSingleKey or MapJoinDoulbeKeys instead of MapJoinObjectKey. > Remove JDBM component from Map Join > --- > > Key: HIVE-1754 > URL: https://issues.apache.org/jira/browse/HIVE-1754 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1754.patch > > > Right now, JDBM is the major performance bottleneck of performance. > With the growth of the small table, the PUT and GET operation will take most > of execution time. > Map Join is designed to load the data of small table into memory. > If the data is too large to hold in memory, then there is no need to use the > map join strategy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1702) optimize JDBM to make mapjoin faster
[ https://issues.apache.org/jira/browse/HIVE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang resolved HIVE-1702. -- Resolution: Won't Fix Release Note: JDBM will be removed from Hive > optimize JDBM to make mapjoin faster > > > Key: HIVE-1702 > URL: https://issues.apache.org/jira/browse/HIVE-1702 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > > Htree.get() cost 70% total time. It could help a lot if there is bloom filter > here to avoid unneeded get() if we know for sure the given key is not in > JDBM. (we can generate the bloom filter when doing the jdbm sink, and read > into memory when doing read. ) > Copied from https://issues.apache.org/jira/browse/HIVE-1700 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1733) Make the bucket size of JDBM configurable
[ https://issues.apache.org/jira/browse/HIVE-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang resolved HIVE-1733. -- Resolution: Not A Problem Release Note: The JDBM will be removed from Hive > Make the bucket size of JDBM configurable > -- > > Key: HIVE-1733 > URL: https://issues.apache.org/jira/browse/HIVE-1733 > Project: Hive > Issue Type: Task > Components: Query Processor >Affects Versions: 0.6.0, 0.7.0 >Reporter: Liyin Tang >Assignee: Liyin Tang > > Right now the bucket size of jdbm bucket is hard coded as 256. > To better config and improve the performance of the jdbm component, > it is necessary to make the bucket size configurable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1756) failures in fatal.q in TestNegativeCliDriver
[ https://issues.apache.org/jira/browse/HIVE-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1756: - Attachment: Hive-1756.patch remove fatal.q > failures in fatal.q in TestNegativeCliDriver > > > Key: HIVE-1756 > URL: https://issues.apache.org/jira/browse/HIVE-1756 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > Attachments: Hive-1756.patch > > > This is probably caused by HIVE-1641 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1757) test cleanup for Hive-1641
[ https://issues.apache.org/jira/browse/HIVE-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1757: - Attachment: Hive-1757.patch Remove some unnecessary print out statements > test cleanup for Hive-1641 > -- > > Key: HIVE-1757 > URL: https://issues.apache.org/jira/browse/HIVE-1757 > Project: Hive > Issue Type: Improvement >Reporter: He Yongqiang >Assignee: Liyin Tang > Attachments: Hive-1757.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1641) add map joined table to distributed cache
[ https://issues.apache.org/jira/browse/HIVE-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated HIVE-1641: - Attachment: Hive-1641(6).patch new diff about removing some print and log statements > add map joined table to distributed cache > - > > Key: HIVE-1641 > URL: https://issues.apache.org/jira/browse/HIVE-1641 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Namit Jain >Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1641(3).txt, Hive-1641(4).patch, > Hive-1641(5).patch, Hive-1641(6).patch, Hive-1641.patch > > > Currently, the mappers directly read the map-joined table from HDFS, which > makes it difficult to scale. > We end up getting lots of timeouts once the number of mappers are beyond a > few thousand, due to > concurrent mappers. > It would be good idea to put the mapped file into distributed cache and read > from there instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1756) failures in fatal.q in TestNegativeCliDriver
[ https://issues.apache.org/jira/browse/HIVE-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925628#action_12925628 ] Liyin Tang commented on HIVE-1756: -- The fatal.q is : set hive.mapjoin.maxsize=1; set hive.task.progress=true; select /*+ mapjoin(b) */ * from src a join src b on (a.key=b.key); But right now, there is no max size for map join, so the MapRed task returns normally(0). So junit fails this test query. Shall I support the parameter max size or just skip this test case? > failures in fatal.q in TestNegativeCliDriver > > > Key: HIVE-1756 > URL: https://issues.apache.org/jira/browse/HIVE-1756 > Project: Hive > Issue Type: Bug > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > > This is probably caused by HIVE-1641 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1723) The result of left semi join is not correct
[ https://issues.apache.org/jira/browse/HIVE-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang resolved HIVE-1723. -- Resolution: Fixed Release Note: This bug is resolved in Hive-1641 This bug is resolved in Hive-1641 > The result of left semi join is not correct > --- > > Key: HIVE-1723 > URL: https://issues.apache.org/jira/browse/HIVE-1723 > Project: Hive > Issue Type: Bug >Reporter: Liyin Tang >Assignee: Liyin Tang > > In the test case semijoin.q, there is a query: > select /*+ mapjoin(b) */ a.key from t3 a left semi join t1 b on a.key = b.key > sort by a.key; > I think this query will return a wrong result if table t1 is larger than > 25000 different keys > To be simple, I tried a very similar query: > select /*+ mapjoin(b) */ a.key from test_semijoin a left semi join > test_semijoin b on a.key = b.key sort by a.key; > The table of test_semijoin is like > 0 0 > 1 1 > 2 2 > 3 3 > 4 4 > 5 5 > ...... > ... > 25000 25000 > 25001 25001 > ... > ... > 25999 25999 > 26000 26000 > So we can easily estimate the correct result of this query should be the same > keys from table test_semijoin itsel. > Actually, the result is only part of that: only from 0 to 24544. > 0 > 1 > 2 > .. > .. > 24543 > 24544 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.