[jira] [Commented] (HIVE-6900) HostUtil.getTaskLogUrl signature change causes compilation to fail

2014-04-24 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13980111#comment-13980111
 ] 

Vinod Kumar Vavilapalli commented on HIVE-6900:
---

We can fix it in 2.4.1 and Hive can depend on that release if that is the route.

I see you filed MAPREDUCE-5857. It is strictly a YARN issue, I'll move it to 
the right sub-project.

> HostUtil.getTaskLogUrl signature change causes compilation to fail
> --
>
> Key: HIVE-6900
> URL: https://issues.apache.org/jira/browse/HIVE-6900
> Project: Hive
>  Issue Type: Bug
>  Components: Shims
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Chris Drome
> Attachments: HIVE-6900.1.patch.txt
>
>
> The signature for HostUtil.getTaskLogUrl has changed between Hadoop-2.3 and 
> Hadoop-2.4.
> Code in 
> shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java 
> works with Hadoop-2.3 method and causes compilation failure with Hadoop-2.4.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6900) HostUtil.getTaskLogUrl signature change causes compilation to fail

2014-04-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979014#comment-13979014
 ] 

Vinod Kumar Vavilapalli commented on HIVE-6900:
---

I looked at the issue together with [~jdere]. Haven't reviewed the patch but 
overall this can let the compilation pass. The eventual link is used elsewhere 
in Hive to pull the logs and do some processing. The link used in the patch 
will still not work as the URLs changed completely.

We can do this in two halves
 - Fix compilation for now
 - And then follow up in YARN with a right API that can expose logs to users 
and change Hive to use that.

For the compilation fix, we can put back the previous API in YARN via 
MAPREDUCE-5830 or we can do the fix as done here in Hive.

Thoughts?

> HostUtil.getTaskLogUrl signature change causes compilation to fail
> --
>
> Key: HIVE-6900
> URL: https://issues.apache.org/jira/browse/HIVE-6900
> Project: Hive
>  Issue Type: Bug
>  Components: Shims
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Chris Drome
> Attachments: HIVE-6900.1.patch.txt
>
>
> The signature for HostUtil.getTaskLogUrl has changed between Hadoop-2.3 and 
> Hadoop-2.4.
> Code in 
> shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java 
> works with Hadoop-2.3 method and causes compilation failure with Hadoop-2.4.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-5317) Implement insert, update, and delete in Hive with full ACID support

2014-03-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1392#comment-1392
 ] 

Vinod Kumar Vavilapalli commented on HIVE-5317:
---

bq. MAPREDUCE-279, at 109, currently out scores us. There may be others, but it 
would be cool to have more watchers than Yarn.
Hehe, looks like we have a race. I'll go ask some of us YARN folks who are also 
watching this JIRA to stop watching this one :D

> Implement insert, update, and delete in Hive with full ACID support
> ---
>
> Key: HIVE-5317
> URL: https://issues.apache.org/jira/browse/HIVE-5317
> Project: Hive
>  Issue Type: New Feature
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: InsertUpdatesinHive.pdf
>
>
> Many customers want to be able to insert, update and delete rows from Hive 
> tables with full ACID support. The use cases are varied, but the form of the 
> queries that should be supported are:
> * INSERT INTO tbl SELECT …
> * INSERT INTO tbl VALUES ...
> * UPDATE tbl SET … WHERE …
> * DELETE FROM tbl WHERE …
> * MERGE INTO tbl USING src ON … WHEN MATCHED THEN ... WHEN NOT MATCHED THEN 
> ...
> * SET TRANSACTION LEVEL …
> * BEGIN/END TRANSACTION
> Use Cases
> * Once an hour, a set of inserts and updates (up to 500k rows) for various 
> dimension tables (eg. customer, inventory, stores) needs to be processed. The 
> dimension tables have primary keys and are typically bucketed and sorted on 
> those keys.
> * Once a day a small set (up to 100k rows) of records need to be deleted for 
> regulatory compliance.
> * Once an hour a log of transactions is exported from a RDBS and the fact 
> tables need to be updated (up to 1m rows)  to reflect the new data. The 
> transactions are a combination of inserts, updates, and deletes. The table is 
> partitioned and bucketed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6098) Merge Tez branch into trunk

2013-12-23 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855980#comment-13855980
 ] 

Vinod Kumar Vavilapalli commented on HIVE-6098:
---

If it is not too late, how about changing _hive.optimize.tez_ to be 
_hive.execution-engine_ taking values _(MapReduce, Tez] etc)_ or something like 
that?

> Merge Tez branch into trunk
> ---
>
> Key: HIVE-6098
> URL: https://issues.apache.org/jira/browse/HIVE-6098
> Project: Hive
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Gunther Hagleitner
>Assignee: Gunther Hagleitner
> Attachments: HIVE-6098.1.patch, HIVE-6098.2.patch, HIVE-6098.3.patch, 
> hive-on-tez-conf.txt
>
>
> I think the Tez branch is at a point where we can consider merging it back 
> into trunk after review. 
> Tez itself has had its first release, most hive features are available on Tez 
> and the test coverage is decent. There are a few known limitations, all of 
> which can be handled in trunk as far as I can tell (i.e.: None of them are 
> large disruptive changes that still require a branch.)
> Limitations:
> - Union all is not yet supported on Tez
> - SMB is not yet supported on Tez
> - Bucketed map-join is executed as broadcast join (bucketing is ignored)
> Since the user is free to toggle hive.optimize.tez, it's obviously possible 
> to just run these on MR.
> I am hoping to follow the approach that was taken with vectorization and 
> shoot for a merge instead of single commit. This would retain history of the 
> branch. Also in vectorization we required at least three +1s before merge, 
> I'm hoping to go with that as well.
> I will add a combined patch to this ticket for review purposes (not for 
> commit). I'll also attach instructions to run on a cluster if anyone wants to 
> try.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-4801) hive.mapred.map.tasks.speculative.execution is not used to configure Hadoop jobs

2013-07-08 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702697#comment-13702697
 ] 

Vinod Kumar Vavilapalli commented on HIVE-4801:
---

Or just deprecate hive.mapred.reduce.tasks.speculative.execution.

> hive.mapred.map.tasks.speculative.execution is not used to configure Hadoop 
> jobs
> 
>
> Key: HIVE-4801
> URL: https://issues.apache.org/jira/browse/HIVE-4801
> Project: Hive
>  Issue Type: Bug
>  Components: Configuration
>Affects Versions: 0.10.0
>Reporter: Chu Tong
>Assignee: Chu Tong
> Attachments: HIVE-4801.patch
>
>
> Hive does not honor hive.mapred.map.tasks.speculative.execution parameter 
> while it comes to configuring hadoop jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4160) Vectorized Query Execution in Hive

2013-07-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699820#comment-13699820
 ] 

Vinod Kumar Vavilapalli commented on HIVE-4160:
---

A huge +1 to that. Having a common set of operators will be a huge win. That 
said, I already see that the current branch follows Hive's operator base 
classes, uses HiveConf etc. I believe with little effort, this can be cleaned 
and pulled apart into one separate maven module that everyone can use.

Some points to think about:
 - The target location of the module. The dependency graph can become 
un-wieldly.
 - Given the use of base Operator, OperatorDesc etc from Hive, if at all there 
is interest and commitment, we should do this ASAP when we only have a handful 
of operators.
 - Make one other project demonstrate how it can be reused across ecosystem 
projects, PIG will be great - just a few operators will be a great start 

Thoughts?

> Vectorized Query Execution in Hive
> --
>
> Key: HIVE-4160
> URL: https://issues.apache.org/jira/browse/HIVE-4160
> Project: Hive
>  Issue Type: New Feature
>Reporter: Jitendra Nath Pandey
>Assignee: Jitendra Nath Pandey
> Attachments: Hive-Vectorized-Query-Execution-Design.docx, 
> Hive-Vectorized-Query-Execution-Design-rev2.docx, 
> Hive-Vectorized-Query-Execution-Design-rev3.docx, 
> Hive-Vectorized-Query-Execution-Design-rev3.docx, 
> Hive-Vectorized-Query-Execution-Design-rev3.pdf, 
> Hive-Vectorized-Query-Execution-Design-rev4.docx, 
> Hive-Vectorized-Query-Execution-Design-rev4.pdf, 
> Hive-Vectorized-Query-Execution-Design-rev5.docx, 
> Hive-Vectorized-Query-Execution-Design-rev5.pdf, 
> Hive-Vectorized-Query-Execution-Design-rev6.docx, 
> Hive-Vectorized-Query-Execution-Design-rev6.pdf, 
> Hive-Vectorized-Query-Execution-Design-rev7.docx, 
> Hive-Vectorized-Query-Execution-Design-rev8.docx, 
> Hive-Vectorized-Query-Execution-Design-rev8.pdf, 
> Hive-Vectorized-Query-Execution-Design-rev9.docx, 
> Hive-Vectorized-Query-Execution-Design-rev9.pdf
>
>
> The Hive query execution engine currently processes one row at a time. A 
> single row of data goes through all the operators before the next row can be 
> processed. This mode of processing is very inefficient in terms of CPU usage. 
> Research has demonstrated that this yields very low instructions per cycle 
> [MonetDB X100]. Also currently Hive heavily relies on lazy deserialization 
> and data columns go through a layer of object inspectors that identify column 
> type, deserialize data and determine appropriate expression routines in the 
> inner loop. These layers of virtual method calls further slow down the 
> processing. 
> This work will add support for vectorized query execution to Hive, where, 
> instead of individual rows, batches of about a thousand rows at a time are 
> processed. Each column in the batch is represented as a vector of a primitive 
> data type. The inner loop of execution scans these vectors very fast, 
> avoiding method calls, deserialization, unnecessary if-then-else, etc. This 
> substantially reduces CPU time used, and gives excellent instructions per 
> cycle (i.e. improved processor pipeline utilization). See the attached design 
> specification for more details.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-29 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Attachment: HIVE-3952-20130428-branch-0.11-bugfix.txt

Patch with only the bug fix. The previously failing tests pass now.

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Fix For: 0.11.0
>
> Attachments: hive.3952.1.patch, HIVE-3952-20130226.txt, 
> HIVE-3952-20130227.1.txt, HIVE-3952-20130301.txt, HIVE-3952-20130421.txt, 
> HIVE-3952-20130424.txt, HIVE-3952-20130428-branch-0.11-bugfix.txt, 
> HIVE-3952-20130428-branch-0.11.txt, HIVE-3952-20130428-branch-0.11-v2.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Attachment: HIVE-3952-20130428-branch-0.11-v2.txt

That was a case of bad merge, here's the correct one. All these failing tests 
pass for me now..

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Fix For: 0.11.0
>
> Attachments: hive.3952.1.patch, HIVE-3952-20130226.txt, 
> HIVE-3952-20130227.1.txt, HIVE-3952-20130301.txt, HIVE-3952-20130421.txt, 
> HIVE-3952-20130424.txt, HIVE-3952-20130428-branch-0.11.txt, 
> HIVE-3952-20130428-branch-0.11-v2.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-28 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644239#comment-13644239
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3952:
---

Sure, looking..

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Fix For: 0.11.0
>
> Attachments: hive.3952.1.patch, HIVE-3952-20130226.txt, 
> HIVE-3952-20130227.1.txt, HIVE-3952-20130301.txt, HIVE-3952-20130421.txt, 
> HIVE-3952-20130424.txt, HIVE-3952-20130428-branch-0.11.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-28 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Attachment: HIVE-3952-20130428-branch-0.11.txt

Patch against branch-0.11 if someone is interested..

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Fix For: 0.12.0
>
> Attachments: hive.3952.1.patch, HIVE-3952-20130226.txt, 
> HIVE-3952-20130227.1.txt, HIVE-3952-20130301.txt, HIVE-3952-20130421.txt, 
> HIVE-3952-20130424.txt, HIVE-3952-20130428-branch-0.11.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-24 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641463#comment-13641463
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3952:
---

Thanks for the patch update, Namit!

Also Ashutosh and Namit again for all the reviews!

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: hive.3952.1.patch, HIVE-3952-20130226.txt, 
> HIVE-3952-20130227.1.txt, HIVE-3952-20130301.txt, HIVE-3952-20130421.txt, 
> HIVE-3952-20130424.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-24 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Status: Patch Available  (was: Open)

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt, HIVE-3952-20130227.1.txt, 
> HIVE-3952-20130301.txt, HIVE-3952-20130421.txt, HIVE-3952-20130424.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-24 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Attachment: HIVE-3952-20130424.txt

Sigh the patch is broken again. Updating it.

Also addressed the review comments on the review board. Added one more test for 
validating this.

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt, HIVE-3952-20130227.1.txt, 
> HIVE-3952-20130301.txt, HIVE-3952-20130421.txt, HIVE-3952-20130424.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Attachment: HIVE-3952-20130421.txt

Thanks for the info Ashutosh.

Attaching updated patch against latest trunk. Also fixes the offending test 
related issues. Latest patch also on review-board. Tx.

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt, HIVE-3952-20130227.1.txt, 
> HIVE-3952-20130301.txt, HIVE-3952-20130421.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Status: Patch Available  (was: Open)

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt, HIVE-3952-20130227.1.txt, 
> HIVE-3952-20130301.txt, HIVE-3952-20130421.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (HIVE-4313) Build fails with OOM in mvn-init stage

2013-04-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reopened HIVE-4313:
---


I am also running into this, shouldn't this be 'fixed' somehow? Either address 
why it suddenly increased or just change some build settings, perhaps?

Reopening this anyways..

> Build fails with OOM in mvn-init stage
> --
>
> Key: HIVE-4313
> URL: https://issues.apache.org/jira/browse/HIVE-4313
> Project: Hive
>  Issue Type: Wish
>  Components: Build Infrastructure
> Environment: ubuntu 10.10, 32bit
>Reporter: Navis
>Priority: Minor
>
> Recently hive build fails with OOM frequently with exception like,
> {noformat}
> mvn-init:
>  [echo] hcatalog-server-extensions
>   [get] Destination already exists (skipping): 
> /home/navis/apache/oss-hive/hcatalog/build/maven-ant-tasks-2.1.3.jar
> Caught an exception while logging the end of the build.  Exception was:
> java.lang.OutOfMemoryError: PermGen space
> java.lang.OutOfMemoryError: PermGen space
>   at java.lang.Throwable.getStackTraceElement(Native Method)
>   at java.lang.Throwable.getOurStackTrace(Throwable.java:591)
>   at java.lang.Throwable.printStackTrace(Throwable.java:462)
>   at java.lang.Throwable.printStackTrace(Throwable.java:451)
>   at org.apache.tools.ant.Main.runBuild(Main.java:828)
>   at org.apache.tools.ant.Main.startAnt(Main.java:218)
>   at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
>   at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
> PermGen space
> {noformat}
> or
> {noformat}
> mvn-init:
>  [echo] hcatalog-server-extensions
>   [get] Destination already exists (skipping): 
> /home/navis/apache/oss-hive/hcatalog/build/maven-ant-tasks-2.1.3.jar
> java.lang.OutOfMemoryError: PermGen space
>   at org.apache.tools.ant.Main.runBuild(Main.java:826)
>   at org.apache.tools.ant.Main.startAnt(Main.java:218)
>   at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
>   at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
> PermGen space
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4105) Hive MapJoinOperator unnecessarily deserializes values for all join-keys

2013-04-15 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-4105:
--

Attachment: HIVE-4105-20130415.txt

Yes, the clearing of the row should happen independent of row-generation. 
Attaching updated patch addressing the review comment.

> Hive MapJoinOperator unnecessarily deserializes values for all join-keys
> 
>
> Key: HIVE-4105
> URL: https://issues.apache.org/jira/browse/HIVE-4105
> Project: Hive
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-4105-20130301.1.txt, HIVE-4105-20130301.txt, 
> HIVE-4105-20130415.txt, HIVE-4105.patch
>
>
> We can avoid this for inner-joins. Hive does an explicit value 
> de-serialization up front so even for those rows which won't emit output. In 
> these cases, we can do just with key de-serialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4105) Hive MapJoinOperator unnecessarily deserializes values for all join-keys

2013-04-09 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-4105:
--

Assignee: Vinod Kumar Vavilapalli
  Status: Patch Available  (was: Open)

> Hive MapJoinOperator unnecessarily deserializes values for all join-keys
> 
>
> Key: HIVE-4105
> URL: https://issues.apache.org/jira/browse/HIVE-4105
> Project: Hive
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-4105-20130301.1.txt, HIVE-4105-20130301.txt, 
> HIVE-4105.patch
>
>
> We can avoid this for inner-joins. Hive does an explicit value 
> de-serialization up front so even for those rows which won't emit output. In 
> these cases, we can do just with key de-serialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4105) Hive MapJoinOperator unnecessarily deserializes values for all join-keys

2013-04-09 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-4105:
--

Attachment: HIVE-4105.patch

Latest patch addressing Vikram's comments.

Created review board request at https://reviews.apache.org/r/10323/.

> Hive MapJoinOperator unnecessarily deserializes values for all join-keys
> 
>
> Key: HIVE-4105
> URL: https://issues.apache.org/jira/browse/HIVE-4105
> Project: Hive
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
> Attachments: HIVE-4105-20130301.1.txt, HIVE-4105-20130301.txt, 
> HIVE-4105.patch
>
>
> We can avoid this for inner-joins. Hive does an explicit value 
> de-serialization up front so even for those rows which won't emit output. In 
> these cases, we can do just with key de-serialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Status: Patch Available  (was: Open)

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt, HIVE-3952-20130227.1.txt, 
> HIVE-3952-20130301.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3952) merge map-job followed by map-reduce job

2013-04-05 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624177#comment-13624177
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3952:
---

The patch named HIVE-3952-20130227.1.txt still applies on trunk.

Created a review board request: https://reviews.apache.org/r/10321/

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt, HIVE-3952-20130227.1.txt, 
> HIVE-3952-20130301.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-03-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Attachment: HIVE-3952-20130301.txt

Ran my new test again, passes. This patch can be applied on top of HIVE-4106.

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt, HIVE-3952-20130227.1.txt, 
> HIVE-3952-20130301.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4105) Hive MapJoinOperator unnecessarily deserializes values for all join-keys

2013-03-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-4105:
--

Attachment: HIVE-4105-20130301.1.txt

Patch upmerged to the latest trunk.

> Hive MapJoinOperator unnecessarily deserializes values for all join-keys
> 
>
> Key: HIVE-4105
> URL: https://issues.apache.org/jira/browse/HIVE-4105
> Project: Hive
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
> Attachments: HIVE-4105-20130301.1.txt, HIVE-4105-20130301.txt
>
>
> We can avoid this for inner-joins. Hive does an explicit value 
> de-serialization up front so even for those rows which won't emit output. In 
> these cases, we can do just with key de-serialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4105) Hive MapJoinOperator unnecessarily deserializes values for all join-keys

2013-03-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-4105:
--

Attachment: HIVE-4105-20130301.txt

Here's a patch to avoid value de-serialization where not needed in case of 
inner join.

In my microbenchmark, where I was map-joining a big table, with a small table, 
this brought the task execution time down from 15seconds to 10seconds on about 
3 million records on the big table, the second table being very small and the 
output is small too. Note that you won't see this much of an improvement for 
non-selective inner joins.

If folks are interested, I'll try productionizing the benchmark.

> Hive MapJoinOperator unnecessarily deserializes values for all join-keys
> 
>
> Key: HIVE-4105
> URL: https://issues.apache.org/jira/browse/HIVE-4105
> Project: Hive
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
> Attachments: HIVE-4105-20130301.txt
>
>
> We can avoid this for inner-joins. Hive does an explicit value 
> de-serialization up front so even for those rows which won't emit output. In 
> these cases, we can do just with key de-serialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4105) Hive MapJoinOperator unnecessarily deserializes values for all join-keys

2013-03-01 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created HIVE-4105:
-

 Summary: Hive MapJoinOperator unnecessarily deserializes values 
for all join-keys
 Key: HIVE-4105
 URL: https://issues.apache.org/jira/browse/HIVE-4105
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli


We can avoid this for inner-joins. Hive does an explicit value de-serialization 
up front so even for those rows which won't emit output. In these cases, we can 
do just with key de-serialization.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-03-01 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591058#comment-13591058
 ] 

Vinod Kumar Vavilapalli commented on HIVE-4014:
---

Okay, I cannot reproduce this on trunk, though I was consistently hitting this 
on hive-0.10. I'll try hive-0.10 again to be sure some other patch fixed this.

[~tamastarjanyi], what version are you using?

> Hive+RCFile is not doing column pruning and reading much more data than 
> necessary
> -
>
> Key: HIVE-4014
> URL: https://issues.apache.org/jira/browse/HIVE-4014
> Project: Hive
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> With even simple projection queries, I see that HDFS bytes read counter 
> doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3952) merge map-job followed by map-reduce job

2013-02-28 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590141#comment-13590141
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3952:
---

Okay, will do.

Tests passed except the two about input37.

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt, HIVE-3952-20130227.1.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-02-27 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Attachment: HIVE-3952-20130227.1.txt

Thanks for trying this, Amareshwari!

I've added your "INSERT OVERWRITE DIRECTORY "/dir Select " case to the test.

Here's an updated patch that should work for you, can you please try again? Tx.

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt, HIVE-3952-20130227.1.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-02-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Status: Patch Available  (was: Open)

I am running tests in the background. The multiJoin test passes though.

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-3952) merge map-job followed by map-reduce job

2013-02-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HIVE-3952:
--

Attachment: HIVE-3952-20130226.txt

Had a patch for a while but took a bit of time to clean it up. Attached. Here's 
what it does:
 - The changes are in CommonJoinResolver, which does the collapse of multi-way 
joins into a single task doing all map-joins.
 - Every time a join is converted to map-join, I also inspect the child task to 
see if it is a MR job and then merge M with MR.
 - Added a test to multiJoin1.q to test that a M-MR collapses into a single MR 
job.
 - The memory model after this patch is very complicated, it all depends on 
what operations are performed in the second MR job. AFAIU, We also don't have a 
clear memory model for HIVE-3952 multi-way map-join too. So for now, I just 
added a config "hive.optimize.mapjoin.mapreduce" to control this. I think we 
need a bigger JIRA to figure out memory restrictions when we have these 
multiple optimizations in play.

Please review. Thanks!

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
> Attachments: HIVE-3952-20130226.txt
>
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13577216#comment-13577216
 ] 

Vinod Kumar Vavilapalli commented on HIVE-4014:
---

I already tracked it down, will upload a patch soon..

> Hive+RCFile is not doing column pruning and reading much more data than 
> necessary
> -
>
> Key: HIVE-4014
> URL: https://issues.apache.org/jira/browse/HIVE-4014
> Project: Hive
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> With even simple projection queries, I see that HDFS bytes read counter 
> doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-12 Thread Vinod Kumar Vavilapalli (JIRA)
Vinod Kumar Vavilapalli created HIVE-4014:
-

 Summary: Hive+RCFile is not doing column pruning and reading much 
more data than necessary
 Key: HIVE-4014
 URL: https://issues.apache.org/jira/browse/HIVE-4014
 Project: Hive
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli


With even simple projection queries, I see that HDFS bytes read counter doesn't 
show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3992) Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks

2013-02-06 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572927#comment-13572927
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3992:
---

This is one of the big things that is solved by the ORC file (HIVE-3874). Not 
saying that it shouldn't be fixed in RCFile, but we will need to modify RCFile 
to similarly include some kind of file header/footer to index into the 
row-groups.

> Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
> -
>
> Key: HIVE-3992
> URL: https://issues.apache.org/jira/browse/HIVE-3992
> Project: Hive
>  Issue Type: Bug
> Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3
>Reporter: Gopal V
> Attachments: select-join-limit.html
>
>
> The following function does some bad I/O
> {code}
> public synchronized void sync(long position) throws IOException {
>   ...
>   try {
> seek(position + 4); // skip escape
> in.readFully(syncCheck);
> int syncLen = sync.length;
> for (int i = 0; in.getPos() < end; i++) {
>   int j = 0;
>   for (; j < syncLen; j++) {
> if (sync[j] != syncCheck[(i + j) % syncLen]) {
>   break;
> }
>   }
>   if (j == syncLen) {
> in.seek(in.getPos() - SYNC_SIZE); // position before
> // sync
> return;
>   }
>   syncCheck[i % syncLen] = in.readByte();
> }
>   }
> ...
> }
> {code}
> This causes a rather large number of readByte() calls which are passed onto a 
> ByteBuffer via a single byte array.
> This results in rather a large amount of CPU being burnt in a the linear 
> search for the sync pattern in the input RCFile (upto 92% for a skewed 
> example - a trivial map-join + limit 100).
> This behaviour should be avoided at best or at least replaced by a rolling 
> hash for efficient comparison, since it has a known byte-width of 16 bytes.
> Attached the stack trace from a Yourkit profile.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-02-06 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572677#comment-13572677
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3874:
---

Would it make sense to create a (very) temporary svn branch for capturing 
various bug fixes from (possibly) different contributors on sub-JIRAs?

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hive.3874.2.patch, OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (HIVE-3952) merge map-job followed by map-reduce job

2013-02-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-3952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reassigned HIVE-3952:
-

Assignee: Vinod Kumar Vavilapalli

I'd like to take a stab at it..

> merge map-job followed by map-reduce job
> 
>
> Key: HIVE-3952
> URL: https://issues.apache.org/jira/browse/HIVE-3952
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Vinod Kumar Vavilapalli
>
> Consider the query like:
> select count(*) FROM
> ( select idOne, idTwo, value FROM
>   bigTable   
>   JOIN
> 
>   smallTableOne on (bigTable.idOne = smallTableOne.idOne) 
>   
>   ) firstjoin 
> 
> JOIN  
> 
> smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo);
> where smallTableOne and smallTableTwo are smaller than 
> hive.auto.convert.join.noconditionaltask.size and
> hive.auto.convert.join.noconditionaltask is set to true.
> The joins are collapsed into mapjoins, and it leads to a map-only job
> (for the map-joins) followed by a map-reduce job (for the group by).
> Ideally, the map-only job should be merged with the following map-reduce job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3784) de-emphasize mapjoin hint

2013-01-31 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13568173#comment-13568173
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3784:
---

[~namit] Sorry I was away and couldn't reply back.

Thanks for addressing my use-case, I'll play with it!

> de-emphasize mapjoin hint
> -
>
> Key: HIVE-3784
> URL: https://issues.apache.org/jira/browse/HIVE-3784
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Namit Jain
> Fix For: 0.11.0
>
> Attachments: hive.3784.10.patch, hive.3784.11.patch, 
> hive.3784.12.patch, hive.3784.13.patch, hive.3784.14.patch, 
> hive.3784.15.patch, hive.3784.16.patch, hive.3784.17.patch, 
> hive.3784.18.patch, hive.3784.19.patch, hive.3784.1.patch, 
> hive.3784.21.patch, hive.3784.22.patch, hive.3784.2.patch, hive.3784.3.patch, 
> hive.3784.4.patch, hive.3784.5.patch, hive.3784.6.patch, hive.3784.7.patch, 
> hive.3784.8.patch, hive.3784.9.patch
>
>
> hive.auto.convert.join has been around for a long time, and is pretty stable.
> When mapjoin hint was created, the above parameter did not exist.
> The only reason for the user to specify a mapjoin currently is if they want
> it to be converted to a bucketed-mapjoin or a sort-merge bucketed mapjoin.
> Eventually, that should also go away, but that may take some time to 
> stabilize.
> There are many rules in SemanticAnalyzer to handle the following trees:
> ReduceSink -> MapJoin
> Union  -> MapJoin
> MapJoin-> MapJoin
> This should not be supported anymore. In any of the above scenarios, the
> user can get the mapjoin behavior by setting hive.auto.convert.join to true
> and not specifying the hint. This will simplify the code a lot.
> What does everyone think ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-10 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550898#comment-13550898
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3874:
---

bq. Can the index be made optional ? In our typical use-case, the old data is 
hardly queried - so we are willing to trade off cpu, and not
support skipping rows for old data to save some space.
The way I understand it, index creation can be specified during creation, so it 
can be made optional. To start with, we may in fact have no indices and then 
add them later.

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-10 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549416#comment-13549416
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3874:
---

Bumping up the version number for ORC and transparently forwarding old data to 
the current file format should work, no?

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-09 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549396#comment-13549396
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3874:
---

+100 !

> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3784) de-emphasize mapjoin hint

2012-12-20 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537479#comment-13537479
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3784:
---

Was trying to play with the patch, and my earlier concern resurfaced.
bq.  With different join keys, it needs some work to merge into a single MR 
anyway - that work is independent of this change.
That isn't true. Even today, I am able to get hive to automatically merge 
multi-way map-join with different join keys into a single map-only job. With 
this patch, we are losing that functionality. For e.g., the following runs as a 
single Map only job:
{noformat}
select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
  bigTable   
  JOIN  
  
  smallTableOne on (bigTable.idOne = smallTableOne.idOne)   

  ) firstjoin   
  
JOIN
  
smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo)
   
{noformat}


> de-emphasize mapjoin hint
> -
>
> Key: HIVE-3784
> URL: https://issues.apache.org/jira/browse/HIVE-3784
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Namit Jain
> Attachments: hive.3784.1.patch, hive.3784.2.patch, hive.3784.3.patch, 
> hive.3784.4.patch, hive.3784.5.patch
>
>
> hive.auto.convert.join has been around for a long time, and is pretty stable.
> When mapjoin hint was created, the above parameter did not exist.
> The only reason for the user to specify a mapjoin currently is if they want
> it to be converted to a bucketed-mapjoin or a sort-merge bucketed mapjoin.
> Eventually, that should also go away, but that may take some time to 
> stabilize.
> There are many rules in SemanticAnalyzer to handle the following trees:
> ReduceSink -> MapJoin
> Union  -> MapJoin
> MapJoin-> MapJoin
> This should not be supported anymore. In any of the above scenarios, the
> user can get the mapjoin behavior by setting hive.auto.convert.join to true
> and not specifying the hint. This will simplify the code a lot.
> What does everyone think ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3784) de-emphasize mapjoin hint

2012-12-14 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532548#comment-13532548
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3784:
---

Thanks for the clarification, Namit!

> de-emphasize mapjoin hint
> -
>
> Key: HIVE-3784
> URL: https://issues.apache.org/jira/browse/HIVE-3784
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Namit Jain
> Attachments: hive.3784.1.patch, hive.3784.2.patch, hive.3784.3.patch, 
> hive.3784.4.patch, hive.3784.5.patch
>
>
> hive.auto.convert.join has been around for a long time, and is pretty stable.
> When mapjoin hint was created, the above parameter did not exist.
> The only reason for the user to specify a mapjoin currently is if they want
> it to be converted to a bucketed-mapjoin or a sort-merge bucketed mapjoin.
> Eventually, that should also go away, but that may take some time to 
> stabilize.
> There are many rules in SemanticAnalyzer to handle the following trees:
> ReduceSink -> MapJoin
> Union  -> MapJoin
> MapJoin-> MapJoin
> This should not be supported anymore. In any of the above scenarios, the
> user can get the mapjoin behavior by setting hive.auto.convert.join to true
> and not specifying the hint. This will simplify the code a lot.
> What does everyone think ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3784) de-emphasize mapjoin hint

2012-12-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531451#comment-13531451
 ] 

Vinod Kumar Vavilapalli commented on HIVE-3784:
---

Hi, couple of questsions:
 - Does this rule out bucketed map-join or hive.optimize.bucketmapjoin will 
continue to work? If it is the earlier, shouldn't fixing that be a blocker of 
this?
 - Also, does this rule out map join of multiple small tables in a single 
map-only job? As discussed on HIVE-3652, giving map-join hints to a nested join 
automatically converts it into a single map-join map.

bq. also optimizes a lot of queries - mapjoin followed by groupby.
This is great!

> de-emphasize mapjoin hint
> -
>
> Key: HIVE-3784
> URL: https://issues.apache.org/jira/browse/HIVE-3784
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Namit Jain
> Attachments: hive.3784.1.patch, hive.3784.2.patch, hive.3784.3.patch, 
> hive.3784.4.patch
>
>
> hive.auto.convert.join has been around for a long time, and is pretty stable.
> When mapjoin hint was created, the above parameter did not exist.
> The only reason for the user to specify a mapjoin currently is if they want
> it to be converted to a bucketed-mapjoin or a sort-merge bucketed mapjoin.
> Eventually, that should also go away, but that may take some time to 
> stabilize.
> There are many rules in SemanticAnalyzer to handle the following trees:
> ReduceSink -> MapJoin
> Union  -> MapJoin
> MapJoin-> MapJoin
> This should not be supported anymore. In any of the above scenarios, the
> user can get the mapjoin behavior by setting hive.auto.convert.join to true
> and not specifying the hint. This will simplify the code a lot.
> What does everyone think ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira