[jira] [Commented] (HIVE-1662) Add file pruning into Hive.
[ https://issues.apache.org/jira/browse/HIVE-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13595482#comment-13595482 ] Joydeep Sen Sarma commented on HIVE-1662: - question - Utilities.getInputSummary() doesn't go through CHIF AFAIK (looking at 0.8 code). will reducer estimation work with this patch? > Add file pruning into Hive. > --- > > Key: HIVE-1662 > URL: https://issues.apache.org/jira/browse/HIVE-1662 > Project: Hive > Issue Type: New Feature >Reporter: He Yongqiang >Assignee: Navis > Attachments: HIVE-1662.D8391.1.patch, HIVE-1662.D8391.2.patch > > > now hive support filename virtual column. > if a file name filter presents in a query, hive should be able to only add > files which passed the filter to input paths. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive
[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556977#comment-13556977 ] Joydeep Sen Sarma commented on HIVE-3874: - couple of observations: - one use case mentioned is external indices. but in my experience, secondary index pointers have little correlation with the primary key ordering. If the use case is to speed up secondary index lookups - then one will be forced to consider smaller row groups. At that point - this starts breaking down - large row groups are good for scanning for scanning and compression - but poor for lookups. a possible way out is to do a two level structure - stripes or chunks as the unit of compression (column dictionaries maintained at this level), but a smaller unit for row-groups (a single 250MB chunk has many smaller row groups all encoded using a common dictionary). this can give a good balance of compression and lookup capabilities. at this point - i believe - we are closer to a HFile data structure - and I think converging HFile* so it works well for Hive would be a great goal. A lot of people would benefit from letting HBase do indexing and let Hive/Hadoop chomp on HBase produced HFiles. - another use case mentioned is pruning based on column ranges. Once again - these use cases typically only benefit columns whose values are correlated with the primary row order. Timestamps and anything correlated with timestamps do benefit - but others don't. In systems like Netezza - this is used as a substitute for partitioning. The issue is that pruning at the block level is not enough - because one has already generated large number splits for MR to chomp on. And large number splits make processing really slow - even if everything is pruned out inside each mapper. Unless that issue is addressed - most users would end up repartitioning their (using Hive's dynamic partitioning) based on column values - and the whole column range stuff would largely not come in use. > Create a new Optimized Row Columnar file format for Hive > > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: OrcFileIntro.pptx > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3275) Fix autolocal1.q testcase failure when building hive on hadoop0.23 MR2
[ https://issues.apache.org/jira/browse/HIVE-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418029#comment-13418029 ] Joydeep Sen Sarma commented on HIVE-3275: - that sounds like a reasonable approach. it's a hive test, not hadoop - so as long as hive is trying to generate a non-local mode job (i am guessing that's what's being tested here) and that's verified against some hadoop tree - we are good. > Fix autolocal1.q testcase failure when building hive on hadoop0.23 MR2 > -- > > Key: HIVE-3275 > URL: https://issues.apache.org/jira/browse/HIVE-3275 > Project: Hive > Issue Type: Bug >Reporter: Zhenxiao Luo >Assignee: Zhenxiao Luo > Attachments: HIVE-3275.1.patch.txt > > > autolocal1.q is failing only on hadoop0.23 MR2, due to cluster initialization > problem: > Begin query: autolocal1.q > diff -a > /var/lib/jenkins/workspace/zhenxiao-CDH4-Hive-0.9.0/build/ql/test/logs/clientnegative/autolocal1.q.out > > /var/lib/jenkins/workspace/zhenxiao-CDH4-Hive-0.9.0/ql/src/test/results/clientnegative/autolocal1.q.out > 5c5 > < Job Submission failed with exception 'java.io.IOException(Cannot initialize > Cluster. Please check your configuration for mapreduce.framework.name and the > correspond server addresses.)' > — > > Job Submission failed with exception > > 'java.lang.IllegalArgumentException(Does not contain a valid host:port > > authority: abracadabra)' > Exception: Client execution results failed with error code = 1 > See build/ql/tmp/hive.log, or try "ant test ... -Dtest.silent=false" to get > more logs. > Failed query: autolocal1.q -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-2125) alter table concatenate fails and deletes data
alter table concatenate fails and deletes data -- Key: HIVE-2125 URL: https://issues.apache.org/jira/browse/HIVE-2125 Project: Hive Issue Type: Bug Reporter: Joydeep Sen Sarma Priority: Critical the number of reducers is not set by this command (unlike other hive queries). since mapred.reduce.tasks=-1 (to let hive infer this automatically) - jobtracker fails the job (number of reducers cannot be negative) hive> alter table ad_imps_2 partition(ds='2009-06-16') concatenate; alter table ad_imps_2 partition(ds='2009-06-16') concatenate; Starting Job = job_201103101203_453180, Tracking URL = http://curium.data.facebook.com:50030/jobdetails.jsp?jobid=job_201103101203_453180 Kill Command = /mnt/vol/hive/sites/curium/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=curium.data.facebook.com:50029 -kill job_201103101203_453180 Hadoop job information for null: number of mappers: 0; number of reducers: 0 2011-04-22 10:21:24,046 null map = 100%, reduce = 100% Ended Job = job_201103101203_453180 with errors Moved to trash: /user/facebook/warehouse/ad_imps_2/_backup.ds=2009-06-16 after the job fails - the partition is deleted thankfully it's still in trash -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-2125) alter table concatenate fails and deletes data
[ https://issues.apache.org/jira/browse/HIVE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma reassigned HIVE-2125: --- Assignee: He Yongqiang > alter table concatenate fails and deletes data > -- > > Key: HIVE-2125 > URL: https://issues.apache.org/jira/browse/HIVE-2125 > Project: Hive > Issue Type: Bug >Reporter: Joydeep Sen Sarma >Assignee: He Yongqiang >Priority: Critical > > the number of reducers is not set by this command (unlike other hive > queries). since mapred.reduce.tasks=-1 (to let hive infer this automatically) > - jobtracker fails the job (number of reducers cannot be negative) > hive> alter table ad_imps_2 partition(ds='2009-06-16') concatenate; > alter table ad_imps_2 partition(ds='2009-06-16') concatenate; > Starting Job = job_201103101203_453180, Tracking URL = > http://curium.data.facebook.com:50030/jobdetails.jsp?jobid=job_201103101203_453180 > Kill Command = /mnt/vol/hive/sites/curium/hadoop/bin/../bin/hadoop job > -Dmapred.job.tracker=curium.data.facebook.com:50029 -kill > job_201103101203_453180 > Hadoop job information for null: number of mappers: 0; number of reducers: 0 > 2011-04-22 10:21:24,046 null map = 100%, reduce = 100% > Ended Job = job_201103101203_453180 with errors > Moved to trash: /user/facebook/warehouse/ad_imps_2/_backup.ds=2009-06-16 > after the job fails - the partition is deleted > thankfully it's still in trash -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-2100) virtual column references inside subqueries cause execution exceptions
virtual column references inside subqueries cause execution exceptions -- Key: HIVE-2100 URL: https://issues.apache.org/jira/browse/HIVE-2100 Project: Hive Issue Type: Bug Reporter: Joydeep Sen Sarma example: create table jssarma_nilzma_bad as select a.fname, a.offset, a.val from (select hash(eventid,userid,eventtime,browsercookie,userstate,useragent,userip,serverip,clienttime,geoid,countrycode\ ,actionid,lastimpressionid,lastnavimpressionid,impressiontype,fullurl,fullreferrer,pagesection,modulesection,adsection) as val, INPUT__FILE__NAME as fname, BLOCK__OFFSET__INSIDE__FILE as offset from nectar_impression_lzma_unverified where ds='2010-07-28') a join jssarma_hc_diff b on (a.val=b.val);" causes Caused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121) ... 18 more Caused by: java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@664310d0, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3d04fc23, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@12457d21, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@101a0ae6, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dc18a4c, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@d5e92d7, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3bfa681c, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@34c92507, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@19e09a4, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2e8aeed0, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2344b18f, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@72e5355f, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@26132ae7, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3465b738, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dfd868, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@ef894ce, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@61f1680f, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2fe6e305, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@5f4275d4, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@445e228, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@802b249] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321) at org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) ... 18 more running the subquery separately fixes the issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2052) PostHook and PreHook API to add flag to indicate it is pre or post hook plus cache for content summary
[ https://issues.apache.org/jira/browse/HIVE-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009427#comment-13009427 ] Joydeep Sen Sarma commented on HIVE-2052: - committed - thanks Siying! > PostHook and PreHook API to add flag to indicate it is pre or post hook plus > cache for content summary > -- > > Key: HIVE-2052 > URL: https://issues.apache.org/jira/browse/HIVE-2052 > Project: Hive > Issue Type: Improvement >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Attachments: HIVE-2051.3.patch, HIVE-2052.1.patch, HIVE-2052.2.patch, > HIVE-2052.3.patch > > > This will allow hooks to share some information better and reduce their > latency -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009426#comment-13009426 ] Joydeep Sen Sarma commented on HIVE-2051: - committed - thanks Siying > getInputSummary() to call FileSystem.getContentSummary() in parallel > > > Key: HIVE-2051 > URL: https://issues.apache.org/jira/browse/HIVE-2051 > Project: Hive > Issue Type: Improvement >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, > HIVE-2051.4.patch, HIVE-2051.5.patch > > > getInputSummary() now call FileSystem.getContentSummary() one by one, which > can be extremely slow when the number of input paths are huge. By calling > those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009237#comment-13009237 ] Joydeep Sen Sarma commented on HIVE-2051: - +1. will commit after running tests. > getInputSummary() to call FileSystem.getContentSummary() in parallel > > > Key: HIVE-2051 > URL: https://issues.apache.org/jira/browse/HIVE-2051 > Project: Hive > Issue Type: Improvement >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, > HIVE-2051.4.patch, HIVE-2051.5.patch > > > getInputSummary() now call FileSystem.getContentSummary() one by one, which > can be extremely slow when the number of input paths are huge. By calling > those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008903#comment-13008903 ] Joydeep Sen Sarma commented on HIVE-2051: - based on: http://www.ibm.com/developerworks/java/library/j-jtp05236.html it seems that the right thing to do here is to catch the interruptedexception and then call Thread.currentThread.interrupt() (grep for 'swallow interrupt' in this article). we could also rethrow it - but the problem then will merely be punted to the higher layer (which probably will ignore it as well) > getInputSummary() to call FileSystem.getContentSummary() in parallel > > > Key: HIVE-2051 > URL: https://issues.apache.org/jira/browse/HIVE-2051 > Project: Hive > Issue Type: Improvement >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, > HIVE-2051.4.patch > > > getInputSummary() now call FileSystem.getContentSummary() one by one, which > can be extremely slow when the number of input paths are huge. By calling > those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2052) PostHook and PreHook API to add flag to indicate it is pre or post hook plus cache for content summary
[ https://issues.apache.org/jira/browse/HIVE-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008558#comment-13008558 ] Joydeep Sen Sarma commented on HIVE-2052: - looks like the new patch is for HIVE-2051 not HIVE-2052! > PostHook and PreHook API to add flag to indicate it is pre or post hook plus > cache for content summary > -- > > Key: HIVE-2052 > URL: https://issues.apache.org/jira/browse/HIVE-2052 > Project: Hive > Issue Type: Improvement >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Attachments: HIVE-2051.3.patch, HIVE-2052.1.patch, HIVE-2052.2.patch > > > This will allow hooks to share some information better and reduce their > latency -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008549#comment-13008549 ] Joydeep Sen Sarma commented on HIVE-2051: - Siying - i think we shouldn't ignore ExecutionException. The best part of checking for each task status seems to be that we can find out if any of them failed (indicated by ExecutionException). Also we can remove the executor.awaitTermination() call as well (same feedback as the comments above). also - do you want to make the core of this routine synchronized (perhaps on the context object - which is one per query)? there really is no point running more than one of these per query at a time. (we can move this whole routine to the Context object if that seems like a better place (or at least make the call from the Context object where it can be marked as a synchronized method). otherwise looks good. please upload a new patch and i will test and commit. > getInputSummary() to call FileSystem.getContentSummary() in parallel > > > Key: HIVE-2051 > URL: https://issues.apache.org/jira/browse/HIVE-2051 > Project: Hive > Issue Type: Improvement >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, > HIVE-2051.4.patch > > > getInputSummary() now call FileSystem.getContentSummary() one by one, which > can be extremely slow when the number of input paths are huge. By calling > those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2052) PostHook and PreHook API to add flag to indicate it is pre or post hook plus cache for content summary
[ https://issues.apache.org/jira/browse/HIVE-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008238#comment-13008238 ] Joydeep Sen Sarma commented on HIVE-2052: - small nits: - setinputpathtocontentsummary is called twice on the same hookcontext object - we are setting the hook type again and again (can do it once before calling postexecute) should the inputpathtocontentsummary be marked final in the hook and passed along with the constructor? (why would we ever change the map to a new one?). > PostHook and PreHook API to add flag to indicate it is pre or post hook plus > cache for content summary > -- > > Key: HIVE-2052 > URL: https://issues.apache.org/jira/browse/HIVE-2052 > Project: Hive > Issue Type: Improvement >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Attachments: HIVE-2052.1.patch, HIVE-2052.2.patch > > > This will allow hooks to share some information better and reduce their > latency -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008221#comment-13008221 ] Joydeep Sen Sarma commented on HIVE-2051: - my bad - i thought Carl == M IS :-). looking at .3 patch - i am concerned about this code: + result.get(); +} catch (InterruptedException e) { + throw new IOException(e); in a different block of code down from this one - we ignore InterruptedException. It seems safer to ignore them (I am just not sure if we there's any reason to get a valid thread interrupt in the calling thread and if so what the thread is supposed to do in that case). is it necessary for the executor to terminate if all the tasks given to it are already terminated? (trivial point - but might reduce code a bit). > getInputSummary() to call FileSystem.getContentSummary() in parallel > > > Key: HIVE-2051 > URL: https://issues.apache.org/jira/browse/HIVE-2051 > Project: Hive > Issue Type: Improvement >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch > > > getInputSummary() now call FileSystem.getContentSummary() one by one, which > can be extremely slow when the number of input paths are huge. By calling > those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel
[ https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007323#comment-13007323 ] Joydeep Sen Sarma commented on HIVE-2051: - looked at the latest patch from Carl. don't get it - why should we pay cost for creating thread when one is not required? > getInputSummary() to call FileSystem.getContentSummary() in parallel > > > Key: HIVE-2051 > URL: https://issues.apache.org/jira/browse/HIVE-2051 > Project: Hive > Issue Type: Improvement >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Minor > Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch > > > getInputSummary() now call FileSystem.getContentSummary() one by one, which > can be extremely slow when the number of input paths are huge. By calling > those functions in parallel, we can cut latency in most cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HIVE-2039) remove hadoop version check from hive cli shell script
[ https://issues.apache.org/jira/browse/HIVE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-2039: Fix Version/s: 0.8.0 Release Note: hive startup shell script allows bypassing running 'hadoop version' subcommand via setting of HADOOP_VERSION environment variable. this can reduce latency of starting up the hive cli. Status: Patch Available (was: Reopened) > remove hadoop version check from hive cli shell script > -- > > Key: HIVE-2039 > URL: https://issues.apache.org/jira/browse/HIVE-2039 > Project: Hive > Issue Type: Improvement > Components: CLI >Reporter: Joydeep Sen Sarma >Assignee: Joydeep Sen Sarma > Fix For: 0.8.0 > > Attachments: HIVE-2039.1.patch > > > looking at cli startup times - one thing i noticed is that the version check > in execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive > is installed). > AFAIK - hive doesn't support versions less than 20 right now - and this check > is only to check if version is less than 20. So we should be able to safely > take it out. please comment if that is not the case. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HIVE-2039) remove hadoop version check from hive cli shell script
[ https://issues.apache.org/jira/browse/HIVE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-2039: Attachment: HIVE-2039.1.patch skip hadoop version check via hadoop subcommand if version is supplied by env variable. > remove hadoop version check from hive cli shell script > -- > > Key: HIVE-2039 > URL: https://issues.apache.org/jira/browse/HIVE-2039 > Project: Hive > Issue Type: Improvement > Components: CLI >Reporter: Joydeep Sen Sarma >Assignee: Joydeep Sen Sarma > Attachments: HIVE-2039.1.patch > > > looking at cli startup times - one thing i noticed is that the version check > in execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive > is installed). > AFAIK - hive doesn't support versions less than 20 right now - and this check > is only to check if version is less than 20. So we should be able to safely > take it out. please comment if that is not the case. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Reopened: (HIVE-2039) remove hadoop version check from hive cli shell script
[ https://issues.apache.org/jira/browse/HIVE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma reopened HIVE-2039: - Assignee: Joydeep Sen Sarma spoke too fast - the version check has moved to a different place - but it's still there. i plan to provide a way to avoid the version check (via env var) because it's pretty expensive. it would still be there by default. > remove hadoop version check from hive cli shell script > -- > > Key: HIVE-2039 > URL: https://issues.apache.org/jira/browse/HIVE-2039 > Project: Hive > Issue Type: Improvement > Components: CLI >Reporter: Joydeep Sen Sarma >Assignee: Joydeep Sen Sarma > > looking at cli startup times - one thing i noticed is that the version check > in execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive > is installed). > AFAIK - hive doesn't support versions less than 20 right now - and this check > is only to check if version is less than 20. So we should be able to safely > take it out. please comment if that is not the case. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (HIVE-2039) remove hadoop version check from hive cli shell script
[ https://issues.apache.org/jira/browse/HIVE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma resolved HIVE-2039. - Resolution: Invalid never mind. looks like it's already gone and i am looking at old version. > remove hadoop version check from hive cli shell script > -- > > Key: HIVE-2039 > URL: https://issues.apache.org/jira/browse/HIVE-2039 > Project: Hive > Issue Type: Improvement > Components: CLI >Reporter: Joydeep Sen Sarma > > looking at cli startup times - one thing i noticed is that the version check > in execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive > is installed). > AFAIK - hive doesn't support versions less than 20 right now - and this check > is only to check if version is less than 20. So we should be able to safely > take it out. please comment if that is not the case. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (HIVE-2039) remove hadoop version check from hive cli shell script
remove hadoop version check from hive cli shell script -- Key: HIVE-2039 URL: https://issues.apache.org/jira/browse/HIVE-2039 Project: Hive Issue Type: Improvement Components: CLI Reporter: Joydeep Sen Sarma looking at cli startup times - one thing i noticed is that the version check in execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive is installed). AFAIK - hive doesn't support versions less than 20 right now - and this check is only to check if version is less than 20. So we should be able to safely take it out. please comment if that is not the case. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HIVE-2037) Merge result file size should honor hive.merge.size.per.task
[ https://issues.apache.org/jira/browse/HIVE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-2037: Resolution: Fixed Fix Version/s: 0.8.0 Status: Resolved (was: Patch Available) > Merge result file size should honor hive.merge.size.per.task > > > Key: HIVE-2037 > URL: https://issues.apache.org/jira/browse/HIVE-2037 > Project: Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Ning Zhang > Fix For: 0.8.0 > > Attachments: HIVE-2037.patch > > > The merge job set mapred.min.split.size to the value of > hive.merge.size.per.task, which roughly equals to the output file size. > However the input split size is also determined by > mapred.min.split.size.per.node, mapred.min.split.size.per.rack, and > mapred.max.split.size. They should be set the same as > hive.merge.size.per.task as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2037) Merge result file size should honor hive.merge.size.per.task
[ https://issues.apache.org/jira/browse/HIVE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004866#comment-13004866 ] Joydeep Sen Sarma commented on HIVE-2037: - committed. thanks Ning. > Merge result file size should honor hive.merge.size.per.task > > > Key: HIVE-2037 > URL: https://issues.apache.org/jira/browse/HIVE-2037 > Project: Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Ning Zhang > Fix For: 0.8.0 > > Attachments: HIVE-2037.patch > > > The merge job set mapred.min.split.size to the value of > hive.merge.size.per.task, which roughly equals to the output file size. > However the input split size is also determined by > mapred.min.split.size.per.node, mapred.min.split.size.per.rack, and > mapred.max.split.size. They should be set the same as > hive.merge.size.per.task as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-2037) Merge result file size should honor hive.merge.size.per.task
[ https://issues.apache.org/jira/browse/HIVE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004802#comment-13004802 ] Joydeep Sen Sarma commented on HIVE-2037: - looks ok - please run the tests and i will commit. > Merge result file size should honor hive.merge.size.per.task > > > Key: HIVE-2037 > URL: https://issues.apache.org/jira/browse/HIVE-2037 > Project: Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-2037.patch > > > The merge job set mapred.min.split.size to the value of > hive.merge.size.per.task, which roughly equals to the output file size. > However the input split size is also determined by > mapred.min.split.size.per.node, mapred.min.split.size.per.rack, and > mapred.max.split.size. They should be set the same as > hive.merge.size.per.task as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (HIVE-1833) Task-cleanup task should be disabled
[ https://issues.apache.org/jira/browse/HIVE-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma resolved HIVE-1833. - Resolution: Fixed Fix Version/s: 0.8.0 > Task-cleanup task should be disabled > > > Key: HIVE-1833 > URL: https://issues.apache.org/jira/browse/HIVE-1833 > Project: Hive > Issue Type: Improvement > Components: Server Infrastructure >Reporter: Scott Chen >Assignee: Scott Chen > Fix For: 0.8.0 > > Attachments: HIVE-1833.1.txt, HIVE-1833.txt > > > Currently when task fails, a cleanup attempt will be scheduled right after > that. > This is unnecessary and increase the latency. MapReduce will allow disabling > this (see MAPREDUCE-2206). > After that patch is committed, we should set the JobConf in HIVE to disable > cleanup task. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1833) Task-cleanup task should be disabled
[ https://issues.apache.org/jira/browse/HIVE-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002165#comment-13002165 ] Joydeep Sen Sarma commented on HIVE-1833: - committed - thanks Scott! > Task-cleanup task should be disabled > > > Key: HIVE-1833 > URL: https://issues.apache.org/jira/browse/HIVE-1833 > Project: Hive > Issue Type: Improvement > Components: Server Infrastructure >Reporter: Scott Chen >Assignee: Scott Chen > Attachments: HIVE-1833.1.txt, HIVE-1833.txt > > > Currently when task fails, a cleanup attempt will be scheduled right after > that. > This is unnecessary and increase the latency. MapReduce will allow disabling > this (see MAPREDUCE-2206). > After that patch is committed, we should set the JobConf in HIVE to disable > cleanup task. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1833) Task-cleanup task should be disabled
[ https://issues.apache.org/jira/browse/HIVE-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000609#comment-13000609 ] Joydeep Sen Sarma commented on HIVE-1833: - can you also make this change in: shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java looks like we have two shims for hadoop-0.20 (one for hadoop-20 with security) > Task-cleanup task should be disabled > > > Key: HIVE-1833 > URL: https://issues.apache.org/jira/browse/HIVE-1833 > Project: Hive > Issue Type: Improvement > Components: Server Infrastructure >Reporter: Scott Chen >Assignee: Scott Chen > Attachments: HIVE-1833.txt > > > Currently when task fails, a cleanup attempt will be scheduled right after > that. > This is unnecessary and increase the latency. MapReduce will allow disabling > this (see MAPREDUCE-2206). > After that patch is committed, we should set the JobConf in HIVE to disable > cleanup task. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (HIVE-1968) data corruption with multi-table insert
data corruption with multi-table insert --- Key: HIVE-1968 URL: https://issues.apache.org/jira/browse/HIVE-1968 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.7.0 Reporter: Joydeep Sen Sarma i had to run a conversion process to compute a checksum (sum(hash(all-columns)) of a table and convert it to a different compression format. trying to be clever - i did both of them in a single pass by doing something to the equivalent of: from (select col1, col2, hash(col1, col2) as val from table_to_be_converted) i insert overwrite table table_to_be_generated select i.col1, i.col2 insert overwrite table table_to_be_converted_checksum select sum(hash(i.val)); the plan looked correct. however - the data produced was erroneous - the checksums and the data were both wrong (and consistent with each other). i know this because: - the checksum computed by the above query didn't match the checksum on the input table when calculated separately - the checksum of the data output by this query (first insert clause) didn't match the input table's checksum (neither the one computed by the query above, nor by the one computed separately) later on - i broke up this query into two independent ones - and the data and checksums were good (ie. they all matched up). so seems like there's some data corruption happening in MTI. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-1852: Resolution: Fixed Fix Version/s: 0.7.0 Status: Resolved (was: Patch Available) committed. thanks Ning. > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Fix For: 0.7.0 > > Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.4.patch, > HIVE-1852.5.patch, HIVE-1852.6.patch, HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973883#action_12973883 ] Joydeep Sen Sarma commented on HIVE-1852: - Hive..java:1564 - this should read fs.rename(srcs[0]) (since srcf may have been a wildcard that matched a single dir) Hive.java:1574 - we can optimize this loop i think. if the wildcard does not match a single directory - then it has to match a set of files. the loadsemantic analyzer already enforces this. so we don't need a second listStatus and loop over the entries here. can directly move each of the srcs into destf/srcs.getName we have lost the atomic move for the wildcard case. i think it's ok (it's not used much i would imagine) - at least leave a note/todo saying that this would be nice to have atomic. new tests look pretty good to me - the load/move case with wildcards is getting covered. we could add one where the load path is a wildcard that matches a single dir to cover the first comment here. > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.4.patch, > HIVE-1852.5.patch, HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973387#action_12973387 ] Joydeep Sen Sarma commented on HIVE-1852: - regarding wildcards - load data inpath /x/*.txt - does that work? the copy task should not happen if the source and destination file systems are the same (in the load command). i think u may have observed the copy command kick in because in the hive unit test environment - the default file system is pfile:// and local file system is not same as pfile:// (from a java class standpoint). so copytask kicks in. you might want to try a load command without the 'local' keyword. also - regarding the question about filesystem.delete versus fsshell - testing should be easy because we have two working file systems in our test environment (file:// and pfile://). we can create a partition pointing to one filesystem and then try to change it's location to the other filesystem and make sure things work. > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973353#action_12973353 ] Joydeep Sen Sarma commented on HIVE-1852: - cool - the fsshell removal sounds good unless Yongqiang says something otherwise. i am pretty sure this patch breaks load command with a wildcard though. it seems to me that the load command is simply passing the input path (with the wildcard pattern) to the the loadTable/loadPartition methods (via LoadTableDesc). these commands were previously capable of handling wildcards that matched a set of files. now they will not be able to do that. Ning - can u confirm this? (maybe add a test trying to load a wildcard pattern?) on a more minor note - the checkPaths call that got taken out was checking for the presence of nested subdirectories inside the path being loaded. is this no longer necessary? (do we support directories within partitions/tables automatically at query time?) > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973300#action_12973300 ] Joydeep Sen Sarma commented on HIVE-1852: - Ning - do you know why we had the FsShell and -rmr code (in Hive.replaceFiles) ? i don't think it was there originally and there must have been a reason why it got put in. this patch is taking it out and i wanted to be sure this is ok. (I am wondering if the -rmr is there to handle non hdfs file systems) > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971991#action_12971991 ] Joydeep Sen Sarma commented on HIVE-1852: - what happens when there are multiple srcs: srcs = fs.listStatus(srcf); for (FileStatus src : srcs) { if (!fs.rename(src.getPath(), tmppath)) { we can't rename multiple sources to the same target. this happens when the input path is a glob (which i think the load command allows). can we special case this optimization (apply only when srcs.length == 1)? > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-1852.2.patch, HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971926#action_12971926 ] Joydeep Sen Sarma commented on HIVE-1852: - oh - sorry - my previous comment was wrong. if u rename src to tmpPath - u lose all previous contents of tmpPath. which is not what the function does - it retains contents of tmpPath that don't collide with contents in src (the load command uses this i think and merges data in). maybe we need a different call for the specific case u are trying to optimize. if this is not being picked up by the tests - that's pretty bad .. > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971922#action_12971922 ] Joydeep Sen Sarma commented on HIVE-1852: - u need a test where src is a directory. please try different variants of load commands. > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971920#action_12971920 ] Joydeep Sen Sarma commented on HIVE-1852: - old code rename src/item_i to tmpPath/item_i new code renames src to tmpPath/src. item_i's final position is tmpPath/src/item_i what am i missing? > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls
[ https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971893#action_12971893 ] Joydeep Sen Sarma commented on HIVE-1852: - are u sure this is ok? it seems we have changed the semantics - the old code takes each file from underneath the dir and moves into final location. the new code moves the directory underneath the final location. there's one extra level of directory in the new code that's not there in the old code. also - the semantics in terms of collisions changes because of this. if we create a subdir - then there may not be collisions in the new code (because of rename) that may occur in the old code. > Reduce unnecessary DFSClient.rename() calls > --- > > Key: HIVE-1852 > URL: https://issues.apache.org/jira/browse/HIVE-1852 > Project: Hive > Issue Type: Improvement >Reporter: Ning Zhang >Assignee: Ning Zhang > Attachments: HIVE-1852.patch > > > In Hive client side (MoveTask etc), DFSCleint.rename() is called for every > file inside a directory. It is very expensive for a large directory in a busy > DFS namenode. We should replace it with a single rename() call on the whole > directory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1846) change hive assumption that local mode mappers/reducers always run in same jvm
[ https://issues.apache.org/jira/browse/HIVE-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma resolved HIVE-1846. - Resolution: Fixed committed the new patch. thanks Ram. > change hive assumption that local mode mappers/reducers always run in same > jvm > --- > > Key: HIVE-1846 > URL: https://issues.apache.org/jira/browse/HIVE-1846 > Project: Hive > Issue Type: Bug >Reporter: Joydeep Sen Sarma >Assignee: Ramkumar Vadali > Fix For: 0.7.0 > > Attachments: HIVE-1846.2.patch, HIVE-1846.patch > > > we are trying out a version of hadoop local mode that runs multiple > mappers/reducers by spawning jvm's for them. In this mode hive mappers fail > in reading the plan file. it seems that we assume (in the setMapredWork call) > that local mode mappers/reducers will run in the same jvm (we can cache the > current plan in a global var and don't serialize to a path). this needs to > get fixed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1846) change hive assumption that local mode mappers/reducers always run in same jvm
[ https://issues.apache.org/jira/browse/HIVE-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-1846: Assignee: Ramkumar Vadali (was: Joydeep Sen Sarma) > change hive assumption that local mode mappers/reducers always run in same > jvm > --- > > Key: HIVE-1846 > URL: https://issues.apache.org/jira/browse/HIVE-1846 > Project: Hive > Issue Type: Bug >Reporter: Joydeep Sen Sarma >Assignee: Ramkumar Vadali > Fix For: 0.7.0 > > Attachments: HIVE-1846.patch > > > we are trying out a version of hadoop local mode that runs multiple > mappers/reducers by spawning jvm's for them. In this mode hive mappers fail > in reading the plan file. it seems that we assume (in the setMapredWork call) > that local mode mappers/reducers will run in the same jvm (we can cache the > current plan in a global var and don't serialize to a path). this needs to > get fixed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1846) change hive assumption that local mode mappers/reducers always run in same jvm
[ https://issues.apache.org/jira/browse/HIVE-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma resolved HIVE-1846. - Resolution: Fixed Fix Version/s: 0.7.0 committed - thanks Ram. > change hive assumption that local mode mappers/reducers always run in same > jvm > --- > > Key: HIVE-1846 > URL: https://issues.apache.org/jira/browse/HIVE-1846 > Project: Hive > Issue Type: Bug >Reporter: Joydeep Sen Sarma >Assignee: Joydeep Sen Sarma > Fix For: 0.7.0 > > Attachments: HIVE-1846.patch > > > we are trying out a version of hadoop local mode that runs multiple > mappers/reducers by spawning jvm's for them. In this mode hive mappers fail > in reading the plan file. it seems that we assume (in the setMapredWork call) > that local mode mappers/reducers will run in the same jvm (we can cache the > current plan in a global var and don't serialize to a path). this needs to > get fixed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1675) SAXParseException on plan.xml during local mode.
[ https://issues.apache.org/jira/browse/HIVE-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970201#action_12970201 ] Joydeep Sen Sarma commented on HIVE-1675: - the combination of auto local and parallel had a bug that was recently fixed (HIVE-1776) The stack reported above is weird - we should not be entering this code path. as u mentioned - HIVE-1846 will fix this - but currently, the getMapRedWork call invoked from different places in the stack should be satisfied from the an in-memory cache (and shouldn't hit the file). the file is not written out for local mode (because there's an assumption that everything runs in the same jvm). Unable to explain this - i think this is worth fixing because the plan file processing takes some time and it's better to retrieve it from memory where possible. > SAXParseException on plan.xml during local mode. > > > Key: HIVE-1675 > URL: https://issues.apache.org/jira/browse/HIVE-1675 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 0.7.0 >Reporter: Bennie Schut >Assignee: Bennie Schut > Fix For: 0.7.0 > > Attachments: HIVE-1675.patch, local_10005_plan.xml, > local_10006_plan.xml > > > When hive switches to local mode (hive.exec.mode.local.auto=true) I receive a > sax parser exception on the plan.xml > If I set hive.exec.mode.local.auto=false I get the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1846) change hive assumption that local mode mappers/reducers always run in same jvm
change hive assumption that local mode mappers/reducers always run in same jvm --- Key: HIVE-1846 URL: https://issues.apache.org/jira/browse/HIVE-1846 Project: Hive Issue Type: Bug Reporter: Joydeep Sen Sarma Assignee: Joydeep Sen Sarma we are trying out a version of hadoop local mode that runs multiple mappers/reducers by spawning jvm's for them. In this mode hive mappers fail in reading the plan file. it seems that we assume (in the setMapredWork call) that local mode mappers/reducers will run in the same jvm (we can cache the current plan in a global var and don't serialize to a path). this needs to get fixed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1778) simultaneously launched queries collide on hive intermediate directories
[ https://issues.apache.org/jira/browse/HIVE-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968859#action_12968859 ] Joydeep Sen Sarma commented on HIVE-1778: - whatever works - we could pass in hash the query string and time (perhaps a nanosecond timer) to come up with a better seed for the random generator for example. > simultaneously launched queries collide on hive intermediate directories > > > Key: HIVE-1778 > URL: https://issues.apache.org/jira/browse/HIVE-1778 > Project: Hive > Issue Type: Bug >Reporter: Joydeep Sen Sarma >Assignee: Edward Capriolo > > we saw one instance of multiple queries for the same user launched in > parallel (from a workflow engine) use the same intermediate directories. > which is obviously super bad but not suprising considering how we allocate > them: >Random rand = new Random(); > String executionId = "hive_" + format.format(new Date()) + "_" + > Math.abs(rand.nextLong()); > Java documentation says: Two Random objects created within the same > millisecond will have the same sequence of random numbers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1695) MapJoin followed by ReduceSink should be done as single MapReduce Job
[ https://issues.apache.org/jira/browse/HIVE-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966247#action_12966247 ] Joydeep Sen Sarma commented on HIVE-1695: - couple of things to watch out for: - mapjoin uses a lot of memory on the mapper. i am not sure how the memory setting are controlled - but we need to make sure that the map-join and the sort (imposed by the reducesink) don't blow through the task heap limits. In case the RS is coming because of group by - the map side hash aggregation will also use memory. - the stuff that liyin has been working on converts regular joins into map joins automatically. i believe he generates several plans (map-join and sort-merge join) and chooses from one of them at runtime. will the technique discussed here apply to map-join plans generated by auto-map-joins? (i am not sure - so asking) > MapJoin followed by ReduceSink should be done as single MapReduce Job > - > > Key: HIVE-1695 > URL: https://issues.apache.org/jira/browse/HIVE-1695 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Reporter: Amareshwari Sriramadasu > > Currently MapJoin followed by ReduceSink runs as two MapReduce jobs : One map > only job followed by a Map-Reduce job. It can be combined into single > MapReduce Job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system
[ https://issues.apache.org/jira/browse/HIVE-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-1776: Attachment: HIVE-1776.2.patch > parallel execution and auto-local mode combine to place plan file in wrong > file system > -- > > Key: HIVE-1776 > URL: https://issues.apache.org/jira/browse/HIVE-1776 > Project: Hive > Issue Type: Bug >Reporter: Joydeep Sen Sarma >Assignee: Joydeep Sen Sarma > Attachments: HIVE-1776.1.patch, HIVE-1776.2.patch > > > A query (that i can't reproduce verbatim) submits a job to a MR cluster with > a plan file that is resident on the local file system. This job obviously > fails. > This seems to result from an interaction between the parallel execution > (which is trying to run one local and one remote job at the same time). > Turning off either the parallel execution mode or the auto-local mode seems > to fix the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system
[ https://issues.apache.org/jira/browse/HIVE-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930368#action_12930368 ] Joydeep Sen Sarma commented on HIVE-1776: - yeah it was - but shoot - i forgot to take out the corresponding call in the finally block to restore the tracker. will upload new patch. these are no longer necessary because we are using a cloned configuration object that is discarded once the task completes. > parallel execution and auto-local mode combine to place plan file in wrong > file system > -- > > Key: HIVE-1776 > URL: https://issues.apache.org/jira/browse/HIVE-1776 > Project: Hive > Issue Type: Bug >Reporter: Joydeep Sen Sarma >Assignee: Joydeep Sen Sarma > Attachments: HIVE-1776.1.patch > > > A query (that i can't reproduce verbatim) submits a job to a MR cluster with > a plan file that is resident on the local file system. This job obviously > fails. > This seems to result from an interaction between the parallel execution > (which is trying to run one local and one remote job at the same time). > Turning off either the parallel execution mode or the auto-local mode seems > to fix the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system
[ https://issues.apache.org/jira/browse/HIVE-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-1776: Assignee: Joydeep Sen Sarma Status: Patch Available (was: Open) > parallel execution and auto-local mode combine to place plan file in wrong > file system > -- > > Key: HIVE-1776 > URL: https://issues.apache.org/jira/browse/HIVE-1776 > Project: Hive > Issue Type: Bug >Reporter: Joydeep Sen Sarma >Assignee: Joydeep Sen Sarma > Attachments: HIVE-1776.1.patch > > > A query (that i can't reproduce verbatim) submits a job to a MR cluster with > a plan file that is resident on the local file system. This job obviously > fails. > This seems to result from an interaction between the parallel execution > (which is trying to run one local and one remote job at the same time). > Turning off either the parallel execution mode or the auto-local mode seems > to fix the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system
[ https://issues.apache.org/jira/browse/HIVE-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma updated HIVE-1776: Attachment: HIVE-1776.1.patch the problem is that tasks are trying to modify the shared hive configuration object and trampling each other. fix is to clone the configuration object before modifying it in the Task. > parallel execution and auto-local mode combine to place plan file in wrong > file system > -- > > Key: HIVE-1776 > URL: https://issues.apache.org/jira/browse/HIVE-1776 > Project: Hive > Issue Type: Bug >Reporter: Joydeep Sen Sarma > Attachments: HIVE-1776.1.patch > > > A query (that i can't reproduce verbatim) submits a job to a MR cluster with > a plan file that is resident on the local file system. This job obviously > fails. > This seems to result from an interaction between the parallel execution > (which is trying to run one local and one remote job at the same time). > Turning off either the parallel execution mode or the auto-local mode seems > to fix the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1778) simultaneously launched queries collide on hive intermediate directories
simultaneously launched queries collide on hive intermediate directories Key: HIVE-1778 URL: https://issues.apache.org/jira/browse/HIVE-1778 Project: Hive Issue Type: Bug Reporter: Joydeep Sen Sarma we saw one instance of multiple queries for the same user launched in parallel (from a workflow engine) use the same intermediate directories. which is obviously super bad but not suprising considering how we allocate them: Random rand = new Random(); String executionId = "hive_" + format.format(new Date()) + "_" + Math.abs(rand.nextLong()); Java documentation says: Two Random objects created within the same millisecond will have the same sequence of random numbers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system
parallel execution and auto-local mode combine to place plan file in wrong file system -- Key: HIVE-1776 URL: https://issues.apache.org/jira/browse/HIVE-1776 Project: Hive Issue Type: Bug Reporter: Joydeep Sen Sarma A query (that i can't reproduce verbatim) submits a job to a MR cluster with a plan file that is resident on the local file system. This job obviously fails. This seems to result from an interaction between the parallel execution (which is trying to run one local and one remote job at the same time). Turning off either the parallel execution mode or the auto-local mode seems to fix the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927555#action_12927555 ] Joydeep Sen Sarma commented on HIVE-1721: - @Siyin - that's a good question. I don't know statistically how common it is - but we have heard requests along these lines. for example one use case is that one project wants to get some data for a reasonably large subset of the users. one use case we have seen was where 0.2% of users were interesting - but even 0.2% is very large for us. people also use semi-joins and that pretty much says that people want to filter rows out. > use bloom filters to improve the performance of joins > - > > Key: HIVE-1721 > URL: https://issues.apache.org/jira/browse/HIVE-1721 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Reporter: Namit Jain >Assignee: Siying Dong > > In case of map-joins, it is likely that the big table will not find many > matching rows from the small table. > Currently, we perform a hash-map lookup for every row in the big table, which > can be pretty expensive. > It might be useful to try out a bloom-filter containing all the elements in > the small table. > Each element from the big table is first searched in the bloom filter, and > only in case of a positive match, > the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927544#action_12927544 ] Joydeep Sen Sarma commented on HIVE-1721: - a bloom filter takes 10 bits per entry (with reasonable probability. i remember reading this value from wikipedia). Our java hash tables take 2000 bytes per key-value pair (based on tests done by Liyin for reasonable sized keys/values). So the idea is that if the small table is too big to be loaded into memory - but small enough that it's bloom filter can be stored in memory - then we can first do a filter of the large table and then do the sort. > use bloom filters to improve the performance of joins > - > > Key: HIVE-1721 > URL: https://issues.apache.org/jira/browse/HIVE-1721 > Project: Hive > Issue Type: New Feature > Components: Query Processor >Reporter: Namit Jain >Assignee: Siying Dong > > In case of map-joins, it is likely that the big table will not find many > matching rows from the small table. > Currently, we perform a hash-map lookup for every row in the big table, which > can be pretty expensive. > It might be useful to try out a bloom-filter containing all the elements in > the small table. > Each element from the big table is first searched in the bloom filter, and > only in case of a positive match, > the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of map joins
[ https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922154#action_12922154 ] Joydeep Sen Sarma commented on HIVE-1721: - i am not so sure about this. consider a hash table which has a very large number of buckets (relative to the size of the elements in the hashtable). a lookup inside the hashtable stops as soon as we hit an empty bucket. this requires us to only compute the hashcode(). if #buckets >> #elements - then for a miss - the likely average cost of the miss should only be the cost of the hashcode routine. now consider a bloom filter. here we have to compute multiple hash codes (or at least one). on top of that - with the added bloom filters - there's an added cost for each positive (many hashcode computations). It's very clear from this reasoning that Bloom filters would be more expensive, not less, for small table joins. note that the hashtables in java do allow specification of number of buckets - so the strategy outlined here (of deliberately constructing a sparse hash table) is a feasible one. Stepping back - this makes sense - because Bloom filters are designed for large data sets (or at least data sets that don't easily fit in memory) - not small ones (that fit easily in memory). --- It would be more interesting to consider Bloom filters to cover join scenarios that cannot be performed with map join. for example - if the small table had 1M keys and map-join is not able to handle that large a hash table - then one can use bloom filters: - filter (probabilistically) large table against medium sized table by looking up against bloom filter of medium-sized table (map-side bloom filter). (Note - this is not a join - just a filter) - take filtered output and do sort-merge join against medium sized table (by now the data size should be greatly reduced and the cost of sorting would go down tremendously). there's lots of literature around this - it's a pretty well known technique. it's quite different from what's proposed in this jira. > use bloom filters to improve the performance of map joins > - > > Key: HIVE-1721 > URL: https://issues.apache.org/jira/browse/HIVE-1721 > Project: Hadoop Hive > Issue Type: New Feature > Components: Query Processor >Reporter: Namit Jain >Assignee: Liyin Tang > > In case of map-joins, it is likely that the big table will not find many > matching rows from the small table. > Currently, we perform a hash-map lookup for every row in the big table, which > can be pretty expensive. > It might be useful to try out a bloom-filter containing all the elements in > the small table. > Each element from the big table is first searched in the bloom filter, and > only in case of a positive match, > the small table hash table is explored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1620) Patch to write directly to S3 from Hive
[ https://issues.apache.org/jira/browse/HIVE-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920808#action_12920808 ] Joydeep Sen Sarma commented on HIVE-1620: - i agree that the speed efficiency may be worth the tradeoff in consistency. as you say - the messaging is critical. can we gate this feature on a new hive option that makes the user conscious of this tradeoff? regarding the cleanup - please look at jobClose method in FileSinkOperator (I think). if the hive client is still functioning at the time the job fails - we can make an attempt to clean things up there (assuming that the file names are unique - which i am not sure about right now because we made some changes to shorten file names (that might have to be undone for this feature)). one thing we have experienced in the past is that hadoop tasks continue to do stuff even after the job is technically 'complete'. so i think while the cleanup can help the 99% use case - there will be marginal cases where the output directory gets written to when it shouldn't. so having this gated on an option would still be worthwhile IMHO (for users who cannot afford speed-accuracy tradeoff). > Patch to write directly to S3 from Hive > --- > > Key: HIVE-1620 > URL: https://issues.apache.org/jira/browse/HIVE-1620 > Project: Hadoop Hive > Issue Type: New Feature >Reporter: Vaibhav Aggarwal >Assignee: Vaibhav Aggarwal > Attachments: HIVE-1620.patch > > > We want to submit a patch to Hive which allows user to write files directly > to S3. > This patch allow user to specify an S3 location as the table output location > and hence eliminates the need of copying data from HDFS to S3. > Users can run Hive queries directly over the data stored in S3. > This patch helps integrate hive with S3 better and quicker. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.