from:"Joydeep Sen Sarma \(JIRA\)"

[jira] [Commented] (HIVE-1662) Add file pruning into Hive.

2013-03-06 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13595482#comment-13595482
 ] 

Joydeep Sen Sarma commented on HIVE-1662:
-

question - Utilities.getInputSummary() doesn't go through CHIF AFAIK (looking 
at 0.8 code). will reducer estimation work with this patch?

> Add file pruning into Hive.
> ---
>
> Key: HIVE-1662
> URL: https://issues.apache.org/jira/browse/HIVE-1662
> Project: Hive
>  Issue Type: New Feature
>Reporter: He Yongqiang
>Assignee: Navis
> Attachments: HIVE-1662.D8391.1.patch, HIVE-1662.D8391.2.patch
>
>
> now hive support filename virtual column. 
> if a file name filter presents in a query, hive should be able to only add 
> files which passed the filter to input paths.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3874) Create a new Optimized Row Columnar file format for Hive

2013-01-17 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556977#comment-13556977
 ] 

Joydeep Sen Sarma commented on HIVE-3874:
-

couple of observations:

- one use case mentioned is external indices. but in my experience, secondary 
index pointers have little correlation with the primary key ordering. If the 
use case is to speed up secondary index lookups - then one will be forced to 
consider smaller row groups. At that point - this starts breaking down - large 
row groups are good for scanning for scanning and compression - but poor for 
lookups.

  a possible way out is to do a two level structure - stripes or chunks as the 
unit of compression (column dictionaries maintained at this level), but a 
smaller unit for row-groups (a single 250MB chunk has many smaller row groups 
all encoded using a common dictionary). this can give a good balance of 
compression and lookup capabilities.

  at this point - i believe - we are closer to a HFile data structure - and I 
think converging HFile* so it works well for Hive would be a great goal. A lot 
of people would benefit from letting HBase do indexing and let Hive/Hadoop 
chomp on HBase produced HFiles.


- another use case mentioned is pruning based on column ranges. Once again - 
these use cases typically only benefit columns whose values are correlated with 
the primary row order. Timestamps and anything correlated with timestamps do 
benefit - but others don't. In systems like Netezza - this is used as a 
substitute for partitioning.

  The issue is that pruning at the block level is not enough - because one has 
already generated large number splits for MR to chomp on. And large number 
splits make processing really slow - even if everything is pruned out inside 
each mapper. Unless that issue is addressed - most users would end up 
repartitioning their (using Hive's dynamic partitioning) based on column values 
- and the whole column range stuff would largely not come in use.


> Create a new Optimized Row Columnar file format for Hive
> 
>
> Key: HIVE-3874
> URL: https://issues.apache.org/jira/browse/HIVE-3874
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3275) Fix autolocal1.q testcase failure when building hive on hadoop0.23 MR2

2012-07-18 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418029#comment-13418029
 ] 

Joydeep Sen Sarma commented on HIVE-3275:
-

that sounds like a reasonable approach. it's a hive test, not hadoop - so as 
long as hive is trying to generate a non-local mode job (i am guessing that's 
what's being tested here) and that's verified against some hadoop tree - we are 
good.

> Fix autolocal1.q testcase failure when building hive on hadoop0.23 MR2
> --
>
> Key: HIVE-3275
> URL: https://issues.apache.org/jira/browse/HIVE-3275
> Project: Hive
>  Issue Type: Bug
>Reporter: Zhenxiao Luo
>Assignee: Zhenxiao Luo
> Attachments: HIVE-3275.1.patch.txt
>
>
> autolocal1.q is failing only on hadoop0.23 MR2, due to cluster initialization 
> problem:
> Begin query: autolocal1.q
> diff -a 
> /var/lib/jenkins/workspace/zhenxiao-CDH4-Hive-0.9.0/build/ql/test/logs/clientnegative/autolocal1.q.out
>  
> /var/lib/jenkins/workspace/zhenxiao-CDH4-Hive-0.9.0/ql/src/test/results/clientnegative/autolocal1.q.out
> 5c5
> < Job Submission failed with exception 'java.io.IOException(Cannot initialize 
> Cluster. Please check your configuration for mapreduce.framework.name and the 
> correspond server addresses.)'
> —
> > Job Submission failed with exception 
> > 'java.lang.IllegalArgumentException(Does not contain a valid host:port 
> > authority: abracadabra)'
> Exception: Client execution results failed with error code = 1
> See build/ql/tmp/hive.log, or try "ant test ... -Dtest.silent=false" to get 
> more logs.
> Failed query: autolocal1.q

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HIVE-2125) alter table concatenate fails and deletes data

2011-04-22 Thread Joydeep Sen Sarma (JIRA)

alter table concatenate fails and deletes data
--

 Key: HIVE-2125
 URL: https://issues.apache.org/jira/browse/HIVE-2125
 Project: Hive
  Issue Type: Bug
Reporter: Joydeep Sen Sarma
Priority: Critical


the number of reducers is not set by this command (unlike other hive queries). 
since mapred.reduce.tasks=-1 (to let hive infer this automatically) - 
jobtracker fails the job (number of reducers cannot be negative)

hive> alter table ad_imps_2 partition(ds='2009-06-16') concatenate;
alter table ad_imps_2 partition(ds='2009-06-16') concatenate;
Starting Job = job_201103101203_453180, Tracking URL = 
http://curium.data.facebook.com:50030/jobdetails.jsp?jobid=job_201103101203_453180
Kill Command = /mnt/vol/hive/sites/curium/hadoop/bin/../bin/hadoop job  
-Dmapred.job.tracker=curium.data.facebook.com:50029 -kill 
job_201103101203_453180
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2011-04-22 10:21:24,046 null map = 100%,  reduce = 100%
Ended Job = job_201103101203_453180 with errors
Moved to trash: /user/facebook/warehouse/ad_imps_2/_backup.ds=2009-06-16
after the job fails - the partition is deleted

thankfully it's still in trash

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HIVE-2125) alter table concatenate fails and deletes data

2011-04-22 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma reassigned HIVE-2125:
---

Assignee: He Yongqiang

> alter table concatenate fails and deletes data
> --
>
> Key: HIVE-2125
> URL: https://issues.apache.org/jira/browse/HIVE-2125
> Project: Hive
>  Issue Type: Bug
>Reporter: Joydeep Sen Sarma
>Assignee: He Yongqiang
>Priority: Critical
>
> the number of reducers is not set by this command (unlike other hive 
> queries). since mapred.reduce.tasks=-1 (to let hive infer this automatically) 
> - jobtracker fails the job (number of reducers cannot be negative)
> hive> alter table ad_imps_2 partition(ds='2009-06-16') concatenate;
> alter table ad_imps_2 partition(ds='2009-06-16') concatenate;
> Starting Job = job_201103101203_453180, Tracking URL = 
> http://curium.data.facebook.com:50030/jobdetails.jsp?jobid=job_201103101203_453180
> Kill Command = /mnt/vol/hive/sites/curium/hadoop/bin/../bin/hadoop job  
> -Dmapred.job.tracker=curium.data.facebook.com:50029 -kill 
> job_201103101203_453180
> Hadoop job information for null: number of mappers: 0; number of reducers: 0
> 2011-04-22 10:21:24,046 null map = 100%,  reduce = 100%
> Ended Job = job_201103101203_453180 with errors
> Moved to trash: /user/facebook/warehouse/ad_imps_2/_backup.ds=2009-06-16
> after the job fails - the partition is deleted
> thankfully it's still in trash

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HIVE-2100) virtual column references inside subqueries cause execution exceptions

2011-04-08 Thread Joydeep Sen Sarma (JIRA)

virtual column references inside subqueries cause execution exceptions
--

 Key: HIVE-2100
 URL: https://issues.apache.org/jira/browse/HIVE-2100
 Project: Hive
  Issue Type: Bug
Reporter: Joydeep Sen Sarma


example:
create table jssarma_nilzma_bad as select a.fname, a.offset, a.val from (select 
hash(eventid,userid,eventtime,browsercookie,userstate,useragent,userip,serverip,clienttime,geoid,countrycode\
,actionid,lastimpressionid,lastnavimpressionid,impressiontype,fullurl,fullreferrer,pagesection,modulesection,adsection)
 as val, INPUT__FILE__NAME as fname, BLOCK__OFFSET__INSIDE__FILE as offset from 
nectar_impression_lzma_unverified where ds='2010-07-28') a join jssarma_hc_diff 
b on (a.val=b.val);"

causes

Caused by: java.lang.RuntimeException: Map operator initialization failed
at 
org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121)
... 18 more
Caused by: java.lang.RuntimeException: cannot find field input__file__name from 
[org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@664310d0,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3d04fc23,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@12457d21,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@101a0ae6,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dc18a4c,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@d5e92d7,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3bfa681c,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@34c92507,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@19e09a4,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2e8aeed0,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2344b18f,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@72e5355f,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@26132ae7,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@3465b738,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@1dfd868,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@ef894ce,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@61f1680f,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@2fe6e305,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@5f4275d4,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@445e228,
 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@802b249]
at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321)
at 
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
at 
org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
at 
org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98)
... 18 more


running the subquery separately fixes the issue.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2052) PostHook and PreHook API to add flag to indicate it is pre or post hook plus cache for content summary

2011-03-21 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009427#comment-13009427
 ] 

Joydeep Sen Sarma commented on HIVE-2052:
-

committed - thanks Siying!

> PostHook and PreHook API to add flag to indicate it is pre or post hook plus 
> cache for content summary
> --
>
> Key: HIVE-2052
> URL: https://issues.apache.org/jira/browse/HIVE-2052
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Attachments: HIVE-2051.3.patch, HIVE-2052.1.patch, HIVE-2052.2.patch, 
> HIVE-2052.3.patch
>
>
> This will allow hooks to share some information better and reduce their 
> latency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-21 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009426#comment-13009426
 ] 

Joydeep Sen Sarma commented on HIVE-2051:
-

committed - thanks Siying

> getInputSummary() to call FileSystem.getContentSummary() in parallel
> 
>
> Key: HIVE-2051
> URL: https://issues.apache.org/jira/browse/HIVE-2051
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
> HIVE-2051.4.patch, HIVE-2051.5.patch
>
>
> getInputSummary() now call FileSystem.getContentSummary() one by one, which 
> can be extremely slow when the number of input paths are huge. By calling 
> those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-21 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009237#comment-13009237
 ] 

Joydeep Sen Sarma commented on HIVE-2051:
-

+1. will commit after running tests.

> getInputSummary() to call FileSystem.getContentSummary() in parallel
> 
>
> Key: HIVE-2051
> URL: https://issues.apache.org/jira/browse/HIVE-2051
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
> HIVE-2051.4.patch, HIVE-2051.5.patch
>
>
> getInputSummary() now call FileSystem.getContentSummary() one by one, which 
> can be extremely slow when the number of input paths are huge. By calling 
> those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-20 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008903#comment-13008903
 ] 

Joydeep Sen Sarma commented on HIVE-2051:
-

based on: http://www.ibm.com/developerworks/java/library/j-jtp05236.html

it seems that the right thing to do here is to catch the interruptedexception 
and then call Thread.currentThread.interrupt() (grep for 'swallow interrupt' in 
this article).

we could also rethrow it - but the problem then will merely be punted to the 
higher layer (which probably will ignore it as well)

> getInputSummary() to call FileSystem.getContentSummary() in parallel
> 
>
> Key: HIVE-2051
> URL: https://issues.apache.org/jira/browse/HIVE-2051
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
> HIVE-2051.4.patch
>
>
> getInputSummary() now call FileSystem.getContentSummary() one by one, which 
> can be extremely slow when the number of input paths are huge. By calling 
> those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-2052) PostHook and PreHook API to add flag to indicate it is pre or post hook plus cache for content summary

2011-03-18 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008558#comment-13008558
 ] 

Joydeep Sen Sarma commented on HIVE-2052:
-

looks like the new patch is for HIVE-2051 not HIVE-2052!

> PostHook and PreHook API to add flag to indicate it is pre or post hook plus 
> cache for content summary
> --
>
> Key: HIVE-2052
> URL: https://issues.apache.org/jira/browse/HIVE-2052
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Attachments: HIVE-2051.3.patch, HIVE-2052.1.patch, HIVE-2052.2.patch
>
>
> This will allow hooks to share some information better and reduce their 
> latency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-18 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008549#comment-13008549
 ] 

Joydeep Sen Sarma commented on HIVE-2051:
-

Siying - i think we shouldn't ignore ExecutionException. The best part of 
checking for each task status seems to be that we can find out if any of them 
failed (indicated by ExecutionException). Also we can remove the 
executor.awaitTermination() call as well (same feedback as the comments above).

also - do you want to make the core of this routine synchronized (perhaps on 
the context object - which is one per query)? there really is no point running 
more than one of these per query at a time. (we can move this whole routine to 
the Context object if that seems like a better place (or at least make the call 
from the Context object where it can be marked as a synchronized method).

otherwise looks good. please upload a new patch and i will test and commit.

> getInputSummary() to call FileSystem.getContentSummary() in parallel
> 
>
> Key: HIVE-2051
> URL: https://issues.apache.org/jira/browse/HIVE-2051
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch, 
> HIVE-2051.4.patch
>
>
> getInputSummary() now call FileSystem.getContentSummary() one by one, which 
> can be extremely slow when the number of input paths are huge. By calling 
> those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-2052) PostHook and PreHook API to add flag to indicate it is pre or post hook plus cache for content summary

2011-03-17 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008238#comment-13008238
 ] 

Joydeep Sen Sarma commented on HIVE-2052:
-

small nits:
- setinputpathtocontentsummary is called twice on the same hookcontext object
- we are setting the hook type again and again (can do it once before calling 
postexecute)

should the inputpathtocontentsummary be marked final in the hook and passed 
along with the constructor? (why would we ever change the map to a new one?).

> PostHook and PreHook API to add flag to indicate it is pre or post hook plus 
> cache for content summary
> --
>
> Key: HIVE-2052
> URL: https://issues.apache.org/jira/browse/HIVE-2052
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Attachments: HIVE-2052.1.patch, HIVE-2052.2.patch
>
>
> This will allow hooks to share some information better and reduce their 
> latency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-17 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008221#comment-13008221
 ] 

Joydeep Sen Sarma commented on HIVE-2051:
-

my bad - i thought Carl == M IS :-). 

looking at .3 patch - i am concerned about this code:


+  result.get();
+} catch (InterruptedException e) {
+  throw new IOException(e);

in a different block of code down from this one - we ignore 
InterruptedException. It seems safer to ignore them (I am just not sure if we 
there's any reason to get a valid thread interrupt in the calling thread and if 
so what the thread is supposed to do in that case).

is it necessary for the executor to terminate if all the tasks given to it are 
already terminated? (trivial point - but might reduce code a bit).



> getInputSummary() to call FileSystem.getContentSummary() in parallel
> 
>
> Key: HIVE-2051
> URL: https://issues.apache.org/jira/browse/HIVE-2051
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch
>
>
> getInputSummary() now call FileSystem.getContentSummary() one by one, which 
> can be extremely slow when the number of input paths are huge. By calling 
> those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-2051) getInputSummary() to call FileSystem.getContentSummary() in parallel

2011-03-15 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007323#comment-13007323
 ] 

Joydeep Sen Sarma commented on HIVE-2051:
-

looked at the latest patch from Carl. don't get it - why should we pay cost for 
creating thread when one is not required? 

> getInputSummary() to call FileSystem.getContentSummary() in parallel
> 
>
> Key: HIVE-2051
> URL: https://issues.apache.org/jira/browse/HIVE-2051
> Project: Hive
>  Issue Type: Improvement
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
> Attachments: HIVE-2051.1.patch, HIVE-2051.2.patch, HIVE-2051.3.patch
>
>
> getInputSummary() now call FileSystem.getContentSummary() one by one, which 
> can be extremely slow when the number of input paths are huge. By calling 
> those functions in parallel, we can cut latency in most cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-2039) remove hadoop version check from hive cli shell script

2011-03-09 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-2039:


Fix Version/s: 0.8.0
 Release Note: hive startup shell script allows bypassing running 'hadoop 
version' subcommand via setting of HADOOP_VERSION environment variable. this 
can reduce latency of starting up the hive cli.
   Status: Patch Available  (was: Reopened)

> remove hadoop version check from hive cli shell script
> --
>
> Key: HIVE-2039
> URL: https://issues.apache.org/jira/browse/HIVE-2039
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Fix For: 0.8.0
>
> Attachments: HIVE-2039.1.patch
>
>
> looking at cli startup times - one thing i noticed is that the version check 
> in execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive 
> is installed).
> AFAIK - hive doesn't support versions less than 20 right now - and this check 
> is only to check if version is less than 20. So we should be able to safely 
> take it out. please comment if that is not the case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-2039) remove hadoop version check from hive cli shell script

2011-03-09 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-2039:


Attachment: HIVE-2039.1.patch

skip hadoop version check via hadoop subcommand if version is supplied by env 
variable.

> remove hadoop version check from hive cli shell script
> --
>
> Key: HIVE-2039
> URL: https://issues.apache.org/jira/browse/HIVE-2039
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Attachments: HIVE-2039.1.patch
>
>
> looking at cli startup times - one thing i noticed is that the version check 
> in execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive 
> is installed).
> AFAIK - hive doesn't support versions less than 20 right now - and this check 
> is only to check if version is less than 20. So we should be able to safely 
> take it out. please comment if that is not the case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Reopened: (HIVE-2039) remove hadoop version check from hive cli shell script

2011-03-09 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma reopened HIVE-2039:
-

  Assignee: Joydeep Sen Sarma

spoke too fast - the version check has moved to a different place - but it's 
still there. i plan to provide a way to avoid the version check (via env var) 
because it's pretty expensive. it would still be there by default.

> remove hadoop version check from hive cli shell script
> --
>
> Key: HIVE-2039
> URL: https://issues.apache.org/jira/browse/HIVE-2039
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>
> looking at cli startup times - one thing i noticed is that the version check 
> in execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive 
> is installed).
> AFAIK - hive doesn't support versions less than 20 right now - and this check 
> is only to check if version is less than 20. So we should be able to safely 
> take it out. please comment if that is not the case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (HIVE-2039) remove hadoop version check from hive cli shell script

2011-03-09 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma resolved HIVE-2039.
-

Resolution: Invalid

never mind. looks like it's already gone and i am looking at old version.

> remove hadoop version check from hive cli shell script
> --
>
> Key: HIVE-2039
> URL: https://issues.apache.org/jira/browse/HIVE-2039
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Reporter: Joydeep Sen Sarma
>
> looking at cli startup times - one thing i noticed is that the version check 
> in execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive 
> is installed).
> AFAIK - hive doesn't support versions less than 20 right now - and this check 
> is only to check if version is less than 20. So we should be able to safely 
> take it out. please comment if that is not the case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (HIVE-2039) remove hadoop version check from hive cli shell script

2011-03-09 Thread Joydeep Sen Sarma (JIRA)

remove hadoop version check from hive cli shell script
--

 Key: HIVE-2039
 URL: https://issues.apache.org/jira/browse/HIVE-2039
 Project: Hive
  Issue Type: Improvement
  Components: CLI
Reporter: Joydeep Sen Sarma


looking at cli startup times - one thing i noticed is that the version check in 
execHiveCmd.sh consumes 0.5-1s of wall-clock time (depending on where hive is 
installed).

AFAIK - hive doesn't support versions less than 20 right now - and this check 
is only to check if version is less than 20. So we should be able to safely 
take it out. please comment if that is not the case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-2037) Merge result file size should honor hive.merge.size.per.task

2011-03-09 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-2037:


   Resolution: Fixed
Fix Version/s: 0.8.0
   Status: Resolved  (was: Patch Available)

> Merge result file size should honor hive.merge.size.per.task
> 
>
> Key: HIVE-2037
> URL: https://issues.apache.org/jira/browse/HIVE-2037
> Project: Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.8.0
>
> Attachments: HIVE-2037.patch
>
>
> The merge job set mapred.min.split.size to the value of 
> hive.merge.size.per.task, which roughly equals to the output file size. 
> However the input split size is also determined by 
> mapred.min.split.size.per.node, mapred.min.split.size.per.rack, and 
> mapred.max.split.size. They should be set the same as 
> hive.merge.size.per.task as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-2037) Merge result file size should honor hive.merge.size.per.task

2011-03-09 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004866#comment-13004866
 ] 

Joydeep Sen Sarma commented on HIVE-2037:
-

committed. thanks Ning.

> Merge result file size should honor hive.merge.size.per.task
> 
>
> Key: HIVE-2037
> URL: https://issues.apache.org/jira/browse/HIVE-2037
> Project: Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.8.0
>
> Attachments: HIVE-2037.patch
>
>
> The merge job set mapred.min.split.size to the value of 
> hive.merge.size.per.task, which roughly equals to the output file size. 
> However the input split size is also determined by 
> mapred.min.split.size.per.node, mapred.min.split.size.per.rack, and 
> mapred.max.split.size. They should be set the same as 
> hive.merge.size.per.task as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-2037) Merge result file size should honor hive.merge.size.per.task

2011-03-09 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004802#comment-13004802
 ] 

Joydeep Sen Sarma commented on HIVE-2037:
-

looks ok - please run the tests and i will commit.

> Merge result file size should honor hive.merge.size.per.task
> 
>
> Key: HIVE-2037
> URL: https://issues.apache.org/jira/browse/HIVE-2037
> Project: Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-2037.patch
>
>
> The merge job set mapred.min.split.size to the value of 
> hive.merge.size.per.task, which roughly equals to the output file size. 
> However the input split size is also determined by 
> mapred.min.split.size.per.node, mapred.min.split.size.per.rack, and 
> mapred.max.split.size. They should be set the same as 
> hive.merge.size.per.task as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (HIVE-1833) Task-cleanup task should be disabled

2011-03-03 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma resolved HIVE-1833.
-

   Resolution: Fixed
Fix Version/s: 0.8.0

> Task-cleanup task should be disabled
> 
>
> Key: HIVE-1833
> URL: https://issues.apache.org/jira/browse/HIVE-1833
> Project: Hive
>  Issue Type: Improvement
>  Components: Server Infrastructure
>Reporter: Scott Chen
>Assignee: Scott Chen
> Fix For: 0.8.0
>
> Attachments: HIVE-1833.1.txt, HIVE-1833.txt
>
>
> Currently when task fails, a cleanup attempt will be scheduled right after 
> that.
> This is unnecessary and increase the latency. MapReduce will allow disabling 
> this (see MAPREDUCE-2206).
> After that patch is committed, we should set the JobConf in HIVE to disable 
> cleanup task.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1833) Task-cleanup task should be disabled

2011-03-03 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002165#comment-13002165
 ] 

Joydeep Sen Sarma commented on HIVE-1833:
-

committed - thanks Scott!

> Task-cleanup task should be disabled
> 
>
> Key: HIVE-1833
> URL: https://issues.apache.org/jira/browse/HIVE-1833
> Project: Hive
>  Issue Type: Improvement
>  Components: Server Infrastructure
>Reporter: Scott Chen
>Assignee: Scott Chen
> Attachments: HIVE-1833.1.txt, HIVE-1833.txt
>
>
> Currently when task fails, a cleanup attempt will be scheduled right after 
> that.
> This is unnecessary and increase the latency. MapReduce will allow disabling 
> this (see MAPREDUCE-2206).
> After that patch is committed, we should set the JobConf in HIVE to disable 
> cleanup task.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1833) Task-cleanup task should be disabled

2011-02-28 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000609#comment-13000609
 ] 

Joydeep Sen Sarma commented on HIVE-1833:
-

can you also make this change in:

shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java

looks like we have two shims for hadoop-0.20 (one for hadoop-20 with security)

> Task-cleanup task should be disabled
> 
>
> Key: HIVE-1833
> URL: https://issues.apache.org/jira/browse/HIVE-1833
> Project: Hive
>  Issue Type: Improvement
>  Components: Server Infrastructure
>Reporter: Scott Chen
>Assignee: Scott Chen
> Attachments: HIVE-1833.txt
>
>
> Currently when task fails, a cleanup attempt will be scheduled right after 
> that.
> This is unnecessary and increase the latency. MapReduce will allow disabling 
> this (see MAPREDUCE-2206).
> After that patch is committed, we should set the JobConf in HIVE to disable 
> cleanup task.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (HIVE-1968) data corruption with multi-table insert

2011-02-07 Thread Joydeep Sen Sarma (JIRA)

data corruption with multi-table insert
---

 Key: HIVE-1968
 URL: https://issues.apache.org/jira/browse/HIVE-1968
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 0.7.0
Reporter: Joydeep Sen Sarma


i had to run a conversion process to compute a checksum (sum(hash(all-columns)) 
of a table and convert it to a different compression format. trying to be 
clever - i did both of them in a single pass by doing something to the 
equivalent of:

from (select col1, col2, hash(col1, col2) as val from table_to_be_converted) i
insert overwrite table table_to_be_generated select i.col1, i.col2
insert overwrite table table_to_be_converted_checksum select sum(hash(i.val));

the plan looked correct. however - the data produced was erroneous - the 
checksums and the data were both wrong (and consistent with each other). i know 
this because:
- the checksum computed by the above query didn't match the checksum on the 
input table when calculated separately
- the checksum of the data output by this query (first insert clause) didn't 
match the input table's checksum (neither the one computed by the query above, 
nor by the one computed separately)

later on - i broke up this query into two independent ones - and the data and 
checksums were good (ie. they all matched up). so seems like there's some data 
corruption happening in MTI.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-22 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1852:


   Resolution: Fixed
Fix Version/s: 0.7.0
   Status: Resolved  (was: Patch Available)

committed. thanks Ning.

> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.4.patch, 
> HIVE-1852.5.patch, HIVE-1852.6.patch, HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-21 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973883#action_12973883
 ] 

Joydeep Sen Sarma commented on HIVE-1852:
-

Hive..java:1564 - this should read fs.rename(srcs[0]) (since srcf may have been 
a wildcard that matched a single dir)
Hive.java:1574 - we can optimize this loop i think. if the wildcard does not 
match a single directory - then it has to match a set of files. the 
loadsemantic analyzer already enforces this. so we don't need a second 
listStatus and loop over the entries here. can directly move each of the srcs 
into destf/srcs.getName

we have lost the atomic move for the wildcard case. i think it's ok (it's not 
used much i would imagine) - at least leave a note/todo saying that this would 
be nice to have atomic.

new tests look pretty good to me - the load/move case with wildcards is getting 
covered. we could add one where the load path is a wildcard that matches a 
single dir to cover the first comment here.


> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.4.patch, 
> HIVE-1852.5.patch, HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-20 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973387#action_12973387
 ] 

Joydeep Sen Sarma commented on HIVE-1852:
-

regarding wildcards - load data inpath /x/*.txt - does that work? the copy task 
should not happen if the source and destination file systems are the same (in 
the load command). i think u may have observed the copy command kick in because 
in the hive unit test environment - the default file system is pfile:// and 
local file system is not same as pfile:// (from a java class standpoint). so 
copytask kicks in. you might want to try a load command without the 'local' 
keyword.

also - regarding the question about filesystem.delete versus fsshell - testing 
should be easy because we have two working file systems in our test environment 
(file:// and pfile://). we can create a partition pointing to one filesystem 
and then try to change it's location to the other filesystem and make sure 
things work.  

> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-20 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973353#action_12973353
 ] 

Joydeep Sen Sarma commented on HIVE-1852:
-

cool - the fsshell removal sounds good unless Yongqiang says something 
otherwise.

i am pretty sure this patch breaks load command with a wildcard though. it 
seems to me that the load command is simply passing the input path (with the 
wildcard pattern) to the the loadTable/loadPartition methods (via 
LoadTableDesc). these commands were previously capable of handling wildcards 
that matched a set of files. now they will not be able to do that. Ning - can u 
confirm this? (maybe add a test trying to load a wildcard pattern?)

on a more minor note - the checkPaths call that got taken out was checking for 
the presence of nested subdirectories inside the path being loaded. is this no 
longer necessary? (do we support directories within partitions/tables 
automatically at query time?)

> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-20 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973300#action_12973300
 ] 

Joydeep Sen Sarma commented on HIVE-1852:
-

Ning - do you know why we had the FsShell and -rmr code (in Hive.replaceFiles) 
? i don't think it was there originally and there must have been a reason why 
it got put in. this patch is taking it out and i wanted to be sure this is ok. 
(I am wondering if the -rmr is there to handle non hdfs file systems)

> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-1852.2.patch, HIVE-1852.3.patch, HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-15 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971991#action_12971991
 ] 

Joydeep Sen Sarma commented on HIVE-1852:
-

what happens when there are multiple srcs:

srcs = fs.listStatus(srcf);
 for (FileStatus src : srcs) {
   if (!fs.rename(src.getPath(), tmppath)) {

we can't rename multiple sources to the same target. this happens when the 
input path is a glob (which i think the load command allows).

can we special case this optimization (apply only when srcs.length == 1)?

> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-1852.2.patch, HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-15 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971926#action_12971926
 ] 

Joydeep Sen Sarma commented on HIVE-1852:
-

oh - sorry - my previous comment was wrong. if u rename src to tmpPath - u lose 
all previous contents of tmpPath. which is not what the function does - it 
retains contents of tmpPath that don't collide with contents in src (the load 
command uses this i think and merges data in).

maybe we need a different call for the specific case u are trying to optimize. 
if this is not being picked up by the tests - that's pretty bad ..

> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-15 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971922#action_12971922
 ] 

Joydeep Sen Sarma commented on HIVE-1852:
-

u need a test where src is a directory. please try different variants of load 
commands.

> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-15 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971920#action_12971920
 ] 

Joydeep Sen Sarma commented on HIVE-1852:
-

old code rename src/item_i to tmpPath/item_i
new code renames src to tmpPath/src. item_i's final position is 
tmpPath/src/item_i

what am i missing?

> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1852) Reduce unnecessary DFSClient.rename() calls

2010-12-15 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971893#action_12971893
 ] 

Joydeep Sen Sarma commented on HIVE-1852:
-

are u sure this is ok? it seems we have changed the semantics - the old code 
takes each file from underneath the dir and moves into final location. the new 
code moves the directory underneath the final location. there's one extra level 
of directory in the new code that's not there in the old code. also - the 
semantics in terms of collisions changes because of this. if we create a subdir 
- then there may not be collisions in the new code (because of rename) that may 
occur in the old code.

> Reduce unnecessary DFSClient.rename() calls
> ---
>
> Key: HIVE-1852
> URL: https://issues.apache.org/jira/browse/HIVE-1852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Attachments: HIVE-1852.patch
>
>
> In Hive client side (MoveTask etc), DFSCleint.rename() is called for every 
> file inside a directory. It is very expensive for a large directory in a busy 
> DFS namenode. We should replace it with a single rename() call on the whole 
> directory. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HIVE-1846) change hive assumption that local mode mappers/reducers always run in same jvm

2010-12-15 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma resolved HIVE-1846.
-

Resolution: Fixed

committed the new patch. thanks Ram.

> change hive assumption that local mode mappers/reducers always run in same 
> jvm 
> ---
>
> Key: HIVE-1846
> URL: https://issues.apache.org/jira/browse/HIVE-1846
> Project: Hive
>  Issue Type: Bug
>Reporter: Joydeep Sen Sarma
>Assignee: Ramkumar Vadali
> Fix For: 0.7.0
>
> Attachments: HIVE-1846.2.patch, HIVE-1846.patch
>
>
> we are trying out a version of hadoop local mode that runs multiple 
> mappers/reducers by spawning jvm's for them. In this mode hive mappers fail 
> in reading the plan file. it seems that we assume (in the setMapredWork call) 
> that local mode mappers/reducers will run in the same jvm (we can cache the 
> current plan in a global var and don't serialize to a path). this needs to 
> get fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1846) change hive assumption that local mode mappers/reducers always run in same jvm

2010-12-10 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1846:


Assignee: Ramkumar Vadali  (was: Joydeep Sen Sarma)

> change hive assumption that local mode mappers/reducers always run in same 
> jvm 
> ---
>
> Key: HIVE-1846
> URL: https://issues.apache.org/jira/browse/HIVE-1846
> Project: Hive
>  Issue Type: Bug
>Reporter: Joydeep Sen Sarma
>Assignee: Ramkumar Vadali
> Fix For: 0.7.0
>
> Attachments: HIVE-1846.patch
>
>
> we are trying out a version of hadoop local mode that runs multiple 
> mappers/reducers by spawning jvm's for them. In this mode hive mappers fail 
> in reading the plan file. it seems that we assume (in the setMapredWork call) 
> that local mode mappers/reducers will run in the same jvm (we can cache the 
> current plan in a global var and don't serialize to a path). this needs to 
> get fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HIVE-1846) change hive assumption that local mode mappers/reducers always run in same jvm

2010-12-10 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma resolved HIVE-1846.
-

   Resolution: Fixed
Fix Version/s: 0.7.0

committed - thanks Ram.

> change hive assumption that local mode mappers/reducers always run in same 
> jvm 
> ---
>
> Key: HIVE-1846
> URL: https://issues.apache.org/jira/browse/HIVE-1846
> Project: Hive
>  Issue Type: Bug
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Fix For: 0.7.0
>
> Attachments: HIVE-1846.patch
>
>
> we are trying out a version of hadoop local mode that runs multiple 
> mappers/reducers by spawning jvm's for them. In this mode hive mappers fail 
> in reading the plan file. it seems that we assume (in the setMapredWork call) 
> that local mode mappers/reducers will run in the same jvm (we can cache the 
> current plan in a global var and don't serialize to a path). this needs to 
> get fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1675) SAXParseException on plan.xml during local mode.

2010-12-10 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970201#action_12970201
 ] 

Joydeep Sen Sarma commented on HIVE-1675:
-

the combination of auto local and parallel had a bug that was recently fixed 
(HIVE-1776)

The stack reported above is weird - we should not be entering this code path. 
as u mentioned - HIVE-1846 will fix this - but currently, the getMapRedWork 
call invoked from different places in the stack should be satisfied from the an 
in-memory cache (and shouldn't hit the file). the file is not written out for 
local mode (because there's an assumption that everything runs in the same jvm).

Unable to explain this - i think this is worth fixing because the plan file 
processing takes some time and it's better to retrieve it from memory where 
possible.

> SAXParseException on plan.xml during local mode.
> 
>
> Key: HIVE-1675
> URL: https://issues.apache.org/jira/browse/HIVE-1675
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.7.0
>Reporter: Bennie Schut
>Assignee: Bennie Schut
> Fix For: 0.7.0
>
> Attachments: HIVE-1675.patch, local_10005_plan.xml, 
> local_10006_plan.xml
>
>
> When hive switches to local mode (hive.exec.mode.local.auto=true) I receive a 
> sax parser exception on the plan.xml
> If I set hive.exec.mode.local.auto=false I get the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HIVE-1846) change hive assumption that local mode mappers/reducers always run in same jvm

2010-12-09 Thread Joydeep Sen Sarma (JIRA)

change hive assumption that local mode mappers/reducers always run in same jvm 
---

 Key: HIVE-1846
 URL: https://issues.apache.org/jira/browse/HIVE-1846
 Project: Hive
  Issue Type: Bug
Reporter: Joydeep Sen Sarma
Assignee: Joydeep Sen Sarma


we are trying out a version of hadoop local mode that runs multiple 
mappers/reducers by spawning jvm's for them. In this mode hive mappers fail in 
reading the plan file. it seems that we assume (in the setMapredWork call) that 
local mode mappers/reducers will run in the same jvm (we can cache the current 
plan in a global var and don't serialize to a path). this needs to get fixed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1778) simultaneously launched queries collide on hive intermediate directories

2010-12-07 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968859#action_12968859
 ] 

Joydeep Sen Sarma commented on HIVE-1778:
-

whatever works - we could pass in hash the query string and time (perhaps a 
nanosecond timer) to come up with a better seed for the random generator for 
example.

> simultaneously launched queries collide on hive intermediate directories
> 
>
> Key: HIVE-1778
> URL: https://issues.apache.org/jira/browse/HIVE-1778
> Project: Hive
>  Issue Type: Bug
>Reporter: Joydeep Sen Sarma
>Assignee: Edward Capriolo
>
> we saw one instance of multiple queries for the same user launched in 
> parallel (from a workflow engine) use the same intermediate directories. 
> which is obviously super bad but not suprising considering how we allocate 
> them:
>Random rand = new Random();
>   String executionId = "hive_" + format.format(new Date()) + "_"  + 
> Math.abs(rand.nextLong());
>  Java documentation says: Two Random objects created within the same 
> millisecond will have the same sequence of random numbers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1695) MapJoin followed by ReduceSink should be done as single MapReduce Job

2010-12-02 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966247#action_12966247
 ] 

Joydeep Sen Sarma commented on HIVE-1695:
-

couple of things to watch out for:

- mapjoin uses a lot of memory on the mapper. i am not sure how the memory 
setting are controlled - but we need to make sure that the map-join and the 
sort (imposed by the reducesink) don't blow through the task heap limits. In 
case the RS is coming because of group by - the map side hash aggregation will 
also use memory.
- the stuff that liyin has been working on converts regular joins into map 
joins automatically. i believe he generates several plans (map-join and 
sort-merge join) and chooses from one of them at runtime. will the technique 
discussed here apply to map-join plans generated by auto-map-joins? (i am not 
sure - so asking)

> MapJoin followed by ReduceSink should be done as single MapReduce Job
> -
>
> Key: HIVE-1695
> URL: https://issues.apache.org/jira/browse/HIVE-1695
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Amareshwari Sriramadasu
>
> Currently MapJoin followed by ReduceSink runs as two MapReduce jobs : One map 
> only job followed by a Map-Reduce job. It can be combined into single 
> MapReduce Job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system

2010-11-09 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1776:


Attachment: HIVE-1776.2.patch

> parallel execution and auto-local mode combine to place plan file in wrong 
> file system
> --
>
> Key: HIVE-1776
> URL: https://issues.apache.org/jira/browse/HIVE-1776
> Project: Hive
>  Issue Type: Bug
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Attachments: HIVE-1776.1.patch, HIVE-1776.2.patch
>
>
> A query (that i can't reproduce verbatim) submits a job to a MR cluster with 
> a plan file that is resident on  the local file system. This job obviously 
> fails.
> This seems to result from an interaction between the parallel execution 
> (which is trying to run one local and one remote job at the same time). 
> Turning off either the parallel execution mode or the auto-local mode seems 
> to fix the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system

2010-11-09 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930368#action_12930368
 ] 

Joydeep Sen Sarma commented on HIVE-1776:
-

yeah it was - but shoot - i forgot to take out the corresponding call in the 
finally block to restore the tracker. will upload new patch.

these are no longer necessary because we are using a cloned configuration 
object that is discarded once the task completes.

> parallel execution and auto-local mode combine to place plan file in wrong 
> file system
> --
>
> Key: HIVE-1776
> URL: https://issues.apache.org/jira/browse/HIVE-1776
> Project: Hive
>  Issue Type: Bug
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Attachments: HIVE-1776.1.patch
>
>
> A query (that i can't reproduce verbatim) submits a job to a MR cluster with 
> a plan file that is resident on  the local file system. This job obviously 
> fails.
> This seems to result from an interaction between the parallel execution 
> (which is trying to run one local and one remote job at the same time). 
> Turning off either the parallel execution mode or the auto-local mode seems 
> to fix the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system

2010-11-09 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1776:


Assignee: Joydeep Sen Sarma
  Status: Patch Available  (was: Open)

> parallel execution and auto-local mode combine to place plan file in wrong 
> file system
> --
>
> Key: HIVE-1776
> URL: https://issues.apache.org/jira/browse/HIVE-1776
> Project: Hive
>  Issue Type: Bug
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Attachments: HIVE-1776.1.patch
>
>
> A query (that i can't reproduce verbatim) submits a job to a MR cluster with 
> a plan file that is resident on  the local file system. This job obviously 
> fails.
> This seems to result from an interaction between the parallel execution 
> (which is trying to run one local and one remote job at the same time). 
> Turning off either the parallel execution mode or the auto-local mode seems 
> to fix the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system

2010-11-09 Thread Joydeep Sen Sarma (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1776:


Attachment: HIVE-1776.1.patch

the problem is that tasks are trying to modify the shared hive configuration 
object and trampling each other. fix is to clone the configuration object 
before modifying it in the Task.

> parallel execution and auto-local mode combine to place plan file in wrong 
> file system
> --
>
> Key: HIVE-1776
> URL: https://issues.apache.org/jira/browse/HIVE-1776
> Project: Hive
>  Issue Type: Bug
>Reporter: Joydeep Sen Sarma
> Attachments: HIVE-1776.1.patch
>
>
> A query (that i can't reproduce verbatim) submits a job to a MR cluster with 
> a plan file that is resident on  the local file system. This job obviously 
> fails.
> This seems to result from an interaction between the parallel execution 
> (which is trying to run one local and one remote job at the same time). 
> Turning off either the parallel execution mode or the auto-local mode seems 
> to fix the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HIVE-1778) simultaneously launched queries collide on hive intermediate directories

2010-11-08 Thread Joydeep Sen Sarma (JIRA)

simultaneously launched queries collide on hive intermediate directories


 Key: HIVE-1778
 URL: https://issues.apache.org/jira/browse/HIVE-1778
 Project: Hive
  Issue Type: Bug
Reporter: Joydeep Sen Sarma


we saw one instance of multiple queries for the same user launched in parallel 
(from a workflow engine) use the same intermediate directories. which is 
obviously super bad but not suprising considering how we allocate them:

   Random rand = new Random();
  String executionId = "hive_" + format.format(new Date()) + "_"  + 
Math.abs(rand.nextLong());

 Java documentation says: Two Random objects created within the same 
millisecond will have the same sequence of random numbers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HIVE-1776) parallel execution and auto-local mode combine to place plan file in wrong file system

2010-11-08 Thread Joydeep Sen Sarma (JIRA)

parallel execution and auto-local mode combine to place plan file in wrong file 
system
--

 Key: HIVE-1776
 URL: https://issues.apache.org/jira/browse/HIVE-1776
 Project: Hive
  Issue Type: Bug
Reporter: Joydeep Sen Sarma


A query (that i can't reproduce verbatim) submits a job to a MR cluster with a 
plan file that is resident on  the local file system. This job obviously fails.

This seems to result from an interaction between the parallel execution (which 
is trying to run one local and one remote job at the same time). Turning off 
either the parallel execution mode or the auto-local mode seems to fix the 
problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins

2010-11-02 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927555#action_12927555
 ] 

Joydeep Sen Sarma commented on HIVE-1721:
-

@Siyin - that's a good question. I don't know statistically how common it is - 
but we have heard requests along these lines. for example one use case is that 
one project wants to get some data for a reasonably large subset of the users. 
one use case we have seen was where 0.2% of users were interesting - but even 
0.2% is very large for us. people also use semi-joins and that pretty much says 
that people want to filter rows out.

> use bloom filters to improve the performance of joins
> -
>
> Key: HIVE-1721
> URL: https://issues.apache.org/jira/browse/HIVE-1721
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Siying Dong
>
> In case of map-joins, it is likely that the big table will not find many 
> matching rows from the small table.
> Currently, we perform a hash-map lookup for every row in the big table, which 
> can be pretty expensive.
> It might be useful to try out a bloom-filter containing all the elements in 
> the small table.
> Each element from the big table is first searched in the bloom filter, and 
> only in case of a positive match,
> the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of joins

2010-11-02 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927544#action_12927544
 ] 

Joydeep Sen Sarma commented on HIVE-1721:
-

a bloom filter takes 10 bits per entry (with reasonable probability. i remember 
reading this value from wikipedia).

Our java hash tables take 2000 bytes per key-value pair (based on tests done by 
Liyin for reasonable sized keys/values).

So the idea is that if the small table is too big to be loaded into memory - 
but small enough that it's bloom filter can be stored in memory - then we can 
first do a filter of the large table and then do the sort.

> use bloom filters to improve the performance of joins
> -
>
> Key: HIVE-1721
> URL: https://issues.apache.org/jira/browse/HIVE-1721
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Siying Dong
>
> In case of map-joins, it is likely that the big table will not find many 
> matching rows from the small table.
> Currently, we perform a hash-map lookup for every row in the big table, which 
> can be pretty expensive.
> It might be useful to try out a bloom-filter containing all the elements in 
> the small table.
> Each element from the big table is first searched in the bloom filter, and 
> only in case of a positive match,
> the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1721) use bloom filters to improve the performance of map joins

2010-10-18 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922154#action_12922154
 ] 

Joydeep Sen Sarma commented on HIVE-1721:
-

i am not so sure about this.

consider a hash table which has a very large number of buckets (relative to the 
size of the elements in the hashtable).  a lookup inside the hashtable stops as 
soon as we hit an empty bucket. this requires us to only compute the 
hashcode(). if #buckets >> #elements - then for a miss - the likely average 
cost of the miss should only be the cost of the hashcode routine.

now consider a bloom filter. here we have to compute multiple hash codes (or at 
least one). on top of that - with the added bloom filters - there's an added 
cost for each positive (many hashcode computations). 

It's very clear from this reasoning that Bloom filters would be more expensive, 
not less, for small table joins. note that the hashtables in java do allow 
specification of number of buckets - so the strategy outlined here (of 
deliberately constructing a sparse hash table) is a feasible one.

Stepping back - this makes sense - because Bloom filters are designed for large 
data sets (or at least data sets that don't easily fit in memory) - not small 
ones (that fit easily in memory).

---

It would be more interesting to consider Bloom filters to cover join scenarios 
that cannot be performed with map join. for example - if the small table had 1M 
keys and map-join is not able to handle that large a hash table - then one can 
use bloom filters:

- filter (probabilistically) large table against medium sized table by looking 
up against bloom filter of medium-sized table (map-side bloom filter). (Note - 
this is not a join - just a filter)
- take filtered output and do sort-merge join against medium sized table (by 
now the data size should be greatly reduced and the cost of sorting would go 
down tremendously).

there's lots of literature around this - it's a pretty well known technique. 
it's quite different from what's proposed in this jira.

> use bloom filters to improve the performance of map joins
> -
>
> Key: HIVE-1721
> URL: https://issues.apache.org/jira/browse/HIVE-1721
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Liyin Tang
>
> In case of map-joins, it is likely that the big table will not find many 
> matching rows from the small table.
> Currently, we perform a hash-map lookup for every row in the big table, which 
> can be pretty expensive.
> It might be useful to try out a bloom-filter containing all the elements in 
> the small table.
> Each element from the big table is first searched in the bloom filter, and 
> only in case of a positive match,
> the small table hash table is explored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1620) Patch to write directly to S3 from Hive

2010-10-13 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920808#action_12920808
 ] 

Joydeep Sen Sarma commented on HIVE-1620:
-

i agree that the speed efficiency may be worth the tradeoff in consistency. as 
you say - the messaging is critical. can we gate this feature on a new hive 
option that makes the user conscious of this tradeoff?

regarding the cleanup - please look at jobClose method in FileSinkOperator (I 
think). if the hive client is still functioning at the time the job fails - we 
can make an attempt to clean things up there (assuming that the file names are 
unique - which i am not sure about right now because we made some changes to 
shorten file names (that might have to be undone for this feature)).

one thing we have experienced in the past is that hadoop tasks continue to do 
stuff even after the job is technically 'complete'. so i think while the 
cleanup can help the 99% use case - there will be marginal cases where the 
output directory gets written to when it shouldn't. so having this gated on an 
option would still be worthwhile IMHO (for users who cannot afford 
speed-accuracy tradeoff).


> Patch to write directly to S3 from Hive
> ---
>
> Key: HIVE-1620
> URL: https://issues.apache.org/jira/browse/HIVE-1620
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Vaibhav Aggarwal
>Assignee: Vaibhav Aggarwal
> Attachments: HIVE-1620.patch
>
>
> We want to submit a patch to Hive which allows user to write files directly 
> to S3.
> This patch allow user to specify an S3 location as the table output location 
> and hence eliminates the need  of copying data from HDFS to S3.
> Users can run Hive queries directly over the data stored in S3.
> This patch helps integrate hive with S3 better and quicker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

54 matches

Mail list logo