[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-08 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886634#action_12886634
 ] 

Jeff Hammerbacher commented on HIVE-417:


Hey,

Any chance you guys could post a more detailed design document for 
"full-fledged index support"? I'm quite curious to read up on it.

Thanks,
Jeff

> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
> indexing_with_ql_rewrites_trunk_953221.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1305) add progress in join and groupby

2010-07-08 Thread Siying Dong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siying Dong updated HIVE-1305:
--

Attachment: hive.1305.3.patch

> add progress in join and groupby
> 
>
> Key: HIVE-1305
> URL: https://issues.apache.org/jira/browse/HIVE-1305
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Siying Dong
> Attachments: hive.1305.1.patch, hive.1305.2.patch, hive.1305.3.patch
>
>
> The operators join and groupby can consume a lot of rows before producing any 
> output. 
> All operators which do not have a output for every input should report 
> progress periodically.
> Currently, it is only being done for ScriptOperator and FilterOperator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1455) lateral view does not work with column pruning

2010-07-08 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-1455:
---

Attachment: hive.1455.1.patch

running tests now.

> lateral view does not work with column pruning 
> ---
>
> Key: HIVE-1455
> URL: https://issues.apache.org/jira/browse/HIVE-1455
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: He Yongqiang
>Assignee: He Yongqiang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: hive.1455.1.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1455) lateral view does not work with column pruning

2010-07-08 Thread He Yongqiang (JIRA)
lateral view does not work with column pruning 
---

 Key: HIVE-1455
 URL: https://issues.apache.org/jira/browse/HIVE-1455
 Project: Hadoop Hive
  Issue Type: Bug
Reporter: He Yongqiang
Assignee: He Yongqiang
 Fix For: 0.6.0, 0.7.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1305) add progress in join and groupby

2010-07-08 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886608#action_12886608
 ] 

He Yongqiang commented on HIVE-1305:


Overall looks good to me.

minor comments:

1. in GroupByOp's flush,   
countAfterReport = 0; should put in the beginning of the function?
2. in AbstractMapjoin
heartbeatInterval = HiveConf.getIntVar(hconf,
HiveConf.ConfVars.HIVESENDHEARTBEAT);

is not needed? because the parent common join op already has that.




> add progress in join and groupby
> 
>
> Key: HIVE-1305
> URL: https://issues.apache.org/jira/browse/HIVE-1305
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Siying Dong
> Attachments: hive.1305.1.patch, hive.1305.2.patch
>
>
> The operators join and groupby can consume a lot of rows before producing any 
> output. 
> All operators which do not have a output for every input should report 
> progress periodically.
> Currently, it is only being done for ScriptOperator and FilterOperator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1096) Hive Variables

2010-07-08 Thread Edward Capriolo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Capriolo updated HIVE-1096:
--

Attachment: hive-1096-12.patch.txt

Change interpolate to substituteAdded the substitution logic to file, dfs, 
set , and query processor

> Hive Variables
> --
>
> Key: HIVE-1096
> URL: https://issues.apache.org/jira/browse/HIVE-1096
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Edward Capriolo
>Assignee: Edward Capriolo
> Fix For: 0.6.0, 0.7.0
>
> Attachments: 1096-9.diff, hive-1096-10-patch.txt, 
> hive-1096-11-patch.txt, hive-1096-12.patch.txt, hive-1096-2.diff, 
> hive-1096-7.diff, hive-1096-8.diff, hive-1096.diff
>
>
> From mailing list:
> --Amazon Elastic MapReduce version of Hive seems to have a nice feature 
> called "Variables." Basically you can define a variable via command-line 
> while invoking hive with -d DT=2009-12-09 and then refer to the variable via 
> ${DT} within the hive queries. This could be extremely useful. I can't seem 
> to find this feature even on trunk. Is this feature currently anywhere in the 
> roadmap?--
> This could be implemented in many places.
> A simple place to put this is 
> in Driver.compile or Driver.run we can do string substitutions at that level, 
> and further downstream need not be effected. 
> There could be some benefits to doing this further downstream, parser,plan. 
> but based on the simple needs we may not need to overthink this.
> I will get started on implementing in compile unless someone wants to discuss 
> this more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1096) Hive Variables

2010-07-08 Thread Edward Capriolo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Capriolo updated HIVE-1096:
--

Status: Patch Available  (was: Open)

> Hive Variables
> --
>
> Key: HIVE-1096
> URL: https://issues.apache.org/jira/browse/HIVE-1096
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Edward Capriolo
>Assignee: Edward Capriolo
> Fix For: 0.6.0, 0.7.0
>
> Attachments: 1096-9.diff, hive-1096-10-patch.txt, 
> hive-1096-11-patch.txt, hive-1096-12.patch.txt, hive-1096-2.diff, 
> hive-1096-7.diff, hive-1096-8.diff, hive-1096.diff
>
>
> From mailing list:
> --Amazon Elastic MapReduce version of Hive seems to have a nice feature 
> called "Variables." Basically you can define a variable via command-line 
> while invoking hive with -d DT=2009-12-09 and then refer to the variable via 
> ${DT} within the hive queries. This could be extremely useful. I can't seem 
> to find this feature even on trunk. Is this feature currently anywhere in the 
> roadmap?--
> This could be implemented in many places.
> A simple place to put this is 
> in Driver.compile or Driver.run we can do string substitutions at that level, 
> and further downstream need not be effected. 
> There could be some benefits to doing this further downstream, parser,plan. 
> but based on the simple needs we may not need to overthink this.
> I will get started on implementing in compile unless someone wants to discuss 
> this more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1305) add progress in join and groupby

2010-07-08 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886593#action_12886593
 ] 

He Yongqiang commented on HIVE-1305:


will take a look.

> add progress in join and groupby
> 
>
> Key: HIVE-1305
> URL: https://issues.apache.org/jira/browse/HIVE-1305
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Siying Dong
> Attachments: hive.1305.1.patch, hive.1305.2.patch
>
>
> The operators join and groupby can consume a lot of rows before producing any 
> output. 
> All operators which do not have a output for every input should report 
> progress periodically.
> Currently, it is only being done for ScriptOperator and FilterOperator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1454) insert overwrite and CTAS fail in hive local mode

2010-07-08 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886592#action_12886592
 ] 

He Yongqiang commented on HIVE-1454:


no, i only committed it to trunk. Do you need me to commit this to 0.6 as well?

> insert overwrite and CTAS fail in hive local mode
> -
>
> Key: HIVE-1454
> URL: https://issues.apache.org/jira/browse/HIVE-1454
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>Priority: Blocker
> Fix For: 0.7.0
>
> Attachments: hive-1454.1.patch
>
>
> this is because of the changes in HIVE-543. We switched to using local 
> storage for intermediate data for local mode queries. However there are code 
> paths that are incorrectly allocating intermediate storage where they should 
> be allocating external file system storage (based on table/directory uri). 
> This is causing regressions in running queries in local mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-08 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886587#action_12886587
 ] 

John Sichi commented on HIVE-417:
-

Based on discussion with Yongqiang, we've decided to go for "Full-fledged index 
support".


> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
> indexing_with_ql_rewrites_trunk_953221.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1305) add progress in join and groupby

2010-07-08 Thread Siying Dong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siying Dong updated HIVE-1305:
--

Attachment: hive.1305.2.patch

> add progress in join and groupby
> 
>
> Key: HIVE-1305
> URL: https://issues.apache.org/jira/browse/HIVE-1305
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Siying Dong
> Attachments: hive.1305.1.patch, hive.1305.2.patch
>
>
> The operators join and groupby can consume a lot of rows before producing any 
> output. 
> All operators which do not have a output for every input should report 
> progress periodically.
> Currently, it is only being done for ScriptOperator and FilterOperator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1305) add progress in join and groupby

2010-07-08 Thread Siying Dong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siying Dong updated HIVE-1305:
--

Attachment: hive.1305.1.patch

> add progress in join and groupby
> 
>
> Key: HIVE-1305
> URL: https://issues.apache.org/jira/browse/HIVE-1305
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Siying Dong
> Attachments: hive.1305.1.patch
>
>
> The operators join and groupby can consume a lot of rows before producing any 
> output. 
> All operators which do not have a output for every input should report 
> progress periodically.
> Currently, it is only being done for ScriptOperator and FilterOperator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1428) ALTER TABLE ADD PARTITION fails with a remote Thirft metastore

2010-07-08 Thread Paul Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Yang updated HIVE-1428:


Status: Open  (was: Patch Available)

> ALTER TABLE ADD PARTITION fails with a remote Thirft metastore
> --
>
> Key: HIVE-1428
> URL: https://issues.apache.org/jira/browse/HIVE-1428
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Paul Yang
>Assignee: Pradeep Kamath
> Attachments: HIVE-1428-2.patch, HIVE-1428.patch, 
> TestHiveMetaStoreRemote.java
>
>
> If the hive cli is configured to use a remote metastore, ALTER TABLE ... ADD 
> PARTITION commands will fail with an error similar to the following:
> [prade...@chargesize:~/dev/howl]hive --auxpath ult-serde.jar -e "ALTER TABLE 
> mytable add partition(datestamp = '20091101', srcid = '10',action) location 
> '/user/pradeepk/mytable/20091101/10';"
> 10/06/16 17:08:59 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found 
> in the classpath. Usage of hadoop-site.xml is deprecated. Instead use 
> core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of 
> core-default.xml, mapred-default.xml and hdfs-default.xml respectively
> Hive history 
> file=/tmp/pradeepk/hive_job_log_pradeepk_201006161709_1934304805.txt
> FAILED: Error in metadata: org.apache.thrift.TApplicationException: 
> get_partition failed: unknown result
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask
> [prade...@chargesize:~/dev/howl]
> This is due to a check that tries to retrieve the partition to see if it 
> exists. If it does not, an attempt is made to pass a null value from the 
> metastore. Since thrift does not support null return values, an exception is 
> thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1428) ALTER TABLE ADD PARTITION fails with a remote Thirft metastore

2010-07-08 Thread Paul Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886559#action_12886559
 ] 

Paul Yang commented on HIVE-1428:
-

Fix looks pretty good - two things though:

1. The field identifiers for NoSuchObjectException and MetaException in 
hive_metastore.thrift should be swapped - this is because thrift uses those 
identifiers for versioning and we want to be consistent.

2. TestHiveMetaStoreRemote and TestHiveMetaStore share quite a bit of code. Can 
you extract this out to a separate class? Or maybe roll the remote metastore 
client into TestHiveMetastore.

> ALTER TABLE ADD PARTITION fails with a remote Thirft metastore
> --
>
> Key: HIVE-1428
> URL: https://issues.apache.org/jira/browse/HIVE-1428
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Paul Yang
>Assignee: Pradeep Kamath
> Attachments: HIVE-1428-2.patch, HIVE-1428.patch, 
> TestHiveMetaStoreRemote.java
>
>
> If the hive cli is configured to use a remote metastore, ALTER TABLE ... ADD 
> PARTITION commands will fail with an error similar to the following:
> [prade...@chargesize:~/dev/howl]hive --auxpath ult-serde.jar -e "ALTER TABLE 
> mytable add partition(datestamp = '20091101', srcid = '10',action) location 
> '/user/pradeepk/mytable/20091101/10';"
> 10/06/16 17:08:59 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found 
> in the classpath. Usage of hadoop-site.xml is deprecated. Instead use 
> core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of 
> core-default.xml, mapred-default.xml and hdfs-default.xml respectively
> Hive history 
> file=/tmp/pradeepk/hive_job_log_pradeepk_201006161709_1934304805.txt
> FAILED: Error in metadata: org.apache.thrift.TApplicationException: 
> get_partition failed: unknown result
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask
> [prade...@chargesize:~/dev/howl]
> This is due to a check that tries to retrieve the partition to see if it 
> exists. If it does not, an attempt is made to pass a null value from the 
> metastore. Since thrift does not support null return values, an exception is 
> thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1454) insert overwrite and CTAS fail in hive local mode

2010-07-08 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886558#action_12886558
 ] 

Joydeep Sen Sarma commented on HIVE-1454:
-

Yongqiang - did u commit this to 0.6 as well?

> insert overwrite and CTAS fail in hive local mode
> -
>
> Key: HIVE-1454
> URL: https://issues.apache.org/jira/browse/HIVE-1454
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>Priority: Blocker
> Fix For: 0.7.0
>
> Attachments: hive-1454.1.patch
>
>
> this is because of the changes in HIVE-543. We switched to using local 
> storage for intermediate data for local mode queries. However there are code 
> paths that are incorrectly allocating intermediate storage where they should 
> be allocating external file system storage (based on table/directory uri). 
> This is causing regressions in running queries in local mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1245) allow access to values stored as non-strings in HBase

2010-07-08 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886541#action_12886541
 ] 

John Sichi commented on HIVE-1245:
--

For atomic types, we could extend the column-level mapping directive to allow 
for three options

* string
* binary
* use table-level default

So where we currently have a:b, we would support a:b:string and a:b:binary.

The table-level default would be set in a separate serde property 
hbase.storedtype.atomic, with a default value of string for 
backwards-compatibility.

Then something similar for compound types, but with json and delimited as 
options?  I haven't thought about all the combinations, and what to do with 
column familiies.


> allow access to values stored as non-strings in HBase
> -
>
> Key: HIVE-1245
> URL: https://issues.apache.org/jira/browse/HIVE-1245
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: John Sichi
>
> See  test case in
> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/201003.mbox/browser

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1229) replace dependencies on HBase deprecated API

2010-07-08 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886523#action_12886523
 ] 

John Sichi commented on HIVE-1229:
--

Instead of cachedColumnNameBytes, is it possible to instead keep a separate 
array of the names in byte form?  If we're doing all access positionally, that 
would allow us to skip the hash map lookups.


> replace dependencies on HBase deprecated API
> 
>
> Key: HIVE-1229
> URL: https://issues.apache.org/jira/browse/HIVE-1229
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: Basab Maulik
> Fix For: 0.7.0
>
> Attachments: HIVE-1229.1.patch, HIVE-1229.2.patch, HIVE-1229.3.patch
>
>
> Some of these dependencies are on the old Hadoop mapred packages; others are 
> HBase-specific.  The former have to wait until the rest of Hive moves over to 
> the new Hadoop mapreduce package, but the HBase-specific ones don't have to 
> wait.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1454) insert overwrite and CTAS fail in hive local mode

2010-07-08 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-1454:
---

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.7.0
   (was: 0.6.0)
   Resolution: Fixed

I just committed. Thanks Joydeep!

> insert overwrite and CTAS fail in hive local mode
> -
>
> Key: HIVE-1454
> URL: https://issues.apache.org/jira/browse/HIVE-1454
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>Priority: Blocker
> Fix For: 0.7.0
>
> Attachments: hive-1454.1.patch
>
>
> this is because of the changes in HIVE-543. We switched to using local 
> storage for intermediate data for local mode queries. However there are code 
> paths that are incorrectly allocating intermediate storage where they should 
> be allocating external file system storage (based on table/directory uri). 
> This is causing regressions in running queries in local mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1229) replace dependencies on HBase deprecated API

2010-07-08 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1229:
-

Status: Patch Available  (was: Open)

> replace dependencies on HBase deprecated API
> 
>
> Key: HIVE-1229
> URL: https://issues.apache.org/jira/browse/HIVE-1229
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: Basab Maulik
> Fix For: 0.7.0
>
> Attachments: HIVE-1229.1.patch, HIVE-1229.2.patch, HIVE-1229.3.patch
>
>
> Some of these dependencies are on the old Hadoop mapred packages; others are 
> HBase-specific.  The former have to wait until the rest of Hive moves over to 
> the new Hadoop mapreduce package, but the HBase-specific ones don't have to 
> wait.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HIVE-1428) ALTER TABLE ADD PARTITION fails with a remote Thirft metastore

2010-07-08 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi reassigned HIVE-1428:


Assignee: Pradeep Kamath

Setting assignee to Pradeep.

(Pradeep, I just added you as a contributor on Hive, so you should be able to 
assign issues to yourself going forward.)


> ALTER TABLE ADD PARTITION fails with a remote Thirft metastore
> --
>
> Key: HIVE-1428
> URL: https://issues.apache.org/jira/browse/HIVE-1428
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Paul Yang
>Assignee: Pradeep Kamath
> Attachments: HIVE-1428-2.patch, HIVE-1428.patch, 
> TestHiveMetaStoreRemote.java
>
>
> If the hive cli is configured to use a remote metastore, ALTER TABLE ... ADD 
> PARTITION commands will fail with an error similar to the following:
> [prade...@chargesize:~/dev/howl]hive --auxpath ult-serde.jar -e "ALTER TABLE 
> mytable add partition(datestamp = '20091101', srcid = '10',action) location 
> '/user/pradeepk/mytable/20091101/10';"
> 10/06/16 17:08:59 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found 
> in the classpath. Usage of hadoop-site.xml is deprecated. Instead use 
> core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of 
> core-default.xml, mapred-default.xml and hdfs-default.xml respectively
> Hive history 
> file=/tmp/pradeepk/hive_job_log_pradeepk_201006161709_1934304805.txt
> FAILED: Error in metadata: org.apache.thrift.TApplicationException: 
> get_partition failed: unknown result
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask
> [prade...@chargesize:~/dev/howl]
> This is due to a check that tries to retrieve the partition to see if it 
> exists. If it does not, an attempt is made to pass a null value from the 
> metastore. Since thrift does not support null return values, an exception is 
> thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1454) insert overwrite and CTAS fail in hive local mode

2010-07-08 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886442#action_12886442
 ] 

Joydeep Sen Sarma commented on HIVE-1454:
-

it's difficult to add queries that can test this. the reason is that we need 
two different valid and different Hadoop FileSystems to replicate this problem. 
this is not trivial.

medium term - as part of HIVE-1408 - i will add a new dummy Hadoop filesystem 
(that will just use local storage underneath). Then we will be able to 
systematically test this using our regression queries.

> insert overwrite and CTAS fail in hive local mode
> -
>
> Key: HIVE-1454
> URL: https://issues.apache.org/jira/browse/HIVE-1454
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>Priority: Blocker
> Fix For: 0.6.0
>
> Attachments: hive-1454.1.patch
>
>
> this is because of the changes in HIVE-543. We switched to using local 
> storage for intermediate data for local mode queries. However there are code 
> paths that are incorrectly allocating intermediate storage where they should 
> be allocating external file system storage (based on table/directory uri). 
> This is causing regressions in running queries in local mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1454) insert overwrite and CTAS fail in hive local mode

2010-07-08 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886445#action_12886445
 ] 

He Yongqiang commented on HIVE-1454:


+1. will commit after tests pass.

> insert overwrite and CTAS fail in hive local mode
> -
>
> Key: HIVE-1454
> URL: https://issues.apache.org/jira/browse/HIVE-1454
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>Priority: Blocker
> Fix For: 0.6.0
>
> Attachments: hive-1454.1.patch
>
>
> this is because of the changes in HIVE-543. We switched to using local 
> storage for intermediate data for local mode queries. However there are code 
> paths that are incorrectly allocating intermediate storage where they should 
> be allocating external file system storage (based on table/directory uri). 
> This is causing regressions in running queries in local mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

2010-07-08 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886428#action_12886428
 ] 

John Sichi commented on HIVE-287:
-

Regarding DISTINCT:  I agree with Arvind; this information should be provided 
to the UDAF so that it can reject invocations that don't make sense.  Once this 
validation is passed, the distinct elimination is still implemented generically 
inside of Hive (upstream of the UDAF).

Regarding F(*):  let's discriminate three cases.

COUNT(*):  this really means COUNT(), not COUNT(x,y,z).  This is a very 
important distinction to make from an optimizer perspective, because we want to 
be able to push down projection to avoid I/O and other processing for columns 
whose values we will never look at.

SUM(*) and similar ones:  these we should disallow.

MY_UDAF(*), or MY_UDAF(t.*):  this is similar to Pradeep's case that came up 
recently on the mailing list, and it needs to expand to MY_UDAF(x,y,z), not 
MY_UDAF().  I think the patch is currently doing MY_UDAF(), which isn't what he 
wants.

My recommendation is that we commit Arvind's patch as is, then create a 
followup JIRA issue to do what Pradeep is looking for (the expansion of * in 
the semantic analyzer) for both UDF and UDAF, but with a special case for 
COUNT. UDAF authors will be able to decide whether or not to reject the star 
syntax, since in the common case of a UDAF expecting a limited number of 
parameters, the star won't make sense.


> count distinct on multiple columns does not work
> 
>
> Key: HIVE-287
> URL: https://issues.apache.org/jira/browse/HIVE-287
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Arvind Prabhakar
> Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, 
> HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1454) insert overwrite and CTAS fail in hive local mode

2010-07-08 Thread Joydeep Sen Sarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1454:


Status: Patch Available  (was: Open)

> insert overwrite and CTAS fail in hive local mode
> -
>
> Key: HIVE-1454
> URL: https://issues.apache.org/jira/browse/HIVE-1454
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>Priority: Blocker
> Fix For: 0.6.0
>
> Attachments: hive-1454.1.patch
>
>
> this is because of the changes in HIVE-543. We switched to using local 
> storage for intermediate data for local mode queries. However there are code 
> paths that are incorrectly allocating intermediate storage where they should 
> be allocating external file system storage (based on table/directory uri). 
> This is causing regressions in running queries in local mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1454) insert overwrite and CTAS fail in hive local mode

2010-07-08 Thread Joydeep Sen Sarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1454:


Attachment: hive-1454.1.patch

> insert overwrite and CTAS fail in hive local mode
> -
>
> Key: HIVE-1454
> URL: https://issues.apache.org/jira/browse/HIVE-1454
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
>Priority: Blocker
> Fix For: 0.6.0
>
> Attachments: hive-1454.1.patch
>
>
> this is because of the changes in HIVE-543. We switched to using local 
> storage for intermediate data for local mode queries. However there are code 
> paths that are incorrectly allocating intermediate storage where they should 
> be allocating external file system storage (based on table/directory uri). 
> This is causing regressions in running queries in local mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1454) insert overwrite and CTAS fail in hive local mode

2010-07-08 Thread Joydeep Sen Sarma (JIRA)
insert overwrite and CTAS fail in hive local mode
-

 Key: HIVE-1454
 URL: https://issues.apache.org/jira/browse/HIVE-1454
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Joydeep Sen Sarma
Assignee: Joydeep Sen Sarma
Priority: Blocker
 Fix For: 0.6.0


this is because of the changes in HIVE-543. We switched to using local storage 
for intermediate data for local mode queries. However there are code paths that 
are incorrectly allocating intermediate storage where they should be allocating 
external file system storage (based on table/directory uri). This is causing 
regressions in running queries in local mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1408) add option to let hive automatically run in local mode based on tunable heuristics

2010-07-08 Thread Joydeep Sen Sarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1408:


Status: Patch Available  (was: Open)

> add option to let hive automatically run in local mode based on tunable 
> heuristics
> --
>
> Key: HIVE-1408
> URL: https://issues.apache.org/jira/browse/HIVE-1408
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Attachments: 1408.1.patch
>
>
> as a followup to HIVE-543 - we should have a simple option (enabled by 
> default) to let hive run in local mode if possible.
> two levels of options are desirable:
> 1. hive.exec.mode.local.auto=true/false // control whether local mode is 
> automatically chosen
> 2. Options to control different heuristics, some naiive examples:
>  hive.exec.mode.local.auto.input.size.max=1G // don't choose local mode 
> if data > 1G
>  hive.exec.mode.local.auto.script.enable=true/false // choose if local 
> mode is enabled for queries with user scripts
> this can be implemented as a pre/post execution hook. It makes sense to 
> provide this as a standard hook in the hive codebase since it's likely to 
> improve response time for many users (especially for test queries).
> the initial proposal is to choose this at a query level and not at per 
> hive-task (ie. hadoop job) level. per job-level requires more changes to 
> compilation (to not pre-commit to hdfs or local scratch directories at 
> compile time).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-08 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886380#action_12886380
 ] 

He Yongqiang commented on HIVE-417:
---

I think SUMMARY index's mapper code is comment out in the uploaded patch.

> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
> indexing_with_ql_rewrites_trunk_953221.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1428) ALTER TABLE ADD PARTITION fails with a remote Thirft metastore

2010-07-08 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated HIVE-1428:
-

Attachment: HIVE-1428-2.patch

New patch with unit tests included. Thanks for the suggestion of using threads 
Paul, have used that in the unit test. I have made a change in HiveConf to 
enable the unit test. The remaining changes in the patch are as in the first 
version - to throw NoSuchObjectException in getPartition() when no partition 
exists. This mainly changes the generated thrift code (to add throws in the 
method signature) and in other code which interacts with it to catch the 
exception and set the partition object to null.

> ALTER TABLE ADD PARTITION fails with a remote Thirft metastore
> --
>
> Key: HIVE-1428
> URL: https://issues.apache.org/jira/browse/HIVE-1428
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Paul Yang
> Attachments: HIVE-1428-2.patch, HIVE-1428.patch, 
> TestHiveMetaStoreRemote.java
>
>
> If the hive cli is configured to use a remote metastore, ALTER TABLE ... ADD 
> PARTITION commands will fail with an error similar to the following:
> [prade...@chargesize:~/dev/howl]hive --auxpath ult-serde.jar -e "ALTER TABLE 
> mytable add partition(datestamp = '20091101', srcid = '10',action) location 
> '/user/pradeepk/mytable/20091101/10';"
> 10/06/16 17:08:59 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found 
> in the classpath. Usage of hadoop-site.xml is deprecated. Instead use 
> core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of 
> core-default.xml, mapred-default.xml and hdfs-default.xml respectively
> Hive history 
> file=/tmp/pradeepk/hive_job_log_pradeepk_201006161709_1934304805.txt
> FAILED: Error in metadata: org.apache.thrift.TApplicationException: 
> get_partition failed: unknown result
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask
> [prade...@chargesize:~/dev/howl]
> This is due to a check that tries to retrieve the partition to see if it 
> exists. If it does not, an attempt is made to pass a null value from the 
> metastore. Since thrift does not support null return values, an exception is 
> thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1428) ALTER TABLE ADD PARTITION fails with a remote Thirft metastore

2010-07-08 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated HIVE-1428:
-

Status: Patch Available  (was: Open)

HIVE-1428-2.patch is ready for review.

> ALTER TABLE ADD PARTITION fails with a remote Thirft metastore
> --
>
> Key: HIVE-1428
> URL: https://issues.apache.org/jira/browse/HIVE-1428
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Paul Yang
> Attachments: HIVE-1428-2.patch, HIVE-1428.patch, 
> TestHiveMetaStoreRemote.java
>
>
> If the hive cli is configured to use a remote metastore, ALTER TABLE ... ADD 
> PARTITION commands will fail with an error similar to the following:
> [prade...@chargesize:~/dev/howl]hive --auxpath ult-serde.jar -e "ALTER TABLE 
> mytable add partition(datestamp = '20091101', srcid = '10',action) location 
> '/user/pradeepk/mytable/20091101/10';"
> 10/06/16 17:08:59 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found 
> in the classpath. Usage of hadoop-site.xml is deprecated. Instead use 
> core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of 
> core-default.xml, mapred-default.xml and hdfs-default.xml respectively
> Hive history 
> file=/tmp/pradeepk/hive_job_log_pradeepk_201006161709_1934304805.txt
> FAILED: Error in metadata: org.apache.thrift.TApplicationException: 
> get_partition failed: unknown result
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask
> [prade...@chargesize:~/dev/howl]
> This is due to a check that tries to retrieve the partition to see if it 
> exists. If it does not, an attempt is made to pass a null value from the 
> metastore. Since thrift does not support null return values, an exception is 
> thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

2010-07-08 Thread Arvind Prabhakar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886339#action_12886339
 ] 

Arvind Prabhakar commented on HIVE-287:
---

@Zheng: Welcome to the party.

bq. Why do we put the DISTINCT in the information? DISTINCT is currently done 
by the framework, instead of individual UDAF. This is good because the logic of 
removing duplicates are common for all UDAFs. We do support SUM(DISTINCT val).

Providing the information in the parameter specification is not the same as 
enforcing its interpretation. This is provided primarily to ensure that UDAFs 
that rely on this information can make appropriate decisions. For example, we 
wanted to disallow the invocation {{COUNT( EXPR1, EXPR2 ...)}} in favor of 
{{COUNT(*DISTINCT* EXPR1, EXPR2 ...)}}. Without this information, the count 
UDAF will not be able to enforce the later syntax.

bq. Why do we special-case ""? It seems to me that "" is just a short-cut. Hive 
already supports regex-based multi-column specification, so that we can say 
`abc.*` for all columns with name starting with abc. The compiler should just 
expand * and give all the columns to the UDAF.

If you wish to use \* as a regular expression, you would have to quote it as a 
string - {{COUNT('\*')}}. This is different from the invocation as specified in 
SQL which treats \* as a terminal symbol. So if it is OK to deviate from the 
standard representation, the user can easily use the quoted string 
representation to achieve the effect similar to {{COUNT(col1, col2 ..)}}. The 
semantics of this should be more like {{COUNT(DISTINCT EXPR1, EXPR2 ...)}} as 
opposed to {{COUNT(\*)}}.

bq. Since COUNT(\*) is a special-case in the SQL standard (COUNT(\*) is 
different from COUNT(col) even if the table has a single column col), I think 
we should just special-case that and replace that with count(1) at some place.

Are you suggesting that we allow the grammar to express {{COUNT(\*)}} syntax, 
but in the lexical analysis stage turn it into a {{COUNT(1)}}? I can see how 
that may work - but personally I am not a fan of such an approach. 

> count distinct on multiple columns does not work
> 
>
> Key: HIVE-287
> URL: https://issues.apache.org/jira/browse/HIVE-287
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Arvind Prabhakar
> Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, 
> HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-08 Thread Prafulla Tekawade (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886308#action_12886308
 ] 

Prafulla Tekawade commented on HIVE-417:


Hi Yongqiang,
I am facing some problem for creating SUMMARY indexes.
This index is not built with update index command.
COMPACT SUMMARY index works fine. Is there any problem with
creation of SUMMARY index table ?


> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
> indexing_with_ql_rewrites_trunk_953221.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1408) add option to let hive automatically run in local mode based on tunable heuristics

2010-07-08 Thread Joydeep Sen Sarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma updated HIVE-1408:


Attachment: 1408.1.patch

v1 - i will update with tests.

couple of main objectives:
1. decide whether each mr job can be run locally
2. decide whether local disk can be used for intermediate data (if all jobs are 
going to run locally)

right now - both #1 and #2 are code complete - but only #1 has been enabled in 
the code (#2 needs more testing)

the general strategy is:
- after compilation/optimization - look at input size of each mr job.
- if all the jobs are small - then we can use local disk for intermediate data 
(#2)
- else - we use hdfs for intermediate input and before launching each job - we 
(re)test whether the input data set is such that we can execute locally.

had to do substantial restructuring to make this happen:
a. MapRedTask is now a wrapper around ExecDriver. This allows us to have a 
single task implementation for running mr jobs. mapredtask decides at execute 
time whether it should run locally or not.
b. Context.java is pretty much rewritten - the path management code was 
somewhat buggy (in particular isMRTmpFileURI was incorrect). the code was 
rewritten to allow make it easy to swizzle tmp paths to be directed to local 
disk after plan generation
c. added a small cache for caching DFS file metadata (sizes). this is because 
we lookup file metadata many times over now (for determining local mode as well 
as for estimating reducer count) and this cuts the overhead of repeated DFS rpcs
d. most test output changes are because of altered temporary path naming 
convention due to (b)
e. bug fixes: CTAS and RCFileOutputFormat were broken for local mode execution. 
some cleanup (debug log statements should be wrapped in ifDebugEnabled()).


> add option to let hive automatically run in local mode based on tunable 
> heuristics
> --
>
> Key: HIVE-1408
> URL: https://issues.apache.org/jira/browse/HIVE-1408
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Joydeep Sen Sarma
>Assignee: Joydeep Sen Sarma
> Attachments: 1408.1.patch
>
>
> as a followup to HIVE-543 - we should have a simple option (enabled by 
> default) to let hive run in local mode if possible.
> two levels of options are desirable:
> 1. hive.exec.mode.local.auto=true/false // control whether local mode is 
> automatically chosen
> 2. Options to control different heuristics, some naiive examples:
>  hive.exec.mode.local.auto.input.size.max=1G // don't choose local mode 
> if data > 1G
>  hive.exec.mode.local.auto.script.enable=true/false // choose if local 
> mode is enabled for queries with user scripts
> this can be implemented as a pre/post execution hook. It makes sense to 
> provide this as a standard hook in the hive codebase since it's likely to 
> improve response time for many users (especially for test queries).
> the initial proposal is to choose this at a query level and not at per 
> hive-task (ie. hadoop job) level. per job-level requires more changes to 
> compilation (to not pre-commit to hdfs or local scratch directories at 
> compile time).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1452) Mapside join on non partitioned table with partitioned table causes error

2010-07-08 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886272#action_12886272
 ] 

He Yongqiang commented on HIVE-1452:


Not sure what's happening here. It will be great if you can provide a testcase 
to reproduce.
The parameter "hive.mapjoin.cache.numrows" (default 25K) is used to control 
when to flush the in-memory hashmap (which's value object is 
MapJoinObjectValue). You may want to use a small number for this parameter in 
your testcase.

A guess for this issue is maybe we should do a 
{noformat} 
out.flush();
{noformat} 
in MapjoinObjectValue's writeExternal method. (MapjoinObjectValue line 131)

> Mapside join on non partitioned table with partitioned table causes error
> -
>
> Key: HIVE-1452
> URL: https://issues.apache.org/jira/browse/HIVE-1452
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>
> I am running script which contains two tables, one is dynamically partitioned 
> and stored as RCFormat and the other is stored as TXT file.
> The TXT file has around 397MB in size and has around 24million rows.
> {code}
> drop table joinquery;
> create external table joinquery (
>   id string,
>   type string,
>   sec string,
>   num string,
>   url string,
>   cost string,
>   listinfo array >
> ) 
> STORED AS TEXTFILE
> LOCATION '/projects/joinquery';
> CREATE EXTERNAL TABLE idtable20mil(
> id string
> )
> STORED AS TEXTFILE
> LOCATION '/projects/idtable20mil';
> insert overwrite table joinquery
>select 
>   /*+ MAPJOIN(idtable20mil) */
>   rctable.id,
>   rctable.type,
>   rctable.map['sec'],
>   rctable.map['num'],
>   rctable.map['url'],
>   rctable.map['cost'],
>   rctable.listinfo
> from rctable
> JOIN  idtable20mil on (rctable.id = idtable20mil.id)
> where
> rctable.id is not null and
> rctable.part='value' and
> rctable.subpart='value'and
> rctable.pty='100' and
> rctable.uniqid='1000'
> order by id;
> {code}
> Result:
> Possible error:
>   Data file split:string,part:string,subpart:string,subsubpart:string> is 
> corrupted.
> Solution:
>   Replace file. i.e. by re-running the query that produced the source table / 
> partition.
> -
> If I look at mapper logs.
> {verbatim}
> Caused by: java.io.IOException: java.io.EOFException
>   at 
> org.apache.hadoop.hive.ql.exec.persistence.MapJoinObjectValue.readExternal(MapJoinObjectValue.java:109)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1792)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1751)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
>   at 
> org.apache.hadoop.hive.ql.util.jdbm.htree.HashBucket.readExternal(HashBucket.java:284)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1792)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1751)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
>   at 
> org.apache.hadoop.hive.ql.util.jdbm.helper.Serialization.deserialize(Serialization.java:106)
>   at 
> org.apache.hadoop.hive.ql.util.jdbm.helper.DefaultSerializer.deserialize(DefaultSerializer.java:106)
>   at 
> org.apache.hadoop.hive.ql.util.jdbm.recman.BaseRecordManager.fetch(BaseRecordManager.java:360)
>   at 
> org.apache.hadoop.hive.ql.util.jdbm.recman.BaseRecordManager.fetch(BaseRecordManager.java:332)
>   at 
> org.apache.hadoop.hive.ql.util.jdbm.htree.HashDirectory.get(HashDirectory.java:195)
>   at org.apache.hadoop.hive.ql.util.jdbm.htree.HTree.get(HTree.java:155)
>   at 
> org.apache.hadoop.hive.ql.exec.persistence.HashMapWrapper.get(HashMapWrapper.java:114)
>   ... 11 more
> Caused by: java.io.EOFException
>   at java.io.DataInputStream.readInt(DataInputStream.java:375)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2776)
>   at java.io.ObjectInputStream.readInt(ObjectInputStream.java:950)
>   at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.persistence.MapJoinObjectValue.readExternal(MapJoinObjectValue.java:98)
> {verbatim}
> I am trying to create a testcase, which can demonstrate this error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



ReviewBoard Tips

2010-07-08 Thread Carl Steinbach
I'm really excited to see more people using ReviewBoard for their
Hive JIRAs. I want to remind everyone that when creating a
review request it is really important to set the "Bugs"
and "Groups" fields. The "Bugs" field should be set to the ID of
the Hive JIRA, e.g. "HIVE-756". ReviewBoard needs this
information in order to automatically post review comments back
to the JIRA ticket. The "Groups" field should be set
to "hive". This ensures that ReviewBoard will send the review
request and comments to the hive-dev mailing list.

Thanks.

Carl