[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

2010-06-09 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877303#action_12877303
 ] 

Ning Zhang commented on HIVE-1139:
--

Arvind, I remember I got this problem (non-serializable) problem before and the 
problem boils down to support ObjectInspector in the same way as RowContainer 
does. In terms of a complete interface of Map, it would be good to have but I 
would be very cautious about performance penalty it may add. If the new API is 
not needed for now, I would vote for the ObjectInspector support first. 

> GroupByOperator sometimes throws OutOfMemory error when there are too many 
> distinct keys
> 
>
> Key: HIVE-1139
> URL: https://issues.apache.org/jira/browse/HIVE-1139
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to 
> keep all distinct keys in main memory. This could leads to OOM exception when 
> there are too many distinct keys for a particular mapper. A workaround is to 
> set the map split size smaller so that each mapper takes less number of rows. 
> A better solution is to use the persistent HashMapWrapper (currently used in 
> CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-09 Thread Prafulla Tekawade (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877295#action_12877295
 ] 

Prafulla Tekawade commented on HIVE-417:


Yes Ashish,
Thats what I had in mind.

Rewrite system would need metadata, and hence it should be invoked 
after semantic analysis phase which would make metadata available.


> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing.3.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

2010-06-09 Thread Arvind Prabhakar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877239#action_12877239
 ] 

Arvind Prabhakar commented on HIVE-1139:


Ashish - no problem - let me explain: The problem being addressed by this JIRA 
is that {{GroupByOperator}} and possibly other aggregation operators use 
in-memory maps to store intermediate keys, which could lead to 
{{OutOfMemoryException}} in case the number of such keys is large. It is 
suggested that one way to work around it is to use the {{HashMapWrapper}} class 
which would help alleviate the memory concern since it is capable of spilling 
the excess data to disk.

The {{HashMapWrapper}} however, uses Java serialization to write out the excess 
data. This does not work when the data contains non-serializable objects such 
as {{Writable}} types - {{Text}} etc. What I have done so far is to modify the 
{{HashMapWrapper}} to support full {{java.util.Map}} interface. However, when I 
tried updating the {{GroupByOperator}} to use it, I ran into the said 
serialization problem.

Thats why I was suggesting that perhaps we should decouple the serialization 
problem from enhancing the {{HashMapWrapper}} and let the later be checked 
independently.

> GroupByOperator sometimes throws OutOfMemory error when there are too many 
> distinct keys
> 
>
> Key: HIVE-1139
> URL: https://issues.apache.org/jira/browse/HIVE-1139
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to 
> keep all distinct keys in main memory. This could leads to OOM exception when 
> there are too many distinct keys for a particular mapper. A workaround is to 
> set the map split size smaller so that each mapper takes less number of rows. 
> A better solution is to use the persistent HashMapWrapper (currently used in 
> CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-09 Thread Ashish Thusoo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877236#action_12877236
 ] 

Ashish Thusoo commented on HIVE-417:


A couple of comments on this:

A complication that happens by doing a rewrite just after parse is that you 
loose the ability to report back errors that correspond to the original query. 
Also the 
metadata that you need to do the rewrite is only available after phase 1 of 
semantic analysis. So in my opinion the rewrite should be done after semantic 
analysis but before plan generation. Is that what you had in mind...

so something like...

[Query parser]
[Query semantic analysis]
[Query optimization]
...


> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing.3.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1398) Support union all without an outer select *

2010-06-09 Thread Ashish Thusoo (JIRA)
Support union all without an outer select *
---

 Key: HIVE-1398
 URL: https://issues.apache.org/jira/browse/HIVE-1398
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Ashish Thusoo
Assignee: Ashish Thusoo


In hive for union alls the query has to be wrapped in an sub query as shown 
below:

select * from 
(select c1 from t1
  union all
  select c2 from t2);

This JIRA proposes to fix that to support

select c1 from t1
union all
select c2 from t2;


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

2010-06-09 Thread Ashish Thusoo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877232#action_12877232
 ] 

Ashish Thusoo commented on HIVE-1139:
-

Arvind, I thought the whole point of this JIRA was to make HashMapWrapper to 
support java.util.Map, no? If that would be a separate JIRA, what would this 
one be for? Sorry for being a bit dense here but if you could clarify that 
would be great.

Thanks,
Ashish


> GroupByOperator sometimes throws OutOfMemory error when there are too many 
> distinct keys
> 
>
> Key: HIVE-1139
> URL: https://issues.apache.org/jira/browse/HIVE-1139
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to 
> keep all distinct keys in main memory. This could leads to OOM exception when 
> there are too many distinct keys for a particular mapper. A workaround is to 
> set the map split size smaller so that each mapper takes less number of rows. 
> A better solution is to use the persistent HashMapWrapper (currently used in 
> CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-09 Thread Ashish Thusoo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877233#action_12877233
 ] 

Ashish Thusoo commented on HIVE-1397:
-

+1.

This would be a cool contribution.


> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1373) Missing connection pool plugin in Eclipse classpath

2010-06-09 Thread Ashish Thusoo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Thusoo updated HIVE-1373:


   Status: Resolved  (was: Patch Available)
 Hadoop Flags: [Reviewed]
Fix Version/s: 0.6.0
   Resolution: Fixed

Committed. Thanks Vinithra!!


> Missing connection pool plugin in Eclipse classpath
> ---
>
> Key: HIVE-1373
> URL: https://issues.apache.org/jira/browse/HIVE-1373
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Build Infrastructure
> Environment: Eclipse, Linux
>Reporter: Vinithra Varadharajan
>Assignee: Vinithra Varadharajan
> Fix For: 0.6.0
>
> Attachments: HIVE-1373.patch
>
>
> In a recent checkin, connection pool dependency was introduced but eclipse 
> .classpath file was not updated.  This causes launch configurations from 
> within Eclipse to fail.
> {code}
> hive> show tables;
> show tables;
> 10/05/26 14:59:46 INFO parse.ParseDriver: Parsing command: show tables
> 10/05/26 14:59:46 INFO parse.ParseDriver: Parse Completed
> 10/05/26 14:59:46 INFO ql.Driver: Semantic Analysis Completed
> 10/05/26 14:59:46 INFO ql.Driver: Returning Hive schema: 
> Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from 
> deserializer)], properties:null)
> 10/05/26 14:59:46 INFO ql.Driver: query plan = 
> file:/tmp/vinithra/hive_2010-05-26_14-59-46_058_1636674338194744357/queryplan.xml
> 10/05/26 14:59:46 INFO ql.Driver: Starting command: show tables
> 10/05/26 14:59:46 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 10/05/26 14:59:46 INFO metastore.ObjectStore: ObjectStore, initialize called
> FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error 
> creating transactional connection factory
> NestedThrowables:
> java.lang.reflect.InvocationTargetException
> 10/05/26 14:59:47 ERROR exec.DDLTask: FAILED: Error in metadata: 
> javax.jdo.JDOFatalInternalException: Error creating transactional connection 
> factory
> NestedThrowables:
> java.lang.reflect.InvocationTargetException
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> javax.jdo.JDOFatalInternalException: Error creating transactional connection 
> factory
> NestedThrowables:
> java.lang.reflect.InvocationTargetException
>   at org.apache.hadoop.hive.ql.metadata.Hive.getTablesForDb(Hive.java:491)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getTablesByPattern(Hive.java:472)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getAllTables(Hive.java:458)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:504)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:176)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:107)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:55)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:631)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:504)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:382)
>   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:138)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197)
>   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:303)
> Caused by: javax.jdo.JDOFatalInternalException: Error creating transactional 
> connection factory
> NestedThrowables:
> java.lang.reflect.InvocationTargetException
>   at 
> org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:395)
>   at 
> org.datanucleus.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:547)
>   at 
> org.datanucleus.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:175)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1956)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1951)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1159)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:191)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(Ob

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

2010-06-09 Thread Arvind Prabhakar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877222#action_12877222
 ] 

Arvind Prabhakar commented on HIVE-1139:


If there is interest, I can file a separate JIRA for modifying 
{{HashMapWrapper}} to support the {{java.util.Map}} interface and decouple that 
work from this JIRA. I think there is a lot of benefit in doing just that. 
Also, we could have this JIRA depend upon that as a prerequisite.



> GroupByOperator sometimes throws OutOfMemory error when there are too many 
> distinct keys
> 
>
> Key: HIVE-1139
> URL: https://issues.apache.org/jira/browse/HIVE-1139
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to 
> keep all distinct keys in main memory. This could leads to OOM exception when 
> there are too many distinct keys for a particular mapper. A workaround is to 
> set the map split size smaller so that each mapper takes less number of rows. 
> A better solution is to use the persistent HashMapWrapper (currently used in 
> CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1397) histogram() UDAF for a numerical column

2010-06-09 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877220#action_12877220
 ] 

Edward Capriolo commented on HIVE-1397:
---

Looks great. Can not wait.

> histogram() UDAF for a numerical column
> ---
>
> Key: HIVE-1397
> URL: https://issues.apache.org/jira/browse/HIVE-1397
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: Mayank Lahiri
>Assignee: Mayank Lahiri
> Fix For: 0.6.0
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1386) HiveQL SQL Compliance (Umbrella)

2010-06-09 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877204#action_12877204
 ] 

Edward Capriolo commented on HIVE-1386:
---

As a side note maybe with something total order partitioner we can do a true 
map/reduce order by.

> HiveQL SQL Compliance (Umbrella)
> 
>
> Key: HIVE-1386
> URL: https://issues.apache.org/jira/browse/HIVE-1386
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Carl Steinbach
>
> This is an umbrella ticket to track work related to HiveQL compliance with 
> the SQL standard, e.g. supported query syntax, data types, views, catalog 
> access, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Cannot access more than one hive prompt

2010-06-09 Thread Edward Capriolo
On Wed, Jun 9, 2010 at 3:42 PM, jaydeep vishwakarma <
jaydeep.vishwaka...@mkhoj.com> wrote:

> Hi,
>
> I am trying to access two hive prompt from same machine. Only first one
> is working. But other one hive prompt showing
> following error when doing simple select query.
>
> FAILED: Error in semantic analysis: Unable to fetch table employee
>
> How to access more than one hive prompt in same system.
>
> Regards,
> Jaydeep
>
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify us
> immediately by responding to this email and then delete it from your system.
> The firm is neither liable for the proper and complete transmission of the
> information contained in this communication nor for any delay in its
> receipt.
>

I remember once asking the same question :)

http://wiki.apache.org/hadoop/HiveDerbyServerMode


[jira] Commented: (HIVE-1386) HiveQL SQL Compliance (Umbrella)

2010-06-09 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877183#action_12877183
 ] 

Jeff Hammerbacher commented on HIVE-1386:
-

The discussion in HIVE-61 seems to indicate that SORT BY was decided upon as 
the syntax and that, to simulate ORDER BY behavior, one should set the number 
of reduce tasks to be 1. I don't have an instance of Hive running near by, but 
if there is now ORDER BY in the syntax, could you please update the language 
guide?

> HiveQL SQL Compliance (Umbrella)
> 
>
> Key: HIVE-1386
> URL: https://issues.apache.org/jira/browse/HIVE-1386
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Carl Steinbach
>
> This is an umbrella ticket to track work related to HiveQL compliance with 
> the SQL standard, e.g. supported query syntax, data types, views, catalog 
> access, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Cannot access more than one hive prompt

2010-06-09 Thread jaydeep vishwakarma

Hi,

I am trying to access two hive prompt from same machine. Only first one
is working. But other one hive prompt showing
following error when doing simple select query.

FAILED: Error in semantic analysis: Unable to fetch table employee

How to access more than one hive prompt in same system.

Regards,
Jaydeep

The information contained in this communication is intended solely for the use 
of the individual or entity to whom it is addressed and others authorized to 
receive it. It may contain confidential or legally privileged information. If 
you are not the intended recipient you are hereby notified that any disclosure, 
copying, distribution or taking any action in reliance on the contents of this 
information is strictly prohibited and may be unlawful. If you have received 
this communication in error, please notify us immediately by responding to this 
email and then delete it from your system. The firm is neither liable for the 
proper and complete transmission of the information contained in this 
communication nor for any delay in its receipt.


[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-09 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877144#action_12877144
 ] 

He Yongqiang commented on HIVE-417:
---

Plan sounds perfectly good to me!

> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing.3.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-09 Thread Prafulla Tekawade (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877049#action_12877049
 ] 

Prafulla Tekawade commented on HIVE-417:


I was thinking of adding something called query rewrite module.
It would be rule-based query rewrite system and it would 
rewrite the query into semantically equivalent query which is 
more optimized and/or uses indexes (not just for scans, but
for other query operators, e.g. GroupBy etc.)

Eg.

select distinct c1
from t1;

This query, if we have densed index ('compact summary index' in this
hive indexing patch) on c1 can be replaced with query on index table 
itself.

select idx_key
from t1_cmpct_sum_idx;

Similar query transformation can happen for other queries.

Module will be placed just before optimizer and will help optimizer.
Module structure looks like below.

[Query parser]
[Query rewrites] --> new phase
[Query optimization]
[Query execution planner]
[Query execution engine]

The rewrite module is 'generic', not just for above indexing case,
but for other cases too, e.g. OR predicates to union (for efficiency?), outer 
join
to union of anti & semi joins, moving out 'order by' out of union
subquery etc etc.

The aim is to implement a very simple, light-weight rewrite support,
implement the indexing related rewrites (above rewrite does not
even need a new run-time map-red operator) and integrate indexing
support quickly and cleanly. As noted above, this rewrite phase
is rule-based (and not cost-based), sort of early optimization.

Let me know what u think. I'll start with reading ur patch.
This would do most part from TODO 1, 
TODO 2 and 3 will have to be looked into. 

> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing.3.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Question about Hadoop task side-effect files//

2010-06-09 Thread Gerrit van Vuuren
Hi, 

 Using tools frameworks pig and hive already avoids this (unless you write your 
own stores/writers).
What these do is each mapper or reducer (depending from where you write your 
final data to) will write to its own unique file on hdfs. Have a look at the 
contents of a table in hive which normally is a folder on hdfs with multiple 
files. Inserting to a hive table will just write another file to the folder. 


- Original Message -
From: wuxy 
To: hive-dev@hadoop.apache.org 
Sent: Wed Jun 09 07:08:22 2010
Subject: Question about Hadoop task side-effect files//


I found following section at the end of chapter 6 of the book ,

'Task side-effect files';
"Care needs to be taken to ensure that multiple instances of the same task
don't try to write to the same file. There are two problems to avoid: if a
task failed and was retried, then the old partial output would still be
present when the second task ran, and it would have to delete the old file
first. Second, with speculative execution enabled, two instances of the same
task could try to write to the same file simultaneously." 
---
In the description: "two instances of the same task could try to write to
the same file simultaneously" is a case should be avoided.
Can anyone confirm this for me, and if possible, tell me the reason below
behind it. 

Thanks.

Steven. Wu